新年开坑, 从零开始实现一个LSM-Tree的KV存储引擎。这个项目我会使用C++从零开始实现基于LSM-Tree的KV存储引擎。我的代码: https://github.com/ToniXWD/toni-lsm
1 LSM Tree 整体架构
图片摘自https://zhuanlan.zhihu.com/p/560421550?utm_id=0 , 侵删
上图为LSM-Tree
的整体架构, LSM
全名为Log Structured Merge Tree
, 其核心思路就是, 只对数据进行最佳写入, 而不改变已写入部分, 从而利用追加写入结合硬件局部性的特点, 最大化写入效率。
结合结构体其整体的架构介绍如下:
MemoryTable :
内存中有一个称为**
MemoryTable**
的结构,用于暂时存储新写入的数据。MemoryTable
通常是一个有序的数据结构,如跳表或平衡树,以支持快速插入和查找。
Immutable MemoryTable
Immutable MemoryTable
为不可更改的内存数据结构
WAL Log(Write-Ahead Log) :
在数据写入MemoryTable
之前,通常会先写入WAL Log
。这是一种持久化日志,用于在系统崩溃时恢复未持久化的数据。
SSTable(Sorted String Table) :
当MemoryTable达到一定大小(例如10 MB)时,它会被冻结并转换为一个不可变的SSTable,然后写入硬盘。SSTable是按键排序的文件,支持高效的区间查询。
需要注意的是, Level 0
的SST
是未排序的, 因此查询可能需要遍历所有Level 0
的SST
文件。但后续高层的SST
是通过低层的SST
进行compact
形成的, 此过程包含排序, 因此查询时可以进行二分查找。
读写过程如下:
读操作(Read) :
读取数据时,系统首先检查MemoryTable
,然后在各级SSTable
中查找。由于除了Level 0
的SST
是未排序的, 其余SST
都是有序的, 因此可以使用二分查找等高效算法来定位数据。
合并(Compaction) :
为了管理不同层次的SSTable并减少读取时的查找次数,系统会定期执行合并操作。合并过程会将多个SSTable文件合并为一个更大的文件,并删除重复或过期的数据。
这里只是浅浅介绍了LSM-Tree
的架构, 具体的计数细节在后续具体的模块代码实现中会有体现, 也可以参考知友的文章: https://zhuanlan.zhihu.com/p/560421550?utm_id=0
2 跳表实现 现在首先实现内存的数据结构, 其实我一开始想直接用基于红黑树的std::map
的, 但发现LevelDB
和RocksDB
使用的低层数据结构都是跳表, 主要是因为:
跳表天然支持顺序遍历,适合 LSM Tree 的 Compaction 和范围查询
跳表的代码实现和维护成本低
跳表的插入更简单, 而红黑树需要通过左旋或右旋来保证黑色节点高度的平衡
红黑树的原理和实现可以参考我的另一篇文章: https://zhuanlan.zhihu.com/p/681229703
这里简单介绍一下跳表是什么, 跳表就是过个链表, 每个链表的步长不同, 且链表节点是排序的, 查找或插入时, 先使用最大步长层级的链表, 然后定位大区间后, 转入下层低步长层级的链表继续查询, 是不是非常简单? :smile:
具体细节就不过多介绍了, 八股里面太多了
2.1. 数据结构设计 2.1.1 节点结构 1 2 3 4 5 6 7 8 struct SkipListNode { std::string key; std::string value; std::vector<std::shared_ptr<SkipListNode>> forward; SkipListNode (const std::string &k, const std::string &v, int level) : key (k), value (v), forward(level, nullptr ) {} };
每个节点包含:
key-value对
多层前向指针数组,用于快速跳跃
使用智能指针管理内存
2.1.2 跳表主体 1 2 3 4 5 6 7 8 9 10 11 class SkipList {private : std::shared_ptr<SkipListNode> head; int max_level; int current_level; size_t size_bytes; public : SkipList (int max_lvl = 16 ); };
2.2 核心操作实现 2.2.1 随机层级生成 1 2 3 4 5 6 7 int SkipList::random_level () { int level = 1 ; while (random () < SKIPLIST_P && level < max_level) { level++; } return level; }
使用概率分布确保层级分布合理
较低层级的概率更高,形成金字塔结构
2.2.2 插入操作 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 void SkipList::put (const std::string &key, const std::string &value) { std::vector<std::shared_ptr<SkipListNode>> update (max_level); auto current = head; for (int i = current_level - 1 ; i >= 0 ; --i) { while (current->forward[i] && current->forward[i]->key < key) { current = current->forward[i]; } update[i] = current; } current = current->forward[0 ]; if (current && current->key == key) { current->value = value; } else { int new_level = random_level (); } }
2.2.3 查询操作 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 std::optional<std::string> SkipList::get (const std::string &key) const { auto current = head; for (int i = current_level - 1 ; i >= 0 ; --i) { while (current->forward[i] && current->forward[i]->key < key) { current = current->forward[i]; } } current = current->forward[0 ]; if (current && current->key == key) { return current->value; } return std::nullopt ; }
2.3 迭代器支持 1 2 3 4 5 6 7 8 9 10 11 12 class SkipListIterator {private : std::shared_ptr<SkipListNode> current; public : SkipListIterator (std::shared_ptr<SkipListNode> node) : current (node) {} std::pair<std::string, std::string> operator *() const ; SkipListIterator& operator ++(); SkipListIterator operator ++(int ); };
迭代器实现支持:
2.4. LSM Tree特定优化 2.4.1 内存跟踪
2.4.2 批量导出 1 2 3 4 5 6 7 8 9 std::vector<std::pair<std::string, std::string>> flush () { std::vector<std::pair<std::string, std::string>> data; auto node = head->forward[0 ]; while (node) { data.emplace_back (node->key, node->value); node = node->forward[0 ]; } return data; }
支持高效的批量数据导出
用于LSM Tree的刷盘操作
3 xmake管理项目 3.1 项目构建管理 之前学习和做项目的时候C++
的项目管理基本上用的是CMake
, 总所周知, CMake
的语法复杂臃肿(大佬轻喷), 所以本次我采用xmake
进行项目管理, xmake
的用法可以参考官网: https://xmake.io/#/ , 其实语法很简单, 基本上见名知义了, 这里给出跳表部分的构建:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 set_project("toni-lsm" ) set_version("0.0.1" ) add_rules("mode.debug" , "mode.release" ) add_requires("gtest" ) target("skiplist" ) set_kind("static" ) add_files("src/skiplist/*.cpp" ) add_includedirs("include" , {public = true }) target("test_skiplist" ) set_kind("binary" ) set_group("tests" ) add_files("test/test_skiplist.cpp" ) add_deps("skiplist" ) add_packages("gtest" )
项目结构如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 . ├── include │ ├── consts.h │ ├── memtable │ │ ├── iterator.h │ │ └── memtable.h │ └── skiplist │ └── skiplist.h ├── src │ ├── memtable │ │ ├── iterator.cpp │ │ └── memtable.cpp │ └── skiplist │ └── skipList.cpp ├── test │ ├── test_memtable.cpp │ └── test_skiplist.cpp └── xmake.lua
可以看到, xmake
支持类似Linux
文件系统的通配符*
来添加依赖文件, 且添加依赖只需要add_deps
, 而不是想CMake
那样在每个文件夹下面都写一份CMakeLists.txt
, 爽!!!
3.2 单元测试 3.2.1 依赖准备 xmake
不仅仅是构建工具, 更包含简单的包管理工具和项目管理, 可以看成轻量的cargo
, 这里单元测试还需要安装gtest
:
如果你无法再IDE中获取代码高亮提示和补全的话, 需要手动引入gtest
的头文件, 这里我的开发环境是WSL+clangd+VSCode
, 可以直接在项目根目录添加一个.clangd
配置文件:
1 2 3 CompileFlags: Add: - "-isystem/home/toni/proj/vcpkg/installed/x64-linux/include"
3.2.2 单元测试编写 简单测试跳表:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 #include "../include/skiplist/skiplist.h" #include <gtest/gtest.h> #include <vector> #include <string> #include <algorithm> #include <unordered_set> TEST (SkipListTest, BasicOperations) { SkipList skipList; skipList.put ("key1" , "value1" ); EXPECT_EQ (skipList.get ("key1" ).value (), "value1" ); skipList.put ("key1" , "new_value" ); EXPECT_EQ (skipList.get ("key1" ).value (), "new_value" ); skipList.remove ("key1" ); EXPECT_FALSE (skipList.get ("key1" ).has_value ()); } TEST (SkipListTest, Iterator) { SkipList skipList; skipList.put ("key1" , "value1" ); skipList.put ("key2" , "value2" ); skipList.put ("key3" , "value3" ); std::vector<std::pair<std::string, std::string>> result; for (auto it = skipList.begin (); it != skipList.end (); ++it) { result.push_back (*it); } EXPECT_EQ (result.size (), 3 ); EXPECT_EQ (result[0 ].first, "key1" ); EXPECT_EQ (result[1 ].first, "key2" ); EXPECT_EQ (result[2 ].first, "key3" ); } TEST (SkipListTest, LargeScaleInsertAndGet) { SkipList skipList; const int num_elements = 10000 ; for (int i = 0 ; i < num_elements; ++i) { std::string key = "key" + std::to_string (i); std::string value = "value" + std::to_string (i); skipList.put (key, value); } for (int i = 0 ; i < num_elements; ++i) { std::string key = "key" + std::to_string (i); std::string expected_value = "value" + std::to_string (i); EXPECT_EQ (skipList.get (key).value (), expected_value); } } TEST (SkipListTest, LargeScaleRemove) { SkipList skipList; const int num_elements = 10000 ; for (int i = 0 ; i < num_elements; ++i) { std::string key = "key" + std::to_string (i); std::string value = "value" + std::to_string (i); skipList.put (key, value); } for (int i = 0 ; i < num_elements; ++i) { std::string key = "key" + std::to_string (i); skipList.remove (key); } for (int i = 0 ; i < num_elements; ++i) { std::string key = "key" + std::to_string (i); EXPECT_FALSE (skipList.get (key).has_value ()); } } TEST (SkipListTest, DuplicateInsert) { SkipList skipList; skipList.put ("key1" , "value1" ); skipList.put ("key1" , "value2" ); skipList.put ("key1" , "value3" ); EXPECT_EQ (skipList.get ("key1" ).value (), "value3" ); } TEST (SkipListTest, EmptySkipList) { SkipList skipList; EXPECT_FALSE (skipList.get ("nonexistent_key" ).has_value ()); skipList.remove ("nonexistent_key" ); } TEST (SkipListTest, RandomInsertAndRemove) { SkipList skipList; std::unordered_set<std::string> keys; const int num_operations = 10000 ; for (int i = 0 ; i < num_operations; ++i) { std::string key = "key" + std::to_string (rand () % 1000 ); std::string value = "value" + std::to_string (rand () % 1000 ); if (keys.find (key) == keys.end ()) { skipList.put (key, value); keys.insert (key); } else { skipList.remove (key); keys.erase (key); } if (keys.find (key) != keys.end ()) { EXPECT_EQ (skipList.get (key).value (), value); } else { EXPECT_FALSE (skipList.get (key).has_value ()); } } } TEST (SkipListTest, MemorySizeTracking) { SkipList skipList; skipList.put ("key1" , "value1" ); skipList.put ("key2" , "value2" ); size_t expected_size = sizeof ("key1" ) - 1 + sizeof ("value1" ) - 1 + sizeof ("key2" ) - 1 + sizeof ("value2" ) - 1 ; EXPECT_EQ (skipList.get_size (), expected_size); skipList.remove ("key1" ); expected_size -= sizeof ("key1" ) - 1 + sizeof ("value1" ) - 1 ; EXPECT_EQ (skipList.get_size (), expected_size); skipList.clear (); EXPECT_EQ (skipList.get_size (), 0 ); } int main (int argc, char **argv) { ::testing::InitGoogleTest (&argc, argv); return RUN_ALL_TESTS (); }
3.3 构建运行 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 xmake xmake run test_skiplist [==========] Running 8 tests from 1 test suite. [----------] Global test environment set-up. [----------] 8 tests from SkipListTest [ RUN ] SkipListTest.BasicOperations [ OK ] SkipListTest.BasicOperations (0 ms) [ RUN ] SkipListTest.Iterator [ OK ] SkipListTest.Iterator (0 ms) [ RUN ] SkipListTest.LargeScaleInsertAndGet [ OK ] SkipListTest.LargeScaleInsertAndGet (43 ms) [ RUN ] SkipListTest.LargeScaleRemove [ OK ] SkipListTest.LargeScaleRemove (42 ms) [ RUN ] SkipListTest.DuplicateInsert [ OK ] SkipListTest.DuplicateInsert (0 ms) [ RUN ] SkipListTest.EmptySkipList [ OK ] SkipListTest.EmptySkipList (0 ms) [ RUN ] SkipListTest.RandomInsertAndRemove [ OK ] SkipListTest.RandomInsertAndRemove (25 ms) [ RUN ] SkipListTest.MemorySizeTracking [ OK ] SkipListTest.MemorySizeTracking (0 ms) [----------] 8 tests from SkipListTest (111 ms total) [----------] Global test environment tear-down [==========] 8 tests from 1 test suite ran. (111 ms total) [ PASSED ] 8 tests.