-
basic concept
- Disk-based B+ trees
- Queries and Indexes
-
Design Choices
- Node Size
- Merge Thredshold
- Variable-length Keys
- Intra-Node Search (INS)
-
Optimization tools
- Pointer Swizzling
- Bε-trees
- Bulk Insert
- Prefix Compression
- Deduplication
- Suffix Truncation
basic concept
Disk-based B+ trees
Why use B+ numbers for data access (Access Method):
- Naturally organized, supports range finding
- Supports traversing all data, utilizing sequential IO
- The time complexity is\(O(logn)\)Meet Performance Needs
- Compared to B-tree, data accesses are at leaf nodes: high disk space utilization; fewer concurrent conflicts
A basic B+ tree:
- Three types of borrowing points: root nodes, intermediate nodes, and leaf nodes
- Data distribution: root and intermediate nodes store only indexes, leaf nodes store data
- Pointer relationships: parent-child pointers, brother pointers
Disk-based B+ tree mapping:
A node is stored in a Heap File page; the PageId acts in place of a pointer.
-
key-value union storage
-
The keys are stored separately
The leaf nodes of a B+ tree store the actual data, and how this data is interpreted depends on different database implementations: some store RecordIDs, and some databases based on Index-Organized Storage (IOS) store tuple (tuple) data directly.
If you don't know about RecordID, the way the data is organized, see theThis blog post。
Queries and Indexes
leftmost prefix matching
With union index<a,b,c>
The following query conditions are supported
(a=1 AND b=2 AND c=3)
(a=1 AND b=2)
If all do not satisfy the leftmost prefix match principle, a full table scan is required.
How to handle duplicate keys
-
Add RecordID to make it a unique key
-
Leaf node overflow (not used in actual systems)
clustered index
- A table can only have one clustered index
- Index keys and values are stored together
- The data is sorted by the key of the index
- Synchronize indexes when manipulating data
Clustered indexes are non-essential, depending on the database specific implementation, data in Mysql and SQLite is organized directly with clustered indexes.
Implementing clustered indexes with B+ trees makes it easy to implement range queries and facilitates the full utilization of sequential IO.
For non-clustered indexes, although the keys of the index are ordered, the corresponding data are not necessarily stored in order on disk, so a very effective way is to get the PageID first, then sort according to the PageID, and finally get the data to take full advantage of sequential IO.
Design Choices
Node Size
The slower the storage device reads data and the more it needs to utilize sequential IO, the larger the node will be;
The faster the storage device reads data and the more redundant data reads need to be reduced, the smaller the node will be.
- HDD:~1MB
- SSD:~10KB
- In-Memory:~512B
Merge Thredshold
When the number of keys in a node is less than half-full, the merge is not performed immediately, but small nodes are allowed to exist before the whole tree is periodically rebuilt.
PostgreSQL calls it an unbalanced B+ tree ("non-balanced" B+Tree, nbtree).
Variable-length Keys
- Pointer: key stores a pointer to the actual data [can't utilize sequential IO because you have to jump to read the contents of the pointer]
- variable length node (math.)
- Padding data (Padding)
Index data in the actual system and the heap file data, is able to store the node on the node in the store, is in the store does not store pointers.
Intra-Node Search (INS)
-
Linear lookup: due to the existence of the SIMD instruction set, the actual sequential lookup, which can actually be batch processing
-
binary search
-
Interpolation: when the key has no gap (self-incrementing), the offset can be calculated directly
Optimization tools
Pointer Swizzling
The basic idea: when an object is loaded from disk into memory, its disk address is converted to a memory address (swizzling) in order for the program to be accessed directly in memory by the pointer.
Example: For example, the root node of a B+ tree indexed by the primary key will be pinned and not replaced by the Buffer Pool when it is read, so at this point you can access the root node directly with a memory pointer, omitting the step of asking the Buffer Pool for the memory address using the PageID.
Illustration:
Bε-trees
A B+-tree writing optimization.
Basic idea: Instead of modifying data directly when updating, logging is done (similar to log-structured data storage).
The log is recorded on the node, and when the node log is full, the log for that node is pushed down to the child node.
Bulk Insert
Basic idea: create a B+ tree from bottom to top, not top to bottom.
Reduces structural changes in the tree during insertion, provided that the data needs to be pre-sorted.
Keys: 3, 7, 9, 13, 6, 1
Sorted Keys: 1, 3, 6, 7, 9, 13
Prefix Compression
Basic idea: dictionary ordering compresses prefixes.
Deduplication
Basic idea: avoid storing the same key repeatedly in non-unique indexes.
Suffix Truncation
The basic idea: the intermediate nodes just act as guides, so it is sufficient to store the smallest prefix that is recognizable.