cmu15545 - Data Access Method: B+Tree (B+Tree)

basic concept
- Disk-based B+ trees
- Queries and Indexes
Design Choices
- Node Size
- Merge Thredshold
- Variable-length Keys
- Intra-Node Search (INS)
Optimization tools
- Pointer Swizzling
- Bε-trees
- Bulk Insert
- Prefix Compression
- Deduplication
- Suffix Truncation

basic concept

Disk-based B+ trees

Why use B+ numbers for data access (Access Method):

Naturally organized, supports range finding
Supports traversing all data, utilizing sequential IO
The time complexity is\(O(logn)\)Meet Performance Needs
Compared to B-tree, data accesses are at leaf nodes: high disk space utilization; fewer concurrent conflicts

A basic B+ tree:

Three types of borrowing points: root nodes, intermediate nodes, and leaf nodes
Data distribution: root and intermediate nodes store only indexes, leaf nodes store data
Pointer relationships: parent-child pointers, brother pointers

Disk-based B+ tree mapping:

A node is stored in a Heap File page; the PageId acts in place of a pointer.

key-value union storage
The keys are stored separately

The leaf nodes of a B+ tree store the actual data, and how this data is interpreted depends on different database implementations: some store RecordIDs, and some databases based on Index-Organized Storage (IOS) store tuple (tuple) data directly.

If you don't know about RecordID, the way the data is organized, see theThis blog post。

Queries and Indexes

leftmost prefix matching

With union index<a,b,c>The following query conditions are supported

(a=1 AND b=2 AND c=3)
(a=1 AND b=2)

If all do not satisfy the leftmost prefix match principle, a full table scan is required.

How to handle duplicate keys

Add RecordID to make it a unique key
Leaf node overflow (not used in actual systems)

clustered index

A table can only have one clustered index
Index keys and values are stored together
The data is sorted by the key of the index
Synchronize indexes when manipulating data

Clustered indexes are non-essential, depending on the database specific implementation, data in Mysql and SQLite is organized directly with clustered indexes.

Implementing clustered indexes with B+ trees makes it easy to implement range queries and facilitates the full utilization of sequential IO.

For non-clustered indexes, although the keys of the index are ordered, the corresponding data are not necessarily stored in order on disk, so a very effective way is to get the PageID first, then sort according to the PageID, and finally get the data to take full advantage of sequential IO.

Design Choices

Node Size

The slower the storage device reads data and the more it needs to utilize sequential IO, the larger the node will be;

The faster the storage device reads data and the more redundant data reads need to be reduced, the smaller the node will be.

HDD：～1MB
SSD：~10KB
In-Memory：~512B

Merge Thredshold

When the number of keys in a node is less than half-full, the merge is not performed immediately, but small nodes are allowed to exist before the whole tree is periodically rebuilt.

PostgreSQL calls it an unbalanced B+ tree ("non-balanced" B+Tree, nbtree).

Variable-length Keys

Pointer: key stores a pointer to the actual data [can't utilize sequential IO because you have to jump to read the contents of the pointer]
variable length node (math.)
Padding data (Padding)

Index data in the actual system and the heap file data, is able to store the node on the node in the store, is in the store does not store pointers.

Intra-Node Search (INS)

Linear lookup: due to the existence of the SIMD instruction set, the actual sequential lookup, which can actually be batch processing
binary search
Interpolation: when the key has no gap (self-incrementing), the offset can be calculated directly

Optimization tools

Pointer Swizzling

The basic idea: when an object is loaded from disk into memory, its disk address is converted to a memory address (swizzling) in order for the program to be accessed directly in memory by the pointer.

Example: For example, the root node of a B+ tree indexed by the primary key will be pinned and not replaced by the Buffer Pool when it is read, so at this point you can access the root node directly with a memory pointer, omitting the step of asking the Buffer Pool for the memory address using the PageID.

Illustration:

Bε-trees

A B+-tree writing optimization.

Basic idea: Instead of modifying data directly when updating, logging is done (similar to log-structured data storage).

The log is recorded on the node, and when the node log is full, the log for that node is pushed down to the child node.

Bulk Insert

Basic idea: create a B+ tree from bottom to top, not top to bottom.

Reduces structural changes in the tree during insertion, provided that the data needs to be pre-sorted.

Keys: 3, 7, 9, 13, 6, 1
Sorted Keys: 1, 3, 6, 7, 9, 13

Prefix Compression

Basic idea: dictionary ordering compresses prefixes.

Deduplication

Basic idea: avoid storing the same key repeatedly in non-unique indexes.

Suffix Truncation

The basic idea: the intermediate nodes just act as guides, so it is sufficient to store the smallest prefix that is recognizable.