- Overview
-
Nested Loop Join
- Naïve
- Block
- Index
- Sort-Merge Join
-
Hash Join
- Simple Hash Join
- Partition Hash Join
- summarize
Overview
Form of output: early materialization vs. late materialization (OLAP is generally late materialization)
Cost analysis: generally calculated using the number of IOs (the final result may or may not fall on the disk, so we only count the number of IOs before the output result).
JoinThe left side is called the outer table (Outer Table) and the right side is called the inner table (Inner Join), the outer table is usually a small table.
Nested Loop Join
Naïve
Prerequisites: Buffer size 3, one outer table input, one inner table input, and one output.
Basic idea: a double loop, pairing each tuple (Tuple) and reading the S-table m times.
Cost:\(M+(m*N)\)
Block
Prerequisites: Buffer size 3, one outer table input, one inner table input, and one output.
Basic idea: double loop, pairing within each block (Block, same page Page), so read S-table M times.
Cost:\(M+(M*N)\)
If the buffer capacity is B, i.e., it can hold B blocks (pages), B-2 blocks for outer table input, one block for inner table input, and one block for output.
Cost:\(M+(⌈M/(B-2)⌉*N)\)
Index
Prerequisites: Buffer size 3, one outer table input, one inner table input, and one output.
The basic idea: if the external table has an index, then the inner loop doesn't need to traverse, just query the index.
Cost:\(M+(m*C)\)
Sort-Merge Join
Basic idea: sorted sequences make it easier to find matches.
There are two steps:
- Sort: Sort R and S using any sort.
- Merge: move two pointers to find a match, may need to back out pointers during the process.
These two steps are the same as those mentioned in the previous sectionouter subsumption sortSame idea, but not the same thing.
SortCost(R):\(2M*(1 + ⌈ log_{B-1} ⌈M / B⌉ ⌉)\)
SortCost(S):\(2N*(1 + ⌈ log_{B-1} ⌈N / B⌉ ⌉)\)
MergeCost:\(M+N\)
Total Cost:Sort + Merge
When the same element is stored in R and also in S, the pointer needs to go all the way back, and the Sort-Merge Join degenerates into a Nest Loop Join.
Hash Join
Simple Hash Join
Basic idea: matches are mapped to the same hash bucket.
There are two steps:
- Constructing a hash table: applying a hash function to the R-table\(h_1\)Perform a hash to get a hash table, containing different hash buckets (different hash tables can be used, but thechain hash(most in line with needs).
- Detection: put the S-table tuple in a hash function\(h_1\)Perform a hash to get the corresponding hash bucket location and then look for matches in the hash bucket.
Optimization measures: Bloom filters.
Bloom filters are constructed incidentally when the hash table is created, and the probing phase goes through the Bloom filter before the hash bucket.
Problems i: The algorithm needs to ensure that the hash table can exist in memory, if the hash table is too large to be stored in memory, it needs to be constantly swapped in and out, which affects efficiency. But unfortunately, most of the time, we can't guarantee that the memory can fully store the hash table.
Partition Hash Join
The basic idea: the two tables were hashed with the same hash function, the same hash bucket between the pairing, if the hash bucket can not be stored, it will be hashed again until it can be stored.
Just read the corresponding hash buckets into memory for pairing.
Partition Cost:\(2(M+N)\) [Read data + hash bucket drop disk (hash space complexity is\(O(n)\))】
Probe Cost:\(M+N\)
Total Cost:\(3(M+N)\)
summarize
Algorithm | IO Cost | Example |
---|---|---|
Naïve Nested Loop Join | \(M + (m \cdot N)\) | 1.3 hours |
Block Nested Loop Join | $ M + (\lceil \frac{M}{B-2} \rceil \cdot N)$ | 0.55 seconds |
Index Nested Loop Join | \(M + (m \cdot C)\) | Variable |
Sort-Merge Join | $ M + N + (\text{sort cost})$ | 0.75 seconds |
Hash Join | \(3 \cdot (M + N)\) | 0.45 seconds |
Select Partition Hash Join and use Sort-Merge Join when the following situation occurs:
-
Severe Data Skew: Hash Join Degraded to Sort-Merge Join
-
The data itself needs to be sorted: at this point the Sort-Merge Join only requires an additional cost of\(M+N\) You can realize Join
Hash Join and Sort-Merge Join are implemented in general databases.