cmu15545 Notes - Join Algorithms (Join Algorithms)

Overview
Nested Loop Join
- Naïve
- Block
- Index
Sort-Merge Join
Hash Join
- Simple Hash Join
- Partition Hash Join
summarize

Overview

Form of output: early materialization vs. late materialization (OLAP is generally late materialization)

Cost analysis: generally calculated using the number of IOs (the final result may or may not fall on the disk, so we only count the number of IOs before the output result).

JoinThe left side is called the outer table (Outer Table) and the right side is called the inner table (Inner Join), the outer table is usually a small table.

Nested Loop Join

Naïve

Prerequisites: Buffer size 3, one outer table input, one inner table input, and one output.

Basic idea: a double loop, pairing each tuple (Tuple) and reading the S-table m times.

Cost：$M+(m*N)$

Block

Prerequisites: Buffer size 3, one outer table input, one inner table input, and one output.

Basic idea: double loop, pairing within each block (Block, same page Page), so read S-table M times.

Cost：$M+(M*N)$

If the buffer capacity is B, i.e., it can hold B blocks (pages), B-2 blocks for outer table input, one block for inner table input, and one block for output.

Cost：$M+(⌈M/(B-2)⌉*N)$

Index

Prerequisites: Buffer size 3, one outer table input, one inner table input, and one output.

The basic idea: if the external table has an index, then the inner loop doesn't need to traverse, just query the index.

Cost：$M+(m*C)$

Sort-Merge Join

Basic idea: sorted sequences make it easier to find matches.

There are two steps:

Sort: Sort R and S using any sort.
Merge: move two pointers to find a match, may need to back out pointers during the process.

These two steps are the same as those mentioned in the previous sectionouter subsumption sortSame idea, but not the same thing.

SortCost(R)：$2M*(1 + ⌈ log_{B-1} ⌈M / B⌉ ⌉)$

SortCost(S)：$2N*(1 + ⌈ log_{B-1} ⌈N / B⌉ ⌉)$

MergeCost：$M+N$

Total Cost：Sort + Merge

When the same element is stored in R and also in S, the pointer needs to go all the way back, and the Sort-Merge Join degenerates into a Nest Loop Join.

Hash Join

Simple Hash Join

Basic idea: matches are mapped to the same hash bucket.

There are two steps:

Constructing a hash table: applying a hash function to the R-table$h_1$Perform a hash to get a hash table, containing different hash buckets (different hash tables can be used, but thechain hash(most in line with needs).
Detection: put the S-table tuple in a hash function$h_1$Perform a hash to get the corresponding hash bucket location and then look for matches in the hash bucket.

Optimization measures: Bloom filters.

Bloom filters are constructed incidentally when the hash table is created, and the probing phase goes through the Bloom filter before the hash bucket.

Problems i: The algorithm needs to ensure that the hash table can exist in memory, if the hash table is too large to be stored in memory, it needs to be constantly swapped in and out, which affects efficiency. But unfortunately, most of the time, we can't guarantee that the memory can fully store the hash table.

Partition Hash Join

The basic idea: the two tables were hashed with the same hash function, the same hash bucket between the pairing, if the hash bucket can not be stored, it will be hashed again until it can be stored.

Just read the corresponding hash buckets into memory for pairing.

Partition Cost：$2(M+N)$ [Read data + hash bucket drop disk (hash space complexity is$O(n)$）】

Probe Cost：$M+N$

Total Cost：$3(M+N)$

summarize

Algorithm	IO Cost	Example
Naïve Nested Loop Join	$M + (m \cdot N)$	1.3 hours
Block Nested Loop Join	$ M + (\lceil \frac{M}{B-2} \rceil \cdot N)$	0.55 seconds
Index Nested Loop Join	$M + (m \cdot C)$	Variable
Sort-Merge Join	$ M + N + (\text{sort cost})$	0.75 seconds
Hash Join	$3 \cdot (M + N)$	0.45 seconds

Select Partition Hash Join and use Sort-Merge Join when the following situation occurs:

Severe Data Skew: Hash Join Degraded to Sort-Merge Join
The data itself needs to be sorted: at this point the Sort-Merge Join only requires an additional cost of$M+N$ You can realize Join

Hash Join and Sort-Merge Join are implemented in general databases.

Algorithm	IO Cost	Example
Naïve Nested Loop Join	\(M + (m \cdot N)\)	1.3 hours
Block Nested Loop Join	$ M + (\lceil \frac{M}{B-2} \rceil \cdot N)$	0.55 seconds
Index Nested Loop Join	\(M + (m \cdot C)\)	Variable
Sort-Merge Join	$ M + N + (\text{sort cost})$	0.75 seconds
Hash Join	\(3 \cdot (M + N)\)	0.45 seconds