cmu15545 Notes - Query Optimization (Query Optimization)

summarize
Heuristics / Rules
Cost-based Search
- Single relation
- Mutiple relation
  - Genertive / Bottom-Up
  - Transformation / Top-Down
- Nested sub-queries
  - Decomposing Queries
  - Expression/Queries Rewriting
Statistics

summarize

The execution process of a database system:

Steps designed from the optimizer to the disk:

Query optimization is divided into two categories:

Heuristics / Rules: Heuristic, rule-based
Cost-based Search: Based on a cost model. Choosing the least costly path
1. Single relation: single table
2. Multiple relation: Multiple tables. Bottom-Up or Top-Down.
3. Nested sub-queries: nested subqueries. Rewriting, decomposition.

Heuristics / Rules

Basic idea: equivalent transformations in relational algebra.

Predicate underpropagation, projective substitution of Cartesian product, projective underpropagation (not represented in the figure below).

Cost-based Search

Query optimization is NP-hard, so we cannot exhaust all query plans, but only select a subset of them, while the cost cannot be calculated accurately, but only estimated:

Logical costs: predicate selectivity, logical complexity of operators or algorithms, intermediate result size
Physical cost: IO cost, CPU cost, memory usage, data distribution and storage structure
Costs are calculated from statistical data: histograms, snapshots, sampling
Search by Bottom-Top and Top-Down

Statistics -> Logical cost -> Physical cost -> Search for the plan with the least physical cost.

End condition: find the least expensive of all the current query plans or reach the time limit.

Single relation

For single-table operations, a simple heuristic optimization (Heuristics) + Reasonable ordering of predicates is often sufficient; it is only necessary to consider which database access method to choose.

full table scan
Binary lookup (clustered index)
index scan

SELECT * 
FROM Employees 
WHERE age > 40 AND department = 'IT';

In this SQL, theage > 40with a selectivity of 0.1.department = 'IT'is 0.9.

Both are indexed: separate index scan filters are performed and the result is taken as the intersection.
An indexed (e.g.age): Yes.agePerform an index scan to getage>40data, and then use thedepartment = 'IT'Filter the results.
Neither is indexed: full table scan, filter with two conditions to get the final result.

Mutiple relation

Genertive / Bottom-Up

The basic idea: by starting with the smallest subset of data (e.g., a single table or intermediate results), progressively merging these subsets and using dynamic programming techniques to find the optimal execution plan for the entire query.

Databases: IBM System R, DB2, MySQL, Postgres, most open-source DBMSs

Example: IBM System R

Determine how table data is accessed, listing possible connections

Searching for minimum cost: dynamic programming + pruning

![image-20241119145258994](/Users/iven/Library/Application Support/typora-user-images/)

Transformation / Top-Down

The basic idea: starting from a logical query plan, using the branch-and-bound method, gradually converting it to a physical query plan, retaining the optimal solution in the search space, and at the same time directly considering the physical attributes of the data in the planning process.

Database: MSSQL, Greenplum, CockroachDB

There exists a mandatory rule (enforer) for constraints, which is directly pruned if it is not satisfied. The mandatory rule can be an accepted cost floor or a disabled operation, such as a Hash Join when an ordered result is required but a Hash Join is used.

Nested sub-queries

Decomposing Queries

Basic idea: nested subqueries are decomposed into a single query if they are not related to an external query.

![image-20241119150608999](/Users/iven/Library/Application Support/typora-user-images/)

Expression/Queries Rewriting

Basic idea: tautological substitution of queries or expressions.

SELECT * FROM A WHERE 1 = 0; -> false
SELECT * FROM A WHERE NOW() IS NULL; -> false
SELECT * FROM A WHERE false; -> No inquiries

SELECT * FROM A
WHERE val BETWEEN 1 AND 100
	OR val BETWEEN 50 AND 150; -> Where val BETWEEN 1 AND 150

SELECT name FROM sailors AS S
WHERE EXISTS (
	SELECT * FROM reserves AS R
		WHERE  = 
			AND  = '2022-10-25'
); -> 
SELECT name FROM sailors AS S, reserves AS R
WHERE  = 
	AND  = '2022-10-25'

Statistics

The database will be in the internal data catalog (Data Catalog) stores statistical information about tables, attributes, and indexes (Statistics), different systems will be updated at different times.

Called manually:

Postgres/SQLite：ANALYZE
Oracle/MySQL：ANALYZE TABLE
SQL Server：UPDATE STATISTICS
DB2：RUNSTATS

Histogram: equal-width histogram; equal-depth histogram

Snapshot (sketch): a lightweight approximate statistical tool suitable for use in big data and real-time streaming scenarios.

The ability to dynamically adapt to data changes and significantly reduce storage and computation costs compared to histograms.
In selective estimation, snapshots can effectively improve the efficiency of database optimization, especially for high and dynamic data.

Sampling (sampling): when the data size is large, selectivity can be predicted based on samples.