Location>code7788 >text

cmu15545 Notes - Query Optimization (Query Optimization)

Popularity:33 ℃/2024-11-19 15:40:33

catalogs
  • summarize
  • Heuristics / Rules
  • Cost-based Search
    • Single relation
    • Mutiple relation
      • Genertive / Bottom-Up
      • Transformation / Top-Down
    • Nested sub-queries
      • Decomposing Queries
      • Expression/Queries Rewriting
  • Statistics

summarize

The execution process of a database system:

image-20241119110634550

Steps designed from the optimizer to the disk:

imageimage

Query optimization is divided into two categories:

  1. Heuristics / Rules: Heuristic, rule-based
  2. Cost-based Search: Based on a cost model. Choosing the least costly path
    1. Single relation: single table
    2. Multiple relation: Multiple tables. Bottom-Up or Top-Down.
    3. Nested sub-queries: nested subqueries. Rewriting, decomposition.

Heuristics / Rules

Basic idea: equivalent transformations in relational algebra.

Predicate underpropagation, projective substitution of Cartesian product, projective underpropagation (not represented in the figure below).

image-20241119121200785

Cost-based Search

Query optimization is NP-hard, so we cannot exhaust all query plans, but only select a subset of them, while the cost cannot be calculated accurately, but only estimated:

  • Logical costs: predicate selectivity, logical complexity of operators or algorithms, intermediate result size

  • Physical cost: IO cost, CPU cost, memory usage, data distribution and storage structure

  • Costs are calculated from statistical data: histograms, snapshots, sampling

  • Search by Bottom-Top and Top-Down

Statistics -> Logical cost -> Physical cost -> Search for the plan with the least physical cost.

End condition: find the least expensive of all the current query plans or reach the time limit.

Single relation

For single-table operations, a simple heuristic optimization (Heuristics) + Reasonable ordering of predicates is often sufficient; it is only necessary to consider which database access method to choose.

  • full table scan
  • Binary lookup (clustered index)
  • index scan
SELECT * 
FROM Employees 
WHERE age > 40 AND department = 'IT';

In this SQL, theage > 40with a selectivity of 0.1.department = 'IT'is 0.9.

  • Both are indexed: separate index scan filters are performed and the result is taken as the intersection.
  • An indexed (e.g.age): Yes.agePerform an index scan to getage>40data, and then use thedepartment = 'IT'Filter the results.
  • Neither is indexed: full table scan, filter with two conditions to get the final result.

Mutiple relation

Genertive / Bottom-Up

The basic idea: by starting with the smallest subset of data (e.g., a single table or intermediate results), progressively merging these subsets and using dynamic programming techniques to find the optimal execution plan for the entire query.

Databases: IBM System R, DB2, MySQL, Postgres, most open-source DBMSs

Example: IBM System R

  1. Determine how table data is accessed, listing possible connections

image-20241119145002129

  1. Searching for minimum cost: dynamic programming + pruning
imageimage
imageimage

![image-20241119145258994](/Users/iven/Library/Application Support/typora-user-images/)

Transformation / Top-Down

The basic idea: starting from a logical query plan, using the branch-and-bound method, gradually converting it to a physical query plan, retaining the optimal solution in the search space, and at the same time directly considering the physical attributes of the data in the planning process.

Database: MSSQL, Greenplum, CockroachDB

There exists a mandatory rule (enforer) for constraints, which is directly pruned if it is not satisfied. The mandatory rule can be an accepted cost floor or a disabled operation, such as a Hash Join when an ordered result is required but a Hash Join is used.

image-20241119150055670

Nested sub-queries

Decomposing Queries

Basic idea: nested subqueries are decomposed into a single query if they are not related to an external query.

![image-20241119150608999](/Users/iven/Library/Application Support/typora-user-images/)

Expression/Queries Rewriting

Basic idea: tautological substitution of queries or expressions.

SELECT * FROM A WHERE 1 = 0; -> false
SELECT * FROM A WHERE NOW() IS NULL; -> false
SELECT * FROM A WHERE false; -> No inquiries

SELECT * FROM A
WHERE val BETWEEN 1 AND 100
	OR val BETWEEN 50 AND 150; -> Where val BETWEEN 1 AND 150
SELECT name FROM sailors AS S
WHERE EXISTS (
	SELECT * FROM reserves AS R
		WHERE  = 
			AND  = '2022-10-25'
); -> 
SELECT name FROM sailors AS S, reserves AS R
WHERE  = 
	AND  = '2022-10-25'

Statistics

The database will be in the internal data catalog (Data Catalog) stores statistical information about tables, attributes, and indexes (Statistics), different systems will be updated at different times.

Called manually:

  • Postgres/SQLite:ANALYZE
  • Oracle/MySQL:ANALYZE TABLE
  • SQL Server:UPDATE STATISTICS
  • DB2:RUNSTATS

Histogram: equal-width histogram; equal-depth histogram

image-20241119152252673

image-20241119152507457

Snapshot (sketch): a lightweight approximate statistical tool suitable for use in big data and real-time streaming scenarios.

  • The ability to dynamically adapt to data changes and significantly reduce storage and computation costs compared to histograms.
  • In selective estimation, snapshots can effectively improve the efficiency of database optimization, especially for high and dynamic data.

Sampling (sampling): when the data size is large, selectivity can be predicted based on samples.

image-20241119153128691