Query Optimization

Query optimization is the process of selecting the most efficient execution plan from among all possible plans for a given SQL query. The optimizer's goal is to minimize

What is Query Optimization?

Query optimization is the process of selecting the most efficient execution plan from among all possible plans for a given SQL query. The optimizer's goal is to minimize the total cost (usually measured in disk I/O operations).

Plan 1	(A ⋈ B) ⋈ C — join A and B first, then with C
Plan 2	A ⋈ (B ⋈ C) — join B and C first, then with A
Plan 3	(A ⋈ C) ⋈ B — join A and C first (if possible), then B
(n tables	n! possible join orders, but smart pruning reduces this)

Types of Query Optimization

1. Rule-Based (Heuristic) Optimization

Apply algebraic transformation rules to rewrite the query into a more efficient form, without calculating actual costs.

Key Heuristic Rules:

Rule 1	Push selections as early as possible (reduce rows early)
BEFORE	σ_{salary>70000}(Employee ⋈ Department)
AFTER	σ_{salary>70000}(Employee) ⋈ Department
Rule 2	Push projections as early as possible (reduce columns)
BEFORE	π_{name, dept_name}(Employee ⋈ Department)
AFTER	π_{name, dept_id}(Employee) ⋈ π_{dept_id, dept_name}(Department)
Rule 3	Most restrictive selection first
Rule 4	Join order — smaller intermediate results first
Rule 5	Use indexes when available
If an index exists on the WHERE predicate column	use it

2. Cost-Based Optimization

Generate multiple candidate plans and estimate the cost of each. Choose the minimum-cost plan.

Cost Model

Total Cost = Disk I/O Cost + CPU Cost + Memory Cost

Disk I/O dominates for large data → most optimizers minimize I/O

Estimate for each plan

- Number of disk block reads for each operator

- Size (cardinality) of intermediate results

- Available memory for sorting / hashing

Cost Estimation Inputs

The optimizer uses statistics stored in the system catalog:

For each table T

- n(T) = number of tuples (rows)

- b(T) = number of blocks

- bfr(T) = blocking factor (rows per block)

- V(A, T) = number of distinct values of attribute A

- min/max values of each attribute

- Histogram of value distributions

For each index I

- Height of B+ tree (number of levels)

- Number of leaf blocks

Selectivity and Cardinality Estimation

Selectivity of a predicate = fraction of tuples that satisfy it.

Selectivity Formulas

Equality: A = v

sel = 1 / V(A, T) (assume uniform distribution)

Range: A > v

sel = (max(A) - v) / (max(A) - min(A))

Conjunction: P1 AND P2

sel = sel(P1) × sel(P2) (independence assumption)

Disjunction: P1 OR P2

sel = sel(P1) + sel(P2) - sel(P1) × sel(P2)

Expected output cardinality

|σ_P(T)| = sel(P) × n(T)

Join Cardinality Estimation

|R ⋈ S| (natural join on attribute A):

= |R| × |S| / max(V(A,R), V(A,S))

Intuition: each tuple in R matches |S| / V(A,S) tuples in S on average.

For foreign key join:

|Employee ⋈ Department| = |Employee|

(each employee belongs to exactly one department)

Plan Enumeration

For 2-3 Tables: Enumerate All Plans

Plans	A ⋈ B or B ⋈ A (with different algorithms for each)
Plans	(A⋈B)⋈C, (A⋈C)⋈B, (B⋈C)⋈A, A⋈(B⋈C), B⋈(A⋈C), C⋈(A⋈B)
For each	choose join algorithm (NL, sort-merge, hash)

For Many Tables: Dynamic Programming

Selinger's Dynamic Programming Algorithm: 1. Compute best plan for each single table (access path) 2. For each pair of tables, compute best join plan 3. Build on best pairs to get best triples 4. Continue until all tables are joined Prune dominated plans at each step. Time: O(3^n) with pruning, much better than O(n!) full enumeration

Execution Plan Output (EXPLAIN)

sql exampleWoHoTech

EXPLAIN SELECT e.name, d.dept_name
FROM Employee e
JOIN Department d ON e.dept_id = d.dept_id
WHERE e.salary > 70000
ORDER BY e.name;

-- Example Output (MySQL):
+----+--------+------------+--------+------+------+--------+
| id | type   | table      | key    | rows | filt | Extra  |
+----+--------+------------+--------+------+------+--------+
|  1 | ALL    | d          | NULL   |   10 |  100 | NULL   |
|  1 | ref    | e          | idx_d  | 5000 |   33 | Using  |
|    |        |            |        |      |      | where; |
|    |        |            |        |      |      | filesort|
+----+--------+------------+--------+------+------+--------+

-- Interpretation:
-- d (Department): full table scan (small table, OK)
-- e (Employee): index ref on dept_id, filter salary > 70000
-- filesort: ORDER BY requires sorting (no covering index)