Cost Estimation

Cost estimation is the process of predicting the execution cost of a query plan without actually running it. The query optimizer uses cost estimates to compare alternativ

Overview

Cost estimation is the process of predicting the execution cost of a query plan without actually running it. The query optimizer uses cost estimates to compare alternative plans and select the cheapest one.

The primary cost metric is number of disk block transfers (I/Os), since disk access is orders of magnitude slower than CPU operations.

Notation

Symbol	Meaning
n(R)	Number of tuples in relation R
b(R)	Number of blocks occupied by R
bfr(R)	Blocking factor — tuples per block
V(A, R)	Number of distinct values of attribute A in R
h_i	Height of B+ tree index
l_i	Number of leaf blocks in B+ tree index

Cost of Basic Operations

1. Linear (Full) Scan

Reads every block of the relation.

Cost = b(R) block reads Example: R has 10,000 tuples, 10 tuples/block b(R) = 10,000 / 10 = 1,000 blocks Cost = 1,000 block reads

2. Binary Search (on sorted file)

Cost = ⌈log₂(b(R))⌉ + m − 1 Where m = number of blocks containing matching records For a unique key search (m = 1): Cost = ⌈log₂(b(R))⌉ Example: b(R) = 1,000 Cost = ⌈log₂(1000)⌉ = 10 block reads

3. Primary Index (B+ Tree) on Key Attribute

Equality search: Cost = h_i + 1 (traverse tree height + 1 leaf/data block access) Example: B+ tree height = 3 Cost = 3 + 1 = 4 block reads For n(R) = 1,000,000, order 200 B+ tree: h_i = ⌈log₁₀₀(1,000,000)⌉ = 3 Cost = 3 + 1 = 4 (vs. 1,000,000 / 200 = 5,000 for full scan)

4. Secondary Index on Non-Key Attribute

Worst case: each matching tuple is in a different block	very expensive
Example	V(DeptID, Employee) = 10 departments, n(R) = 10,000
If each is in a different block	Cost = 3 + 1,000 = 1,003 block reads

5. Selection Cost with Simple Predicate

Using primary B+ tree index	h_i + 1
Using secondary index	h_i + n(R)/V(A,R) (worst case)
Full table scan	b(R) / 2 (average for equality, unsorted)
Binary search (sorted file)	⌈log₂(b(R))⌉ + n(R)/(2×bfr(R)×V(A,R))

6. Sorting Cost (External Merge Sort)

Example	b(R) = 1,000, M = 50 buffer pages
Runs after Phase 1	⌈1000/50⌉ = 20 runs
Merge passes	⌈log₄₉(20)⌉ = 1 pass
Total	2 × 1,000 × 2 = 4,000 block I/Os

7. Join Cost Estimation

Nested Loop Join

R ⋈_θ S (R as outer, S as inner)

Cost = b(R) + n(R) × b(S) (block-level nested loop)

Better — Block Nested Loop:

Cost = b(R) + ⌈b(R)/(M−2)⌉ × b(S)

Where M = available memory buffer pages

Sort-Merge Join

Cost = Sort(R) + Sort(S) + (b(R) + b(S)) (final merge pass) ≈ 3 × (b(R) + b(S)) (rough estimate with enough memory)

Hash Join

Phase 1 (Partition)	2 × (b(R) + b(S))
Phase 2 (Probe)	b(R) + b(S)
Total	3 × (b(R) + b(S))
Requirement	smaller relation must fit in memory after partitioning

Cardinality Estimation for Join Result

|R ⋈_{R.A=S.B} S| = n(R) × n(S) / max(V(A,R), V(B,S))

Assumptions:

- Uniform distribution

- Attribute independence

Example:

n(Employee) = 10,000

n(Department) = 100

V(DeptID, Employee) = 100

V(DeptID, Department) = 100 (PK, so = n(Department))

|Employee ⋈ Department| = 10,000 × 100 / max(100, 100)

= 10,000 ✓ (each employee → 1 dept)

Statistics Maintenance

Statistics are stored in the system catalog and updated by

1. Automatic: DBMS updates stats periodically

2. Manual: DBA runs ANALYZE / UPDATE STATISTICS

SQL commands

ANALYZE TABLE Employee; -- MySQL

ANALYZE Employee; -- PostgreSQL

UPDATE STATISTICS Employee; -- SQL Server

Outdated stats → poor plan choices → slow queries!

Best practice: Run ANALYZE after bulk loads or major data changes.

Cost Summary Table

Operation	Cost (block I/Os)
Full table scan	b(R)
B+ tree equality (primary)	h_i + 1
B+ tree equality (secondary)	h_i + n(R)/V(A,R)
Binary search (sorted)	⌈log₂(b(R))⌉
External sort	2 × b(R) × (passes + 1)
Hash join	3 × (b(R) + b(S))
Sort-merge join	≈ 3 × (b(R) + b(S))
Block nested loop join	b(R) + b(R)/M × b(S)

Overview

The primary cost metric is number of disk block transfers (I/Os), since disk access is orders of magnitude slower than CPU operations.

Notation

Symbol	Meaning
n(R)	Number of tuples in relation R
b(R)	Number of blocks occupied by R
bfr(R)	Blocking factor — tuples per block
V(A, R)	Number of distinct values of attribute A in R
h_i	Height of B+ tree index
l_i	Number of leaf blocks in B+ tree index

Cost of Basic Operations

1. Linear (Full) Scan

Reads every block of the relation.

Cost = b(R) block reads Example: R has 10,000 tuples, 10 tuples/block b(R) = 10,000 / 10 = 1,000 blocks Cost = 1,000 block reads

2. Binary Search (on sorted file)

3. Primary Index (B+ Tree) on Key Attribute

4. Secondary Index on Non-Key Attribute

Worst case: each matching tuple is in a different block	very expensive
Example	V(DeptID, Employee) = 10 departments, n(R) = 10,000
If each is in a different block	Cost = 3 + 1,000 = 1,003 block reads

5. Selection Cost with Simple Predicate

Using primary B+ tree index	h_i + 1
Using secondary index	h_i + n(R)/V(A,R) (worst case)
Full table scan	b(R) / 2 (average for equality, unsorted)
Binary search (sorted file)	⌈log₂(b(R))⌉ + n(R)/(2×bfr(R)×V(A,R))

6. Sorting Cost (External Merge Sort)

Example	b(R) = 1,000, M = 50 buffer pages
Runs after Phase 1	⌈1000/50⌉ = 20 runs
Merge passes	⌈log₄₉(20)⌉ = 1 pass
Total	2 × 1,000 × 2 = 4,000 block I/Os

7. Join Cost Estimation

Nested Loop Join

R ⋈_θ S (R as outer, S as inner)

Cost = b(R) + n(R) × b(S) (block-level nested loop)

Better — Block Nested Loop:

Cost = b(R) + ⌈b(R)/(M−2)⌉ × b(S)

Where M = available memory buffer pages

Sort-Merge Join

Cost = Sort(R) + Sort(S) + (b(R) + b(S)) (final merge pass) ≈ 3 × (b(R) + b(S)) (rough estimate with enough memory)

Hash Join

Phase 1 (Partition)	2 × (b(R) + b(S))
Phase 2 (Probe)	b(R) + b(S)
Total	3 × (b(R) + b(S))
Requirement	smaller relation must fit in memory after partitioning

Cardinality Estimation for Join Result

|R ⋈_{R.A=S.B} S| = n(R) × n(S) / max(V(A,R), V(B,S))

Assumptions:

- Uniform distribution

- Attribute independence

Example:

n(Employee) = 10,000

n(Department) = 100

V(DeptID, Employee) = 100

V(DeptID, Department) = 100 (PK, so = n(Department))

|Employee ⋈ Department| = 10,000 × 100 / max(100, 100)

= 10,000 ✓ (each employee → 1 dept)

Statistics Maintenance

Statistics are stored in the system catalog and updated by

1. Automatic: DBMS updates stats periodically

2. Manual: DBA runs ANALYZE / UPDATE STATISTICS

SQL commands

ANALYZE TABLE Employee; -- MySQL

ANALYZE Employee; -- PostgreSQL

UPDATE STATISTICS Employee; -- SQL Server

Outdated stats → poor plan choices → slow queries!

Best practice: Run ANALYZE after bulk loads or major data changes.

Cost Summary Table

Operation	Cost (block I/Os)
Full table scan	b(R)
B+ tree equality (primary)	h_i + 1
B+ tree equality (secondary)	h_i + n(R)/V(A,R)
Binary search (sorted)	⌈log₂(b(R))⌉
External sort	2 × b(R) × (passes + 1)
Hash join	3 × (b(R) + b(S))
Sort-merge join	≈ 3 × (b(R) + b(S))
Block nested loop join	b(R) + b(R)/M × b(S)

Cost Estimation

Overview

Notation

Cost of Basic Operations

1. Linear (Full) Scan

2. Binary Search (on sorted file)

3. Primary Index (B+ Tree) on Key Attribute

4. Secondary Index on Non-Key Attribute

5. Selection Cost with Simple Predicate

6. Sorting Cost (External Merge Sort)

7. Join Cost Estimation

Nested Loop Join

Sort-Merge Join

Hash Join

Cardinality Estimation for Join Result

Statistics Maintenance

Statistics are stored in the system catalog and updated by

SQL commands

Cost Summary Table

Continue learning this concept

Cost Estimation

Overview

Notation

Cost of Basic Operations

1. Linear (Full) Scan

2. Binary Search (on sorted file)

3. Primary Index (B+ Tree) on Key Attribute

4. Secondary Index on Non-Key Attribute

5. Selection Cost with Simple Predicate

6. Sorting Cost (External Merge Sort)

7. Join Cost Estimation

Nested Loop Join

Sort-Merge Join

Hash Join

Cardinality Estimation for Join Result

Statistics Maintenance

Statistics are stored in the system catalog and updated by

SQL commands

Cost Summary Table

Continue learning this concept