File Organization

File organization refers to how data records are physically stored on disk. The choice of file organization affects the speed of data retrieval, insertion, deletion, and

Overview

File organization refers to how data records are physically stored on disk. The choice of file organization affects the speed of data retrieval, insertion, deletion, and update operations.

Basic Storage Concepts

Disk Storage Hierarchy

┌────────────────────────────────────────────────────┐

│ Database File │

│ ├── Segment (tablespace) │

│ │ ├── Extent (group of pages) │

│ │ │ ├── Page / Block (8KB–16KB) │

│ │ │ │ └── Records (rows/tuples) │

│ │ │ │ └── Fields (columns/attributes)│

└────────────────────────────────────────────────────┘

Disk I/O is the bottleneck

Reading one page from disk ≈ 5–10ms

Reading from RAM ≈ nanoseconds

Goal: Minimize disk I/O for all operations

1. Heap File (Unordered File)

Records are stored in no particular order — new records are appended wherever space is available.

│ Page 1	[R3][R7][R1][R9] │
│ Page 2	[R5][R2][R8][R4] │
│ Page 3	[R6][R10][ empty ][ empty ] │

Operation	Cost
Insert	O(1) — append to last page
Search (no index)	O(N) — full scan
Delete	Mark as deleted; space reclaimed later
Update	Find record + modify

Use case: Bulk loading data, temporary tables, when full-scan queries are common.

2. Sequential (Ordered) File

Records stored in sorted order based on a key attribute (the ordering field/key).

│ Page 1	[E101][E102][E103][E104] │
│ Page 2	[E105][E106][E107][E108] │
│ Page 3	[E109][E110] │

Operation	Cost
Search by ordering key	O(log N) — binary search
Insert	O(N) — must maintain order; expensive
Range queries	Efficient — physically contiguous

Use case: Range queries, reports, when sorted access is common.

3. Hashed File

Records stored based on a hash function applied to a key attribute. The hash function maps a key value to a bucket (page).

Hash Function	bucket = hash(EmpID) % 4
Bucket 0	[E104][E108]
Bucket 1	[E101][E105][E109]
Bucket 2	[E102][E106]
Bucket 3	[E103][E107][E110]

Operation	Cost
Search by hash key	O(1) — compute hash, go to bucket
Insert	O(1) — compute hash, insert in bucket
Range queries	Poor — records not in order

Use case: Equality searches (WHERE emp_id = 105).

4. Clustered (Indexed Sequential) File

Records stored in sorted order AND an index (usually B+ tree) is built on the ordering key. This is how most RDBMS primary keys work.

B+ Tree Index [105] / \ [101-104] [105-110] ↓ ↓ Data pages Data pages (sorted) (sorted)

Operation	Cost
Search by key	O(log N) via index
Range query	Excellent — sorted + indexed
Insert	Moderate — may need page splits

Use case: Most production OLTP tables; primary key access.

Variable-Length Records

Some records have variable-length fields (VARCHAR, TEXT, BLOB):

Fixed-length	\| EmpID(4) \| Name(50) \| Salary(8) \| ← easy to navigate
Variable-length	\| EmpID(4) \| Name(??) \| Salary(8) \| ← need extra info
1. Slotted Page	page header has array of (offset, length) per record
2. Null bitmap	track which nullable fields contain null
3. Overflow pages	long fields stored in separate overflow pages

Record Formats

Fixed-Length Records

EmpID	Name (padded to 50 chars)	Salary
(4 bytes)	(50 bytes)	(8 bytes)

Comparison of File Organizations

Organization	Insert	Search	Delete	Range Query
Heap	Fast	O(N)	Easy	Poor
Sequential	Slow	O(log N)	Moderate	Excellent
Hashed	Fast	O(1)	Fast	Poor
Clustered	Moderate	O(log N)	Moderate	Excellent

Overview

File organization refers to how data records are physically stored on disk. The choice of file organization affects the speed of data retrieval, insertion, deletion, and update operations.

Basic Storage Concepts

Disk Storage Hierarchy

┌────────────────────────────────────────────────────┐

│ Database File │

│ ├── Segment (tablespace) │

│ │ ├── Extent (group of pages) │

│ │ │ ├── Page / Block (8KB–16KB) │

│ │ │ │ └── Records (rows/tuples) │

│ │ │ │ └── Fields (columns/attributes)│

└────────────────────────────────────────────────────┘

Disk I/O is the bottleneck

Reading one page from disk ≈ 5–10ms

Reading from RAM ≈ nanoseconds

Goal: Minimize disk I/O for all operations

1. Heap File (Unordered File)

Records are stored in no particular order — new records are appended wherever space is available.

│ Page 1	[R3][R7][R1][R9] │
│ Page 2	[R5][R2][R8][R4] │
│ Page 3	[R6][R10][ empty ][ empty ] │

Operation	Cost
Insert	O(1) — append to last page
Search (no index)	O(N) — full scan
Delete	Mark as deleted; space reclaimed later
Update	Find record + modify

Use case: Bulk loading data, temporary tables, when full-scan queries are common.

2. Sequential (Ordered) File

Records stored in sorted order based on a key attribute (the ordering field/key).

│ Page 1	[E101][E102][E103][E104] │
│ Page 2	[E105][E106][E107][E108] │
│ Page 3	[E109][E110] │

Operation	Cost
Search by ordering key	O(log N) — binary search
Insert	O(N) — must maintain order; expensive
Range queries	Efficient — physically contiguous

Use case: Range queries, reports, when sorted access is common.

3. Hashed File

Records stored based on a hash function applied to a key attribute. The hash function maps a key value to a bucket (page).

Hash Function	bucket = hash(EmpID) % 4
Bucket 0	[E104][E108]
Bucket 1	[E101][E105][E109]
Bucket 2	[E102][E106]
Bucket 3	[E103][E107][E110]

Operation	Cost
Search by hash key	O(1) — compute hash, go to bucket
Insert	O(1) — compute hash, insert in bucket
Range queries	Poor — records not in order

Use case: Equality searches (WHERE emp_id = 105).

4. Clustered (Indexed Sequential) File

Records stored in sorted order AND an index (usually B+ tree) is built on the ordering key. This is how most RDBMS primary keys work.

B+ Tree Index [105] / \ [101-104] [105-110] ↓ ↓ Data pages Data pages (sorted) (sorted)

Operation	Cost
Search by key	O(log N) via index
Range query	Excellent — sorted + indexed
Insert	Moderate — may need page splits

Use case: Most production OLTP tables; primary key access.

Variable-Length Records

Some records have variable-length fields (VARCHAR, TEXT, BLOB):

Fixed-length	\| EmpID(4) \| Name(50) \| Salary(8) \| ← easy to navigate
Variable-length	\| EmpID(4) \| Name(??) \| Salary(8) \| ← need extra info
1. Slotted Page	page header has array of (offset, length) per record
2. Null bitmap	track which nullable fields contain null
3. Overflow pages	long fields stored in separate overflow pages

Record Formats

Fixed-Length Records

EmpID	Name (padded to 50 chars)	Salary
(4 bytes)	(50 bytes)	(8 bytes)

Comparison of File Organizations

Organization	Insert	Search	Delete	Range Query
Heap	Fast	O(N)	Easy	Poor
Sequential	Slow	O(log N)	Moderate	Excellent
Hashed	Fast	O(1)	Fast	Poor
Clustered	Moderate	O(log N)	Moderate	Excellent

File Organization

Overview

Basic Storage Concepts

Disk Storage Hierarchy

Disk I/O is the bottleneck

1. Heap File (Unordered File)

2. Sequential (Ordered) File

3. Hashed File

4. Clustered (Indexed Sequential) File

Variable-Length Records

Record Formats

Fixed-Length Records

Comparison of File Organizations

Continue learning this concept

File Organization

Overview

Basic Storage Concepts

Disk Storage Hierarchy

Disk I/O is the bottleneck

1. Heap File (Unordered File)

2. Sequential (Ordered) File

3. Hashed File

4. Clustered (Indexed Sequential) File

Variable-Length Records

Record Formats

Fixed-Length Records

Comparison of File Organizations

Continue learning this concept