DBMS Topics
Big Data and DBMS
Last Updated : 21 May, 2026
Big Data refers to datasets that are so large, fast-moving, or complex that traditional database systems cannot process them efficiently. It requires new technologies and
What is Big Data?
Big Data refers to datasets that are so large, fast-moving, or complex that traditional database systems cannot process them efficiently. It requires new technologies and approaches for storage, processing, and analysis.
The 5 V's of Big Data
| │ Volume | Terabytes to Petabytes of data │ |
| 4 PB/day; Twitter: 500M tweets/day)│ | |
| │ Velocity | Speed at which data is generated │ |
| │ Variety | Different types and formats of data │ |
| │ Veracity | Uncertainty and quality of data │ |
| │ Value | Extracting useful insights from raw data │ |
Why Traditional DBMS Fails for Big Data
| 1. Vertical scaling only | one powerful server has hardware limits |
| 2. Fixed schema | cannot handle unstructured/semi-structured data |
| 3. ACID overhead | slows throughput for write-heavy workloads |
| 4. Row-oriented storage | inefficient for analytical queries |
| 6. Single point of failure | RDBMS needs HA solutions |
Big Data Ecosystem
Apache Hadoop
Hadoop is an open-source framework for distributed storage and processing of Big Data.
| HDFS (Hadoop Distributed File System) | ||
|---|---|---|
| → Stores data across many nodes | ||
| → Replicates each block 3 times (fault | ||
| tolerant) | ||
| → Block size: 128 MB default | ||
| MapReduce | ||
| → Distributed processing framework | ||
| → Map phase: process each record | ||
| → Shuffle: group by key | ||
| → Reduce phase: aggregate per key | ||
| YARN (Resource Manager) | ||
| → Manages cluster resources (CPU, RAM) |
MapReduce Example — Word Count
| Input | "apple banana apple cherry banana apple" |
| apple | 1 |
| banana | 1 |
| apple | 1 |
| cherry | 1 |
| banana | 1 |
| apple | 1 |
| apple | [1, 1, 1] |
| banana | [1, 1] |
| cherry | [1] |
| apple | 3 |
| banana | 2 |
| cherry | 1 |
Apache Spark
Apache Spark is a faster alternative to Hadoop MapReduce that processes data in-memory.
| Spark SQL | Spark | Spark | GraphX | ||
|---|---|---|---|---|---|
| (SQL on | Streaming | MLlib | (Graph | ||
| big data) | (real-time | (machine | analytics) | ||
| analytics) | learning) |
Big Data Storage Technologies
| Technology | Type | Best For |
|---|---|---|
| HDFS | Distributed file system | Batch processing, large files |
| HBase | Column-family NoSQL on HDFS | Random read/write on big data |
| Apache Cassandra | Column-family NoSQL | High write throughput, time-series |
| Apache Kafka | Message streaming | Real-time data pipelines |
| Amazon S3 | Object storage | Data lake, unstructured data |
| Elasticsearch | Document search engine | Full-text search, log analytics |
Big Data Architecture Patterns
Lambda Architecture
| │ | Processes all historical data |
| │ | Accurate but slow (hours) |
| Serving Layer | Combines batch + speed layer results for queries |
Kappa Architecture
Simpler — uses only a streaming layer (Kafka + Spark Streaming) for both real-time and historical data.
Big Data and DBMS Integration
Hive — SQL on Big Data
Apache Hive allows writing SQL-like queries (HiveQL) on data stored in HDFS:
-- HiveQL (looks like SQL, runs on Hadoop cluster)
SELECT department, AVG(salary) AS avg_sal
FROM employee_data
WHERE year = 2024
GROUP BY department
HAVING AVG(salary) > 60000;
-- Compiles to MapReduce or Spark jobs behind the scenesExam Focus
Revise definitions, diagrams, examples, and short-answer points for Big Data and DBMS.
Interview Use
Prepare one clear explanation, one practical example, and one common mistake for this DBMS topic.
Search Terms
dbms, database management system, database notes, sql, unit, big, data, and
Related DBMS Topics