Big Data and DBMS

Big Data refers to datasets that are so large, fast-moving, or complex that traditional database systems cannot process them efficiently. It requires new technologies and

What is Big Data?

Big Data refers to datasets that are so large, fast-moving, or complex that traditional database systems cannot process them efficiently. It requires new technologies and approaches for storage, processing, and analysis.

The 5 V's of Big Data

│ Volume	Terabytes to Petabytes of data │
│ (Facebook	4 PB/day; Twitter: 500M tweets/day)│
│ Velocity	Speed at which data is generated │
│ Variety	Different types and formats of data │
│ Veracity	Uncertainty and quality of data │
│ Value	Extracting useful insights from raw data │

Why Traditional DBMS Fails for Big Data

1. Vertical scaling only	one powerful server has hardware limits
2. Fixed schema	cannot handle unstructured/semi-structured data
3. ACID overhead	slows throughput for write-heavy workloads
4. Row-oriented storage	inefficient for analytical queries
6. Single point of failure	RDBMS needs HA solutions

Big Data Ecosystem

Apache Hadoop

Hadoop is an open-source framework for distributed storage and processing of Big Data.

	HDFS (Hadoop Distributed File System)
	→ Stores data across many nodes
	→ Replicates each block 3 times (fault
	tolerant)
	→ Block size: 128 MB default
	MapReduce
	→ Distributed processing framework
	→ Map phase: process each record
	→ Shuffle: group by key
	→ Reduce phase: aggregate per key
	YARN (Resource Manager)
	→ Manages cluster resources (CPU, RAM)

MapReduce Example — Word Count

Input	"apple banana apple cherry banana apple"
apple	1
banana	1
apple	1
cherry	1
banana	1
apple	1
apple	[1, 1, 1]
banana	[1, 1]
cherry	[1]
apple	3
banana	2
cherry	1

Apache Spark

Apache Spark is a faster alternative to Hadoop MapReduce that processes data in-memory.

Spark SQL	Spark	Spark	GraphX
(SQL on	Streaming	MLlib	(Graph
big data)	(real-time	(machine	analytics)
	analytics)	learning)

Big Data Storage Technologies

Technology	Type	Best For
HDFS	Distributed file system	Batch processing, large files
HBase	Column-family NoSQL on HDFS	Random read/write on big data
Apache Cassandra	Column-family NoSQL	High write throughput, time-series
Apache Kafka	Message streaming	Real-time data pipelines
Amazon S3	Object storage	Data lake, unstructured data
Elasticsearch	Document search engine	Full-text search, log analytics

Big Data Architecture Patterns

Lambda Architecture

│	Processes all historical data
│	Accurate but slow (hours)
Serving Layer	Combines batch + speed layer results for queries

Kappa Architecture

Simpler — uses only a streaming layer (Kafka + Spark Streaming) for both real-time and historical data.

Big Data and DBMS Integration

Traditional RDBMS still used alongside Big Data tools: Data Flow: Source Systems (Apps, IoT, Web logs) │ ▼ Big Data Ingestion (Kafka, Flume) │ ▼ Big Data Storage (HDFS, S3, Cassandra) │ ▼ Big Data Processing (Spark, Hive) │ ▼ Data Warehouse / RDBMS (for reporting & analytics) │ ▼ BI Tools (Tableau, Power BI, Grafana)

Hive — SQL on Big Data

Apache Hive allows writing SQL-like queries (HiveQL) on data stored in HDFS:

sql exampleWoHoTech

-- HiveQL (looks like SQL, runs on Hadoop cluster)
SELECT department, AVG(salary) AS avg_sal
FROM employee_data
WHERE year = 2024
GROUP BY department
HAVING AVG(salary) > 60000;
-- Compiles to MapReduce or Spark jobs behind the scenes

What is Big Data?

The 5 V's of Big Data

│ Volume	Terabytes to Petabytes of data │
│ (Facebook	4 PB/day; Twitter: 500M tweets/day)│
│ Velocity	Speed at which data is generated │
│ Variety	Different types and formats of data │
│ Veracity	Uncertainty and quality of data │
│ Value	Extracting useful insights from raw data │

Why Traditional DBMS Fails for Big Data

1. Vertical scaling only	one powerful server has hardware limits
2. Fixed schema	cannot handle unstructured/semi-structured data
3. ACID overhead	slows throughput for write-heavy workloads
4. Row-oriented storage	inefficient for analytical queries
6. Single point of failure	RDBMS needs HA solutions

Big Data Ecosystem

Apache Hadoop

Hadoop is an open-source framework for distributed storage and processing of Big Data.

	HDFS (Hadoop Distributed File System)
	→ Stores data across many nodes
	→ Replicates each block 3 times (fault
	tolerant)
	→ Block size: 128 MB default
	MapReduce
	→ Distributed processing framework
	→ Map phase: process each record
	→ Shuffle: group by key
	→ Reduce phase: aggregate per key
	YARN (Resource Manager)
	→ Manages cluster resources (CPU, RAM)

MapReduce Example — Word Count

Input	"apple banana apple cherry banana apple"
apple	1
banana	1
apple	1
cherry	1
banana	1
apple	1
apple	[1, 1, 1]
banana	[1, 1]
cherry	[1]
apple	3
banana	2
cherry	1

Apache Spark

Apache Spark is a faster alternative to Hadoop MapReduce that processes data in-memory.

Spark SQL	Spark	Spark	GraphX
(SQL on	Streaming	MLlib	(Graph
big data)	(real-time	(machine	analytics)
	analytics)	learning)

Big Data Storage Technologies

Technology	Type	Best For
HDFS	Distributed file system	Batch processing, large files
HBase	Column-family NoSQL on HDFS	Random read/write on big data
Apache Cassandra	Column-family NoSQL	High write throughput, time-series
Apache Kafka	Message streaming	Real-time data pipelines
Amazon S3	Object storage	Data lake, unstructured data
Elasticsearch	Document search engine	Full-text search, log analytics

Big Data Architecture Patterns

Lambda Architecture

│	Processes all historical data
│	Accurate but slow (hours)
Serving Layer	Combines batch + speed layer results for queries

Kappa Architecture

Simpler — uses only a streaming layer (Kafka + Spark Streaming) for both real-time and historical data.

Big Data and DBMS Integration

Hive — SQL on Big Data

Apache Hive allows writing SQL-like queries (HiveQL) on data stored in HDFS:

sql exampleWoHoTech

-- HiveQL (looks like SQL, runs on Hadoop cluster)
SELECT department, AVG(salary) AS avg_sal
FROM employee_data
WHERE year = 2024
GROUP BY department
HAVING AVG(salary) > 60000;
-- Compiles to MapReduce or Spark jobs behind the scenes

Big Data and DBMS

What is Big Data?

The 5 V's of Big Data

Why Traditional DBMS Fails for Big Data

Big Data Ecosystem

Apache Hadoop

MapReduce Example — Word Count

Apache Spark

Big Data Storage Technologies

Big Data Architecture Patterns

Lambda Architecture

Kappa Architecture

Big Data and DBMS Integration

Hive — SQL on Big Data

Continue learning this concept

Big Data and DBMS

What is Big Data?

The 5 V's of Big Data

Why Traditional DBMS Fails for Big Data

Big Data Ecosystem

Apache Hadoop

MapReduce Example — Word Count

Apache Spark

Big Data Storage Technologies

Big Data Architecture Patterns

Lambda Architecture

Kappa Architecture

Big Data and DBMS Integration

Hive — SQL on Big Data

Continue learning this concept