# Big Data and DBMS
## What is Big Data?
**Big Data** refers to datasets that are so large, fast-moving, or complex that traditional database systems cannot process them efficiently. It requires new technologies and approaches for storage, processing, and analysis.
---
## The 5 V's of Big Data
```
┌──────────────────────────────────────────────────────────┐
│ The 5 V's of Big Data │
│ │
│ Volume → Terabytes to Petabytes of data │
│ (Facebook: 4 PB/day; Twitter: 500M tweets/day)│
│ │
│ Velocity → Speed at which data is generated │
│ Real-time streams (IoT sensors, stock ticks)│
│ │
│ Variety → Different types and formats of data │
│ Structured (DB), Semi-structured (JSON), │
│ Unstructured (images, videos, logs) │
│ │
│ Veracity → Uncertainty and quality of data │
│ Noisy, incomplete, inconsistent sources │
│ │
│ Value → Extracting useful insights from raw data │
│ The ultimate goal — business intelligence │
└──────────────────────────────────────────────────────────┘
```
---
## Why Traditional DBMS Fails for Big Data
```
Traditional RDBMS limitations:
─────────────────────────────
1. Vertical scaling only → one powerful server has hardware limits
2. Fixed schema → cannot handle unstructured/semi-structured data
3. ACID overhead → slows throughput for write-heavy workloads
4. Row-oriented storage → inefficient for analytical queries
5. Cannot process data in real time as it arrives (streaming)
6. Single point of failure → RDBMS needs HA solutions
```
---
## Big Data Ecosystem
### Apache Hadoop
Hadoop is an open-source framework for distributed storage and processing of Big Data.
```
Hadoop Core Components:
┌──────────────────────────────────────────────────┐
│ HADOOP │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ HDFS (Hadoop Distributed File System) │ │
│ │ → Stores data across many nodes │ │
│ │ → Replicates each block 3 times (fault │ │
│ │ tolerant) │ │
│ │ → Block size: 128 MB default │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ MapReduce │ │
│ │ → Distributed processing framework │ │
│ │ → Map phase: process each record │ │
│ │ → Shuffle: group by key │ │
│ │ → Reduce phase: aggregate per key │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ YARN (Resource Manager) │ │
│ │ → Manages cluster resources (CPU, RAM) │ │
│ └──────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
```
### MapReduce Example — Word Count
```
Input: "apple banana apple cherry banana apple"
MAP phase (splits and emits key-value pairs):
apple → 1
banana → 1
apple → 1
cherry → 1
banana → 1
apple → 1
SHUFFLE (groups same keys together):
apple → [1, 1, 1]
banana → [1, 1]
cherry → [1]
REDUCE (aggregates):
apple → 3
banana → 2
cherry → 1
```
---
### Apache Spark
Apache Spark is a faster alternative to Hadoop MapReduce that processes data **in-memory**.
```
Hadoop MapReduce vs Apache Spark:
──────────────────────────────────
MapReduce: Writes intermediate results to DISK between steps → slow
Spark: Keeps intermediate results in RAM → 100x faster
Spark Components:
┌────────────────────────────────────────────────┐
│ Apache Spark │
│ Spark Core (RDD — Resilient Distributed Dataset)│
│ ┌──────────┬──────────┬──────────┬──────────┐ │
│ │Spark SQL │Spark │Spark │GraphX │ │
│ │(SQL on │Streaming │MLlib │(Graph │ │
│ │big data) │(real-time│(machine │analytics)│ │
│ │ │analytics)│learning) │ │ │
│ └──────────┴──────────┴──────────┴──────────┘ │
└────────────────────────────────────────────────┘
```
---
### Big Data Storage Technologies
| Technology | Type | Best For |
|-----------|------|---------|
| HDFS | Distributed file system | Batch processing, large files |
| HBase | Column-family NoSQL on HDFS | Random read/write on big data |
| Apache Cassandra | Column-family NoSQL | High write throughput, time-series |
| Apache Kafka | Message streaming | Real-time data pipelines |
| Amazon S3 | Object storage | Data lake, unstructured data |
| Elasticsearch | Document search engine | Full-text search, log analytics |
---
## Big Data Architecture Patterns
### Lambda Architecture
```
Incoming Data
│
├──────────────────────► Batch Layer (Hadoop/Spark)
│ → Processes all historical data
│ → Accurate but slow (hours)
│
└──────────────────────► Speed Layer (Kafka + Spark Streaming)
→ Processes recent real-time data
→ Fast but approximate
Serving Layer: Combines batch + speed layer results for queries
```
### Kappa Architecture
Simpler — uses only a streaming layer (Kafka + Spark Streaming) for both real-time and historical data.
---
## Big Data and DBMS Integration
```
Traditional RDBMS still used alongside Big Data tools:
Data Flow:
Source Systems (Apps, IoT, Web logs)
│
▼
Big Data Ingestion (Kafka, Flume)
│
▼
Big Data Storage (HDFS, S3, Cassandra)
│
▼
Big Data Processing (Spark, Hive)
│
▼
Data Warehouse / RDBMS (for reporting & analytics)
│
▼
BI Tools (Tableau, Power BI, Grafana)
```
---
## Hive — SQL on Big Data
**Apache Hive** allows writing SQL-like queries (HiveQL) on data stored in HDFS:
```sql
-- HiveQL (looks like SQL, runs on Hadoop cluster)
SELECT department, AVG(salary) AS avg_sal
FROM employee_data
WHERE year = 2024
GROUP BY department
HAVING AVG(salary) > 60000;
-- Compiles to MapReduce or Spark jobs behind the scenes
```Back to Course