DBMS - Learn DBMS Online | WohoTech

Unit 6

# Big Data and DBMS ## What is Big Data? **Big Data** refers to datasets that are so large, fast-moving, or complex that traditional database systems cannot process them efficiently. It requires new technologies and approaches for storage, processing, and analysis. --- ## The 5 V's of Big Data ``` ┌──────────────────────────────────────────────────────────┐ │ The 5 V's of Big Data │ │ │ │ Volume → Terabytes to Petabytes of data │ │ (Facebook: 4 PB/day; Twitter: 500M tweets/day)│ │ │ │ Velocity → Speed at which data is generated │ │ Real-time streams (IoT sensors, stock ticks)│ │ │ │ Variety → Different types and formats of data │ │ Structured (DB), Semi-structured (JSON), │ │ Unstructured (images, videos, logs) │ │ │ │ Veracity → Uncertainty and quality of data │ │ Noisy, incomplete, inconsistent sources │ │ │ │ Value → Extracting useful insights from raw data │ │ The ultimate goal — business intelligence │ └──────────────────────────────────────────────────────────┘ ``` --- ## Why Traditional DBMS Fails for Big Data ``` Traditional RDBMS limitations: ───────────────────────────── 1. Vertical scaling only → one powerful server has hardware limits 2. Fixed schema → cannot handle unstructured/semi-structured data 3. ACID overhead → slows throughput for write-heavy workloads 4. Row-oriented storage → inefficient for analytical queries 5. Cannot process data in real time as it arrives (streaming) 6. Single point of failure → RDBMS needs HA solutions ``` --- ## Big Data Ecosystem ### Apache Hadoop Hadoop is an open-source framework for distributed storage and processing of Big Data. ``` Hadoop Core Components: ┌──────────────────────────────────────────────────┐ │ HADOOP │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ HDFS (Hadoop Distributed File System) │ │ │ │ → Stores data across many nodes │ │ │ │ → Replicates each block 3 times (fault │ │ │ │ tolerant) │ │ │ │ → Block size: 128 MB default │ │ │ └──────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ MapReduce │ │ │ │ → Distributed processing framework │ │ │ │ → Map phase: process each record │ │ │ │ → Shuffle: group by key │ │ │ │ → Reduce phase: aggregate per key │ │ │ └──────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ YARN (Resource Manager) │ │ │ │ → Manages cluster resources (CPU, RAM) │ │ │ └──────────────────────────────────────────┘ │ └──────────────────────────────────────────────────┘ ``` ### MapReduce Example — Word Count ``` Input: "apple banana apple cherry banana apple" MAP phase (splits and emits key-value pairs): apple → 1 banana → 1 apple → 1 cherry → 1 banana → 1 apple → 1 SHUFFLE (groups same keys together): apple → [1, 1, 1] banana → [1, 1] cherry → [1] REDUCE (aggregates): apple → 3 banana → 2 cherry → 1 ``` --- ### Apache Spark Apache Spark is a faster alternative to Hadoop MapReduce that processes data **in-memory**. ``` Hadoop MapReduce vs Apache Spark: ────────────────────────────────── MapReduce: Writes intermediate results to DISK between steps → slow Spark: Keeps intermediate results in RAM → 100x faster Spark Components: ┌────────────────────────────────────────────────┐ │ Apache Spark │ │ Spark Core (RDD — Resilient Distributed Dataset)│ │ ┌──────────┬──────────┬──────────┬──────────┐ │ │ │Spark SQL │Spark │Spark │GraphX │ │ │ │(SQL on │Streaming │MLlib │(Graph │ │ │ │big data) │(real-time│(machine │analytics)│ │ │ │ │analytics)│learning) │ │ │ │ └──────────┴──────────┴──────────┴──────────┘ │ └────────────────────────────────────────────────┘ ``` --- ### Big Data Storage Technologies | Technology | Type | Best For | |-----------|------|---------| | HDFS | Distributed file system | Batch processing, large files | | HBase | Column-family NoSQL on HDFS | Random read/write on big data | | Apache Cassandra | Column-family NoSQL | High write throughput, time-series | | Apache Kafka | Message streaming | Real-time data pipelines | | Amazon S3 | Object storage | Data lake, unstructured data | | Elasticsearch | Document search engine | Full-text search, log analytics | --- ## Big Data Architecture Patterns ### Lambda Architecture ``` Incoming Data │ ├──────────────────────► Batch Layer (Hadoop/Spark) │ → Processes all historical data │ → Accurate but slow (hours) │ └──────────────────────► Speed Layer (Kafka + Spark Streaming) → Processes recent real-time data → Fast but approximate Serving Layer: Combines batch + speed layer results for queries ``` ### Kappa Architecture Simpler — uses only a streaming layer (Kafka + Spark Streaming) for both real-time and historical data. --- ## Big Data and DBMS Integration ``` Traditional RDBMS still used alongside Big Data tools: Data Flow: Source Systems (Apps, IoT, Web logs) │ ▼ Big Data Ingestion (Kafka, Flume) │ ▼ Big Data Storage (HDFS, S3, Cassandra) │ ▼ Big Data Processing (Spark, Hive) │ ▼ Data Warehouse / RDBMS (for reporting & analytics) │ ▼ BI Tools (Tableau, Power BI, Grafana) ``` --- ## Hive — SQL on Big Data **Apache Hive** allows writing SQL-like queries (HiveQL) on data stored in HDFS: ```sql -- HiveQL (looks like SQL, runs on Hadoop cluster) SELECT department, AVG(salary) AS avg_sal FROM employee_data WHERE year = 2024 GROUP BY department HAVING AVG(salary) > 60000; -- Compiles to MapReduce or Spark jobs behind the scenes ```

Back to Course

Unit 6

Back to Course