Data Mining

Data Mining is the process of discovering interesting, previously unknown patterns, correlations, anomalies, and insights from large amounts of data stored in databases a

What is Data Mining?

Data Mining is the process of discovering interesting, previously unknown patterns, correlations, anomalies, and insights from large amounts of data stored in databases and data warehouses.

It is part of the larger KDD (Knowledge Discovery in Databases) process.

KDD Process (Knowledge Discovery in Databases)

Raw Data │ ▼ Selection & Sampling Target Data │ ▼ Cleaning & Preprocessing Preprocessed Data │ ▼ Transformation & Feature Engineering Transformed Data │ ▼ DATA MINING ← core step Patterns & Models │ ▼ Interpretation & Evaluation Knowledge │ ▼ Action / Decision Making

Data Mining Tasks

1. Classification

Assigns data items to predefined categories based on their attributes.

Goal	Given attributes, predict a category label.
Input	Age=35, Income=75000, Married=Yes, Owns_Home=No
Output	Credit Risk = LOW / MEDIUM / HIGH

2. Clustering

Groups data items into clusters such that items in the same cluster are similar and items in different clusters are dissimilar. No predefined labels.

Goal	Group data items by similarity (unsupervised).
Customer Data	Group into clusters:
Cluster 1	Young, high income, tech-savvy
Cluster 2	Middle-aged, average income, family-oriented
Cluster 3	Senior, low income, low digital engagement

3. Association Rule Mining

Discovers interesting relationships (associations) between variables in large datasets.

Goal	Find rules of the form: IF {A, B} THEN {C}
Support	40% of transactions contain all three
Confidence	80% of transactions with Bread+Butter also have Milk
Lift	How much more likely than random co-occurrence
Algorithm	Apriori Algorithm

4. Regression

Predicts a continuous numerical value based on input attributes.

Goal	Predict a numeric output.
Input	Size=1500 sqft, Bedrooms=3, Location=Delhi
Output	House Price = ₹85,00,000

5. Anomaly Detection (Outlier Analysis)

Identifies data points that deviate significantly from normal behavior.

Goal: Find unusual, unexpected data items. Example: A credit card transaction of ₹2,00,000 at 3 AM in Moscow for an account normally used in Delhi → ANOMALY (potential fraud) Applications: - Credit card fraud detection - Network intrusion detection - Manufacturing defect detection - Medical diagnosis (unusual test results)

Decision Tree Example

Classify loan applicants as HIGH / LOW risk: Income > 50000? / \ YES NO │ │ Owns Home? Credit Score > 650? / \ / \ YES NO YES NO │ │ │ │ LOW MEDIUM MEDIUM HIGH

Apriori Algorithm — Step by Step

T1	{Bread, Butter, Milk}
T2	{Bread, Butter}
T3	{Butter, Milk}
T4	{Bread, Milk}
T5	{Bread, Butter, Milk}
{Bread}	4/5 = 80% ✓
{Butter}	4/5 = 80% ✓
{Milk}	4/5 = 80% ✓
{Bread, Butter}	3/5 = 60% ✓
{Bread, Milk}	3/5 = 60% ✓
{Butter, Milk}	3/5 = 60% ✓
{Bread, Butter, Milk}	2/5 = 40% ✗ (below min_support)
Association Rules from {Bread, Butter}	60% support:
Bread	Butter: confidence = 3/4 = 75%
Butter	Bread: confidence = 3/4 = 75%

Data Mining vs Machine Learning vs Statistics

Statistics	Mathematical framework for inference from data
Data Mining	Discovering patterns in DATABASES (large-scale)
Machine Learning	Algorithms that LEARN from data to make predictions

Data Mining Tools

Tool	Description
Weka	Open-source ML/DM toolkit (Java)
RapidMiner	Visual DM workflow designer
Python (sklearn, pandas)	Most popular DM/ML library ecosystem
R	Statistical computing and DM
KNIME	Open-source analytics platform
Apache Mahout	Distributed ML on Hadoop
SQL with analytics functions	RANK, PARTITION, window functions

What is Data Mining?

Data Mining is the process of discovering interesting, previously unknown patterns, correlations, anomalies, and insights from large amounts of data stored in databases and data warehouses.

It is part of the larger KDD (Knowledge Discovery in Databases) process.

KDD Process (Knowledge Discovery in Databases)

Data Mining Tasks

1. Classification

Assigns data items to predefined categories based on their attributes.

Goal	Given attributes, predict a category label.
Input	Age=35, Income=75000, Married=Yes, Owns_Home=No
Output	Credit Risk = LOW / MEDIUM / HIGH

2. Clustering

Groups data items into clusters such that items in the same cluster are similar and items in different clusters are dissimilar. No predefined labels.

Goal	Group data items by similarity (unsupervised).
Customer Data	Group into clusters:
Cluster 1	Young, high income, tech-savvy
Cluster 2	Middle-aged, average income, family-oriented
Cluster 3	Senior, low income, low digital engagement

3. Association Rule Mining

Discovers interesting relationships (associations) between variables in large datasets.

Goal	Find rules of the form: IF {A, B} THEN {C}
Support	40% of transactions contain all three
Confidence	80% of transactions with Bread+Butter also have Milk
Lift	How much more likely than random co-occurrence
Algorithm	Apriori Algorithm

4. Regression

Predicts a continuous numerical value based on input attributes.

Goal	Predict a numeric output.
Input	Size=1500 sqft, Bedrooms=3, Location=Delhi
Output	House Price = ₹85,00,000

5. Anomaly Detection (Outlier Analysis)

Identifies data points that deviate significantly from normal behavior.

Decision Tree Example

Classify loan applicants as HIGH / LOW risk: Income > 50000? / \ YES NO │ │ Owns Home? Credit Score > 650? / \ / \ YES NO YES NO │ │ │ │ LOW MEDIUM MEDIUM HIGH

Apriori Algorithm — Step by Step

T1	{Bread, Butter, Milk}
T2	{Bread, Butter}
T3	{Butter, Milk}
T4	{Bread, Milk}
T5	{Bread, Butter, Milk}
{Bread}	4/5 = 80% ✓
{Butter}	4/5 = 80% ✓
{Milk}	4/5 = 80% ✓
{Bread, Butter}	3/5 = 60% ✓
{Bread, Milk}	3/5 = 60% ✓
{Butter, Milk}	3/5 = 60% ✓
{Bread, Butter, Milk}	2/5 = 40% ✗ (below min_support)
Association Rules from {Bread, Butter}	60% support:
Bread	Butter: confidence = 3/4 = 75%
Butter	Bread: confidence = 3/4 = 75%

Data Mining vs Machine Learning vs Statistics

Statistics	Mathematical framework for inference from data
Data Mining	Discovering patterns in DATABASES (large-scale)
Machine Learning	Algorithms that LEARN from data to make predictions

Data Mining Tools

Tool	Description
Weka	Open-source ML/DM toolkit (Java)
RapidMiner	Visual DM workflow designer
Python (sklearn, pandas)	Most popular DM/ML library ecosystem
R	Statistical computing and DM
KNIME	Open-source analytics platform
Apache Mahout	Distributed ML on Hadoop
SQL with analytics functions	RANK, PARTITION, window functions

Data Mining

What is Data Mining?

KDD Process (Knowledge Discovery in Databases)

Data Mining Tasks

1. Classification

2. Clustering

3. Association Rule Mining

4. Regression

5. Anomaly Detection (Outlier Analysis)

Decision Tree Example

Apriori Algorithm — Step by Step

Data Mining vs Machine Learning vs Statistics

Data Mining Tools

Continue learning this concept

Data Mining

What is Data Mining?

KDD Process (Knowledge Discovery in Databases)

Data Mining Tasks

1. Classification

2. Clustering

3. Association Rule Mining

4. Regression

5. Anomaly Detection (Outlier Analysis)

Decision Tree Example

Apriori Algorithm — Step by Step

Data Mining vs Machine Learning vs Statistics

Data Mining Tools

Continue learning this concept