1️⃣ CS Sem 1 2️⃣ CS Sem 2 3️⃣ CS Sem 3 4️⃣ CS Sem 4 5️⃣ CS Sem 5 6️⃣ CS Sem 6 💡 IT Branch 📡 ECE Branch 🏫 Class 9 🎒 Class 10 🔬 Class 11 🧪 Class 12 🎓 MCA / PG 📜 PhD / Research

PG / MCASEM-2Machine Learning

Advanced Machine Learning & Deep Learning — MTech/MCA Complete Notes

✍️ WohoTech Team📅 Last Updated: 2026-03-11📄 52 pages · 2.1 MB

Advanced Machine Learning & Deep Learning — MTech / MCA Notes

This course covers advanced ML algorithms, deep learning architectures, and current research areas. Prerequisite: Basic ML (Linear/Logistic Regression, Decision Trees, Clustering).

Unit 1: Review of ML Fundamentals

1.1 Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Noise

High Bias (Underfitting):
- Model too simple
- High training AND test error
- Fix: More complex model, more features

High Variance (Overfitting):
- Model too complex
- Low training error, high test error
- Fix: Regularization, more data, dropout

Regularization:

| Method | Technique | Effect | |--------|----------|--------| | L1 (Lasso) | Add λΣ|wᵢ| to loss | Sparse weights, feature selection | | L2 (Ridge) | Add λΣwᵢ² to loss | Small weights, no sparsity | | Elastic Net | L1 + L2 combined | Both effects | | Dropout | Randomly zero neurons during training | Ensemble effect |

1.2 Advanced Optimization

Gradient Descent Variants:

# SGD with Momentum
v = β*v - α*∇L(w)
w = w + v

# AdaGrad — adapts lr per parameter
cache += ∇L²
w -= α * ∇L / (√cache + ε)

# Adam (most popular)
m = β₁*m + (1-β₁)*∇L     # first moment
v = β₂*v + (1-β₂)*∇L²    # second moment
m_hat = m/(1-β₁ᵗ)         # bias correction
v_hat = v/(1-β₂ᵗ)
w -= α * m_hat / (√v_hat + ε)

Recommended defaults for Adam: α=0.001, β₁=0.9, β₂=0.999, ε=1e-8

Unit 2: Convolutional Neural Networks (CNN)

2.1 Convolution Operation

Input: H × W × C (height, width, channels)
Filter: k × k × C (kernel size × kernel size × channels)
Output: (H-k+2P)/S + 1 × (W-k+2P)/S + 1 × num_filters

Where:
P = padding, S = stride

Key Operations:

Convolution: Sliding filter over input, computing dot products
Pooling (Max/Average): Downsamples feature maps, reduces spatial dimensions
Batch Normalization: Normalizes activations, speeds training
ReLU Activation: max(0, x) — avoids vanishing gradient

2.2 Landmark CNN Architectures

| Architecture | Year | Key Innovation | Parameters | |-------------|------|---------------|-----------| | LeNet-5 | 1998 | First modern CNN | ~60K | | AlexNet | 2012 | ReLU, Dropout, GPU | ~60M | | VGGNet | 2014 | Deep with small 3×3 filters | ~138M | | GoogLeNet/Inception | 2014 | Inception modules, 1×1 conv | ~7M | | ResNet | 2015 | Skip connections, 152 layers | ~25M (ResNet-50) | | DenseNet | 2017 | Dense connections (every to every) | ~8M | | EfficientNet | 2019 | Compound scaling | ~5.3M (B0) | | ViT | 2020 | Pure Transformer for vision | ~86M (Base) |

2.3 ResNet Skip Connections — Why They Work

# Standard Block
x → Conv → BN → ReLU → Conv → BN → + input → ReLU

# Mathematical insight:
# Instead of learning F(x), learn F(x) = H(x) - x (residual)
# Gradient flows directly through shortcut connections
# Solves vanishing gradient for very deep networks (100+ layers)

2.4 Transfer Learning

import torch
import torchvision.models as models

# Load pretrained ResNet
model = models.resnet50(pretrained=True)

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for new task (e.g., 5 classes)
num_features = model.fc.in_features
model.fc = torch.nn.Linear(num_features, 5)

# Only train the new layer
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

Strategies:

Feature Extraction: Freeze all, retrain only classifier (small dataset, similar domain)
Fine-tuning: Unfreeze last few layers too (medium dataset)
Full Fine-tuning: Retrain all layers (large dataset, different domain)

Unit 3: Recurrent Neural Networks (RNN) & LSTM

3.1 Vanishing Gradient in RNNs

Standard RNN: hₜ = tanh(Wₕhₜ₋₁ + Wₓxₜ + b)

Problem: Gradient = ∂L/∂h₁ = ∂L/∂hₜ × ∏ᵢ (∂hᵢ/∂hᵢ₋₁)

If |∂hᵢ/∂hᵢ₋₁| < 1 for many steps → gradient → 0 (vanishing) If |∂hᵢ/∂hᵢ₋₁| > 1 for many steps → gradient → ∞ (exploding)

3.2 LSTM Architecture

LSTM Cell has 3 gates + cell state (Cₜ):

Forget Gate:    fₜ = σ(Wf[hₜ₋₁, xₜ] + bf)
Input Gate:     iₜ = σ(Wi[hₜ₋₁, xₜ] + bi)
Cell Candidate: C̃ₜ = tanh(Wc[hₜ₋₁, xₜ] + bc)
Cell Update:    Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ
Output Gate:    oₜ = σ(Wo[hₜ₋₁, xₜ] + bo)
Hidden State:   hₜ = oₜ⊙tanh(Cₜ)

σ = sigmoid (0 to 1)
⊙ = element-wise multiplication

3.3 GRU (Gated Recurrent Unit)

GRU Cell (simplified LSTM):

Reset Gate:  rₜ = σ(Wr[hₜ₋₁, xₜ])
Update Gate: zₜ = σ(Wz[hₜ₋₁, xₜ])
Candidate:   h̃ₜ = tanh(W[rₜ⊙hₜ₋₁, xₜ])
Output:      hₜ = (1-zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ

GRU vs LSTM:

GRU: 2 gates, faster, similar performance on most tasks
LSTM: 3 gates, better for very long sequences, more expressive
Choice: Try GRU first, upgrade to LSTM if needed

Unit 4: Attention Mechanism & Transformers

4.1 Attention Mechanism

# Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

Where:
Q = Query matrix (what we're looking for)
K = Key matrix (what each position has)
V = Value matrix (what to retrieve)
d_k = dimension of keys (scaling factor)

# Multi-Head Attention
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) × W^O
where headᵢ = Attention(QW^Q_i, KW^K_i, VW^V_i)

4.2 Transformer Architecture

Encoder:
Input → Embedding → Positional Encoding
     → [Multi-Head Self-Attention → Add&Norm
       → Feed Forward → Add&Norm] × N layers
     → Encoder Output

Decoder:
Output → Embedding → Positional Encoding
      → [Masked Multi-Head Self-Attention → Add&Norm
        → Multi-Head Cross-Attention (K,V from encoder) → Add&Norm
        → Feed Forward → Add&Norm] × N layers
      → Linear → Softmax → Output probabilities

Positional Encoding:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

4.3 BERT (Bidirectional Encoder Representations from Transformers)

Pre-training Tasks:

MLM (Masked Language Modeling): 15% tokens masked, model predicts them
NSP (Next Sentence Prediction): Predict if sentence B follows A

Fine-tuning for downstream tasks:

from transformers import BertForSequenceClassification, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits  # shape: (batch_size, num_labels)

BERT Variants:

| Model | Parameters | Key Difference | |-------|-----------|---------------| | BERT-Base | 110M | 12 layers, 12 heads | | BERT-Large | 340M | 24 layers, 16 heads | | DistilBERT | 66M | 40% smaller, 97% performance | | RoBERTa | 125M | Better pre-training (no NSP, more data) | | ALBERT | 12M | Parameter sharing, very small | | GPT-3/4 | 175B–1T+ | Decoder-only, generative |

Unit 5: Generative Adversarial Networks (GANs)

5.1 GAN Framework

GAN = Generator (G) + Discriminator (D)

Training Objective (minimax game):
min_G max_D V(D,G) = E[log D(x)] + E[log(1 - D(G(z)))]

Generator: G(z) → fake data (z = random noise)
Discriminator: D(x) → probability x is real

G tries to fool D
D tries to distinguish real from fake

5.2 GAN Training Problems and Solutions

| Problem | Symptom | Solution | |---------|---------|---------| | Mode Collapse | G produces limited varieties | Minibatch discrimination, WGAN | | Training Instability | Loss oscillates wildly | WGAN, spectral normalization | | Vanishing Gradients | G gets no feedback | WGAN-GP, LS-GAN | | Non-convergence | D always wins or G always wins | Balance training steps, TTUR |

WGAN (Wasserstein GAN):

Instead of log loss, use Wasserstein distance:
D_loss = -E[D(real)] + E[D(fake)]
G_loss = -E[D(fake)]
Require: Lipschitz constraint (clip weights to [-c, c] or gradient penalty)

5.3 Important GAN Variants

| GAN Type | Key Innovation | Application | |----------|---------------|------------| | DCGAN | Convolutional layers | Image generation | | CGAN | Class label conditioning | Class-specific generation | | CycleGAN | Cycle consistency loss | Image-to-image translation | | StyleGAN2 | Style-based generator | Photorealistic faces | | Pix2Pix | Paired image translation | Sketch → Photo | | BigGAN | Large-scale, class-conditional | High resolution images |

Unit 6: Reinforcement Learning

6.1 MDP Framework

MDP = (S, A, P, R, γ)
S = State space
A = Action space
P = Transition probability P(s'|s,a)
R = Reward function R(s,a)
γ = Discount factor (0 < γ ≤ 1)

Goal: Find policy π*(a|s) that maximizes:
G_t = Σ γᵏ R_{t+k+1} (sum of discounted future rewards)

6.2 Value Functions

State-Value Function: V^π(s) = E_π[G_t | S_t = s]

Action-Value Function (Q-function): Q^π(s,a) = E_π[G_t | S_t = s, A_t = a]

Bellman Equations:

V^π(s) = Σ_a π(a|s) Σ_s' P(s'|s,a) [R(s,a,s') + γV^π(s')]

Q^π(s,a) = Σ_s' P(s'|s,a) [R(s,a,s') + γ Σ_a' π(a'|s') Q^π(s',a')]

6.3 Q-Learning and Deep Q-Network (DQN)

Q-Learning Update (off-policy, model-free):

Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]

DQN Innovations (DeepMind, 2015):

Experience Replay: Store transitions (s,a,r,s') in replay buffer, sample randomly
Target Network: Separate network for computing targets, updated periodically
CNN Feature Extraction: Process raw pixels as input

# DQN pseudocode
for episode in range(num_episodes):
    s = env.reset()
    while not done:
        a = epsilon_greedy(Q_network, s)  # explore vs exploit
        s', r, done = env.step(a)
        replay_buffer.push(s, a, r, s', done)
        
        if len(replay_buffer) > batch_size:
            batch = replay_buffer.sample(batch_size)
            # target = r + γ * max Q_target(s', a')
            # loss = MSE(Q_network(s,a), target)
            optimizer.step()
            
        if step % target_update_freq == 0:
            Q_target.load_state_dict(Q_network.state_dict())

Unit 7: Model Evaluation and Deployment

7.1 Evaluation Metrics

Classification:

Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1 Score  = 2 × (Precision × Recall) / (Precision + Recall)
AUC-ROC   = Area under ROC curve (1.0 = perfect)

Regression:

MAE  = (1/n) Σ |y - ŷ|
MSE  = (1/n) Σ (y - ŷ)²
RMSE = √MSE
R²   = 1 - SS_res/SS_tot (1.0 = perfect)

7.2 Cross-Validation

from sklearn.model_selection import KFold, StratifiedKFold

# K-Fold (regression)
kf = KFold(n_splits=5, shuffle=True)
for train_idx, val_idx in kf.split(X):
    X_train, X_val = X[train_idx], X[val_idx]
    # train and evaluate model

# Stratified K-Fold (classification — preserves class ratio)
skf = StratifiedKFold(n_splits=5)
for train_idx, val_idx in skf.split(X, y):
    # ...

7.3 Model Deployment Pipeline

Training → Evaluation → Export → Serve → Monitor

Tools:
Training:   PyTorch / TensorFlow / scikit-learn
Export:     ONNX, TorchScript, SavedModel, Pickle
Serving:    FastAPI + uvicorn, TorchServe, TF Serving
Container:  Docker → Kubernetes
Monitoring: MLflow, Weights & Biases, Prometheus + Grafana

# FastAPI Model Serving
from fastapi import FastAPI
import torch
import numpy as np

app = FastAPI()
model = torch.load('model.pt')
model.eval()

@app.post("/predict")
async def predict(data: dict):
    input_tensor = torch.tensor(data['features']).float()
    with torch.no_grad():
        output = model(input_tensor)
    return {"prediction": output.numpy().tolist()}

Quick Revision Checklist — Advanced ML

[ ] Bias-variance tradeoff — underfitting vs overfitting
[ ] Adam optimizer parameters (α, β₁, β₂)
[ ] CNN output size formula: (H-k+2P)/S + 1
[ ] ResNet skip connection — solves vanishing gradient
[ ] Transfer learning strategies (feature extract vs fine-tune)
[ ] LSTM gates: Forget, Input, Output
[ ] GRU gates: Reset, Update
[ ] Attention formula: softmax(QKᵀ/√d_k)V
[ ] BERT pre-training tasks: MLM + NSP
[ ] GAN objective function (minimax)
[ ] Q-learning update rule
[ ] DQN: experience replay + target network
[ ] Evaluation: Precision, Recall, F1, AUC-ROC

📄 Download Complete PDF Notes

Advanced ML and Deep Learning for MTech/MCA students. Covers CNNs, RNNs, Transformers, Attention mechanism, GAN, Reinforcement Learning, NLP with BERT, and model deployment.

52 pages · 2.1 MB · Updated 2026-03-11

Free Download ↓

❓ Frequently Asked Questions

What is the vanishing gradient problem?▾

In deep networks, gradients become extremely small during backpropagation through many layers, making early layers train very slowly or stop learning. Solutions: ReLU activation, Batch Normalization, ResNet skip connections, gradient clipping, LSTM/GRU for sequential data.

What is the difference between LSTM and GRU?▾

LSTM has 3 gates (input, forget, output) and separate cell state and hidden state — more expressive but slower. GRU has 2 gates (reset, update) and single hidden state — faster and similar performance on most tasks. GRU is preferred when computational efficiency matters.

How does Attention mechanism work in Transformers?▾

Attention computes a weighted sum of values based on query-key similarity. Self-attention: each word attends to all other words. Formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V. Multi-head attention runs multiple attention heads in parallel for richer representations.

📌 Related Notes

CSSEM-1

Python Programming for MCA — Complete Notes with Programs

Python Programming

CSSEM-1

Data Structures and Algorithms for MCA — Complete Notes

Data Structures and Algorithms

CSSEM-1

Python Advanced Programming — Complete MCA Notes

Python Programming

CSSEM-4

DBMS Complete Notes — B.Tech CS Sem 4

Database Management Systems

csSEM-6

Compiler Design — Complete Notes CS Sem 6

Compiler Design

Was this helpful?

Your feedback helps us improve notes and tutorials.

PG / MCASEM-2Machine Learning

Advanced Machine Learning & Deep Learning — MTech/MCA Complete Notes

✍️ WohoTech Team📅 Last Updated: 2026-03-11📄 52 pages · 2.1 MB

Advanced Machine Learning & Deep Learning — MTech / MCA Notes

This course covers advanced ML algorithms, deep learning architectures, and current research areas. Prerequisite: Basic ML (Linear/Logistic Regression, Decision Trees, Clustering).

Unit 1: Review of ML Fundamentals

1.1 Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Noise

High Bias (Underfitting):
- Model too simple
- High training AND test error
- Fix: More complex model, more features

High Variance (Overfitting):
- Model too complex
- Low training error, high test error
- Fix: Regularization, more data, dropout

Regularization:

1.2 Advanced Optimization

Gradient Descent Variants:

# SGD with Momentum
v = β*v - α*∇L(w)
w = w + v

# AdaGrad — adapts lr per parameter
cache += ∇L²
w -= α * ∇L / (√cache + ε)

# Adam (most popular)
m = β₁*m + (1-β₁)*∇L     # first moment
v = β₂*v + (1-β₂)*∇L²    # second moment
m_hat = m/(1-β₁ᵗ)         # bias correction
v_hat = v/(1-β₂ᵗ)
w -= α * m_hat / (√v_hat + ε)

Recommended defaults for Adam: α=0.001, β₁=0.9, β₂=0.999, ε=1e-8

Unit 2: Convolutional Neural Networks (CNN)

2.1 Convolution Operation

Input: H × W × C (height, width, channels)
Filter: k × k × C (kernel size × kernel size × channels)
Output: (H-k+2P)/S + 1 × (W-k+2P)/S + 1 × num_filters

Where:
P = padding, S = stride

Key Operations:

Convolution: Sliding filter over input, computing dot products
Pooling (Max/Average): Downsamples feature maps, reduces spatial dimensions
Batch Normalization: Normalizes activations, speeds training
ReLU Activation: max(0, x) — avoids vanishing gradient

2.2 Landmark CNN Architectures

2.3 ResNet Skip Connections — Why They Work

# Standard Block
x → Conv → BN → ReLU → Conv → BN → + input → ReLU

# Mathematical insight:
# Instead of learning F(x), learn F(x) = H(x) - x (residual)
# Gradient flows directly through shortcut connections
# Solves vanishing gradient for very deep networks (100+ layers)

2.4 Transfer Learning

import torch
import torchvision.models as models

# Load pretrained ResNet
model = models.resnet50(pretrained=True)

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for new task (e.g., 5 classes)
num_features = model.fc.in_features
model.fc = torch.nn.Linear(num_features, 5)

# Only train the new layer
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

Strategies:

Feature Extraction: Freeze all, retrain only classifier (small dataset, similar domain)
Fine-tuning: Unfreeze last few layers too (medium dataset)
Full Fine-tuning: Retrain all layers (large dataset, different domain)

Unit 3: Recurrent Neural Networks (RNN) & LSTM

3.1 Vanishing Gradient in RNNs

Standard RNN: hₜ = tanh(Wₕhₜ₋₁ + Wₓxₜ + b)

Problem: Gradient = ∂L/∂h₁ = ∂L/∂hₜ × ∏ᵢ (∂hᵢ/∂hᵢ₋₁)

If |∂hᵢ/∂hᵢ₋₁| < 1 for many steps → gradient → 0 (vanishing) If |∂hᵢ/∂hᵢ₋₁| > 1 for many steps → gradient → ∞ (exploding)

3.2 LSTM Architecture

LSTM Cell has 3 gates + cell state (Cₜ):

Forget Gate:    fₜ = σ(Wf[hₜ₋₁, xₜ] + bf)
Input Gate:     iₜ = σ(Wi[hₜ₋₁, xₜ] + bi)
Cell Candidate: C̃ₜ = tanh(Wc[hₜ₋₁, xₜ] + bc)
Cell Update:    Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ
Output Gate:    oₜ = σ(Wo[hₜ₋₁, xₜ] + bo)
Hidden State:   hₜ = oₜ⊙tanh(Cₜ)

σ = sigmoid (0 to 1)
⊙ = element-wise multiplication

3.3 GRU (Gated Recurrent Unit)

GRU Cell (simplified LSTM):

Reset Gate:  rₜ = σ(Wr[hₜ₋₁, xₜ])
Update Gate: zₜ = σ(Wz[hₜ₋₁, xₜ])
Candidate:   h̃ₜ = tanh(W[rₜ⊙hₜ₋₁, xₜ])
Output:      hₜ = (1-zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ

GRU vs LSTM:

GRU: 2 gates, faster, similar performance on most tasks
LSTM: 3 gates, better for very long sequences, more expressive
Choice: Try GRU first, upgrade to LSTM if needed

Unit 4: Attention Mechanism & Transformers

4.1 Attention Mechanism

# Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

Where:
Q = Query matrix (what we're looking for)
K = Key matrix (what each position has)
V = Value matrix (what to retrieve)
d_k = dimension of keys (scaling factor)

# Multi-Head Attention
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) × W^O
where headᵢ = Attention(QW^Q_i, KW^K_i, VW^V_i)

4.2 Transformer Architecture

Encoder:
Input → Embedding → Positional Encoding
     → [Multi-Head Self-Attention → Add&Norm
       → Feed Forward → Add&Norm] × N layers
     → Encoder Output

Decoder:
Output → Embedding → Positional Encoding
      → [Masked Multi-Head Self-Attention → Add&Norm
        → Multi-Head Cross-Attention (K,V from encoder) → Add&Norm
        → Feed Forward → Add&Norm] × N layers
      → Linear → Softmax → Output probabilities

Positional Encoding:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

4.3 BERT (Bidirectional Encoder Representations from Transformers)

Pre-training Tasks:

MLM (Masked Language Modeling): 15% tokens masked, model predicts them
NSP (Next Sentence Prediction): Predict if sentence B follows A

Fine-tuning for downstream tasks:

from transformers import BertForSequenceClassification, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits  # shape: (batch_size, num_labels)

BERT Variants:

Unit 5: Generative Adversarial Networks (GANs)

5.1 GAN Framework

GAN = Generator (G) + Discriminator (D)

Training Objective (minimax game):
min_G max_D V(D,G) = E[log D(x)] + E[log(1 - D(G(z)))]

Generator: G(z) → fake data (z = random noise)
Discriminator: D(x) → probability x is real

G tries to fool D
D tries to distinguish real from fake

5.2 GAN Training Problems and Solutions

WGAN (Wasserstein GAN):

Instead of log loss, use Wasserstein distance:
D_loss = -E[D(real)] + E[D(fake)]
G_loss = -E[D(fake)]
Require: Lipschitz constraint (clip weights to [-c, c] or gradient penalty)

5.3 Important GAN Variants

Unit 6: Reinforcement Learning

6.1 MDP Framework

MDP = (S, A, P, R, γ)
S = State space
A = Action space
P = Transition probability P(s'|s,a)
R = Reward function R(s,a)
γ = Discount factor (0 < γ ≤ 1)

Goal: Find policy π*(a|s) that maximizes:
G_t = Σ γᵏ R_{t+k+1} (sum of discounted future rewards)

6.2 Value Functions

State-Value Function: V^π(s) = E_π[G_t | S_t = s]

Action-Value Function (Q-function): Q^π(s,a) = E_π[G_t | S_t = s, A_t = a]

Bellman Equations:

V^π(s) = Σ_a π(a|s) Σ_s' P(s'|s,a) [R(s,a,s') + γV^π(s')]

Q^π(s,a) = Σ_s' P(s'|s,a) [R(s,a,s') + γ Σ_a' π(a'|s') Q^π(s',a')]

6.3 Q-Learning and Deep Q-Network (DQN)

Q-Learning Update (off-policy, model-free):

Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]

DQN Innovations (DeepMind, 2015):

Experience Replay: Store transitions (s,a,r,s') in replay buffer, sample randomly
Target Network: Separate network for computing targets, updated periodically
CNN Feature Extraction: Process raw pixels as input

# DQN pseudocode
for episode in range(num_episodes):
    s = env.reset()
    while not done:
        a = epsilon_greedy(Q_network, s)  # explore vs exploit
        s', r, done = env.step(a)
        replay_buffer.push(s, a, r, s', done)
        
        if len(replay_buffer) > batch_size:
            batch = replay_buffer.sample(batch_size)
            # target = r + γ * max Q_target(s', a')
            # loss = MSE(Q_network(s,a), target)
            optimizer.step()
            
        if step % target_update_freq == 0:
            Q_target.load_state_dict(Q_network.state_dict())

Unit 7: Model Evaluation and Deployment

7.1 Evaluation Metrics

Classification:

Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1 Score  = 2 × (Precision × Recall) / (Precision + Recall)
AUC-ROC   = Area under ROC curve (1.0 = perfect)

Regression:

MAE  = (1/n) Σ |y - ŷ|
MSE  = (1/n) Σ (y - ŷ)²
RMSE = √MSE
R²   = 1 - SS_res/SS_tot (1.0 = perfect)

7.2 Cross-Validation

from sklearn.model_selection import KFold, StratifiedKFold

# K-Fold (regression)
kf = KFold(n_splits=5, shuffle=True)
for train_idx, val_idx in kf.split(X):
    X_train, X_val = X[train_idx], X[val_idx]
    # train and evaluate model

# Stratified K-Fold (classification — preserves class ratio)
skf = StratifiedKFold(n_splits=5)
for train_idx, val_idx in skf.split(X, y):
    # ...

7.3 Model Deployment Pipeline

Training → Evaluation → Export → Serve → Monitor

Tools:
Training:   PyTorch / TensorFlow / scikit-learn
Export:     ONNX, TorchScript, SavedModel, Pickle
Serving:    FastAPI + uvicorn, TorchServe, TF Serving
Container:  Docker → Kubernetes
Monitoring: MLflow, Weights & Biases, Prometheus + Grafana

# FastAPI Model Serving
from fastapi import FastAPI
import torch
import numpy as np

app = FastAPI()
model = torch.load('model.pt')
model.eval()

@app.post("/predict")
async def predict(data: dict):
    input_tensor = torch.tensor(data['features']).float()
    with torch.no_grad():
        output = model(input_tensor)
    return {"prediction": output.numpy().tolist()}

Quick Revision Checklist — Advanced ML

[ ] Bias-variance tradeoff — underfitting vs overfitting
[ ] Adam optimizer parameters (α, β₁, β₂)
[ ] CNN output size formula: (H-k+2P)/S + 1
[ ] ResNet skip connection — solves vanishing gradient
[ ] Transfer learning strategies (feature extract vs fine-tune)
[ ] LSTM gates: Forget, Input, Output
[ ] GRU gates: Reset, Update
[ ] Attention formula: softmax(QKᵀ/√d_k)V
[ ] BERT pre-training tasks: MLM + NSP
[ ] GAN objective function (minimax)
[ ] Q-learning update rule
[ ] DQN: experience replay + target network
[ ] Evaluation: Precision, Recall, F1, AUC-ROC

📄 Download Complete PDF Notes

Advanced ML and Deep Learning for MTech/MCA students. Covers CNNs, RNNs, Transformers, Attention mechanism, GAN, Reinforcement Learning, NLP with BERT, and model deployment.

52 pages · 2.1 MB · Updated 2026-03-11

Free Download ↓