Loading...
Loading...
This course covers advanced ML algorithms, deep learning architectures, and current research areas. Prerequisite: Basic ML (Linear/Logistic Regression, Decision Trees, Clustering).
Total Error = Bias² + Variance + Irreducible Noise
High Bias (Underfitting):
- Model too simple
- High training AND test error
- Fix: More complex model, more features
High Variance (Overfitting):
- Model too complex
- Low training error, high test error
- Fix: Regularization, more data, dropout
Regularization:
| Method | Technique | Effect | |--------|----------|--------| | L1 (Lasso) | Add λΣ|wᵢ| to loss | Sparse weights, feature selection | | L2 (Ridge) | Add λΣwᵢ² to loss | Small weights, no sparsity | | Elastic Net | L1 + L2 combined | Both effects | | Dropout | Randomly zero neurons during training | Ensemble effect |
Gradient Descent Variants:
# SGD with Momentum
v = β*v - α*∇L(w)
w = w + v
# AdaGrad — adapts lr per parameter
cache += ∇L²
w -= α * ∇L / (√cache + ε)
# Adam (most popular)
m = β₁*m + (1-β₁)*∇L # first moment
v = β₂*v + (1-β₂)*∇L² # second moment
m_hat = m/(1-β₁ᵗ) # bias correction
v_hat = v/(1-β₂ᵗ)
w -= α * m_hat / (√v_hat + ε)
Recommended defaults for Adam: α=0.001, β₁=0.9, β₂=0.999, ε=1e-8
Input: H × W × C (height, width, channels)
Filter: k × k × C (kernel size × kernel size × channels)
Output: (H-k+2P)/S + 1 × (W-k+2P)/S + 1 × num_filters
Where:
P = padding, S = stride
Key Operations:
| Architecture | Year | Key Innovation | Parameters | |-------------|------|---------------|-----------| | LeNet-5 | 1998 | First modern CNN | ~60K | | AlexNet | 2012 | ReLU, Dropout, GPU | ~60M | | VGGNet | 2014 | Deep with small 3×3 filters | ~138M | | GoogLeNet/Inception | 2014 | Inception modules, 1×1 conv | ~7M | | ResNet | 2015 | Skip connections, 152 layers | ~25M (ResNet-50) | | DenseNet | 2017 | Dense connections (every to every) | ~8M | | EfficientNet | 2019 | Compound scaling | ~5.3M (B0) | | ViT | 2020 | Pure Transformer for vision | ~86M (Base) |
# Standard Block
x → Conv → BN → ReLU → Conv → BN → + input → ReLU
# Mathematical insight:
# Instead of learning F(x), learn F(x) = H(x) - x (residual)
# Gradient flows directly through shortcut connections
# Solves vanishing gradient for very deep networks (100+ layers)
import torch
import torchvision.models as models
# Load pretrained ResNet
model = models.resnet50(pretrained=True)
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace final layer for new task (e.g., 5 classes)
num_features = model.fc.in_features
model.fc = torch.nn.Linear(num_features, 5)
# Only train the new layer
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)
Strategies:
Standard RNN: hₜ = tanh(Wₕhₜ₋₁ + Wₓxₜ + b)
Problem: Gradient = ∂L/∂h₁ = ∂L/∂hₜ × ∏ᵢ (∂hᵢ/∂hᵢ₋₁)
If |∂hᵢ/∂hᵢ₋₁| < 1 for many steps → gradient → 0 (vanishing) If |∂hᵢ/∂hᵢ₋₁| > 1 for many steps → gradient → ∞ (exploding)
LSTM Cell has 3 gates + cell state (Cₜ):
Forget Gate: fₜ = σ(Wf[hₜ₋₁, xₜ] + bf)
Input Gate: iₜ = σ(Wi[hₜ₋₁, xₜ] + bi)
Cell Candidate: C̃ₜ = tanh(Wc[hₜ₋₁, xₜ] + bc)
Cell Update: Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ
Output Gate: oₜ = σ(Wo[hₜ₋₁, xₜ] + bo)
Hidden State: hₜ = oₜ⊙tanh(Cₜ)
σ = sigmoid (0 to 1)
⊙ = element-wise multiplication
GRU Cell (simplified LSTM):
Reset Gate: rₜ = σ(Wr[hₜ₋₁, xₜ])
Update Gate: zₜ = σ(Wz[hₜ₋₁, xₜ])
Candidate: h̃ₜ = tanh(W[rₜ⊙hₜ₋₁, xₜ])
Output: hₜ = (1-zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ
GRU vs LSTM:
# Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V
Where:
Q = Query matrix (what we're looking for)
K = Key matrix (what each position has)
V = Value matrix (what to retrieve)
d_k = dimension of keys (scaling factor)
# Multi-Head Attention
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) × W^O
where headᵢ = Attention(QW^Q_i, KW^K_i, VW^V_i)
Encoder:
Input → Embedding → Positional Encoding
→ [Multi-Head Self-Attention → Add&Norm
→ Feed Forward → Add&Norm] × N layers
→ Encoder Output
Decoder:
Output → Embedding → Positional Encoding
→ [Masked Multi-Head Self-Attention → Add&Norm
→ Multi-Head Cross-Attention (K,V from encoder) → Add&Norm
→ Feed Forward → Add&Norm] × N layers
→ Linear → Softmax → Output probabilities
Positional Encoding:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Pre-training Tasks:
Fine-tuning for downstream tasks:
from transformers import BertForSequenceClassification, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits # shape: (batch_size, num_labels)
BERT Variants:
| Model | Parameters | Key Difference | |-------|-----------|---------------| | BERT-Base | 110M | 12 layers, 12 heads | | BERT-Large | 340M | 24 layers, 16 heads | | DistilBERT | 66M | 40% smaller, 97% performance | | RoBERTa | 125M | Better pre-training (no NSP, more data) | | ALBERT | 12M | Parameter sharing, very small | | GPT-3/4 | 175B–1T+ | Decoder-only, generative |
GAN = Generator (G) + Discriminator (D)
Training Objective (minimax game):
min_G max_D V(D,G) = E[log D(x)] + E[log(1 - D(G(z)))]
Generator: G(z) → fake data (z = random noise)
Discriminator: D(x) → probability x is real
G tries to fool D
D tries to distinguish real from fake
| Problem | Symptom | Solution | |---------|---------|---------| | Mode Collapse | G produces limited varieties | Minibatch discrimination, WGAN | | Training Instability | Loss oscillates wildly | WGAN, spectral normalization | | Vanishing Gradients | G gets no feedback | WGAN-GP, LS-GAN | | Non-convergence | D always wins or G always wins | Balance training steps, TTUR |
WGAN (Wasserstein GAN):
Instead of log loss, use Wasserstein distance:
D_loss = -E[D(real)] + E[D(fake)]
G_loss = -E[D(fake)]
Require: Lipschitz constraint (clip weights to [-c, c] or gradient penalty)
| GAN Type | Key Innovation | Application | |----------|---------------|------------| | DCGAN | Convolutional layers | Image generation | | CGAN | Class label conditioning | Class-specific generation | | CycleGAN | Cycle consistency loss | Image-to-image translation | | StyleGAN2 | Style-based generator | Photorealistic faces | | Pix2Pix | Paired image translation | Sketch → Photo | | BigGAN | Large-scale, class-conditional | High resolution images |
MDP = (S, A, P, R, γ)
S = State space
A = Action space
P = Transition probability P(s'|s,a)
R = Reward function R(s,a)
γ = Discount factor (0 < γ ≤ 1)
Goal: Find policy π*(a|s) that maximizes:
G_t = Σ γᵏ R_{t+k+1} (sum of discounted future rewards)
State-Value Function: V^π(s) = E_π[G_t | S_t = s]
Action-Value Function (Q-function): Q^π(s,a) = E_π[G_t | S_t = s, A_t = a]
Bellman Equations:
V^π(s) = Σ_a π(a|s) Σ_s' P(s'|s,a) [R(s,a,s') + γV^π(s')]
Q^π(s,a) = Σ_s' P(s'|s,a) [R(s,a,s') + γ Σ_a' π(a'|s') Q^π(s',a')]
Q-Learning Update (off-policy, model-free):
Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]
DQN Innovations (DeepMind, 2015):
# DQN pseudocode
for episode in range(num_episodes):
s = env.reset()
while not done:
a = epsilon_greedy(Q_network, s) # explore vs exploit
s', r, done = env.step(a)
replay_buffer.push(s, a, r, s', done)
if len(replay_buffer) > batch_size:
batch = replay_buffer.sample(batch_size)
# target = r + γ * max Q_target(s', a')
# loss = MSE(Q_network(s,a), target)
optimizer.step()
if step % target_update_freq == 0:
Q_target.load_state_dict(Q_network.state_dict())
Classification:
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
AUC-ROC = Area under ROC curve (1.0 = perfect)
Regression:
MAE = (1/n) Σ |y - ŷ|
MSE = (1/n) Σ (y - ŷ)²
RMSE = √MSE
R² = 1 - SS_res/SS_tot (1.0 = perfect)
from sklearn.model_selection import KFold, StratifiedKFold
# K-Fold (regression)
kf = KFold(n_splits=5, shuffle=True)
for train_idx, val_idx in kf.split(X):
X_train, X_val = X[train_idx], X[val_idx]
# train and evaluate model
# Stratified K-Fold (classification — preserves class ratio)
skf = StratifiedKFold(n_splits=5)
for train_idx, val_idx in skf.split(X, y):
# ...
Training → Evaluation → Export → Serve → Monitor
Tools:
Training: PyTorch / TensorFlow / scikit-learn
Export: ONNX, TorchScript, SavedModel, Pickle
Serving: FastAPI + uvicorn, TorchServe, TF Serving
Container: Docker → Kubernetes
Monitoring: MLflow, Weights & Biases, Prometheus + Grafana
# FastAPI Model Serving
from fastapi import FastAPI
import torch
import numpy as np
app = FastAPI()
model = torch.load('model.pt')
model.eval()
@app.post("/predict")
async def predict(data: dict):
input_tensor = torch.tensor(data['features']).float()
with torch.no_grad():
output = model(input_tensor)
return {"prediction": output.numpy().tolist()}
Advanced ML and Deep Learning for MTech/MCA students. Covers CNNs, RNNs, Transformers, Attention mechanism, GAN, Reinforcement Learning, NLP with BERT, and model deployment.
52 pages · 2.1 MB · Updated 2026-03-11
In deep networks, gradients become extremely small during backpropagation through many layers, making early layers train very slowly or stop learning. Solutions: ReLU activation, Batch Normalization, ResNet skip connections, gradient clipping, LSTM/GRU for sequential data.
LSTM has 3 gates (input, forget, output) and separate cell state and hidden state — more expressive but slower. GRU has 2 gates (reset, update) and single hidden state — faster and similar performance on most tasks. GRU is preferred when computational efficiency matters.
Attention computes a weighted sum of values based on query-key similarity. Self-attention: each word attends to all other words. Formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V. Multi-head attention runs multiple attention heads in parallel for richer representations.
Python Programming for MCA — Complete Notes with Programs
Python Programming
Data Structures and Algorithms for MCA — Complete Notes
Data Structures and Algorithms
Python Advanced Programming — Complete MCA Notes
Python Programming
DBMS Complete Notes — B.Tech CS Sem 4
Database Management Systems
Compiler Design — Complete Notes CS Sem 6
Compiler Design
Your feedback helps us improve notes and tutorials.