AI Learning Series · Part 1 of 4

ML Paradigms &
Neural Architectures

Mental models before math. Supervised, unsupervised, reinforcement learning — and every major neural architecture explained from first principles.

Supervised Learning
Neural Nets
Transformers
LLMs
Agents
Inference

01 The Big Picture

Before a single equation: what is the game we're actually playing?

Traditional software is a set of explicit rules. You, the programmer, encode logic: if temperature > 100, alert fire. Machine learning flips this contract. Instead of writing rules, you provide examples — input-output pairs — and let an algorithm discover the rules by itself.

This matters because some problems have rules too complex to write by hand. Recognizing a face, translating a sentence, predicting whether a tumor is malignant — the rules exist in the world, but they're buried in patterns across millions of examples. ML is a machine for extracting those patterns.

The central insight of ML:

Every ML algorithm is, at its core, an automated search through a space of possible functions, guided by a measure of how wrong the current function is on known examples.

This document traces the story from that simple idea all the way to a neural network that can learn arbitrary functions — building mental models at every step, before any math.

Traditional Programming vs. Machine Learning

Data (Input) Rules (Code) Output Traditional Programming: you write the rules Data (Input) Output (Labels) Rules (Learned!) Machine Learning: algorithm discovers the rules

03 Supervised Learning

Learning from labeled examples — the backbone of almost everything.

Supervised learning is ML where every training example has a known correct answer (label). You show the model (image → cat), (image → dog), 1 million times. The model learns a mapping. Then you give it a new image it has never seen — and it predicts.

The word "supervised" means a teacher (the labels) is guiding the learning. Contrast with unsupervised learning (no labels, find structure) and reinforcement learning (labels come as rewards from the environment, after actions).

Classification

Output is a category. "Is this email spam?" → Yes/No. "What digit is this?" → 0–9. The model draws decision boundaries in the input space.

Binary Multi-class Multi-label

Regression

Output is a continuous number. "What will this house sell for?" → $420,000. The model fits a curve through the data points.

Linear Polynomial Neural

The Supervised Training Loop (Every ML model follows this)

1. Forward Pass data through model → prediction 2. Loss Compare prediction to true label 3. Backward Compute gradient (backprop) 4. Update Adjust parameters via optimizer repeat for N epochs over the training dataset Loss ↓ → done

This loop is identical whether you're training a linear regression, a random forest, or a 70B parameter language model. The details of each step differ, but the structure never changes.

The goal of supervised learning is not to memorize the training data — it's to generalize: to make accurate predictions on examples never seen during training. This distinction is everything.

Overfitting

The model memorizes training data including its noise. Performs perfectly on training, terribly on new data. Like a student who memorizes exam answers without understanding the subject.

Underfitting

The model is too simple to capture the real pattern. High error on both training and test data. Like trying to fit a line through data that is clearly a curve.

⚖️
Bias-Variance Tradeoff: Simple models have high bias (wrong assumptions) but low variance (consistent). Complex models have low bias but high variance (sensitive to training noise). Good ML is finding the sweet spot — not always using the most complex model available.

04 Unsupervised Learning

No labels, no teacher — find structure the data hides from you.

Supervised learning requires labeled data — expensive, slow to produce, and sometimes impossible (what's the "correct" label for a news article?). Unsupervised learning asks a different question: what patterns exist in the data itself, without any labels? The signal comes from the data's own structure.

Core intuition:

If similar things tend to appear together or have similar structure, a model can discover those similarities without being told what they are. Compression is a useful lens: a model that can compress data well has implicitly learned its structure.

Clustering groups similar data points together without predefined categories. The canonical algorithm is k-means: pick k centroids, assign each point to the nearest centroid, recompute centroids, repeat until stable.

k-Means: Find Natural Groups in Data

Before: unlabeled points k-means After: k=3 clusters discovered μ₁ μ₂ μ₃
🧲 Model: Gravity Wells

Each centroid is a gravity well pulling nearby points in. Initialization matters hugely — bad starting centroids lead to bad clusters. k-means++ solves this by spreading initial centroids far apart.

k-means assumes spherical clusters of similar size. Real-world clusters can be elongated, nested, or unevenly dense — DBSCAN handles these better.

High-dimensional data (images, genomics, embeddings) is hard to visualize and compute over. Dimensionality reduction compresses it into fewer dimensions while preserving important structure. This is not just preprocessing — it reveals hidden geometry.

PCA (Principal Component Analysis)

Linear. Finds directions of maximum variance. Projects data onto a lower-dimensional hyperplane. Fast, interpretable, but can't capture non-linear structure.

t-SNE / UMAP

Non-linear. Preserves local neighborhood structure. Excellent for visualization (2D/3D). Used everywhere in ML to visualize embeddings. Not for general compression — UMAP is faster and better for large data.

Autoencoders are neural networks trained to compress data into a bottleneck (latent space), then reconstruct it. The loss is reconstruction error — no labels needed. The bottleneck forces the network to learn a compact, meaningful representation.

Autoencoder: Compress → Reconstruct

Input x (high-dim) Encoder compress Latent z bottleneck Decoder reconstruct Output x̂ ≈ x Loss = ||x - x̂||² (reconstruction error)

The latent space z is the learned representation — useful for anomaly detection (high reconstruction error = anomaly), denoising, and as features for downstream tasks. VAEs (Variational Autoencoders) make z a probability distribution, enabling generation of new samples.

Self-supervised learning is the bridge between unsupervised and supervised. You create artificial supervised tasks from unlabeled data — the labels come from the data itself.

Masked Language Modeling (BERT)

Randomly mask 15% of tokens, predict the masked words. The label is the original token — free from the text itself. This forces the model to understand context in both directions.

Contrastive Learning (SimCLR, CLIP)

Create two augmented views of the same image. Train the model so their embeddings are close; embeddings of different images stay far apart. Learns rich visual representations with no labels at all.

🔑
Why self-supervised matters for LLMs: GPT's next-token prediction is self-supervised. BERT's masked token prediction is self-supervised. The internet provides trillions of self-supervised training examples for free. This is why LLMs can be trained on such vast data — no expensive human labeling needed.

05 Reinforcement Learning

Learning from consequences — not from a teacher, but from the world.

In supervised learning, you have a dataset with correct answers. In reinforcement learning, you have an agent in an environment — no dataset, no correct answers. The agent tries actions, the environment responds with a reward signal, and the agent learns to maximize cumulative reward over time.

This is the most general learning framework — and also the closest to how humans learn complex skills like playing chess, riding a bike, or writing code.

The RL Loop: Agent ↔ Environment

Agent (policy π: state → action) Environment (world dynamics) Action aₜ Reward rₜ + Next State sₜ₊₁ Maximize: E[Σ γᵗ rₜ]

Key Concepts

Policy (π)

The agent's strategy — a mapping from state to action. What should I do given what I see? The policy is what we're training. It can be deterministic (given state → fixed action) or stochastic (given state → probability over actions).

Value Function (V)

How good is it to be in a given state? V(s) = expected cumulative reward from state s following the current policy. The agent uses value estimates to plan: prefer states that lead to higher future rewards.

Discount Factor (γ)

A reward now is worth more than the same reward later. γ ∈ [0,1] controls how much the agent values future rewards. γ=0: only care about immediate reward. γ→1: plan far ahead. Think of it as patience.

Exploration vs. Exploitation

The fundamental tension. Exploit: do what you know works. Explore: try new things that might be better. Too much exploitation → stuck in suboptimal behavior. Too much exploration → never learn to be good. ε-greedy: explore randomly with probability ε.

🎮 Model: Learning a Video Game

Imagine playing a new game with no manual. You try button combinations. Sometimes you score (positive reward), sometimes you die (negative reward). Over thousands of attempts, you learn which actions in which situations tend to lead to high scores. You never see the "correct" move — you discover it by trial and error. That's RL.

Games have clean reward signals (score). Real-world RL is harder because rewards are sparse (you only find out if something worked much later) and shaped by complex dynamics.

RL Algorithms — The Landscape

Q-Learning learns a Q-function: Q(state, action) = expected future reward if you take this action in this state and then act optimally afterward. The famous Bellman equation makes this recursive — the value of a state-action pair depends on the best next state-action pair.

Q(s,a) ← Q(s,a) + α · [r + γ · max Q(s',a') − Q(s,a)]

α = learning rate  ·  r = reward received  ·  γ · max Q(s',a') = best future value  ·  The bracketed term is the TD error — how surprised were we?

DQN (Deep Q-Network) replaces the Q-table with a neural network. DeepMind used this to beat Atari games from raw pixels — the input was screen pixels, Q-network learned the action values.

Policy Gradient methods directly optimize the policy without a value function. Instead of learning "how good is each state," you learn "which actions to take more or less often" based on whether they led to high rewards.

REINFORCE: if an action led to a high-reward trajectory, increase its probability. If it led to low reward, decrease it. Simple but high variance — you need many samples to get stable gradient estimates.

Actor-Critic: combine both — a policy network (actor) that picks actions, and a value network (critic) that estimates how good the current state is. The critic reduces variance by providing a baseline. PPO is a modern actor-critic algorithm.

RLHF uses PPO (Proximal Policy Optimization) to fine-tune LLMs. The LLM is the policy (state = conversation so far, action = next token), the reward model gives the reward signal, and PPO constrains updates so the model doesn't drift too far from the supervised baseline.

🔑
Why PPO for LLMs? Standard policy gradient methods take big updates that can catastrophically forget the base model's capabilities. PPO's "proximal" constraint (KL divergence penalty) keeps the updated policy close to the old one — preventing the model from collapsing into repetitive reward-hacking behavior like always outputting "I am helpful!"

DPO (Direct Preference Optimization) bypasses the RL loop entirely by showing that the RLHF objective has an equivalent supervised form. Train directly on preference pairs (preferred response A > response B) using cross-entropy loss. No reward model needed, no PPO loop — much simpler and now the dominant approach.

🗺️
The three paradigms compared: Supervised learning = learn from a teacher's answers. Unsupervised learning = find structure without a teacher. Reinforcement learning = learn from trial-and-error consequences. Modern LLMs blend all three: pretrained self-supervised (unsupervised), fine-tuned supervised (SFT), and aligned via RL (RLHF/DPO).

06 Loss Functions & Gradient Descent

How the model knows it's wrong, and how it fixes itself.

The Loss Function

A loss function takes (prediction, true label) and returns a single number: how wrong are we? The higher the number, the worse. Training is the process of minimizing this number by changing the model's parameters.

Mean Squared Error (regression)
L = (1/n) · Σ (ŷᵢ − yᵢ)² penalizes large errors heavily (squared)
Cross-Entropy (classification)
L = −Σ yᵢ · log(ŷᵢ) penalizes confident wrong predictions catastrophically

Gradient Descent — The Core Algorithm

Imagine you're blindfolded on a hilly landscape and want to reach the lowest valley. You can only feel the slope under your feet. The strategy: always take a small step in the downhill direction.

That's gradient descent. The "landscape" is the loss surface over parameter space. The "slope under your feet" is the gradient — a vector that points in the direction of steepest increase in loss. We step in the opposite direction.

Interactive: Loss Landscape (hover to explore)

The ball rolls downhill toward minimum loss — that's gradient descent. Multiple valleys = local minima problem.

⛰️ Model: The Loss Landscape

Parameters are coordinates on a landscape. Loss is altitude. Training is rolling a ball downhill. The learning rate controls how large each step is — too big and you bounce over valleys, too small and you take forever.

This model is 2D but real parameter spaces have millions of dimensions. Intuitions about local minima are less severe in high dimensions than you'd expect.
Learning Rate (η)

Step size per iteration. Too large → oscillates, diverges. Too small → extremely slow convergence. One of the most important hyperparameters in all of ML.

Batch Size

SGD: gradient from 1 example — noisy but fast. Mini-batch: gradient from ~32-256 examples — best of both worlds. Full batch: all examples — precise but slow.

07 Neural Networks (ANN)

Why we need them, and what they actually are (hint: not really brains).

Linear models can only draw straight lines (hyperplanes) as decision boundaries. Real-world data is not linearly separable. A neural network solves this by composing many simple transformations — each layer learns a new representation of the data, until the final layer can separate the classes with a simple boundary.

🔑
The key idea: A neural network is a function composed of simpler functions. Each layer transforms the representation of the data into a new space where the problem becomes easier to solve. Deep networks learn a hierarchy: early layers learn edges, middle layers learn shapes, late layers learn objects.

▶ Interactive: watch a forward pass fire

Data enters at the input layer; each layer transforms it and passes it forward. Watch the activation wave move left to right — this is all a neural network does at inference time: one directed sweep of multiply-add-activate.

Neural Network: Anatomy

Input Layer raw features Hidden Layer 1 learned features Hidden Layer 2 higher-level features Output Layer prediction x₁ x₂ x₃ x₄ y₁ y₂ y₃ w Inside 1 neuron: Σ(wᵢxᵢ) + b → activation

Each connection = 1 learnable weight parameter

The Single Neuron: What's Actually Happening

Each neuron does two things: (1) takes a weighted sum of its inputs, (2) applies an activation function. That's it. The magic is in what happens when you stack millions of them.

output = activation(w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b)

w = weights (what we train)  ·  x = inputs  ·  b = bias (shift term)  ·  activation = non-linearity

Why Non-Linearity (Activation Functions) Are Critical

Without an activation function, stacking layers is useless — a network of 100 linear layers collapses into a single linear layer mathematically. The activation function (ReLU, Sigmoid, Tanh) introduces non-linearity, which is what allows the network to learn curved decision boundaries.

ReLU: f(x) = max(0, x)

Zero for negative inputs, linear for positive. Cheap to compute, avoids vanishing gradient. Default choice for hidden layers today.

Sigmoid: f(x) = 1/(1+e⁻ˣ)

Squashes to (0,1). Used in output for binary classification. Causes vanishing gradients in deep networks — don't use in hidden layers.

🧠
Universal Approximation Theorem: A neural network with a single hidden layer and enough neurons can approximate any continuous function to arbitrary precision. This is the theoretical justification for why neural nets are so powerful. But "enough neurons" can mean billions — which is why depth (many layers) is often more practical than width (one huge layer).

08 Specialized Architectures: CNN, RNN, LSTM, GRU

Vanilla ANNs treat all inputs as interchangeable. These architectures exploit structure in the data.

A fully-connected ANN (the vanilla net from section 07) applies the same transformation to everything with no awareness of spatial proximity, temporal order, or sequence structure. Real data has structure — images have local patterns, text has sequential dependencies, audio has temporal rhythms. Specialized architectures bake these structural assumptions in as inductive biases, making them far more efficient on their target data types.

CNN — Convolutional Neural Network

Problem it solves: An image is a grid of pixels. Nearby pixels are related (they form edges, textures, objects). A fully-connected layer treating each pixel independently ignores this and has too many parameters to train.

Key idea: Slide a small filter (kernel) across the image, computing a dot product at each position. This detects local patterns — edges, corners, textures — regardless of where they appear in the image (translation equivariance).

Convolution: A Filter Sliding Across an Image

Input (6×6) 3×3 Filter (learned) 1 0 -1 1 0 -1 1 0 -1 (vertical edge detector) Feature Map (4×4) dot products each cell = one dot product Why this works: Same filter reused at every position. Weight sharing = far fewer params than fully-connected.
CNN Layer Stack

Conv → ReLU → Pooling, repeated. Early layers: edges and colors. Middle layers: textures and shapes. Deep layers: objects and parts. Final layers: fully-connected for classification.

Pooling

Max-pooling: take the maximum value in each region. Reduces spatial size, provides translation invariance ("I saw an edge here, approximately"). Halves width and height each time applied.

🔬 Model: Image Microscope

Each conv layer is like examining an image at a different magnification. First pass: individual pixel patterns (edges). Second pass: combinations of those patterns (curves, corners). Third pass: combinations of curves (eyes, wheels). CNNs build a feature hierarchy bottom-up.

CNNs assume translational patterns — the same features appear anywhere in the image. They struggle with rotation and scale unless you augment training data. Vision Transformers (ViT) have largely replaced CNNs for large-scale image tasks.

RNN — Recurrent Neural Network

Problem it solves: A sequence (text, time-series, audio) has temporal dependencies — the meaning of "not" depends on what comes after it. A vanilla ANN processes each position independently; it has no memory of previous inputs.

Key idea: At each time step, the network takes the current input and a hidden state from the previous step. The hidden state acts as a running memory — a compressed summary of everything seen so far.

RNN: Hidden State Propagates Through Time

RNN cell t=1 x₁ h₁ h₁→ RNN cell t=2 x₂ h₂ h₂→ RNN cell t=3 x₃ h₃ ··· RNN cell t=T xₙ hₙ → output Problem: Gradient vanishes through long chains

The hidden state update is: hₜ = tanh(Wₓ·xₜ + Wₕ·hₜ₋₁ + b). Same weights W reused at every time step — the network learns "how to update its memory" in a content-independent way.

⚠️
Vanishing gradient problem: Backprop through time multiplies gradients through many tanh derivatives (each ≤ 1). For long sequences (100+ tokens), gradients from the loss vanish to near-zero before reaching early time steps. The network can't learn long-range dependencies. This motivated LSTM.

LSTM — Long Short-Term Memory

LSTM (1997, Hochreiter & Schmidhuber) solves vanishing gradients by adding a cell state — a separate memory track that flows through time with additive updates rather than multiplicative ones. Gradients can flow through the cell state without shrinking.

LSTM Cell — Three Gates Control Information Flow

Cell State Cₜ — "conveyor belt" (long-term memory) Forget Gate sigmoid(Wf·[hₜ₋₁,xₜ]) → output ∈ (0,1) 0 = forget everything 1 = remember everything Input Gate sigmoid × tanh What new info to add to cell state? ⊕ added to Cₜ Output Gate sigmoid(Wo·[hₜ₋₁,xₜ]) × tanh(Cₜ) What to output as hidden state hₜ hₜ (short-term memory) ← hₜ₋₁ and xₜ feed all three gates
🚂 Model: The Conveyor Belt

The cell state is a conveyor belt running alongside the sequence. The forget gate decides what to wipe off (multiply by ~0). The input gate decides what to stamp on (add new info). The output gate decides what to read off for the current step. Gradients flow backwards through the belt's additions — never multiplied through long chains, so they don't vanish.

LSTMs are still sequential (can't parallelize training over the sequence). Transformers replaced them for NLP by making the entire sequence parallel.

GRU — Gated Recurrent Unit

GRU (2014, Cho et al.) is a simplified LSTM that merges the forget and input gates into a single update gate, and merges the cell state and hidden state. Fewer parameters, similar performance on most tasks, faster to train.

GRU Gates

Update gate (z): How much of the previous hidden state to keep vs. overwrite. Reset gate (r): How much of the previous state to forget when computing the new candidate state. Simpler than LSTM — often preferred for smaller datasets.

LSTM vs GRU

LSTM: separate cell state + hidden state, 3 gates, more expressive. GRU: single hidden state, 2 gates, fewer params. Rule of thumb: GRU for smaller datasets/simpler tasks, LSTM when you have data to learn the extra expressivity.

When to Use Which Architecture

Architecture Selection Guide

Architecture Data Type Key Advantage Use Case Replaced by ANN (MLP) Tabular / fixed Universal approximator Classification, regression Still standard CNN Images, 2D grids Local pattern detection Vision, audio ViT (Vision Transformer) RNN Sequences Sequential memory Short sequences LSTM / Transformer LSTM Long sequences Long-range dependencies NLP (pre-2017), time-series Transformer GRU Sequences LSTM but faster/simpler Small datasets, speed Transformer
📌
The historical arc: ANNs (1980s) → CNNs dominated computer vision (2012, AlexNet) → LSTMs/GRUs dominated NLP (2014–2017) → Transformers took over everything (2017–present). Understanding RNNs and LSTMs is still critical — they appear in production systems, time-series forecasting, and as building blocks inside larger architectures. And the concepts (gating, memory, sequential inductive bias) still matter.

09 Backpropagation

The algorithm that made deep learning possible — it's just the chain rule, applied cleverly.

The central problem of training a network: we have a loss value, and we need to know how to adjust every weight in the network to reduce that loss. With millions of weights across many layers, computing this naively would be impossibly expensive.

Backpropagation solves this by applying the chain rule of calculus backwards through the network — from loss back to the first layer. It's not magic. It's efficient bookkeeping of derivatives.

Forward Pass & Backward Pass

x h₁ layer 1 h₂ layer 2 ŷ output L loss → Forward Pass (compute predictions) → ← Backward Pass (propagate gradients via chain rule) ← W₁ W₂ W₃ ∂L/∂W₁ ∂L/∂W₂ ∂L/∂W₃

The Chain Rule — One Sentence

If loss L depends on weight W through intermediate values, then ∂L/∂W = (∂L/∂output) × (∂output/∂W). Backprop applies this repeatedly from the output layer backward to the input, reusing intermediate computations to avoid redundant work.

📦 Model: Gradient as Blame Assignment

Think of each weight as an employee. After every prediction, you run an audit: how much did each employee (weight) contribute to the error? Weights that contributed more to a wrong answer get adjusted more. Backprop is the audit process — it assigns blame proportionally.

Blame isn't always assigned correctly — weights in early layers often get tiny gradients (vanishing gradient problem), making them hard to train.
Why backprop was a breakthrough: Before 1986, people knew networks could theoretically work but couldn't train deep ones. Backprop gave an efficient O(N) algorithm to compute all N gradients in one backward pass. Without it, you'd need N separate forward passes — completely impractical for millions of parameters.

10 Common Misconceptions

"Neural networks are modeled after how the brain works"
They were loosely inspired by neurons in 1943, but modern neural nets bear little resemblance to biological neuroscience. They're mathematical function approximators. The "neural" name is historical baggage — don't let it mislead your intuition.
"More data always helps"
More data helps if the additional data is from the same distribution as what you need to predict. Garbage data makes models worse. Mislabeled data can be catastrophic. And past a certain scale, returns diminish unless you also scale the model.
"Training loss going down means the model is getting better"
Training loss going down means the model is fitting the training data better. It can overfit catastrophically while training loss plummets. Always watch validation loss — loss on data the model never trained on.
"Deep learning needs to reach global minimum to work well"
In practice, well-trained deep networks often sit in local minima or saddle points — and that's fine. The loss landscape of large networks has many equivalent good solutions. Finding a good minimum is enough; finding the global minimum is computationally intractable.
"Backprop teaches the network what to learn"
Backprop is just a mathematical procedure to compute gradients. It doesn't have any intelligence about what should be learned. What gets learned is determined entirely by the loss function and the training data. Backprop is the engine; the loss function is the steering wheel.

What's Next — Part 2

Now that you understand supervised learning, loss, gradient descent, and backprop — the next document covers how we get from neural networks to Transformers (the architecture behind every major LLM), how LLMs are trained at scale, fine-tuning with RLHF, open weights, prompting, and the full inference pipeline.

11 Training Tricks That Make Deep Learning Work

Theoretical understanding of backprop is necessary but not sufficient. These are the engineering practices that make deep networks actually trainable.

Optimizers — Beyond Vanilla SGD

Vanilla SGD moves every parameter in the same direction with the same step size. Real loss landscapes are anisotropic — some dimensions are steep, some are flat. Better optimizers adapt the step size per parameter.

Momentum accumulates a velocity vector in the direction of persistent gradient, damping oscillations and accelerating through long, shallow valleys. Think of a ball rolling downhill — it builds up speed in a consistent direction.

v ← β·v + (1−β)·∇L   w ← w − η·v

β ≈ 0.9 is typical. At each step, 90% of the old velocity is kept and 10% of the new gradient is blended in. Oscillations in non-gradient directions cancel out; the consistent downhill direction accumulates. Nesterov momentum evaluates gradient at the "lookahead" position — slightly more accurate.

Adam (Adaptive Moment Estimation) maintains per-parameter estimates of both the first moment (mean of gradients) and second moment (uncentred variance). This gives every parameter its own adaptive learning rate — large for parameters that change rarely, small for parameters that change a lot.

m ← β₁·m + (1−β₁)·g  (1st moment — direction)
v ← β₂·v + (1−β₂)·g²  (2nd moment — curvature)
ŵ = m̂/(√v̂ + ε)

β₁=0.9, β₂=0.999, ε=1e-8 in practice. Bias correction: m̂ = m/(1−β₁ᵗ), v̂ = v/(1−β₂ᵗ). AdamW decouples weight decay from the gradient update — prevents weight decay from interacting incorrectly with adaptive learning rates. AdamW is now the default for training LLMs.

Optimizer Landscape — When to Use What

SGD + Momentum ✓ Best final accuracy ✓ Computer vision ✓ ImageNet training ✗ Needs careful LR tuning ✗ Slow to start Adam / AdamW ✓ Fast convergence ✓ NLP, Transformers ✓ LLM training standard ✗ More memory (m, v) ✗ Can generalise worse RMSProp ✓ RNNs, noisy gradients ✓ Adaptive like Adam ✓ No bias correction ✗ Largely superseded ✗ by Adam in practice Learning Rate Schedule Warmup: ramp LR from 0 to peak over first N steps Then cosine decay to 0. Used with ALL optimizers in LLM training.

Batch Normalization & Layer Normalization

Deep networks suffer from internal covariate shift: the distribution of each layer's activations shifts as earlier layers update, forcing later layers to constantly readapt. Normalization layers fix this by standardising activations.

Batch Norm (BatchNorm)

Normalise across the batch dimension — compute mean/variance per feature over N samples. Works brilliantly for CNNs. Problem: depends on batch size; breaks with batch=1; can't be used at inference without tracked running statistics. γ and β (scale and shift) are learned.

Layer Norm

Normalise across the feature dimension — compute mean/variance per sample, over all features. Works with any batch size, including batch=1. Standard for Transformers and LLMs. The only normalisation that works correctly for variable-length sequences and auto-regressive generation.

Dropout — Regularisation by Forgetting

During training, randomly zero out p% of neurons at each forward pass (typically p=0.1–0.5). Each mini-batch trains a different sub-network. The ensemble of all 2ⁿ sub-networks is implicitly averaged at test time (by scaling activations by 1−p, or equivalently dividing by 1−p at train time — "inverted dropout").

🎲 Model: Ensemble Learning

Dropout forces every neuron to be useful on its own — it can't rely on a fixed set of partner neurons. This is mathematically equivalent to averaging predictions over exponentially many subnetworks, which reduces variance and prevents co-adaptation (neurons memorising together).

Dropout hurts training speed (noisy gradients) and doesn't work well for BatchNorm (their statistics conflict). Not used in Transformers — Layer Norm + weight decay is preferred.
Where Dropout Is Used
  • Dense layers in classic CNNs (after pooling)
  • Attention dropout in Transformers (low rate, ~0.1)
  • Embedding dropout in RNNs
  • NOT on batch norm layers (incompatible)
  • NOT at inference — only active during training

12 Generative Architectures — GANs & Diffusion

The two families of models that learned to create, not just classify — and why one replaced the other.

GANs — Generative Adversarial Networks

Proposed by Goodfellow et al. (2014). Two networks trained simultaneously in a minimax game: a Generator tries to produce fake data that looks real; a Discriminator tries to distinguish real from fake. Each makes the other better.

GAN Training Loop

Random Noise z Generator G fake data G(z) wants D to say "real" Real Data Discriminator D P(real | input) wants to catch fakes Adversarial Loss ∇ back to G: "fool D more" ∇ back to D: "detect better"
⚠️
GAN training is notoriously unstable. Mode collapse: G learns to produce one type of output that always fools D, ignoring the rest of the data distribution. Vanishing gradients: when D becomes too good, G gets near-zero gradient. Training GAN requires careful tuning — spectral normalisation, gradient penalty (WGAN-GP), progressive growing (ProGAN). This instability is a key reason diffusion models largely replaced GANs for image generation.

Diffusion Models — The Current State of the Art

Diffusion models (DDPM, 2020; Score Matching) take a completely different approach: instead of adversarial training, they learn to reverse a noise process. Forward process: gradually add Gaussian noise to an image until it's pure noise. Reverse process: learn to denoise step-by-step. Generation = start from pure noise and run the learned reverse process.

Diffusion: Forward (destroy) vs Reverse (create)

→ Forward Process: add Gaussian noise at each step (fixed, no learning) → Clean Image x₀ x₁ +noise x₂ +noise ··· xₜ pure noise ← Reverse: neural net ε_θ predicts noise to remove at each step Denoising Net ε_θ U-Net / Transformer trained to predict ε
Why diffusion beat GANs
  • Stable training — simple MSE loss on noise prediction, no adversarial dynamic
  • Better coverage — generates diverse outputs, no mode collapse
  • Easy to condition — add text/class embeddings to the denoising net (Stable Diffusion, DALL-E 2)
  • Composable — combine multiple conditioning signals
Diffusion's weakness
  • Slow sampling — original DDPM needs 1000 steps; DDIM, PLMS reduce to ~20–50
  • High compute — each step is a full forward pass through U-Net
  • Latent diffusion (LDM / Stable Diffusion) compresses to latent space first, then diffuses — much cheaper

13 Transfer Learning

The paradigm that makes deep learning practical — don't train from scratch when someone already trained from scratch for you.

Training a deep network from random weights requires massive data and compute. Transfer learning reuses a model trained on a large task as the starting point for a smaller related task. The pretrained weights encode general-purpose representations; fine-tuning adapts them to your specific problem.

The Transfer Learning Pipeline

Stage 1: Pretrain Large dataset (ImageNet, internet text, etc.) General representations learned in weights Stage 2: Transfer Copy pretrained weights. Replace task head (last layer) with new task output Stage 3: Fine-tune Small dataset, small LR. Train all layers (full) or freeze early layers (feature extraction) Task Model ✓
Feature Extraction

Freeze all pretrained layers. Only train the new task head. Use when your dataset is very small (<1K examples) — training deep layers would overfit.

Full Fine-tuning

Update all weights, but start from pretrained values with a very small learning rate (~1/10 of original). Best results when you have enough data (10K+ examples).

LoRA / PEFT

Freeze the original weights. Add small trainable low-rank matrices alongside each layer. Train only those. 99% fewer trainable parameters. The modern standard for fine-tuning LLMs. Covered in depth in Part 2.

🔑
Why transfer learning is the foundation of modern AI: GPT pretraining on internet text is transfer learning. BERT fine-tuning on NER is transfer learning. Stable Diffusion fine-tuned on your art style is transfer learning. The entire modern AI stack is built on the insight that general representations, learned at scale, transfer to specific tasks at a fraction of the cost.