01 The Big Picture
Before a single equation: what is the game we're actually playing?
Traditional software is a set of explicit rules. You, the programmer, encode logic: if temperature > 100, alert fire. Machine learning flips this contract. Instead of writing rules, you provide examples — input-output pairs — and let an algorithm discover the rules by itself.
This matters because some problems have rules too complex to write by hand. Recognizing a face, translating a sentence, predicting whether a tumor is malignant — the rules exist in the world, but they're buried in patterns across millions of examples. ML is a machine for extracting those patterns.
Every ML algorithm is, at its core, an automated search through a space of possible functions, guided by a measure of how wrong the current function is on known examples.
This document traces the story from that simple idea all the way to a neural network that can learn arbitrary functions — building mental models at every step, before any math.
Traditional Programming vs. Machine Learning
02 ML as Function Search
The single most important mental model in all of ML.
Imagine a universe of all possible functions. A function that maps every image to a cat/dog label. A function that maps a chess position to the best move. The space is infinite — there are uncountably many such functions.
ML is a structured search through this space. You don't search randomly. You define a family of functions (the model), and then search within that family by adjusting parameters (numbers) to minimize error on your examples.
Every set of parameters defines one specific function. Training = walking through parameter space toward lower error. The "landscape" has hills and valleys — valleys are good functions.
Imagine a mixer with millions of dials. Each dial is a parameter. Training adjusts all dials simultaneously, a tiny bit at a time, to make the output more correct. No human turns the dials — the error signal does.
The Function Family Concept
03 Supervised Learning
Learning from labeled examples — the backbone of almost everything.
Supervised learning is ML where every training example has a known correct answer (label). You show the model (image → cat), (image → dog), 1 million times. The model learns a mapping. Then you give it a new image it has never seen — and it predicts.
The word "supervised" means a teacher (the labels) is guiding the learning. Contrast with unsupervised learning (no labels, find structure) and reinforcement learning (labels come as rewards from the environment, after actions).
Classification
Output is a category. "Is this email spam?" → Yes/No. "What digit is this?" → 0–9. The model draws decision boundaries in the input space.
Regression
Output is a continuous number. "What will this house sell for?" → $420,000. The model fits a curve through the data points.
The Supervised Training Loop (Every ML model follows this)
This loop is identical whether you're training a linear regression, a random forest, or a 70B parameter language model. The details of each step differ, but the structure never changes.
The goal of supervised learning is not to memorize the training data — it's to generalize: to make accurate predictions on examples never seen during training. This distinction is everything.
Overfitting
The model memorizes training data including its noise. Performs perfectly on training, terribly on new data. Like a student who memorizes exam answers without understanding the subject.
Underfitting
The model is too simple to capture the real pattern. High error on both training and test data. Like trying to fit a line through data that is clearly a curve.
04 Unsupervised Learning
No labels, no teacher — find structure the data hides from you.
Supervised learning requires labeled data — expensive, slow to produce, and sometimes impossible (what's the "correct" label for a news article?). Unsupervised learning asks a different question: what patterns exist in the data itself, without any labels? The signal comes from the data's own structure.
If similar things tend to appear together or have similar structure, a model can discover those similarities without being told what they are. Compression is a useful lens: a model that can compress data well has implicitly learned its structure.
Clustering groups similar data points together without predefined categories. The canonical algorithm is k-means: pick k centroids, assign each point to the nearest centroid, recompute centroids, repeat until stable.
k-Means: Find Natural Groups in Data
Each centroid is a gravity well pulling nearby points in. Initialization matters hugely — bad starting centroids lead to bad clusters. k-means++ solves this by spreading initial centroids far apart.
High-dimensional data (images, genomics, embeddings) is hard to visualize and compute over. Dimensionality reduction compresses it into fewer dimensions while preserving important structure. This is not just preprocessing — it reveals hidden geometry.
Linear. Finds directions of maximum variance. Projects data onto a lower-dimensional hyperplane. Fast, interpretable, but can't capture non-linear structure.
Non-linear. Preserves local neighborhood structure. Excellent for visualization (2D/3D). Used everywhere in ML to visualize embeddings. Not for general compression — UMAP is faster and better for large data.
Autoencoders are neural networks trained to compress data into a bottleneck (latent space), then reconstruct it. The loss is reconstruction error — no labels needed. The bottleneck forces the network to learn a compact, meaningful representation.
Autoencoder: Compress → Reconstruct
The latent space z is the learned representation — useful for anomaly detection (high reconstruction error = anomaly), denoising, and as features for downstream tasks. VAEs (Variational Autoencoders) make z a probability distribution, enabling generation of new samples.
Self-supervised learning is the bridge between unsupervised and supervised. You create artificial supervised tasks from unlabeled data — the labels come from the data itself.
Randomly mask 15% of tokens, predict the masked words. The label is the original token — free from the text itself. This forces the model to understand context in both directions.
Create two augmented views of the same image. Train the model so their embeddings are close; embeddings of different images stay far apart. Learns rich visual representations with no labels at all.
05 Reinforcement Learning
Learning from consequences — not from a teacher, but from the world.
In supervised learning, you have a dataset with correct answers. In reinforcement learning, you have an agent in an environment — no dataset, no correct answers. The agent tries actions, the environment responds with a reward signal, and the agent learns to maximize cumulative reward over time.
This is the most general learning framework — and also the closest to how humans learn complex skills like playing chess, riding a bike, or writing code.
The RL Loop: Agent ↔ Environment
Key Concepts
The agent's strategy — a mapping from state to action. What should I do given what I see? The policy is what we're training. It can be deterministic (given state → fixed action) or stochastic (given state → probability over actions).
How good is it to be in a given state? V(s) = expected cumulative reward from state s following the current policy. The agent uses value estimates to plan: prefer states that lead to higher future rewards.
A reward now is worth more than the same reward later. γ ∈ [0,1] controls how much the agent values future rewards. γ=0: only care about immediate reward. γ→1: plan far ahead. Think of it as patience.
The fundamental tension. Exploit: do what you know works. Explore: try new things that might be better. Too much exploitation → stuck in suboptimal behavior. Too much exploration → never learn to be good. ε-greedy: explore randomly with probability ε.
Imagine playing a new game with no manual. You try button combinations. Sometimes you score (positive reward), sometimes you die (negative reward). Over thousands of attempts, you learn which actions in which situations tend to lead to high scores. You never see the "correct" move — you discover it by trial and error. That's RL.
RL Algorithms — The Landscape
Q-Learning learns a Q-function: Q(state, action) = expected future reward if you take this action in this state and then act optimally afterward. The famous Bellman equation makes this recursive — the value of a state-action pair depends on the best next state-action pair.
Q(s,a) ← Q(s,a) + α · [r + γ · max Q(s',a') − Q(s,a)]
α = learning rate · r = reward received · γ · max Q(s',a') = best future value · The bracketed term is the TD error — how surprised were we?
DQN (Deep Q-Network) replaces the Q-table with a neural network. DeepMind used this to beat Atari games from raw pixels — the input was screen pixels, Q-network learned the action values.
Policy Gradient methods directly optimize the policy without a value function. Instead of learning "how good is each state," you learn "which actions to take more or less often" based on whether they led to high rewards.
REINFORCE: if an action led to a high-reward trajectory, increase its probability. If it led to low reward, decrease it. Simple but high variance — you need many samples to get stable gradient estimates.
Actor-Critic: combine both — a policy network (actor) that picks actions, and a value network (critic) that estimates how good the current state is. The critic reduces variance by providing a baseline. PPO is a modern actor-critic algorithm.
RLHF uses PPO (Proximal Policy Optimization) to fine-tune LLMs. The LLM is the policy (state = conversation so far, action = next token), the reward model gives the reward signal, and PPO constrains updates so the model doesn't drift too far from the supervised baseline.
DPO (Direct Preference Optimization) bypasses the RL loop entirely by showing that the RLHF objective has an equivalent supervised form. Train directly on preference pairs (preferred response A > response B) using cross-entropy loss. No reward model needed, no PPO loop — much simpler and now the dominant approach.
06 Loss Functions & Gradient Descent
How the model knows it's wrong, and how it fixes itself.
The Loss Function
A loss function takes (prediction, true label) and returns a single number: how wrong are we? The higher the number, the worse. Training is the process of minimizing this number by changing the model's parameters.
Gradient Descent — The Core Algorithm
Imagine you're blindfolded on a hilly landscape and want to reach the lowest valley. You can only feel the slope under your feet. The strategy: always take a small step in the downhill direction.
That's gradient descent. The "landscape" is the loss surface over parameter space. The "slope under your feet" is the gradient — a vector that points in the direction of steepest increase in loss. We step in the opposite direction.
Interactive: Loss Landscape (hover to explore)
The ball rolls downhill toward minimum loss — that's gradient descent. Multiple valleys = local minima problem.
Parameters are coordinates on a landscape. Loss is altitude. Training is rolling a ball downhill. The learning rate controls how large each step is — too big and you bounce over valleys, too small and you take forever.
Step size per iteration. Too large → oscillates, diverges. Too small → extremely slow convergence. One of the most important hyperparameters in all of ML.
SGD: gradient from 1 example — noisy but fast. Mini-batch: gradient from ~32-256 examples — best of both worlds. Full batch: all examples — precise but slow.
07 Neural Networks (ANN)
Why we need them, and what they actually are (hint: not really brains).
Linear models can only draw straight lines (hyperplanes) as decision boundaries. Real-world data is not linearly separable. A neural network solves this by composing many simple transformations — each layer learns a new representation of the data, until the final layer can separate the classes with a simple boundary.
▶ Interactive: watch a forward pass fire
Data enters at the input layer; each layer transforms it and passes it forward. Watch the activation wave move left to right — this is all a neural network does at inference time: one directed sweep of multiply-add-activate.
Neural Network: Anatomy
Each connection = 1 learnable weight parameter
The Single Neuron: What's Actually Happening
Each neuron does two things: (1) takes a weighted sum of its inputs, (2) applies an activation function. That's it. The magic is in what happens when you stack millions of them.
output = activation(w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b)
w = weights (what we train) · x = inputs · b = bias (shift term) · activation = non-linearity
Why Non-Linearity (Activation Functions) Are Critical
Without an activation function, stacking layers is useless — a network of 100 linear layers collapses into a single linear layer mathematically. The activation function (ReLU, Sigmoid, Tanh) introduces non-linearity, which is what allows the network to learn curved decision boundaries.
Zero for negative inputs, linear for positive. Cheap to compute, avoids vanishing gradient. Default choice for hidden layers today.
Squashes to (0,1). Used in output for binary classification. Causes vanishing gradients in deep networks — don't use in hidden layers.
08 Specialized Architectures: CNN, RNN, LSTM, GRU
Vanilla ANNs treat all inputs as interchangeable. These architectures exploit structure in the data.
A fully-connected ANN (the vanilla net from section 07) applies the same transformation to everything with no awareness of spatial proximity, temporal order, or sequence structure. Real data has structure — images have local patterns, text has sequential dependencies, audio has temporal rhythms. Specialized architectures bake these structural assumptions in as inductive biases, making them far more efficient on their target data types.
CNN — Convolutional Neural Network
Problem it solves: An image is a grid of pixels. Nearby pixels are related (they form edges, textures, objects). A fully-connected layer treating each pixel independently ignores this and has too many parameters to train.
Key idea: Slide a small filter (kernel) across the image, computing a dot product at each position. This detects local patterns — edges, corners, textures — regardless of where they appear in the image (translation equivariance).
Convolution: A Filter Sliding Across an Image
Conv → ReLU → Pooling, repeated. Early layers: edges and colors. Middle layers: textures and shapes. Deep layers: objects and parts. Final layers: fully-connected for classification.
Max-pooling: take the maximum value in each region. Reduces spatial size, provides translation invariance ("I saw an edge here, approximately"). Halves width and height each time applied.
Each conv layer is like examining an image at a different magnification. First pass: individual pixel patterns (edges). Second pass: combinations of those patterns (curves, corners). Third pass: combinations of curves (eyes, wheels). CNNs build a feature hierarchy bottom-up.
RNN — Recurrent Neural Network
Problem it solves: A sequence (text, time-series, audio) has temporal dependencies — the meaning of "not" depends on what comes after it. A vanilla ANN processes each position independently; it has no memory of previous inputs.
Key idea: At each time step, the network takes the current input and a hidden state from the previous step. The hidden state acts as a running memory — a compressed summary of everything seen so far.
RNN: Hidden State Propagates Through Time
The hidden state update is: hₜ = tanh(Wₓ·xₜ + Wₕ·hₜ₋₁ + b). Same weights W reused at every time step — the network learns "how to update its memory" in a content-independent way.
LSTM — Long Short-Term Memory
LSTM (1997, Hochreiter & Schmidhuber) solves vanishing gradients by adding a cell state — a separate memory track that flows through time with additive updates rather than multiplicative ones. Gradients can flow through the cell state without shrinking.
LSTM Cell — Three Gates Control Information Flow
The cell state is a conveyor belt running alongside the sequence. The forget gate decides what to wipe off (multiply by ~0). The input gate decides what to stamp on (add new info). The output gate decides what to read off for the current step. Gradients flow backwards through the belt's additions — never multiplied through long chains, so they don't vanish.
GRU — Gated Recurrent Unit
GRU (2014, Cho et al.) is a simplified LSTM that merges the forget and input gates into a single update gate, and merges the cell state and hidden state. Fewer parameters, similar performance on most tasks, faster to train.
Update gate (z): How much of the previous hidden state to keep vs. overwrite. Reset gate (r): How much of the previous state to forget when computing the new candidate state. Simpler than LSTM — often preferred for smaller datasets.
LSTM: separate cell state + hidden state, 3 gates, more expressive. GRU: single hidden state, 2 gates, fewer params. Rule of thumb: GRU for smaller datasets/simpler tasks, LSTM when you have data to learn the extra expressivity.
When to Use Which Architecture
Architecture Selection Guide
09 Backpropagation
The algorithm that made deep learning possible — it's just the chain rule, applied cleverly.
The central problem of training a network: we have a loss value, and we need to know how to adjust every weight in the network to reduce that loss. With millions of weights across many layers, computing this naively would be impossibly expensive.
Backpropagation solves this by applying the chain rule of calculus backwards through the network — from loss back to the first layer. It's not magic. It's efficient bookkeeping of derivatives.
Forward Pass & Backward Pass
The Chain Rule — One Sentence
If loss L depends on weight W through intermediate values, then ∂L/∂W = (∂L/∂output) × (∂output/∂W). Backprop applies this repeatedly from the output layer backward to the input, reusing intermediate computations to avoid redundant work.
Think of each weight as an employee. After every prediction, you run an audit: how much did each employee (weight) contribute to the error? Weights that contributed more to a wrong answer get adjusted more. Backprop is the audit process — it assigns blame proportionally.
10 Common Misconceptions
Now that you understand supervised learning, loss, gradient descent, and backprop — the next document covers how we get from neural networks to Transformers (the architecture behind every major LLM), how LLMs are trained at scale, fine-tuning with RLHF, open weights, prompting, and the full inference pipeline.
11 Training Tricks That Make Deep Learning Work
Theoretical understanding of backprop is necessary but not sufficient. These are the engineering practices that make deep networks actually trainable.
Optimizers — Beyond Vanilla SGD
Vanilla SGD moves every parameter in the same direction with the same step size. Real loss landscapes are anisotropic — some dimensions are steep, some are flat. Better optimizers adapt the step size per parameter.
Momentum accumulates a velocity vector in the direction of persistent gradient, damping oscillations and accelerating through long, shallow valleys. Think of a ball rolling downhill — it builds up speed in a consistent direction.
v ← β·v + (1−β)·∇L w ← w − η·v
β ≈ 0.9 is typical. At each step, 90% of the old velocity is kept and 10% of the new gradient is blended in. Oscillations in non-gradient directions cancel out; the consistent downhill direction accumulates. Nesterov momentum evaluates gradient at the "lookahead" position — slightly more accurate.
Adam (Adaptive Moment Estimation) maintains per-parameter estimates of both the first moment (mean of gradients) and second moment (uncentred variance). This gives every parameter its own adaptive learning rate — large for parameters that change rarely, small for parameters that change a lot.
m ← β₁·m + (1−β₁)·g (1st moment — direction)v ← β₂·v + (1−β₂)·g² (2nd moment — curvature)ŵ = m̂/(√v̂ + ε)β₁=0.9, β₂=0.999, ε=1e-8 in practice. Bias correction: m̂ = m/(1−β₁ᵗ), v̂ = v/(1−β₂ᵗ). AdamW decouples weight decay from the gradient update — prevents weight decay from interacting incorrectly with adaptive learning rates. AdamW is now the default for training LLMs.
Optimizer Landscape — When to Use What
Batch Normalization & Layer Normalization
Deep networks suffer from internal covariate shift: the distribution of each layer's activations shifts as earlier layers update, forcing later layers to constantly readapt. Normalization layers fix this by standardising activations.
Normalise across the batch dimension — compute mean/variance per feature over N samples. Works brilliantly for CNNs. Problem: depends on batch size; breaks with batch=1; can't be used at inference without tracked running statistics. γ and β (scale and shift) are learned.
Normalise across the feature dimension — compute mean/variance per sample, over all features. Works with any batch size, including batch=1. Standard for Transformers and LLMs. The only normalisation that works correctly for variable-length sequences and auto-regressive generation.
Dropout — Regularisation by Forgetting
During training, randomly zero out p% of neurons at each forward pass (typically p=0.1–0.5). Each mini-batch trains a different sub-network. The ensemble of all 2ⁿ sub-networks is implicitly averaged at test time (by scaling activations by 1−p, or equivalently dividing by 1−p at train time — "inverted dropout").
Dropout forces every neuron to be useful on its own — it can't rely on a fixed set of partner neurons. This is mathematically equivalent to averaging predictions over exponentially many subnetworks, which reduces variance and prevents co-adaptation (neurons memorising together).
- Dense layers in classic CNNs (after pooling)
- Attention dropout in Transformers (low rate, ~0.1)
- Embedding dropout in RNNs
- NOT on batch norm layers (incompatible)
- NOT at inference — only active during training
12 Generative Architectures — GANs & Diffusion
The two families of models that learned to create, not just classify — and why one replaced the other.
GANs — Generative Adversarial Networks
Proposed by Goodfellow et al. (2014). Two networks trained simultaneously in a minimax game: a Generator tries to produce fake data that looks real; a Discriminator tries to distinguish real from fake. Each makes the other better.
GAN Training Loop
Diffusion Models — The Current State of the Art
Diffusion models (DDPM, 2020; Score Matching) take a completely different approach: instead of adversarial training, they learn to reverse a noise process. Forward process: gradually add Gaussian noise to an image until it's pure noise. Reverse process: learn to denoise step-by-step. Generation = start from pure noise and run the learned reverse process.
Diffusion: Forward (destroy) vs Reverse (create)
- Stable training — simple MSE loss on noise prediction, no adversarial dynamic
- Better coverage — generates diverse outputs, no mode collapse
- Easy to condition — add text/class embeddings to the denoising net (Stable Diffusion, DALL-E 2)
- Composable — combine multiple conditioning signals
- Slow sampling — original DDPM needs 1000 steps; DDIM, PLMS reduce to ~20–50
- High compute — each step is a full forward pass through U-Net
- Latent diffusion (LDM / Stable Diffusion) compresses to latent space first, then diffuses — much cheaper
13 Transfer Learning
The paradigm that makes deep learning practical — don't train from scratch when someone already trained from scratch for you.
Training a deep network from random weights requires massive data and compute. Transfer learning reuses a model trained on a large task as the starting point for a smaller related task. The pretrained weights encode general-purpose representations; fine-tuning adapts them to your specific problem.
The Transfer Learning Pipeline
Freeze all pretrained layers. Only train the new task head. Use when your dataset is very small (<1K examples) — training deep layers would overfit.
Update all weights, but start from pretrained values with a very small learning rate (~1/10 of original). Best results when you have enough data (10K+ examples).
Freeze the original weights. Add small trainable low-rank matrices alongside each layer. Train only those. 99% fewer trainable parameters. The modern standard for fine-tuning LLMs. Covered in depth in Part 2.