Every core algorithm from first principles — the exact equations that run inside every model, derived, worked through numerically, and animated. Not just stated.
00 Mathematical Foundations
Three pillars hold up everything in AI: linear algebra (the language of data), probability (the language of uncertainty), and calculus (the language of optimization). Master the handful of equations below and every later section becomes a combination of things you already know.
Linear Algebra — The Language of Data
All ML operates on vectors and matrices. A dataset of \(N\) samples, each with \(D\) features, is a matrix \(X \in \mathbb{R}^{N \times D}\). A neural network layer is a matrix multiplication. Understanding matrix shapes — and why they must align — prevents 90% of implementation bugs.
$$ A\mathbf{v} = \lambda \mathbf{v} $$
A matrix transforms its eigenvector \(\mathbf{v}\) by scaling only (no rotation). \(\lambda\) = eigenvalue. Used in PCA: principal components are eigenvectors of the covariance matrix, ordered by eigenvalue magnitude.
Animated · Matrix multiplication, cell by cell
Every output cell \(C_{ij}\) is a dot product: row \(i\) of \(A\) against column \(j\) of \(B\). This exact operation, tiled billions of times, is what a GPU spends its life doing.
C[1,1] = ?
Probability — The Language of Uncertainty
Bayes' Theorem
$$ P(\theta \mid D) = \frac{P(D \mid \theta)\, P(\theta)}{P(D)} $$
posterior = likelihood × prior / evidence. Central to all Bayesian ML: update beliefs about parameters \(\theta\) given data \(D\).
Information Theory
$$ H(p) = -\sum_x p(x) \log p(x) $$
$$ H(p, q) = -\sum_x p(x) \log q(x), \qquad D_{\mathrm{KL}}(p \,\|\, q) = H(p,q) - H(p) $$
Shannon entropy: average bits to encode a sample from \(p\); maximal when all outcomes are equally likely. Cross-entropy \(H(p,q)\) measures how well \(q\) approximates \(p\) — it is the loss function of every classifier and every LLM. KL divergence is the gap between them: the "extra bits" paid for using the wrong distribution.
Worked example — why cross-entropy punishes confident mistakes
True label: class 2 of 3 (one-hot \(p = [0, 1, 0]\)). Two models predict:
Model A (hedging): \(q_A = [0.2,\, 0.6,\, 0.2] \Rightarrow H(p, q_A) = -\log 0.6 = 0.51\) nats.
Model B (confidently wrong): \(q_B = [0.85,\, 0.10,\, 0.05] \Rightarrow H(p, q_B) = -\log 0.10 = 2.30\) nats.
B's loss is ~4.5× higher. The \(-\log\) curve explodes as the true-class probability \(\to 0\): being confidently wrong is catastrophically expensive, being uncertain is mildly expensive. This asymmetry is what trains calibrated models.
Calculus — The Language of Optimization
The Chain Rule (Backbone of All Deep Learning)
$$ z = f(y),\; y = g(x) \;\Longrightarrow\; \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} $$
$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdots \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial w} $$
For neural nets — many composed functions — the chain rule becomes a product of local derivatives. Backprop is the chain rule applied efficiently via dynamic programming: compute each factor once, reuse it for every parameter upstream.
The gradient \(\nabla_\mathbf{x} f \in \mathbb{R}^n\) of a scalar w.r.t. a vector points in the direction of steepest increase. The Jacobian \(J \in \mathbb{R}^{m \times n}\) generalizes this for vector-valued functions: \(J_{ij} = \partial f_i / \partial x_j\).
01 Supervised Learning — Mathematics
Linear Regression
Find weights \(\mathbf{w}\) that minimize the mean squared error between predictions \(\hat{y} = X\mathbf{w}\) and targets \(\mathbf{y}\).
$$ L(\mathbf{w}) = \frac{1}{N} \lVert X\mathbf{w} - \mathbf{y} \rVert^2 = \frac{1}{N} (X\mathbf{w} - \mathbf{y})^\top (X\mathbf{w} - \mathbf{y}) $$
Expanded: \(\frac{1}{N}(\mathbf{w}^\top X^\top X \mathbf{w} - 2\mathbf{w}^\top X^\top \mathbf{y} + \mathbf{y}^\top \mathbf{y})\) — a quadratic bowl in \(\mathbf{w}\), hence one global minimum.
Closed-form solution: set the gradient to zero.
$$ \frac{\partial L}{\partial \mathbf{w}} = \frac{2}{N}\left( X^\top X \mathbf{w} - X^\top \mathbf{y} \right) = 0 $$derivative of the quadratic, term by term
$$ \mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y} $$the Normal Equation
Why not always use this? \(X^\top X\) is \(D \times D\) — inverting it is \(O(D^3)\). For \(D > 10\text{K}\) features, gradient descent is faster.
💡
Geometric interpretation: the optimal \(\mathbf{w}\) projects \(\mathbf{y}\) onto the column space of \(X\). The residual \(\mathbf{y} - X\mathbf{w}^*\) is orthogonal to every column of \(X\) — which is exactly what \(\partial L / \partial \mathbf{w} = 0\) enforces: \(X^\top(\mathbf{y} - X\mathbf{w}) = 0\).
Animated · Gradient descent on a loss bowl
The ball is \(w\); the bowl is \(L(w)\). Each step: \(w \leftarrow w - \eta \, L'(w)\). Drag the learning-rate slider — watch convergence become oscillation become divergence. This single picture explains most training failures.
η = 0.22
Logistic Regression (Binary Classification)
Output a probability using the sigmoid function, then use cross-entropy loss — not MSE. MSE on top of a sigmoid gives a non-convex loss with flat regions; cross-entropy stays convex and keeps gradients alive.
$$ \sigma(z) = \frac{1}{1 + e^{-z}}, \qquad z = \mathbf{w}^\top \mathbf{x} + b $$
\(\sigma\) maps any real number to \((0,1)\). Derivative: \(\sigma'(z) = \sigma(z)\,(1 - \sigma(z))\) — elegantly reuses the forward-pass value.
$$ L(\mathbf{w}) = -\frac{1}{N} \sum_{i=1}^{N} \Big[ y_i \log \hat{y}_i + (1 - y_i) \log (1 - \hat{y}_i) \Big] $$
Binary cross-entropy. When \(y_i = 1\): only the first term survives (penalize low \(\hat{y}_i\)). When \(y_i = 0\): only the second (penalize high \(\hat{y}_i\)). This is exactly the negative log-likelihood of a Bernoulli distribution.
Gradient of cross-entropy loss w.r.t. weights — the sigmoid's derivative cancels perfectly:
$$ \frac{\partial L}{\partial \mathbf{w}} = \frac{1}{N} X^\top (\hat{\mathbf{y}} - \mathbf{y}) $$same form as linear regression's gradient!
$$ \mathbf{w} \leftarrow \mathbf{w} - \eta \cdot \frac{1}{N} X^\top (\hat{\mathbf{y}} - \mathbf{y}) $$\((\hat{y} - y)\) is just the prediction error
Multiclass — Softmax & Cross-Entropy
$$ \mathrm{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_j e^{z_j}} $$
Converts raw logits \(\mathbf{z} \in \mathbb{R}^K\) to a probability distribution over \(K\) classes. Numerically stable version: subtract \(\max(\mathbf{z})\) before exponentiation — identical output, no overflow.
$$ L = -\sum_k y_k \log \hat{y}_k = -\log \hat{y}_{\text{true class}}, \qquad \frac{\partial L}{\partial \mathbf{z}} = \hat{\mathbf{y}} - \mathbf{y} $$
Since \(\mathbf{y}\) is one-hot, only the log-probability of the true class matters. And the gradient through softmax + cross-entropy collapses to "prediction minus label" — the cleanest gradient in all of ML. This same \(\hat{y} - y\) appears at the output of every LLM during pretraining.
Regularization Mathematics
L2 (Ridge / Weight Decay)
$$ L_{\mathrm{reg}} = L + \frac{\lambda}{2} \lVert \mathbf{w} \rVert_2^2 $$
Gradient adds \(\lambda \mathbf{w}\) to every update: \(\mathbf{w} \leftarrow \mathbf{w}(1 - \eta\lambda) - \eta \nabla L\). Shrinks all weights smoothly. Equivalent to a Gaussian prior on weights.
L1 (Lasso / Sparsity)
$$ L_{\mathrm{reg}} = L + \lambda \lVert \mathbf{w} \rVert_1 $$
Gradient adds \(\lambda \cdot \mathrm{sign}(\mathbf{w})\) — a constant-magnitude pull toward zero that drives small weights exactly to zero. Sparse solutions; equivalent to a Laplace prior; useful for feature selection.
Update: \(w \leftarrow 0 - 0.1 \cdot (-8) = 0.8\). Loss falls from \(6.5\) to \(2.42\). Repeat ~20 times and \(w \to 1.6\), the least-squares optimum. Every LLM training step is this exact loop with \(10^{11}\) parameters instead of one.
02 Unsupervised Learning — Mathematics
K-Means Clustering
Minimize the within-cluster sum of squared distances. This is alternating minimization over assignments and centroids.
$$ J = \sum_{k=1}^{K} \sum_{\mathbf{x} \in C_k} \lVert \mathbf{x} - \boldsymbol{\mu}_k \rVert^2 $$
Two sub-problems: (1) fix \(\boldsymbol{\mu}_k\), assign each point to its nearest centroid; (2) fix assignments, set \(\boldsymbol{\mu}_k\) = mean of cluster \(k\). Neither step increases \(J\) → convergence guaranteed — but only to a local minimum.
Convergence proof sketch — why the centroid must be the mean:
Find directions of maximum variance. These are the eigenvectors of the sample covariance matrix, ordered by eigenvalue (variance explained).
$$ \Sigma = \frac{1}{N} X^\top X \;(\text{zero-mean } X), \qquad \Sigma = V \Lambda V^\top $$
\(V\) = eigenvectors (principal components), \(\Lambda\) = diagonal eigenvalues (variance along each PC). Project: \(Z = XV\) gives a decorrelated, variance-sorted representation.
🔗
PCA via SVD: \(X = U S V^\top\). Principal components are the right singular vectors \(V\); variance explained by each PC is \(s_i^2 / N\). SVD is numerically preferred over forming \(X^\top X\) explicitly — it avoids squaring the condition number.
Autoencoders
$$ L(\mathbf{x}) = \lVert \mathbf{x} - g(f(\mathbf{x})) \rVert^2 $$
\(f\): encoder, \(\mathbb{R}^d \to \mathbb{R}^k\) with \(k \ll d\). \(g\): decoder back to \(\mathbb{R}^d\). Minimizing reconstruction error forces the bottleneck \(\mathbf{z}\) to learn a compressed representation of the data's essential structure.
Variational Autoencoder (VAE) — ELBO Loss
$$ L = \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x})}\big[\log p(\mathbf{x}|\mathbf{z})\big] - D_{\mathrm{KL}}\big(q(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z})\big) $$
Term 1 — reconstruction: how well \(\mathbf{z}\) decodes back to \(\mathbf{x}\). Term 2 — regularization: KL forces the latent posterior toward the prior \(p(\mathbf{z}) = \mathcal{N}(0, I)\), making the latent space continuous and interpolatable — which is what enables generation.
Reparameterization trick — the key insight enabling backprop through sampling:
$$ \mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(0, I) $$sampling \(z \sim \mathcal{N}(\mu, \sigma^2)\) directly is non-differentiable; this is
$$ \frac{\partial \mathbf{z}}{\partial \boldsymbol{\mu}} = 1, \qquad \frac{\partial \mathbf{z}}{\partial \boldsymbol{\sigma}} = \boldsymbol{\varepsilon} $$deterministic in \(\mu, \sigma\) → gradients flow; randomness is outsourced to \(\varepsilon\)
Self-Supervised: Contrastive Learning (NT-Xent)
$$ L = -\log \frac{\exp\big(\mathrm{sim}(\mathbf{z}_i, \mathbf{z}_j)/\tau\big)}{\sum_{k \neq i} \exp\big(\mathrm{sim}(\mathbf{z}_i, \mathbf{z}_k)/\tau\big)} $$
\(\mathrm{sim}\) = cosine similarity, \(\tau\) = temperature. \((i,j)\) = positive pair (two augmentations of the same input); all other \(k\) = negatives. Minimizing pulls positives together and pushes negatives apart in embedding space. Basis of CLIP, SimCLR, and the embedding models behind RAG.
03 Reinforcement Learning — Mathematics
Markov Decision Process (MDP) — Formal Definition
An MDP is a tuple \((S, A, P, R, \gamma)\):
$$ \begin{aligned} S &= \text{state space} \\ A &= \text{action space} \\ P(s' \mid s, a) &= \text{transition probability (world dynamics)} \\ R(s, a, s') &= \text{reward function} \\ \gamma \in [0, 1) &= \text{discount factor} \end{aligned} $$
Value Functions & Bellman Equations
$$ V^\pi(s) = \mathbb{E}_\pi\!\left[ \sum_{t=0}^{\infty} \gamma^t R_t \,\Big|\, S_0 = s \right] $$
State-value: expected discounted return starting from \(s\) under policy \(\pi\). \(\gamma < 1\) guarantees the infinite sum converges and encodes preference for sooner rewards.
$$ Q^\pi(s, a) = \mathbb{E}_\pi\!\left[ \sum_{t=0}^{\infty} \gamma^t R_t \,\Big|\, S_0 = s, A_0 = a \right], \qquad V^\pi(s) = \sum_a \pi(a|s)\, Q^\pi(s, a) $$
Action-value: expected return taking action \(a\) first, then following \(\pi\).
$$ Q^*(s, a) = \mathbb{E}\Big[ R(s, a, s') + \gamma \max_{a'} Q^*(s', a') \Big] $$
$$ Q(s,a) \leftarrow Q(s,a) + \alpha \underbrace{\Big[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \Big]}_{\text{TD error}} $$
The optimal Q-function satisfies the recursion; Q-learning converges to it by iterating the update. The bracketed Temporal-Difference error is the "surprise" of the observed reward versus the current estimate.
Policy Gradient Theorem
$$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot Q^{\pi_\theta}(s_t, a_t) \right] $$
Increase the log-probability of actions that led to high value; decrease it for low value. \(\nabla_\theta \log \pi_\theta(a|s)\) is the score function — the direction to nudge \(\theta\) to make action \(a\) more likely in state \(s\).
REINFORCE (Monte-Carlo policy gradient), then variance reduction:
$$ \nabla_\theta J \approx \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t $$unbiased but high variance
$$ \nabla_\theta J \approx \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \underbrace{\big(Q^\pi(s_t,a_t) - V^\pi(s_t)\big)}_{A(s_t, a_t)} $$subtracting baseline \(V(s)\): no bias, much less variance
\(A(s,a) = Q(s,a) - V(s)\) is the advantage — how much better is \(a\) than average? Actor-critic: critic estimates \(V(s)\), actor updates \(\pi\) using the advantage.
PPO — The Math Behind LLM Alignment
$$ L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\Big[ \min\big( r_t(\theta) \hat{A}_t,\; \mathrm{clip}(r_t(\theta), 1{-}\epsilon, 1{+}\epsilon) \hat{A}_t \big) \Big], \qquad r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} $$
\(r_t\) = probability ratio of new vs old policy; \(\hat{A}_t\) = advantage estimate. The clip (\(\epsilon \approx 0.2\)) prevents the ratio straying far from 1 — no catastrophic updates. The \(\min\) ensures the objective never benefits from leaving the clip range.
🔗
RLHF objective: \(L = L^{\mathrm{CLIP}} - \beta \cdot D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\mathrm{SFT}})\). The KL penalty keeps the aligned model close to the supervised fine-tuned baseline, preventing reward hacking — the model finding degenerate ways to maximize reward while collapsing in quality.
Given the loss \(L\) at the output, propagate gradients backward through every layer using the chain rule. Define the error signal \(\boldsymbol{\delta}^{[l]} = \partial L / \partial \mathbf{z}^{[l]}\).
Output layer:
$$ \boldsymbol{\delta}^{[L]} = \frac{\partial L}{\partial \mathbf{a}^{[L]}} \odot f'(\mathbf{z}^{[L]}) $$for softmax + cross-entropy this collapses to \(\hat{\mathbf{y}} - \mathbf{y}\)
Hidden layer \(l\) — the backward recursion:
$$ \boldsymbol{\delta}^{[l]} = \big( W^{[l+1]\top} \boldsymbol{\delta}^{[l+1]} \big) \odot f'(\mathbf{z}^{[l]}) $$\(W^\top\) routes error backward; \(f'\) scales by local slope
Backward: \(\delta_{\text{out}} = \hat{y} - y = -2\). Then \(\frac{\partial L}{\partial w_2} = \delta_{\text{out}} \cdot h = -4\), and \(\delta_h = \delta_{\text{out}} \cdot w_2 \cdot \mathrm{ReLU}'(2) = -2 \cdot 2 \cdot 1 = -4\), so \(\frac{\partial L}{\partial w_1} = \delta_h \cdot x = -8\).
With \(\eta = 0.05\): \(w_2 \to 2.2\), \(w_1 \to 1.4\). New forward pass: \(\hat{y} = 2.2 \times \mathrm{ReLU}(2.8) = 6.16\), loss \(\to 0.013\). One step, loss down 99% — and you just executed the same algorithm that trains GPT.
Vanishing & Exploding Gradients
$$ \lVert \boldsymbol{\delta}^{[l]} \rVert \approx \lVert \boldsymbol{\delta}^{[L]} \rVert \cdot \prod_{k=l+1}^{L} \lVert W^{[k]} \rVert \cdot |f'(\mathbf{z}^{[k]})| $$
If \(\lVert W \rVert \cdot |f'| < 1\) at every layer the gradient shrinks exponentially (vanishing); if \(> 1\) it explodes (NaN). For tanh and sigmoid \(f' \le 0.25\); for ReLU \(f' \in \{0, 1\}\) — much better. Fixes: BatchNorm (stabilizes activations), residual connections (gradient highway), Xavier/He initialization.
Activation Functions — Derivatives
ReLU
$$ f(z) = \max(0, z), \quad f'(z) = \begin{cases} 1 & z > 0 \\ 0 & z \le 0 \end{cases} $$
No saturation for \(z > 0\). Dead neurons: if \(z\) always \(\le 0\), gradient is 0 forever. Leaky ReLU patches this with \(f'(z) = 0.01\) for \(z \le 0\).
Sigmoid
$$ f(z) = \frac{1}{1+e^{-z}}, \quad f'(z) = f(z)(1 - f(z)) $$
Max derivative 0.25 (at \(z=0\)); saturates at both ends → vanishing gradients. Use only in the output layer for binary classification.
GELU (used in GPT)
$$ f(z) = z \cdot \Phi(z) $$
\(\Phi\) = CDF of \(\mathcal{N}(0,1)\). A smooth, probabilistic ReLU — gates input by the probability it's positive. Smooth gradient everywhere; the default in Transformers.
Xavier / He Initialization
$$ \text{Xavier: } W \sim \mathcal{N}\!\Big(0, \tfrac{2}{n_{\text{in}} + n_{\text{out}}}\Big) \quad [\tanh/\sigma], \qquad \text{He: } W \sim \mathcal{N}\!\Big(0, \tfrac{2}{n_{\text{in}}}\Big) \quad [\mathrm{ReLU}] $$
Goal: keep signal variance constant across layers at init. Too large → saturation; too small → the signal dies. Xavier sets \(\mathrm{Var(out)} = \mathrm{Var(in)}\) for linear activations; He doubles the variance because ReLU zeroes ~half the neurons.
05 CNN — Mathematics
Discrete Convolution
$$ (I * K)[i, j] = \sum_m \sum_n I[i+m,\, j+n] \cdot K[m, n] $$
\(I\) = input feature map, \(K\) = kernel. Each output position is the dot product of the kernel with a local patch. Multiple kernels = multiple output channels, each detecting a different pattern. (In practice frameworks compute cross-correlation — no kernel flip — same math, different convention.)
Output Dimensions
$$ H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2P - F}{S} \right\rfloor + 1 $$
\(F\) = filter size, \(P\) = padding, \(S\) = stride. Parameters per conv layer: \(F^2 \cdot C_{\text{in}} \cdot C_{\text{out}} + C_{\text{out}}\). Weight sharing is the win: a 3×3 conv mapping 64→128 channels needs \(3 \cdot 3 \cdot 64 \cdot 128 + 128 = 73{,}856\) parameters vs millions for fully-connected.
Max Pooling
$$ y[i, j] = \max_{(m, n) \in \text{window}} x[iS + m,\, jS + n] $$
Gradient flows only to the maximum element of each window; all others get 0. This is why pooling buys translation invariance — the exact location of a feature within the window doesn't matter.
Receptive Field
$$ \mathrm{RF}_l = \mathrm{RF}_{l-1} + (F_l - 1) \prod_{k=1}^{l-1} S_k $$
The region of the original input a neuron "sees." After \(L\) 3×3 convs (stride 1): \(\mathrm{RF} = 2L + 1\); each stride-2 pool doubles it. Deep nets have huge receptive fields — late-layer neurons integrate whole-image context.
06 RNN, LSTM, GRU — Mathematics
Vanilla RNN
$$ \mathbf{h}_t = \tanh(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b}), \qquad \hat{\mathbf{y}}_t = \mathrm{softmax}(W_y \mathbf{h}_t) $$
\(W_h \in \mathbb{R}^{d \times d}\), \(W_x \in \mathbb{R}^{d \times m}\) — the same weights at every time step (weight tying). Parameter count is independent of sequence length \(T\): the model doesn't grow with longer sequences.
Backpropagation Through Time (BPTT)
Gradient at step \(t\) flows back to step \(k\) through a product of Jacobians:
Fix for exploding: gradient clipping — if \(\lVert g \rVert > \tau\), rescale \(g \leftarrow g \cdot \tau / \lVert g \rVert\). Fix for vanishing: change the architecture → LSTM.
LSTM — Full Mathematics
All gates computed from concatenated \([\mathbf{h}_{t-1}, \mathbf{x}_t]\):
$$ \mathbf{f}_t = \sigma(W_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) $$forget gate — what to erase from the cell
$$ \mathbf{i}_t = \sigma(W_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i), \qquad \tilde{\mathbf{g}}_t = \tanh(W_g [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_g) $$input gate — what to write, and the candidate values
$$ \mathbf{o}_t = \sigma(W_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) $$output gate — what to expose
$$ \mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t) $$hidden state
Key: the cell-state gradient flows through addition — \(\partial \mathbf{C}_t / \partial \mathbf{C}_{t-1} = \mathbf{f}_t\), no matrix multiply in the path. Gradients survive many steps: the "constant error carousel" (Hochreiter & Schmidhuber, 1997).
Watch the three stages: raw scores \(QK^\top\) fill in cell by cell → divide by \(\sqrt{d_k}\) and apply the causal mask (future = \(-\infty\)) → softmax turns each row into a probability distribution. Brighter = higher weight.
stage: raw scores
Worked example — attention for two tokens, every number shown
Let \(d_k = 2\), tokens "the cat". Say \(\mathbf{q}_2 = [1, 0]\), \(\mathbf{k}_1 = [0.5, 0.5]\), \(\mathbf{k}_2 = [1, -1]\), \(\mathbf{v}_1 = [1, 0]\), \(\mathbf{v}_2 = [0, 1]\).
Scores for token 2: \(\mathbf{q}_2 \cdot \mathbf{k}_1 = 0.5\), \(\quad \mathbf{q}_2 \cdot \mathbf{k}_2 = 1\). Scaled by \(\sqrt{2}\): \([0.354, 0.707]\).
Output: \(0.413 \cdot \mathbf{v}_1 + 0.587 \cdot \mathbf{v}_2 = [0.413, 0.587]\). The new representation of "cat" is a weighted blend of value vectors, weighted by query-key agreement. That's all attention is.
Multi-Head Attention
$$ \mathrm{MultiHead}(Q,K,V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_H)\, W_O, \qquad \mathrm{head}_i = \mathrm{Attention}(QW_Q^i, KW_K^i, VW_V^i) $$
\(H\) heads run in parallel with separate projections; \(W_O \in \mathbb{R}^{H d_k \times d_{\text{model}}}\) merges them. Total parameters \(\approx 4 d_{\text{model}}^2\) (for \(d_k = d_{\text{model}}/H\)). Each head learns a different relationship pattern — syntax, coreference, position.
Feed-Forward Network
$$ \mathrm{FFN}(\mathbf{x}) = \max(0,\, \mathbf{x}W_1 + \mathbf{b}_1) W_2 + \mathbf{b}_2 $$
\(W_1 \in \mathbb{R}^{d \times 4d}\), \(W_2 \in \mathbb{R}^{4d \times d}\) — the 4× expansion is empirical. FFN operates per-position, independently. Attention routes information between positions; the FFN processes information at each position. Complementary roles.
Positional Encoding
$$ \mathrm{PE}(pos, 2i) = \sin\!\frac{pos}{10000^{2i/d}}, \qquad \mathrm{PE}(pos, 2i{+}1) = \cos\!\frac{pos}{10000^{2i/d}} $$
Sinusoids at geometrically spaced frequencies give each position a unique, bounded vector; relative position is a linear function of absolute encodings: \(\mathrm{PE}(pos{+}k)\) is a rotation of \(\mathrm{PE}(pos)\). Modern models use RoPE instead (see the Agentic section).
Layer Normalization
$$ \mathrm{LayerNorm}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$
\(\mu, \sigma\) computed across features (not the batch), \(\gamma, \beta\) learned. Works for variable-length sequences and batch size 1. Pre-LN placement (before attention/FFN) trains more stably than the original Post-LN.
Attention Complexity
$$ \text{Time: } O(T^2 d), \qquad \text{Memory: } O(T^2) $$
The \(QK^\top\) matrix is \(T \times T\). For \(T = 4096\): 16M entries. For \(T = 128\mathrm{K}\): 16B entries — cannot be materialized. Solutions: FlashAttention (tile in SRAM, never materialize), sliding-window attention, linear-attention approximations.
Causal Masking
$$ \mathrm{score}(i, j) \leftarrow \begin{cases} \mathrm{score}(i, j) & j \le i \\ -\infty & j > i \end{cases}, \qquad \mathrm{softmax}(-\infty) = 0 $$
Each token attends only to the past. Setting future positions to \(-\infty\) before softmax drives their weight to exactly 0 — enabling parallel training on full sequences (teacher forcing) while preserving the autoregressive property.
08 LLM Training & Inference — Mathematics
Pretraining Objective — Negative Log-Likelihood
$$ L = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1};\, \theta) $$
Each token is a classification over the vocabulary \(V\) (\(|V| \approx 50\mathrm{K}\)): a 50K-way softmax at every position of every training step. Cross-entropy = NLL; minimizing it is maximum likelihood estimation of \(\theta\).
Perplexity
$$ \mathrm{PPL} = \exp(L) = \exp\!\left( -\frac{1}{T} \sum_t \log P(x_t \mid x_{<t}) \right) $$
Geometric mean of inverse probabilities: "how many equally-likely options does the model believe exist at each step?" PPL = 100 ⟺ as confused as choosing uniformly among 100 tokens. GPT-2: ≈35 on Wikipedia; GPT-3: ≈20; frontier models: <10.
Scaling Laws (Kaplan 2020 → Chinchilla 2022)
$$ L(N) = \left( \frac{N_c}{N} \right)^{\alpha_N}, \qquad L(D) = \left( \frac{D_c}{D} \right)^{\alpha_D}, \qquad L(C) = \left( \frac{C_c}{C} \right)^{\alpha_C} $$
Loss falls as a power law in parameters \(N\), tokens \(D\), compute \(C\); \(\alpha_N \approx 0.076\), \(\alpha_D \approx 0.095\). Chinchilla's revision: compute-optimal training wants \(D \approx 20N\) — a 70B model should see ~1.4T tokens. Modern models intentionally "overtrain" beyond this for inference efficiency.
Inference: Sampling Mathematics
Temperature scaling:
$$ P_T(v \mid \text{ctx}) = \frac{\exp(z_v / T)}{\sum_w \exp(z_w / T)} $$\(T \to 0\): argmax (greedy) · \(T = 1\): base distribution · \(T > 1\): flatter, more creative
Top-p (nucleus) sampling:
$$ V_p = \text{smallest set s.t.} \sum_{v \in V_p} P(v) \ge p $$renormalize over \(V_p\), sample — set size adapts to model confidence
Top-k: keep the \(k\) most probable tokens. Min-p (2024): keep tokens with \(P(v) \ge p_{\min} \cdot \max_v P(v)\) — often better than top-p.
KV Cache — Memory Cost
$$ \mathrm{Mem}_{KV} = 2 \cdot n_{\text{layers}} \cdot n_{\text{kv\_heads}} \cdot d_{\text{head}} \cdot T \cdot \text{bytes} $$
Factor 2 = keys + values. Llama-3-70B: 80 layers, 8 KV heads (GQA), \(d_{\text{head}} = 128\), \(T = 8192\), fp16 → \(2 \times 80 \times 8 \times 128 \times 8192 \times 2 \approx 3.4\) GB per sequence, just for cache. Context length is expensive because this grows linearly in \(T\) and competes for HBM with the weights.
Grouped Query Attention (GQA)
$$ \text{MHA: } H \text{ KV heads} \quad\to\quad \text{GQA: } G \text{ KV heads shared by } H/G \text{ query heads} \quad (G \ll H) $$
MQA (\(G = 1\)) is cheapest but loses quality; GQA is the sweet spot. Llama-3-70B uses \(G = 8\) vs \(H = 64\) → 8× smaller KV cache, negligible quality loss. This is a pure memory-bandwidth optimization — the math of attention is unchanged.
09 RAG — Mathematics
Dense Retrieval — Embedding & Similarity
$$ \mathbf{e}_q = \mathrm{Enc}(\text{query}) \in \mathbb{R}^d, \qquad \mathbf{e}_d = \mathrm{Enc}(\text{doc}) \in \mathbb{R}^d $$
Encoders are BERT-style transformers trained contrastively (positives = relevant query-doc pairs; negatives = random or hard negatives). Semantic meaning is encoded in the geometry of \(\mathbb{R}^d\).
$$ \mathrm{sim}(q, d) = \cos(\mathbf{e}_q, \mathbf{e}_d) = \frac{\mathbf{e}_q \cdot \mathbf{e}_d}{\lVert \mathbf{e}_q \rVert \, \lVert \mathbf{e}_d \rVert} $$
Angle between vectors, invariant to magnitude, range \([-1, 1]\). For L2-normalized vectors cosine = dot product, so retrieval = maximum inner product search (MIPS).
Approximate Nearest Neighbor (ANN)
Exact search over \(N\) embeddings is \(O(Nd)\) per query — too slow at scale. ANN trades a little recall for orders of magnitude of speed:
$$ P(y \mid x) = \sum_{d \in \mathrm{top\text{-}}k(x)} P(d \mid x) \cdot P(y \mid x, d;\, \theta) $$
Marginalize over retrieved documents: \(P(d|x) \propto \exp(\mathbf{e}_x \cdot \mathbf{e}_d)\) (softmax over the retrieved set), \(P(y|x,d;\theta)\) = the LLM's answer probability given the document. In practice: concatenate top-\(k\) docs, or FiD — encode each separately, fuse in the decoder.
BM25 (Sparse Retrieval)
$$ \mathrm{BM25}(q, d) = \sum_{w \in q} \mathrm{IDF}(w) \cdot \frac{\mathrm{TF}(w, d) \cdot (k_1 + 1)}{\mathrm{TF}(w, d) + k_1 \big(1 - b + b \frac{|d|}{\mathrm{avgdl}}\big)} $$
\(\mathrm{IDF}(w) = \log\frac{N - df_w + 0.5}{df_w + 0.5}\); \(k_1 \approx 1.2\), \(b \approx 0.75\). Rewards term frequency but saturates (\(k_1\)) and normalizes by document length (\(b\)). Hybrid RAG = BM25 (exact keywords) + dense (semantics) — better than either alone.
10 Agentic Systems & Context Engineering — Inner Mathematics
Agentic Loops as Partially Observable MDPs
An agent driving an LLM is a POMDP: it never observes the full world state — only what fits in its context window.
$$ \text{POMDP} = (S, A, O, T, R, \Omega, \gamma) $$
\(S\) = true world state, \(A\) = actions (tool calls, text), \(O\) = observations (tool results, messages), \(T(s'|s,a)\) = dynamics, \(\Omega(o|s',a)\) = observation model. The agent maintains a belief state \(b(s)\) — a distribution over possible world states, updated by Bayes' rule on each observation. The context window IS the (approximate) belief state.
Context Window as Belief State
The context at step \(t\): \(c_t = [\text{system}, \text{examples}, \text{history}_{0..t-1}, \text{obs}_t]\)
$$ P(a_t \mid c_t) = \mathrm{LLM}_{\text{policy}}(a_t \mid c_t;\, \theta) $$
The LLM is a policy mapping belief state → distribution over next actions. Each appended tool result is a Bayesian update narrowing the model's beliefs. Context engineering = designing \(c_t\) so the inferred belief state is accurate and sufficient for the task.
Attention Over Long Contexts — Position Bias
$$ \text{weight}(i, j) \propto \exp\!\left( \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}} + \mathrm{bias}(i, j) \right) $$
"Lost in the middle": effective attention is highest for local tokens (recency) and tokens near position 0 (primacy). Middle tokens in long contexts receive measurably less attention. Implication: put critical facts at the start or end of the prompt, never buried in the middle.
RoPE — Rotary Position Embedding
$$ \mathrm{RoPE}(\mathbf{x}, pos) = \mathbf{x} \odot \cos(pos \cdot \boldsymbol{\theta}) + \mathrm{rot90}(\mathbf{x}) \odot \sin(pos \cdot \boldsymbol{\theta}), \qquad \theta_d = 10000^{-2d/D} $$
Encodes position by rotating query/key vectors. Key property: \(\langle \mathrm{RoPE}(\mathbf{q}, m), \mathrm{RoPE}(\mathbf{k}, n) \rangle\) depends only on \(m - n\) — relative position, not absolute. Long-context models extend beyond training length via RoPE scaling (NTK, YaRN).
FlashAttention — Enabling Long Contexts
Standard attention materializes the \(O(T^2)\) matrix in HBM (slow GPU main memory). FlashAttention tiles the computation in SRAM (fast on-chip memory) and never materializes it:
$$ \text{Standard: } O(T^2) \text{ memory}, \qquad \text{Flash: } O(T) \text{ memory}, \; \sim\!\frac{T^2 d}{M} \text{ HBM reads} $$
\(M\) = SRAM size. Identical mathematical output — only the memory access pattern changes, and that's worth ~16× fewer HBM reads at \(T = 4096\). This is the doc-08 lesson in one equation: the bottleneck is memory bandwidth, not FLOPs.
Structured Outputs — Constrained Decoding
$$ P_{\text{constrained}}(v \mid \text{ctx}) = \begin{cases} P_{\mathrm{LLM}}(v \mid \text{ctx}) / Z & v \in \mathrm{valid}(\text{state}) \\ 0 & \text{otherwise} \end{cases} $$
A finite-state machine tracks which tokens are valid under the JSON schema / regex; invalid logits are masked to \(-\infty\) before sampling. Structure is enforced exactly, with zero prompt tricks. Libraries: Outlines, llama.cpp grammars, SGLang.
Speculative Decoding — Inference Speedup
$$ \text{Speedup} \approx \frac{1}{1 - \alpha (1 - \beta)} $$
\(\alpha\) = draft-token acceptance rate, \(\beta\) = draft/target cost ratio. A small draft model proposes \(K\) tokens; the big model verifies all of them in one forward pass. Accepted tokens are provably from the target distribution. Typical: 2–4× speedup at \(\alpha \approx 0.7\text{–}0.9\).
The Mathematical Picture — Unified
Every component connects back to the same core: maximum likelihood estimation via gradient descent. Supervised learning maximizes \(P(y|x)\). Language modeling maximizes \(P(x_t | x_{<t})\). RLHF maximizes preference probability as estimated by a reward model. RAG is Bayesian marginalization over retrieved evidence. Agents are POMDPs whose belief state is the context window. FlashAttention is the same computation rearranged for the memory hierarchy. The math never changes — only the distribution being modeled and what it's conditioned on.
11 Optimizers — Mathematics
Momentum SGD
$$ \mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1 - \beta) \nabla L(\mathbf{w}_t), \qquad \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \mathbf{v}_t $$
\(\mathbf{v}\) = velocity, an exponential moving average of gradients (\(\beta \approx 0.9\)). Accumulates signal in consistent directions, cancels oscillation. Nesterov variant evaluates the gradient at the look-ahead point \(\mathbf{w}_t - \eta\beta\mathbf{v}_{t-1}\).
Adam — Full Derivation
$$ \mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \mathbf{g}_t $$1st moment — EMA of the gradient
$$ \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2 $$2nd moment — EMA of the squared gradient
$$ \hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t} $$bias correction — without it, zero-initialized moments make early steps tiny
$$ \textbf{AdamW: } \mathbf{w} \leftarrow \mathbf{w}(1 - \eta\lambda) - \eta \frac{\hat{\mathbf{m}}}{\sqrt{\hat{\mathbf{v}}} + \epsilon} $$
Decouples weight decay from the adaptive denominator. In vanilla Adam, L2 regularization gets divided by \(\sqrt{\hat{v}}\), so decay strength varies per parameter with gradient history. AdamW applies decay uniformly — and generalizes measurably better for Transformers. It's the default optimizer for every modern LLM.
Learning Rate Warmup + Cosine Decay
$$ \eta(t) = \begin{cases} \eta_{\max} \cdot \dfrac{t}{T_w} & t \le T_w \\[8pt] \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos \pi \dfrac{t - T_w}{T - T_w}\right) & t > T_w \end{cases} $$
Warmup (\(T_w \approx\) 1–5% of steps) prevents large updates while weights are random and Adam's moment estimates are unreliable. Cosine decay avoids the loss spikes of sudden LR drops; \(\eta_{\min} \approx 0.1 \eta_{\max}\). This schedule is near-universal for LLM training.
12 LoRA — Low-Rank Adaptation Mathematics
Why Low-Rank Works — The Intrinsic Rank Hypothesis
Empirically, the weight-change matrix \(\Delta W\) from fine-tuning has low numerical rank — most singular values are near zero. The update lives in a small subspace, so a low-rank factorization captures it almost losslessly.
$$ W' = W_0 + \Delta W = W_0 + \frac{\alpha}{r} A B $$
\(W_0 \in \mathbb{R}^{d \times k}\) frozen; \(A \in \mathbb{R}^{d \times r}\) initialized \(\mathcal{N}(0, \sigma^2)\); \(B \in \mathbb{R}^{r \times k}\) initialized 0 — so \(\Delta W = 0\) at start and the pretrained model is exactly preserved. Rank \(r\) typically 4–64; \(\alpha\) typically \(r\) or \(2r\). Trainable: \(r(d + k)\) vs \(dk\).
Parameter count for one attention projection (\(d = k = 4096\), \(r = 16\)):
$$ 4 \text{ proj} \times 32 \text{ layers} \times 131\mathrm{K} \approx 16.8\mathrm{M} \;\; (0.24\% \text{ of a 7B model}) $$the entire fine-tune fits in one GPU's spare memory
SVD Connection
$$ \Delta W = U \Sigma V^\top \;\Rightarrow\; A = U_r \sqrt{\Sigma_r},\; B = \sqrt{\Sigma_r} V_r^\top \text{ is the optimal rank-}r\text{ approximation} $$
Eckart–Young theorem: truncated SVD minimizes \(\lVert \Delta W - AB \rVert_F\) over all rank-\(r\) factorizations. LoRA can't use SVD at init (you'd need to know \(\Delta W\) before training) — but this is why a rank-\(r\) decomposition can faithfully capture the update if the intrinsic-rank hypothesis holds.
QLoRA: 4-bit Quantization + LoRA
$$ W_0^{\text{stored}} = Q_4(W_0), \qquad W' = \mathrm{dequant}(W_0^{\text{stored}}) + \frac{\alpha}{r} AB $$
NF4 (NormalFloat-4): a quantization scheme matched to normally-distributed weights — more levels near 0 where weights concentrate. Double quantization compresses the quantization constants themselves. Net: 70B weights in ~35 GB instead of 140 GB fp16, with full-precision LoRA adapters on top.
13 GAN — Mathematics
The Minimax Objective
$$ \min_G \max_D V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))] $$
\(D\) wants \(D(\text{real}) \to 1\), \(D(\text{fake}) \to 0\); \(G\) wants to fool \(D\). Practical fix: \(G\) maximizes \(\log D(G(\mathbf{z}))\) instead of minimizing \(\log(1 - D(G(\mathbf{z})))\) — avoids vanishing gradients early in training when \(D\) is confident (non-saturating GAN).
Optimal Discriminator
For fixed \(G\), maximize pointwise:
$$ D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_G(\mathbf{x})} $$take \(\partial/\partial D\) of the integrand, set to 0
$$ p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\big(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \Sigma_\theta(\mathbf{x}_t, t)\big), \qquad \boldsymbol{\mu}_\theta = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t) \right) $$
DDPM's simplification: instead of predicting \(\boldsymbol{\mu}\) directly, predict the noise \(\boldsymbol{\varepsilon}_\theta\) that was added — empirically better than predicting \(\mathbf{x}_0\) or \(\boldsymbol{\mu}\).
Training Objective — Simplified DDPM Loss
$$ L_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\varepsilon}} \Big[ \big\lVert \boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}_\theta\big( \sqrt{\bar{\alpha}_t}\, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\, \boldsymbol{\varepsilon},\; t \big) \big\rVert^2 \Big] $$
Sample \(t\) uniformly, sample noise, corrupt, predict the noise back. Plain MSE — no adversarial training, no mode collapse. The ELBO derivation justifies this as a reweighted variational bound.
Classifier-Free Guidance (CFG)
$$ \tilde{\boldsymbol{\varepsilon}}_\theta(\mathbf{x}_t, c) = \boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, \varnothing) + w \big( \boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, c) - \boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, \varnothing) \big) $$
\(c\) = text prompt, \(\varnothing\) = null prompt, \(w\) = guidance scale (7–15). Training randomly drops conditioning (\(p \approx 0.1\)) so one network learns both modes; inference extrapolates in the conditioning direction. Higher \(w\) = stronger prompt adherence, less diversity. This is how Stable Diffusion follows your prompt.