01 Why Transformers? (The Problem They Solved)
To understand attention, you must first feel the pain that motivated it.
Before 2017, sequence problems (translation, text, speech) used RNNs (Recurrent Neural Networks). An RNN processes a sequence one token at a time, left to right. At each step it maintains a "hidden state" — a vector that's supposed to summarize everything seen so far.
Gradients flow back through every time step. In long sequences (100+ tokens), they shrink exponentially. Early words barely influence the model's weights. The network effectively forgets the beginning of a long sentence by the time it reads the end.
Step 2 can't begin until step 1 is done. You cannot parallelize RNN training across the sequence. Modern GPUs with thousands of cores sit idle, waiting. This fundamentally limited how much data you could train on.
02 Attention — The Core Idea
The mechanism that lets every word ask: "which other words should I look at?"
Imagine you're translating "The animal didn't cross the street because it was too tired." What does "it" refer to? You need to look at the word "animal," not "street." Attention lets the model do this dynamically, learned from data.
Query, Key, Value — The Mental Model
Think of it like a database lookup, but soft (probabilistic) rather than exact:
Self-Attention: Every Token Attends to Every Other Token
Multi-Head Attention
A single attention head learns one type of relationship. Multi-head attention runs H attention heads in parallel, each learning different relationships — syntactic, semantic, coreference, positional. Their outputs are concatenated and projected back.
Think of each head as a different question you can ask about token relationships: "Who acts on whom?" "What qualifies what?" "What happened before what?" Multi-head attention lets the model ask all these questions simultaneously, independently.
Attention Weights — Where Does "it" Look?
softmax(QKᵀ/√d_k) for the query token "it" — darker cells = higher weight. "animal" dominates, which is how the model resolves the pronoun's referent.
03 The Transformer Architecture
All the pieces assembled — and why each one exists.
Transformer Decoder Block (what LLMs use — repeated N times)
Key Design Decisions — Why Each Piece Exists
Attention has no built-in notion of order. "cat sat on mat" and "mat on sat cat" look the same. Positional encodings inject position information as additive vectors — the model learns to use them.
Output = input + transformed_input. This ensures gradients can flow straight back through the network without vanishing. Makes training very deep (100+ layer) networks stable.
Normalizes each layer's activations to zero mean, unit variance. Stabilizes training. Without it, early layers' activations would grow or shrink wildly, causing training to explode or collapse.
04 Training an LLM — The Full Story
How you go from random weights to a model that speaks English, writes code, and reasons.
Stage 1: Pretraining — Next Token Prediction
The core task is deceptively simple: given the previous tokens, predict the next one. That's it. Feed the model the internet — trillions of tokens of text — and have it predict, at every position, what comes next.
To predict the next word well, you have to understand context, grammar, facts, reasoning, and world knowledge. "The capital of France is ___" — the model must know geography. "To sort a list in Python, use ___" — the model must know programming. The label (next token) is free — it's just the existing text. You get a rich supervision signal from unlabeled data.
Next Token Prediction — The Pretraining Signal
Tokenization — Before Any Training
Text is broken into tokens (subword units) using algorithms like Byte-Pair Encoding (BPE). Common words become single tokens; rare words are split into subword pieces. This gives a fixed vocabulary (50K–200K tokens) that can represent any text.
Example tokenization (GPT-style):
"Un""believ""able""ly" → 1 common word split into 4 tokens because "Unbelievably" is rareScale: What Changed at 100B+ Parameters
In 2020, OpenAI's Scaling Laws paper showed that model capability scales predictably with compute, data, and parameters — following power laws. Larger models trained on more data with more compute were consistently better. This justified building GPT-3 (175B parameters) and led to the modern race to scale.
At certain scale thresholds, abilities appear that weren't present in smaller models — few-shot learning, chain-of-thought reasoning, arithmetic. These "emergent" behaviors are a major active research area. Why do they appear? Not fully understood.
Training LLMs requires thousands of GPUs for months. Three parallelism strategies: data parallel (same model, different data shards), tensor parallel (split weight matrices), pipeline parallel (split layers across GPUs).
05 Fine-Tuning & RLHF
Pretraining gives raw capability. Fine-tuning shapes it into useful, safe behavior.
A pretrained LLM is a raw text predictor — it will complete "How do I make a bomb?" with the statistically likely continuation from its training corpus. Not useful. Instruction fine-tuning and RLHF align the model to human intent.
The Three-Stage Pipeline: Pretraining → SFT → RLHF
How RLHF Works (Step by Step)
06 LoRA & QLoRA — Fine-Tuning Without Full Training
How to specialise a 70B parameter model on a consumer GPU — the technique that democratised LLM fine-tuning.
Full fine-tuning of a 70B model requires ~140GB of GPU VRAM just for the weights, plus more for optimiser states (Adam needs 2× the weight memory). That's 4–8 A100s. LoRA (Low-Rank Adaptation) makes fine-tuning accessible by observing that the update to pretrained weights during fine-tuning is intrinsically low-rank — you don't need to update all 140GB.
Freeze the original weight matrix W₀. Instead of computing the full update ΔW (same shape as W₀), decompose it: ΔW = A·B where A ∈ ℝ^{d×r} and B ∈ ℝ^{r×k}, with rank r ≪ min(d,k). Only train A and B. For a 4096×4096 weight matrix with r=16: 16M parameters → 131K parameters — a 122× reduction.
LoRA: Parallel Low-Rank Update Path
Combines LoRA with 4-bit quantisation of the frozen base model. The base weights are stored in NF4 (Normal Float 4-bit) — 4× smaller than float16. LoRA adapters remain in float16. A 70B model fits in ~35GB VRAM instead of 140GB. This is how a 70B model runs on a single A100 (80GB). State-of-the-art fine-tuning quality at 10× lower cost.
Typically applied to the attention projection matrices (Q, K, V, O) and sometimes the FFN layers. Rank r=4–64 is common. After training, the LoRA weights can be merged back into W₀ (W₀ + AB) for zero-overhead inference — no extra compute at inference time.
07 Mixture of Experts (MoE)
How to build a 141B parameter model that only activates 22B parameters per token — making scale affordable.
A standard Transformer has a single FFN per layer that every token passes through. MoE replaces the FFN with N expert FFNs plus a lightweight router. For each token, the router selects the top-k experts (typically k=2). Only those k experts compute. The rest are idle. You get the capacity of a large model at the cost of a small one — per token.
MoE Layer: Each Token Routes to 2 of N Experts
Mixtral 8×7B has 47B total parameters but only activates ~13B per token (2 of 8 experts). It matches or beats Llama-2-70B on most benchmarks at ~5× lower inference cost. Experts implicitly specialise — some handle code, some handle reasoning, some handle languages — though this is emergent, not engineered.
Load balancing: Without regularisation, the router collapses to always routing to the same 2 experts, wasting capacity. A load-balancing auxiliary loss penalises uneven routing. Memory: All N experts must live in VRAM even though only 2 are active — the 47B total weights must all be loaded.
08 Multimodal Models — Vision Language Models
How LLMs learn to see: connecting image encoders to text decoders.
A Vision Language Model (VLM) combines a visual encoder (typically a Vision Transformer or CNN) with a text decoder (a standard LLM). The key challenge: image patches live in a different representation space than text tokens. A projection layer maps visual features into the LLM's token embedding space, where they can be processed identically to text tokens.
VLM Architecture — Image Tokens Join the LLM Context
Stage 1: Freeze both encoder and LLM; train only the projection layer on image-caption pairs (aligns representations). Stage 2: Unfreeze the LLM; fine-tune on instruction-following vision tasks ("describe this image", "answer questions about this chart"). LLaVA, InternVL, Qwen-VL use this two-stage approach.
CLIP (OpenAI, 2021) trained a vision encoder and text encoder contrastively — matching images to captions at web scale. The visual representations are already semantically aligned with language, making projection to LLM embedding space much easier. CLIP's visual encoder is the default choice in most open VLMs.
09 Flash Attention — How Long Contexts Actually Work
Standard attention is O(T²) in memory. Flash Attention computes the same result in O(T) memory by restructuring the computation around GPU hardware.
The naive attention algorithm computes the full T×T score matrix, stores it in HBM (GPU high-bandwidth memory — large but slow, ~2TB/s bandwidth), then multiplies by V. For T=32K tokens: the score matrix is 32K×32K×2 bytes = 2GB. Just storing it dominates your memory budget, and the repeated HBM reads dominate your runtime.
Split Q, K, V into tiles that fit in SRAM (on-chip memory — tiny but fast, ~19TB/s bandwidth). For each tile pair, compute attention locally, accumulate the output using a running softmax normaliser. Never materialise the full T×T matrix in HBM.
- Memory: O(T²) → O(T) — enables 128K+ contexts
- Speed: 2–4× faster than standard attention on A100
- Mathematically exact — not an approximation
- Backward pass also recomputed from tiles (no storing activations)
Modern ML performance is dominated by memory bandwidth, not FLOPs. The GPU has fast SRAM (~20MB) and slow HBM (~80GB). Every time you read/write HBM, you pay a latency penalty. Flash Attention restructures the algorithm to maximise SRAM reuse — the same computation, far fewer HBM accesses.
Flash Attention 2 (2023) adds work partitioning across thread blocks for further speedup. Flash Attention 3 (2024) targets Hopper GPUs with tensor memory accelerators. This is the difference between theory and systems engineering.
10 Open Weights
What it means for a model to be "open," and what you actually get.
When a lab releases "open weights," they publish the trained parameter tensors — the actual numbers that define the model. You can download them and run inference locally. This is not the same as open source — the training code, training data, and methodology may all be proprietary.
- Model weight files (often .safetensors)
- Tokenizer vocabulary & config
- Architecture config (layer count, dims)
- Inference code (usually)
- Training data
- Full training code at scale
- Detailed training methodology
- RLHF preference data
Weight Files — What They Are
The weights are tensors — multi-dimensional arrays of floating-point numbers. A 7B model has ~7 billion numbers, typically stored as float16 (2 bytes each) = ~14GB. The file structure maps to named layers: model.layers.0.self_attn.q_proj.weight, etc. Loading these into memory and running forward passes is "inference."
11 Prompting
Prompting is not asking a search engine. It's programming in natural language.
A prompt is the input to an LLM. The model doesn't "understand" your intent — it predicts the most likely continuation of the prompt+response template it was fine-tuned on. Good prompting means providing the context that makes the correct output the statistically most likely continuation.
Just ask. No examples. Works for tasks the model has seen a lot during training.
"Translate to French: Hello"
Give 2–5 examples before your actual query. Tells the model the format and task via demonstration.
"EN: Hello → FR: Bonjour\nEN: Bye → FR:"
A special prefix (injected before user messages) that sets persona, rules, context. Invisible to users in production apps.
Chain-of-Thought (CoT) prompting — "Let's think step by step" — dramatically improves reasoning on math, logic, and multi-step problems. Why? Because by generating intermediate steps, the model gets more computation (more tokens) to allocate to the problem. Each token generated is one "step of thought."
Chat models are fine-tuned with a specific message format. The raw text passed to the model looks like:
You are a helpful assistant.
<|end|>
<|user|>
What is 2+2?
<|end|>
<|assistant|>
← model generates from here
Different models have different templates (ChatML, Llama, Mistral, etc.). Running a model with the wrong template produces garbage output — a common mistake when running open-weight models locally.
12 Inference — The Complete Pipeline
What actually happens when you send a message and get a response back.
End-to-End Inference Pipeline
The KV Cache — Why It Matters for Performance
In auto-regressive generation, each new token requires a forward pass through every layer. But the attention Keys and Values for all previous tokens are the same every time — recomputing them is wasteful. The KV cache stores these tensors in memory, so each new token only needs to compute its own K and V, plus attend over cached ones. This is the single biggest inference optimization.
Sampling Parameters
Divides logits before softmax. T=1.0: normal. T→0: greedy (always pick most likely). T>1: more random. The creativity dial.
Sample from the smallest set of tokens whose cumulative probability exceeds p. Cuts off the long tail of unlikely tokens. Better than Top-k for quality.
Only sample from the k most probable tokens. Simple, but the right k varies by context — Top-p is generally preferred. k=1 is greedy decoding.
Supervised learning gave us the training loop. Neural networks gave us function approximation. Transformers gave us parallel sequence processing via attention. Pretraining on internet text gave LLMs world knowledge. RLHF shaped that knowledge into helpful behavior. Open weights make that behavior downloadable. Prompting is the interface. Inference is the runtime. Agentic loops connect inference to the world.
Next: Part 3 will go deep on the mathematics — attention score computation, softmax, loss functions, gradient derivations, and scaling laws.