01 The Big Picture
Training gets the headlines; inference pays the bills. Every token you read from a model was produced by the machinery in this document.
When you send a prompt, you are not calling a function — you are submitting a job to a GPU factory. That job has two phases with opposite performance personalities: a parallel, compute-hungry burst (prefill) followed by a serial, memory-hungry drip (decode). Almost every practical fact about LLMs — why input tokens are cheaper than output tokens, why responses stream word by word, why long contexts slow things down, why "time to first token" and "tokens per second" are separate metrics — falls out of this two-phase structure.
02 What Is Inference? (Precisely)
Inference is running a trained model forward — weights frozen, no gradients, no learning — to map an input sequence to output tokens, one at a time. Formally: given tokens t₁…tₙ, the model computes a probability distribution over the vocabulary for tₙ₊₁, a sampler picks one, and the process repeats with the new token appended.
What inference is not: it is not retrieval (nothing is "looked up" from a database of texts), not learning (your conversation does not update the weights), and not planning ahead (the model has no committed plan for token 50 while emitting token 3 — though its internal representations do encode predictive structure beyond the next token).
The two named phases
Prefill (also: prompt processing) — every prompt token is pushed through the network simultaneously. One pass, heavy matrices, output is the first generated token plus a filled KV cache.
Decode (also: generation) — one token per forward pass, sequentially, each pass reading the ever-growing KV cache. This loop runs until an end-of-sequence token or a length limit.
03 Why Two Phases? (The Motivation)
Nobody designed inference as two phases for elegance — it is forced by one mathematical fact and one hardware fact colliding:
The math fact
Attention for token i needs the Keys and Values of all previous tokens. Prompt tokens already exist, so their attention can be computed in parallel — one big batch. Generated tokens don't exist yet, so generation is irreducibly sequential: you cannot attend to a token that hasn't been sampled.
The hardware fact
GPUs are throughput machines (see CPU vs GPU). They are happiest multiplying big matrices. Processing 2,000 prompt tokens at once is a big matrix — great. Processing 1 new token is a skinny vector — the GPU spends its time waiting on memory, not computing.
The thought experiment that makes the KV cache obvious: suppose there were no cache. To generate token 1,001 you would recompute attention keys and values for all 1,000 previous tokens — work you already did. Token 1,002 repeats it again. Generating n tokens would cost O(n²) full recomputations of history. The KV cache is nothing more than memoization: store each token's K and V the first time, never recompute. This trades memory for compute — and as we'll see, that memory becomes the new bottleneck.
04 How It Works — The Full Request Lifecycle
Step through what happens to one request, from your keyboard to streamed tokens.
The two metrics that matter
05 KV Cache Math — Why Long Contexts Hurt
The KV cache is not an abstraction; it's real bytes. Per token, per layer, you store one K and one V vector. The total:
Two gigabytes of cache for one 4K conversation — on top of the 14 GB of weights. Now imagine the provider wants to serve 50 concurrent users on one 80 GB GPU. The arithmetic stops working. This single calculation explains an entire branch of the industry:
| Technique | What it does | What it trades |
|---|---|---|
| GQA / MQA | Share K,V across query heads (n_kv_heads < n_heads) → cache shrinks 4–8× | Slight quality loss; now standard in Llama-3, Mistral |
| PagedAttention (vLLM) | Store cache in non-contiguous pages, like OS virtual memory → no fragmentation | Indirection overhead; complexity |
| Quantized cache | Store K,V in 8-bit or 4-bit instead of 16 | Accuracy at long range |
| Sliding window | Only cache the last W tokens (Mistral: 4K window) | Model can't directly attend further back |
06 Batching — How One GPU Serves Many Users
Decode wastes the GPU: one token's worth of vectors cannot feed thousands of cores. The fix is to decode many users' sequences at once — the weights are read from memory once per step and amortized across the whole batch. This is why providers can sell tokens cheaply at all.
Static batching (old way)
Collect N requests, run them together, wait until all finish. One user generating a 2,000-token essay holds hostage nine users who needed 20 tokens. GPU runs at the speed of the slowest request.
Continuous batching (modern way)
The scheduler operates per step, not per request. The moment any sequence finishes, a queued request takes its slot in the very next decode step. Like a restaurant seating parties as tables free up, not waiting for the whole room to leave.
This is also where your latency variance comes from: mid-generation, the scheduler may pause decode steps to run a prefill burst for newly arrived requests (or mix the two — "chunked prefill"). When a model's output visibly stutters, you are usually watching someone else's prompt being prefilled.
07 Mental Models
The model "reads" your whole prompt in one gulp, then "writes" its answer one word at a time, re-reading its notes (KV cache) before each word. Lets you reason about: why TTFT scales with input length but tok/s scales with total context.
Weights are the textbook (shared, read-only); the KV cache is each student's scratchpad (private, growing). Lets you reason about: memory limits on concurrency, why context length costs VRAM, why "restarting" a conversation is cheap for you but expensive for the provider.
Prefill is the wide conveyor belt; decode is a single-file checkout. Batching opens more checkout lanes that share one cashier brain (the weights). Lets you reason about: provider economics, why batch size trades your latency for their throughput.
08 Common Misconceptions
"The model gets slower because it's thinking harder about difficult questions." No — per-token cost is constant for a given context length. Hard questions produce more tokens (including hidden reasoning tokens), and longer context makes each token slightly slower. The model never "tries harder" on one forward pass.
"My whole conversation is re-sent and re-read every turn, so the model re-thinks it all." Half true. The conversation is re-sent (the API is stateless), but with prefix caching the provider usually restores the KV cache for the unchanged prefix instead of recomputing it — that's the subject of the next doc.
"Streaming is a UX gimmick." Streaming is the honest representation of what the hardware is doing — tokens genuinely exist one at a time. Buffering the full answer would only add latency.
"Bigger GPU = faster single response." Mostly false for decode: a single sequence is memory-bandwidth-bound, and one user can't use more bandwidth than the cache read requires. Bigger GPUs mostly buy concurrency, not single-stream speed.