AI Learning Series · Part 6

Inference Anatomy

What actually happens between hitting Send and seeing the first word — and why the two phases of inference have opposite hardware personalities.

Neural Nets
Transformers
Inference Anatomy
Context Caching
GPU Memory

01 The Big Picture

Training gets the headlines; inference pays the bills. Every token you read from a model was produced by the machinery in this document.

When you send a prompt, you are not calling a function — you are submitting a job to a GPU factory. That job has two phases with opposite performance personalities: a parallel, compute-hungry burst (prefill) followed by a serial, memory-hungry drip (decode). Almost every practical fact about LLMs — why input tokens are cheaper than output tokens, why responses stream word by word, why long contexts slow things down, why "time to first token" and "tokens per second" are separate metrics — falls out of this two-phase structure.

🔑
The one-sentence summary: Prefill is a matrix-matrix problem that saturates compute; decode is a matrix-vector problem that saturates memory bandwidth. Everything else is engineering around that asymmetry.

02 What Is Inference? (Precisely)

Inference is running a trained model forward — weights frozen, no gradients, no learning — to map an input sequence to output tokens, one at a time. Formally: given tokens t₁…tₙ, the model computes a probability distribution over the vocabulary for tₙ₊₁, a sampler picks one, and the process repeats with the new token appended.

What inference is not: it is not retrieval (nothing is "looked up" from a database of texts), not learning (your conversation does not update the weights), and not planning ahead (the model has no committed plan for token 50 while emitting token 3 — though its internal representations do encode predictive structure beyond the next token).

The two named phases

Prefill (also: prompt processing) — every prompt token is pushed through the network simultaneously. One pass, heavy matrices, output is the first generated token plus a filled KV cache.

Decode (also: generation) — one token per forward pass, sequentially, each pass reading the ever-growing KV cache. This loop runs until an end-of-sequence token or a length limit.

03 Why Two Phases? (The Motivation)

Nobody designed inference as two phases for elegance — it is forced by one mathematical fact and one hardware fact colliding:

The math fact

Attention for token i needs the Keys and Values of all previous tokens. Prompt tokens already exist, so their attention can be computed in parallel — one big batch. Generated tokens don't exist yet, so generation is irreducibly sequential: you cannot attend to a token that hasn't been sampled.

The hardware fact

GPUs are throughput machines (see CPU vs GPU). They are happiest multiplying big matrices. Processing 2,000 prompt tokens at once is a big matrix — great. Processing 1 new token is a skinny vector — the GPU spends its time waiting on memory, not computing.

The thought experiment that makes the KV cache obvious: suppose there were no cache. To generate token 1,001 you would recompute attention keys and values for all 1,000 previous tokens — work you already did. Token 1,002 repeats it again. Generating n tokens would cost O(n²) full recomputations of history. The KV cache is nothing more than memoization: store each token's K and V the first time, never recompute. This trades memory for compute — and as we'll see, that memory becomes the new bottleneck.

💡
Why input tokens are billed cheaper than output tokens: a 1,000-token prompt costs one parallel pass. A 1,000-token response costs 1,000 sequential passes, each dragging the full weights and a growing cache through memory. The price difference on every API pricing page is this hardware asymmetry, passed on to you.

04 How It Works — The Full Request Lifecycle

Step through what happens to one request, from your keyboard to streamed tokens.

Tokenizer "Explain DNS" → [9132, 16332] Scheduler queued · batched with others PREFILL all prompt tokens in parallel compute-bound · sets TTFT KV Cache K,V per token · per layer · per head DECODE LOOP 1 token / pass · sequential memory-bound · sets tok/s reads + appends Sample → Stream temperature · top-p → your screen Stop EOS token · max length · stop word

The two metrics that matter

TTFT
Time To First Token — dominated by prefill (+ queueing). Grows with prompt length.
TPOT / tok·s⁻¹
Time Per Output Token — dominated by decode. Grows with context length (bigger cache to read).
Goodput
Useful tokens per GPU-second across all users — what the provider optimizes, often against your latency.

05 KV Cache Math — Why Long Contexts Hurt

The KV cache is not an abstraction; it's real bytes. Per token, per layer, you store one K and one V vector. The total:

// KV cache size for one sequence bytes = 2 (K and V) × n_layers × n_kv_heads × head_dim // = kv hidden size × seq_len × bytes_per_param // 2 for fp16 // Llama-2-7B, fp16, 4,096-token context: 2 × 32 × 32×128 × 4096 × 2 ≈ 2.1 GB // for ONE sequence

Two gigabytes of cache for one 4K conversation — on top of the 14 GB of weights. Now imagine the provider wants to serve 50 concurrent users on one 80 GB GPU. The arithmetic stops working. This single calculation explains an entire branch of the industry:

TechniqueWhat it doesWhat it trades
GQA / MQAShare K,V across query heads (n_kv_heads < n_heads) → cache shrinks 4–8×Slight quality loss; now standard in Llama-3, Mistral
PagedAttention (vLLM)Store cache in non-contiguous pages, like OS virtual memory → no fragmentationIndirection overhead; complexity
Quantized cacheStore K,V in 8-bit or 4-bit instead of 16Accuracy at long range
Sliding windowOnly cache the last W tokens (Mistral: 4K window)Model can't directly attend further back
🧠
DSA connection: the KV cache is memoization (dynamic programming's core trick), and PagedAttention is literally OS paging applied to tensors — a page table mapping logical token positions to physical GPU memory blocks. Same data structure, sixty years apart.

06 Batching — How One GPU Serves Many Users

Decode wastes the GPU: one token's worth of vectors cannot feed thousands of cores. The fix is to decode many users' sequences at once — the weights are read from memory once per step and amortized across the whole batch. This is why providers can sell tokens cheaply at all.

Static batching (old way)

Collect N requests, run them together, wait until all finish. One user generating a 2,000-token essay holds hostage nine users who needed 20 tokens. GPU runs at the speed of the slowest request.

Continuous batching (modern way)

The scheduler operates per step, not per request. The moment any sequence finishes, a queued request takes its slot in the very next decode step. Like a restaurant seating parties as tables free up, not waiting for the whole room to leave.

This is also where your latency variance comes from: mid-generation, the scheduler may pause decode steps to run a prefill burst for newly arrived requests (or mix the two — "chunked prefill"). When a model's output visibly stutters, you are usually watching someone else's prompt being prefilled.

07 Mental Models

Prefill = reading, Decode = writing

The model "reads" your whole prompt in one gulp, then "writes" its answer one word at a time, re-reading its notes (KV cache) before each word. Lets you reason about: why TTFT scales with input length but tok/s scales with total context.

The model doesn't literally comprehend-then-compose; prefill and decode run the same network.
The KV cache is a per-conversation scratchpad in VRAM

Weights are the textbook (shared, read-only); the KV cache is each student's scratchpad (private, growing). Lets you reason about: memory limits on concurrency, why context length costs VRAM, why "restarting" a conversation is cheap for you but expensive for the provider.

Unlike a scratchpad, the cache is never summarized or compacted — it grows linearly until the sequence ends.
Token factory with one slow lane

Prefill is the wide conveyor belt; decode is a single-file checkout. Batching opens more checkout lanes that share one cashier brain (the weights). Lets you reason about: provider economics, why batch size trades your latency for their throughput.

Lanes aren't independent — every lane's cart (cache) competes for the same VRAM.

08 Common Misconceptions

"The model gets slower because it's thinking harder about difficult questions." No — per-token cost is constant for a given context length. Hard questions produce more tokens (including hidden reasoning tokens), and longer context makes each token slightly slower. The model never "tries harder" on one forward pass.

"My whole conversation is re-sent and re-read every turn, so the model re-thinks it all." Half true. The conversation is re-sent (the API is stateless), but with prefix caching the provider usually restores the KV cache for the unchanged prefix instead of recomputing it — that's the subject of the next doc.

"Streaming is a UX gimmick." Streaming is the honest representation of what the hardware is doing — tokens genuinely exist one at a time. Buffering the full answer would only add latency.

"Bigger GPU = faster single response." Mostly false for decode: a single sequence is memory-bandwidth-bound, and one user can't use more bandwidth than the cache read requires. Bigger GPUs mostly buy concurrency, not single-stream speed.

🗺️
Where to go next: the KV cache you now understand is the foundation for prompt caching economics (doc 07) and the memory-bandwidth wall it lives behind (doc 08).