The Inference Optimization Stack

01 The Big Picture

Docs 06–08 diagnosed the disease: inference is memory-bound, the KV cache grows without mercy, and decode starves the GPU. This doc is the pharmacy — every major treatment, organized by where in the system it intervenes.

The industry's optimizations look like a zoo: MoE, SSMs, GQA, quantization, PagedAttention, kernel fusion, tensor parallelism, Splitwise… But they snap into a clean four-layer stack, ordered by when the decision is made — from model design time down to datacenter deployment time. Higher layers change what must be computed and stored; lower layers change how efficiently it happens. Master the stack and any new technique you meet has an obvious slot — and an obvious question: which bytes does it eliminate, shrink, schedule, or distribute?

02 The Stack, Animated

03 Layer 1 · Model Structure

DON'T CREATE THE BYTES

The cheapest byte is the one that never exists. These techniques redesign the network so less must be computed or cached at all.

MoE — only wake the experts you need

Replace each FFN with many "expert" FFNs plus a router that activates 1–2 per token. A 8×7B model holds ~47B parameters but touches ~13B per token — decode reads fewer weight-bytes through the wall (doc 08's budget) while keeping big-model capacity. Cost: all experts must still fit in VRAM, and routing causes load imbalance. Covered in depth in doc 02.

GQA — stop caching what heads can share

In vanilla multi-head attention, 32 query heads each carry their own K and V — 32 sets of cache. The insight of Grouped-Query Attention: queries need diversity (they ask different questions), but keys/values are more redundant (the "address book" being queried can be shared). So: keep 32 Q heads, share K,V across groups — e.g. 8 KV heads, each serving 4 Q heads. KV cache shrinks 4× with minimal quality loss; the extreme (1 KV head, MQA) shrinks 32× but hurts. Plug it into doc 06's formula: n_kv_heads is the lever. This is why Llama-3 and Mistral serve long contexts affordably — the choice was made at training time and you inherit it.

SSMs — abolish the KV cache entirely

State Space Models (Mamba family) are the radical option. A transformer remembers by keeping everything — the KV cache IS its memory, growing O(n). An SSM instead carries a fixed-size hidden state updated recurrently each token, like an RNN with better math (selective state updates that train in parallel). Consequences:

What you win

O(1) memory per sequence regardless of length; constant tokens/sec at any context; no cache to page, compress, or evict. Layers 2–3 problems vanish by construction.

What you pay

A fixed-size state is lossy — perfect recall of an arbitrary token 100K back isn't guaranteed (the transformer's superpower). Practice converged on hybrids: mostly-SSM with a few attention layers (Jamba), recovering recall where it matters.

🧠

DSA lens: a transformer's memory is an ever-growing array with random access (O(n) space, exact); an SSM's is a fixed-size sketch/rolling hash (O(1) space, lossy). The hybrid is the classic trade: keep a small exact index over a compressed stream.

04 Layer 2 · Storage / Memory

SHRINK THE BYTES

Given a fixed architecture, store each number in fewer bits. Since decode speed ≈ bandwidth ÷ bytes (doc 08), shrinking bytes is a direct speed multiplier.

Quantization — fewer bits per weight

Training uses fp16/bf16 (16 bits). But inference weights are read-only — they can be compressed once, offline. Map each weight to an 8-bit or 4-bit integer plus a per-group scale factor: w ≈ scale × q. The matmul dequantizes on the fly — extra compute, but compute is free (doc 08's roofline) while bytes are precious:

Format	7B model size	Effect
fp16	14 GB	baseline — ~240 tok/s ceiling (doc 08)
int8	7 GB	~2× decode ceiling, near-lossless
int4 (GPTQ/AWQ class)	3.5 GB	~4× ceiling; clever methods keep "salient" weights precise; small quality tax

Why it works at all: trained weights are redundant and roughly normally distributed — most of the information survives coarse rounding if outlier values are handled carefully (that's the difference between naive rounding and GPTQ/AWQ).

KV cache compression — fewer bits per remembered token

The same logic applied to the cache, which at long context outweighs the weights themselves (doc 06's 2.1 GB per sequence). The toolbox: quantize K,V to 8/4-bit (cache is also write-once-read-many per token); sliding windows — only keep the last W tokens' K,V (Mistral); eviction policies — drop K,V of tokens that stopped receiving attention (H2O's "heavy hitters": measure attention weights, evict the ignored — literally LFU cache eviction applied to tokens); and attention sinks — the odd discovery that the first few tokens must never be evicted or generation destabilizes.

🔑

Layer interplay: GQA (layer 1) decides how many KV vectors exist; this layer decides their byte-size; PagedAttention (layer 3) decides where they live. Multiply the savings: 4× (GQA) × 4× (int4 cache) = 16× more concurrent users in the same VRAM.

05 Layer 3 · Runtime Execution

STOP WASTING THE BYTES

The bytes that remain must be placed, scheduled, and moved without waste. This is the serving engine's job (vLLM, TensorRT-LLM, SGLang).

PagedAttention — virtual memory for the KV cache

Pre-vLLM servers allocated each request's cache as one contiguous region sized for max_tokens — like malloc'ing the worst case up front. Result: 60–80% of "used" VRAM was actually padding and fragmentation. PagedAttention copies the OS playbook wholesale: chop the cache into fixed-size blocks (~16 tokens), keep a per-sequence block table (page table) mapping logical position → physical block, allocate on demand. Fragmentation collapses to under 4%; bonus: two requests sharing a prefix can map the same physical blocks (copy-on-write), which is how prefix caching (doc 07) is implemented under the hood.

Continuous batching — no empty seats

Scheduler admits/retires requests every decode step, not every request lifetime. Covered in doc 06 — it lives at this layer because it's pure scheduling: same model, same bytes, just never an idle slot.

Kernel fusion — don't commute back to HBM between errands

A naive implementation of LayerNorm → matmul → bias → GeLU launches four kernels; each writes its output to HBM and the next reads it back. Three round trips of intermediate tensors through the slow lane, plus per-launch overhead. Fusion compiles them into one kernel: values stay in registers/SRAM from start to finish, touching HBM once on the way in and once on the way out. FlashAttention (doc 08) is exactly this idea executed on attention — fusion is the general principle, FlashAttention its most famous instance. This is what TensorRT, torch.compile, and Triton spend their lives doing.

🧠

One mental test for layer 3: "did any byte travel that didn't need to, or any cycle idle that could have worked?" Paging fixes wasted space, batching fixes wasted slots, fusion fixes wasted trips.

06 Layer 4 · Hardware Cluster

MULTIPLY THE MACHINES

When one GPU's 80 GB and 3.35 TB/s aren't enough, distribute — but every split buys capacity with communication.

Tensor parallelism — split the matrices themselves

A 70B fp16 model (140 GB) cannot fit one GPU. Tensor parallelism (TP) splits each weight matrix column- or row-wise across N GPUs: each computes its shard of every matmul, then an all-reduce stitches results — twice per transformer layer. Now the model fits, and N GPUs' bandwidth works in parallel (decode ceiling roughly ×N). The catch: those all-reduces happen every layer, every token, which is why TP lives inside an NVLink island (~900 GB/s); over PCIe or Ethernet it dies. Contrast with its sibling, pipeline parallelism: split by layers (GPU1 holds layers 1–40…), cheap point-to-point comms, but sequential decode hops through every stage — TP for latency, PP for capacity across slow links.

Splitwise — give each phase its own machines

The elegant endgame of doc 06. Prefill is compute-bound; decode is memory-bound — so why run them on the same GPU? Disaggregated serving (Splitwise, DistServe, and today's production stacks) routes each request to a prefill fleet (compute-heavy GPUs, maximally batched prompts), then ships the finished KV cache over the interconnect to a decode fleet (bandwidth/capacity-optimized GPUs, huge concurrent batches). Each fleet scales independently with its own bottleneck; the interference you saw in doc 06 — decode stuttering while someone's prompt prefills — disappears, because the phases no longer share silicon. The KV-cache transfer is the toll, paid once per request.

💡

Full-circle moment: the two-phase asymmetry introduced in doc 06 as a nuisance becomes, at cluster scale, an architecture. Diagnose a bottleneck precisely enough and it tells you how to build the datacenter. (Same shape as doc 09's subagents: specialized workers, distilled hand-off.)

07 Mental Models

Four questions, one per layer

Any technique slots by which question it answers: Can we avoid creating the bytes? (model) · Can we shrink them? (memory) · Can we stop wasting them? (runtime) · Can we throw more machines at them? (cluster). Lets you reason about: any new acronym — find its question, predict its trade-off.

Some techniques straddle layers (FlashAttention is runtime fusion enabling memory savings); the stack is a map, not a partition.

Multiplicative, not alternative

Layers compose: MoE × GQA × int4 × paging × fusion × TP is what a production stack actually runs. Lets you reason about: why serving costs fell ~100× in three years with the same silicon generation doing most of the work.

Savings interact — after 4-bit quantization, further bandwidth tricks have less left to save (Amdahl's law applies).

Higher layers move slower

Cluster choices change per deployment; runtime per engine release; memory per model load; model structure only per training run. Urgency flows up: today's runtime hack (cache eviction) becomes tomorrow's architecture (SSM hybrids). Lets you reason about: where the research frontier will move next.

Not strictly true — speculative decoding and similar tricks blur design-time vs run-time.

08 Common Misconceptions

"Quantization makes the model dumber, so serious deployments avoid it." Backwards — int8 is effectively standard and 4-bit widespread; well-calibrated quantization costs ~1% on benchmarks while doubling-to-quadrupling throughput. The dumb move is paying 2× latency for precision the weights don't contain.

"MoE with 47B params runs like a 13B model." Only in compute per token. All 47B must sit in VRAM, and different tokens hit different experts, so memory capacity and routing balance — not FLOPs — are MoE's real constraints. It trades layer-2/4 pain for layer-1 gain.

"Tensor parallelism over 8 GPUs = 8× faster." All-reduce twice per layer eats the gain; TP across slow interconnects can be slower than one GPU. Parallelism multiplies bandwidth only when communication is an order cheaper than the work it coordinates.

"SSMs failed — everyone still uses transformers." Pure SSMs lost on recall; hybrid SSM-attention models ship today, and every frontier lab pursues sub-quadratic memory. The KV cache's O(n) growth is the stack's deepest unsolved tax — layer 1 is where it will eventually be repealed.

"These are vendor implementation details, irrelevant to me as an application engineer." They set your price card (doc 07), your latency profile (doc 06), and your context limits (doc 09's whole reason to exist). Reading a model card's "GQA, 4-bit, vLLM, TP=2" line and knowing what it implies for cost and tokens/sec is the systems literacy this series builds.

🎓

The series, complete: mental models (01–05) → the two-phase engine (06) → the token economy (07) → the silicon wall (08) → software's answer (09) → and now the full optimization stack the industry built on those exact constraints (10). Every layer fights the same enemy: bytes through the wall.