01 The Big Picture
An H100 can do ~1,000 trillion fp16 operations per second, but can only pull ~3.35 TB/s from its main memory. Divide those numbers and you get the most important ratio in AI systems.
That ratio (~300 FLOPs available per byte moved) means: unless your algorithm does hundreds of math operations for every byte it touches, the math units sit idle, starving, waiting on memory. Most of inference — especially decode — does far fewer. The GPU you pay for is, much of the time, an expensive memory pump. Understanding the memory hierarchy is therefore not a hardware curiosity: it is the explanation underneath docs 06 and 07, and the reason the software techniques in doc 09 work at all.
02 What: The Hierarchy, Top to Bottom
Like a CPU, a GPU has a pyramid of memories — each level roughly 10× faster and 100× smaller than the one below. The numbers (H100-class, rounded for memorability):
The crucial difference from a CPU: the SRAM level is explicitly programmed, not an automatic cache. A CUDA kernel chooses what to stage there. That choice — what to keep close — is the entire game, and it is exactly the same game context engineering plays one level up.
03 Why the Wall Exists
Compute and memory scale on different physics. Packing more multiply units onto a die is "easy" — transistor counts kept growing. But HBM bandwidth is limited by pins, signal integrity, and the physical distance bits travel; it grows slowly and costs enormous power. Result: over the last decade, GPU FLOPs grew ~60×, HBM bandwidth ~8×. The gap widens every generation.
Why decode hits the wall (the doc-06 claim, proven)
One decode step for a 7B fp16 model must read all 14 GB of weights from HBM (plus the KV cache) to produce one token. Math per byte: roughly 2 FLOPs per parameter read — that's an arithmetic intensity of ~1, versus the ~300 the hardware needs to stay busy. The math units finish their work in microseconds, then wait. Decode speed ≈ bandwidth ÷ bytes-to-read: 3,350 GB/s ÷ 14 GB ≈ ~240 tokens/s theoretical ceiling for batch-1 — no amount of extra FLOPs raises it.
This is also why batching (doc 06) works: 32 users' decode steps read the weights from HBM once and reuse them 32 times — arithmetic intensity multiplied by batch size. And it's why quantization speeds up inference even when compute doesn't change: 4-bit weights are 4× fewer bytes through the wall.
04 Arithmetic Intensity — The One Number to Compute
For any kernel: AI = FLOPs / bytes moved. Compare it against the hardware's ratio (~300 for H100 fp16). Below → memory-bound; above → compute-bound. This is the roofline model, and it sorts all of inference instantly:
| Operation | Arithmetic intensity | Bound by |
|---|---|---|
| Prefill (long prompt, big matmuls) | O(sequence length) — high | Compute |
| Decode (batch 1) | ~1–2 | Memory |
| Decode (batch 32) | ~32–64 | Memory, less badly |
| Naive attention (long context) | Low — writes/reads n×n matrix to HBM | Memory |
| FlashAttention | High — n×n never leaves SRAM | Compute |
05 FlashAttention — The Hierarchy Exploited
The most famous "memory trick" in AI, step by step. Same math, same result, 2–4× faster — purely by choosing where intermediate values live.
The deep lesson: FlashAttention does more FLOPs than naive attention (it recomputes softmax corrections), yet runs much faster. On modern hardware, recomputing can be cheaper than remembering — compute is the abundant currency, bytes are the scarce one. The same inversion shows up in training as gradient checkpointing, and in context engineering as "re-derive it from a tool call instead of carrying it in context."
06 Mental Models
Registers = your hands; SRAM = the counter; HBM = the pantry; host RAM = the supermarket. A good cook (kernel) plans so ingredients reach the counter once and get fully used. Lets you reason about: why "tiling" appears in every fast kernel; why PCIe transfers (supermarket runs) are catastrophic mid-recipe.
Estimate any inference change by bytes moved, not operations performed. Quantization: fewer bytes → faster. Bigger batch: same bytes, more useful work → faster. Longer context: more KV bytes per step → slower. Lets you reason about: nearly every performance question in this series without benchmarks.
Model weights : HBM :: context window : SRAM :: external storage (files, memory banks) : host RAM. Context engineering is kernel optimization for attention — stage exactly what the next "operation" needs into the fast, scarce tier. Lets you reason about: doc 09 in advance.
07 Common Misconceptions
"More TFLOPs = faster inference." For decode, almost irrelevant past a point — the ceiling is bandwidth ÷ model bytes. This is why an H100 doesn't decode batch-1 dramatically faster than an A100 despite ~3× the FLOPs.
"The GPU is busy when utilization shows 100%." "Utilization" often means "a kernel was resident," not "math units were fed." A memory-bound kernel shows 100% while ALUs starve. MFU (model FLOPs utilization) is the honest metric — and for decode it's often under 10%.
"80 GB VRAM means I can run an 80 GB model fine." Weights must fit plus KV cache, activations, and framework overhead — and fitting says nothing about speed. A model that barely fits decodes slowly (all 80 GB through the wall, every token) and leaves no room for batching.
"FlashAttention approximates attention." It is numerically exact. Only the schedule of computation changed — which is precisely why it's the cleanest proof that data movement, not math, was the cost.