GPU Memory Hierarchy — The Bandwidth Wall

01 The Big Picture

An H100 can do ~1,000 trillion fp16 operations per second, but can only pull ~3.35 TB/s from its main memory. Divide those numbers and you get the most important ratio in AI systems.

That ratio (~300 FLOPs available per byte moved) means: unless your algorithm does hundreds of math operations for every byte it touches, the math units sit idle, starving, waiting on memory. Most of inference — especially decode — does far fewer. The GPU you pay for is, much of the time, an expensive memory pump. Understanding the memory hierarchy is therefore not a hardware curiosity: it is the explanation underneath docs 06 and 07, and the reason the software techniques in doc 09 work at all.

02 What: The Hierarchy, Top to Bottom

Like a CPU, a GPU has a pyramid of memories — each level roughly 10× faster and 100× smaller than the one below. The numbers (H100-class, rounded for memorability):

Registers

~256 KB/SM · effectively instant · private to a thread

SRAM / Shared Mem

~228 KB/SM (~25 MB total) · ~19 TB/s class · shared by a thread block — programmer-managed

L2 Cache

~50 MB · ~10 TB/s class · shared by all SMs

HBM (VRAM)

80 GB · ~3.35 TB/s · where weights + KV cache live — the wall

Host RAM ↔ PCIe / NVLink

TBs · PCIe ~64 GB/s, NVLink ~900 GB/s · crossing this boundary is 50× worse than HBM

The crucial difference from a CPU: the SRAM level is explicitly programmed, not an automatic cache. A CUDA kernel chooses what to stage there. That choice — what to keep close — is the entire game, and it is exactly the same game context engineering plays one level up.

03 Why the Wall Exists

Compute and memory scale on different physics. Packing more multiply units onto a die is "easy" — transistor counts kept growing. But HBM bandwidth is limited by pins, signal integrity, and the physical distance bits travel; it grows slowly and costs enormous power. Result: over the last decade, GPU FLOPs grew ~60×, HBM bandwidth ~8×. The gap widens every generation.

Why decode hits the wall (the doc-06 claim, proven)

One decode step for a 7B fp16 model must read all 14 GB of weights from HBM (plus the KV cache) to produce one token. Math per byte: roughly 2 FLOPs per parameter read — that's an arithmetic intensity of ~1, versus the ~300 the hardware needs to stay busy. The math units finish their work in microseconds, then wait. Decode speed ≈ bandwidth ÷ bytes-to-read: 3,350 GB/s ÷ 14 GB ≈ ~240 tokens/s theoretical ceiling for batch-1 — no amount of extra FLOPs raises it.

This is also why batching (doc 06) works: 32 users' decode steps read the weights from HBM once and reuse them 32 times — arithmetic intensity multiplied by batch size. And it's why quantization speeds up inference even when compute doesn't change: 4-bit weights are 4× fewer bytes through the wall.

04 Arithmetic Intensity — The One Number to Compute

For any kernel: AI = FLOPs / bytes moved. Compare it against the hardware's ratio (~300 for H100 fp16). Below → memory-bound; above → compute-bound. This is the roofline model, and it sorts all of inference instantly:

Operation	Arithmetic intensity	Bound by
Prefill (long prompt, big matmuls)	O(sequence length) — high	Compute
Decode (batch 1)	~1–2	Memory
Decode (batch 32)	~32–64	Memory, less badly
Naive attention (long context)	Low — writes/reads n×n matrix to HBM	Memory
FlashAttention	High — n×n never leaves SRAM	Compute

🧠

DSA connection: this is cache-aware algorithm design. Blocked matrix multiply, B-trees vs binary trees, external-sort runs — all the same principle: restructure the computation so each datum, once fetched from slow storage, is used as many times as possible before eviction. The roofline is the I/O model of computation wearing a GPU costume.

05 FlashAttention — The Hierarchy Exploited

The most famous "memory trick" in AI, step by step. Same math, same result, 2–4× faster — purely by choosing where intermediate values live.

The deep lesson: FlashAttention does more FLOPs than naive attention (it recomputes softmax corrections), yet runs much faster. On modern hardware, recomputing can be cheaper than remembering — compute is the abundant currency, bytes are the scarce one. The same inversion shows up in training as gradient checkpointing, and in context engineering as "re-derive it from a tool call instead of carrying it in context."

06 Mental Models

The kitchen pyramid

Registers = your hands; SRAM = the counter; HBM = the pantry; host RAM = the supermarket. A good cook (kernel) plans so ingredients reach the counter once and get fully used. Lets you reason about: why "tiling" appears in every fast kernel; why PCIe transfers (supermarket runs) are catastrophic mid-recipe.

Counters hold ingredients passively; SRAM must be explicitly loaded and freed by the program.

Bandwidth is a budget, FLOPs are free

Estimate any inference change by bytes moved, not operations performed. Quantization: fewer bytes → faster. Bigger batch: same bytes, more useful work → faster. Longer context: more KV bytes per step → slower. Lets you reason about: nearly every performance question in this series without benchmarks.

Breaks for genuinely compute-bound phases (long-prompt prefill, training) where FLOPs do bind.

Same pyramid, one level up

Model weights : HBM :: context window : SRAM :: external storage (files, memory banks) : host RAM. Context engineering is kernel optimization for attention — stage exactly what the next "operation" needs into the fast, scarce tier. Lets you reason about: doc 09 in advance.

An analogy of structure, not mechanism — the context window's scarcity is attention quality and cost, not silicon pins.

07 Common Misconceptions

"More TFLOPs = faster inference." For decode, almost irrelevant past a point — the ceiling is bandwidth ÷ model bytes. This is why an H100 doesn't decode batch-1 dramatically faster than an A100 despite ~3× the FLOPs.

"The GPU is busy when utilization shows 100%." "Utilization" often means "a kernel was resident," not "math units were fed." A memory-bound kernel shows 100% while ALUs starve. MFU (model FLOPs utilization) is the honest metric — and for decode it's often under 10%.

"80 GB VRAM means I can run an 80 GB model fine." Weights must fit plus KV cache, activations, and framework overhead — and fitting says nothing about speed. A model that barely fits decodes slowly (all 80 GB through the wall, every token) and leaves no room for batching.

"FlashAttention approximates attention." It is numerically exact. Only the schedule of computation changed — which is precisely why it's the cleanest proof that data movement, not math, was the cost.

🗺️

Next: you now have the full hardware picture: two-phase inference (06), the token economy it creates (07), and the bandwidth wall underneath (08). Doc 09 closes the loop: how skills, rules, memory banks, and subagents are software's answer to exactly these constraints.