From sequential instruction execution to massively parallel matrix operations — why AI needed an entirely different kind of processor.
Why does training a neural network on a CPU that's 10x faster per-clock than a GPU still take 100x longer? The answer is about parallelism geometry, not raw speed.
Intelligence at scale is fundamentally a linear algebra problem. Billions of multiply-accumulate operations that are embarrassingly parallel. CPUs were built for serial thinking. GPUs were built for parallel geometry — accidentally perfect for AI.
Instruction coding (CPU era): one instruction executes, then the next. Control flow, branching, sequential logic. A brilliant single-threaded mind.
Intelligence coding (GPU era): thousands of identical operations execute simultaneously on different data. No branching. Brute-force mathematical parallelism. A massive parallel workforce doing identical work.
Every CPU instruction follows this cycle. This is the atomic unit of computation.
CPU dynamically reorders instructions to avoid stalls. If instruction 5 doesn't depend on instruction 4, execute them in parallel. A complex scheduler tracks ~200 in-flight instructions. Massive silicon cost for serial-looking code.
Speculative execution: guess the branch outcome, execute ahead. Modern CPUs hit ~99% prediction accuracy. On miss: flush pipeline, ~15 cycle penalty. Essential for code with lots of if/else.
L1 (~32KB, 4 cycle), L2 (~256KB, 12 cycle), L3 (~32MB, 40 cycle), RAM (200 cycle). Huge transistor budget on caches because a single thread must find its data fast.
AVX-512: 512-bit wide registers, 16 floats in one instruction. CPUs added vector units reactively — GPU had vectors as the primary design from day 1.
A matrix multiply of size [1024×1024] × [1024×1024] requires ~2 billion multiply-add operations. These have zero dependencies between them — they're perfectly parallel. A CPU with 16 cores and 8-wide AVX runs 128 FLOPs/cycle. At 5GHz, that's 640 GFLOPS. An H100 GPU delivers 67 TFLOPS FP32 — 100× more. The CPU's OOO, branch prediction, and cache — all that brilliant complexity — contributes nothing to this workload.
A CPU dedicates ~80% of transistors to control logic (OOO, branch prediction, caches). A GPU dedicates ~80% of transistors to arithmetic units. One philosophy: be great at any single task. The other: be great at doing the same simple task 10,000 times simultaneously.
SIMD (CPU): one instruction, multiple data items, one thread controls it.
SIMT (GPU): one instruction, multiple data items, each data item has its own thread with its own registers. This lets GPU threads diverge and reconverge — at the cost of serializing divergent warps.
Think of a warp as 32 soldiers who must all do the same move at the same time. If 16 need to go left and 16 need to go right, they first all go left (16 wait idle), then all go right (16 wait idle). This is why GPUs hate branchy code.
GPU memory latency is ~500 cycles. CPU would stall; GPU switches to another warp instantly.
A CUDA core is essentially a pipelined FPU. It executes floating-point arithmetic at high throughput by accepting a new operand pair every clock cycle (fully pipelined, even if latency is multiple cycles).
GPT-3 training: ~314 ZettaFLOPs total. CUDA cores at FP32 on H100: 67 TFLOPS. Tensor Cores at BF16: 1,979 TFLOPS — ~30× faster. CUDA cores handle general work; Tensor Cores are why LLM training is economically feasible.
Tensor Cores are the most important hardware innovation for AI. They implement fused matrix-multiply-accumulate (MMA) — the exact operation at the heart of every neural network layer.
Every linear layer is: output = weight_matrix × input + bias
A transformer attention head: Attention(Q,K,V) = softmax(QK^T/√d)V
Both reduce to batched matrix multiplications. Tensor Cores execute these in hardware, not software loops.
| Precision | TFLOPS | Use Case |
|---|---|---|
| FP8 | 3,958 | Fast inference |
| FP16/BF16 | 1,979 | Training |
| TF32 | 989 | Accurate training |
| FP64 | 67 | Scientific |
Neural networks are surprisingly tolerant of reduced numerical precision. Training in BF16 instead of FP32 gives 30× speedup with <0.1% accuracy loss. This is because gradients are noisy by nature — you're doing stochastic gradient descent, not exact computation. Hardware precision and algorithm tolerance co-evolved.
RT Cores implement hardware-accelerated ray tracing — specifically, the computationally expensive Bounding Volume Hierarchy (BVH) traversal that determines which triangle a ray hits.
RT renders photorealistic training data for computer vision models. Cheaper and more varied than real-world data collection.
RT Cores accelerate NeRF rendering — AI-generated 3D scenes represented as continuous volumetric radiance functions.
Watch the data traffic. The on-chip SRAM↔core path moves dots fast and densely (~19 TB/s). The HBM↔SM path is slower (~3 TB/s). The PCIe path from the CPU is a trickle (~64 GB/s) — this is why you batch work and keep weights resident on the GPU instead of shuttling them back and forth.
Every AI kernel is either compute-bound (math is the bottleneck) or memory-bandwidth bound (data movement is the bottleneck).
Arithmetic Intensity = FLOPs / Bytes loaded from memory
Matrix multiply (large): ~125 FLOPs/byte → compute-bound → Tensor Cores help massively
Attention softmax (small batch): ~1 FLOPs/byte → memory-bound → Flash Attention solves this by keeping activations in SMEM, never writing to HBM
This is why Flash Attention is not about new math — it's about data movement optimization.
CPU: optimized to minimize latency for any single task. One thread of execution runs as fast as physically possible.
GPU: optimized to maximize throughput across thousands of identical tasks. Any single thread is slow, but 10,000 run in parallel.
Neural network training doesn't have a "single task" — it has billions of identical floating-point multiplications with no inter-dependencies. GPU wins by design.
CPU is a Ferrari: fastest possible single-passenger journey, adaptive routing, handles any road. GPU is 10,000 buses: each bus is slow, but they all depart simultaneously. If you're moving one person, take the Ferrari. If you're moving 300,000 people doing the same trip — you need buses.
CPU is the architect: designs the building, makes complex decisions, handles dependencies. GPU is the construction crew of 10,000 workers all doing the same task (laying bricks) simultaneously. The architect (CPU) orchestrates; the crew (GPU) executes in parallel.
CPU thinks deep: chain of reasoning, step A → B → C → D, each step depends on the previous. GPU thinks wide: A₁, A₂, A₃...A₁₀,₀₀₀ all simultaneously, no interdependence. Intelligence requires deep thinking (CPU) AND wide pattern recognition (GPU).
Instruction coding: you write the rules. CPU executes them deterministically. Intelligence coding: you show examples. GPU finds the rules via optimization over data. The shift isn't just hardware — it's a different theory of what computation is.
CPU and GPU aren't competitors — they're co-evolved specialists. Every AI system uses both: CPU for the complex, serial, branchy orchestration (the "thinking about what to compute"), GPU for the massive parallel math (the "actually computing it"). The ratio shifts as AI models scale: a larger fraction of wall-clock time is GPU compute, which is why GPU spending dominates AI infrastructure cost.
The single most important inference optimization. Without it, generating each token would require recomputing attention over the entire context from scratch — O(L²) work per token.
Anthropic's prompt caching maps directly to KV cache reuse. When you send the same system prompt prefix repeatedly, the provider stores the KV pairs for that prefix on GPU HBM. Subsequent calls with that prefix skip recomputing attention for those tokens — typically ~90% cost reduction and ~85% latency reduction for the cached portion. The cache lives in HBM between requests on the same server. This is why long, reused system prompts are economical despite their token count.
Standard attention is memory-bandwidth bound, not compute-bound. Flash Attention is not new math — it's the same result computed via a tiling strategy that keeps data in SRAM and never writes the attention matrix to HBM.
Flash Attention is the same answer as standard attention — just computed in a different order. It's a hardware-aware algorithm: the math was redesigned around the memory hierarchy of the GPU, not around mathematical convenience. This is the paradigm for GPU kernel optimization: write algorithms that match data access patterns to the memory tier that serves them cheapest.
Autoregressive generation is inherently serial: token N requires token N-1. This wastes GPU parallelism — Tensor Cores sit idle while the model waits for the previous token. Speculative decoding breaks this serialization.
Naive inference servers reserve GPU memory for the maximum possible KV cache per request at the start. This causes severe internal fragmentation — GPU memory is wasted on reserved-but-unused space.
Each GPU holds a full copy of the model. Different batches of data are processed in parallel. After each step, gradients are synchronized via all-reduce.
A single weight matrix is split across GPUs. Each GPU computes a shard of the matmul; results are combined. Requires tight synchronization — only works within a node (NVLink latency).
Model layers split across GPUs. GPU 0 holds layers 1–24, GPU 1 holds layers 25–48, etc. Each processes a micro-batch while the next micro-batch fills the earlier stages.
Every gradient synchronization step requires every GPU to send its gradients to every other GPU and receive the average. This is an AllReduce collective — the communication pattern that dominates training time at scale.
During backprop, intermediate activations must be stored to compute gradients. For a 70B model, this is ~70GB of activations per batch — more than one GPU's HBM.
Every major framework feature maps to a hardware constraint:
Flash Attention → HBM bandwidth is the bottleneck for attention (not Tensor Core compute)
KV Cache / PagedAttention → HBM capacity is finite; fragmentation kills throughput
Tensor Parallelism → single matmul too big for one GPU's HBM; split it
NVLink vs PCIe → AllReduce communication volume determines whether multi-GPU training is viable
Gradient Checkpointing → activation memory exceeds HBM; recompute is cheaper than spilling to CPU RAM
The pattern: every optimization is the software adapting an algorithm to match a hardware constraint in the memory hierarchy.