AI Learning Series · Lens 1

CPU vs GPU
The Architecture of Intelligence

From sequential instruction execution to massively parallel matrix operations — why AI needed an entirely different kind of processor.

🎯 Why This Lens Matters

Understanding why AI runs on GPUs isn't about hardware trivia — it reveals the fundamental shape of intelligence computation.

The Core Question

Why does training a neural network on a CPU that's 10x faster per-clock than a GPU still take 100x longer? The answer is about parallelism geometry, not raw speed.

The Insight

Intelligence at scale is fundamentally a linear algebra problem. Billions of multiply-accumulate operations that are embarrassingly parallel. CPUs were built for serial thinking. GPUs were built for parallel geometry — accidentally perfect for AI.

🔑 The Paradigm Shift

Instruction coding (CPU era): one instruction executes, then the next. Control flow, branching, sequential logic. A brilliant single-threaded mind.

Intelligence coding (GPU era): thousands of identical operations execute simultaneously on different data. No branching. Brute-force mathematical parallelism. A massive parallel workforce doing identical work.

📜 Historical Arc: From Instructions to Intelligence

How computation evolved from executing commands to learning from data.
1940s–1950s · The von Neumann Era
Instruction Execution as the Paradigm
The stored-program computer: fetch instruction → decode → execute → writeback. One instruction at a time. The CPU was born as a serial instruction executor. Programs were lists of steps. Intelligence was the programmer's, not the machine's.
1970s–1980s · Pipelining & Superscalar
CPUs Get Clever About Serial Work
Instruction pipelines overlapped fetch/decode/execute stages. Superscalar CPUs issued multiple instructions per cycle. Branch prediction guessed future control flow. All optimizations for sequential programs. The assumption: code has dependencies, so you can't just parallelize everything.
1990s · GPU Born for Graphics
A Different Problem: Pixels Are Embarrassingly Parallel
Rendering a 1080p frame requires transforming ~2 million independent pixels. No pixel depends on another. This is the opposite of serial code — pure data parallelism. GPUs were built to do the same operation (vertex transform, pixel shading) on thousands of data points simultaneously. No branch prediction needed; no dependency tracking. Brute force parallelism.
2006 · CUDA Changes Everything
General-Purpose GPU Computing (GPGPU)
NVIDIA releases CUDA — a programming model letting you run arbitrary code on GPU cores. Researchers immediately notice: matrix multiplication (the core of neural networks) has the same embarrassingly parallel structure as pixel shading. A GPU isn't just a graphics chip — it's a massively parallel math coprocessor.
2012 · AlexNet — The Inflection Point
Deep Learning Proves GPUs Are the Right Tool
AlexNet wins ImageNet by training on two GTX 580 GPUs. The result: a 10.8% accuracy gap over prior art. Training on CPU would have taken weeks; on GPU, days. The AI community permanently migrated. The GPU was no longer accidentally useful — it became the primary compute substrate for intelligence.
2017–Present · Purpose-Built AI Hardware
Tensor Cores, TPUs, and Intelligence-First Silicon
NVIDIA adds Tensor Cores in Volta (2017) — hardware units that execute a full 4×4 matrix multiply in one cycle. Google builds TPUs (Tensor Processing Units). The design philosophy inverts: instead of CPUs with occasional AI acceleration, we now have AI engines with occasional CPU-like control logic.

🔷 CPU Architecture: The Serial Genius

Built for latency minimization, branch prediction, and complex control flow — a single powerful thread of execution.

The Von Neumann Execution Model

Every CPU instruction follows this cycle. This is the atomic unit of computation.

Stage 1Fetch
PC → instruction
Stage 2Decode
opcode + operands
Stage 3Execute
ALU / FPU / branch
Stage 4Memory
load / store
Stage 5Writeback
result → register

What CPUs Are Optimized For

Out-of-Order Execution (OOO)

CPU dynamically reorders instructions to avoid stalls. If instruction 5 doesn't depend on instruction 4, execute them in parallel. A complex scheduler tracks ~200 in-flight instructions. Massive silicon cost for serial-looking code.

Branch Prediction

Speculative execution: guess the branch outcome, execute ahead. Modern CPUs hit ~99% prediction accuracy. On miss: flush pipeline, ~15 cycle penalty. Essential for code with lots of if/else.

Deep Cache Hierarchy

L1 (~32KB, 4 cycle), L2 (~256KB, 12 cycle), L3 (~32MB, 40 cycle), RAM (200 cycle). Huge transistor budget on caches because a single thread must find its data fast.

SIMD Extensions (SSE/AVX)

AVX-512: 512-bit wide registers, 16 floats in one instruction. CPUs added vector units reactively — GPU had vectors as the primary design from day 1.

CPU Anatomy (Modern Core)

1 Physical Core
Front End
Branch predictor · Instruction fetch · Decode (4-wide) · Micro-op cache
Out-of-Order Engine
Rename/allocate · Reorder buffer (~512 entries) · Reservation stations · Scheduler
Execution Units
4× ALU · 2× FPU/SIMD · 2× Load · 1× Store
Memory Subsystem
L1-D (48KB) · L1-I (32KB) · L2 (1.25MB) · Shared L3
Modern desktop: 16–32 cores · ~5 GHz · ~10M transistors/core

⚡ Why CPU Struggles with Neural Networks

A matrix multiply of size [1024×1024] × [1024×1024] requires ~2 billion multiply-add operations. These have zero dependencies between them — they're perfectly parallel. A CPU with 16 cores and 8-wide AVX runs 128 FLOPs/cycle. At 5GHz, that's 640 GFLOPS. An H100 GPU delivers 67 TFLOPS FP32 — 100× more. The CPU's OOO, branch prediction, and cache — all that brilliant complexity — contributes nothing to this workload.

🟩 GPU Architecture: The Parallel Army

Thousands of simple cores executing the same instruction on different data simultaneously — SIMT at massive scale.

The Fundamental Design Philosophy

A CPU dedicates ~80% of transistors to control logic (OOO, branch prediction, caches). A GPU dedicates ~80% of transistors to arithmetic units. One philosophy: be great at any single task. The other: be great at doing the same simple task 10,000 times simultaneously.


GPU Hierarchy: From Chip to Thread

GPU (Device)
The full chip. H100: 132 SMs, 528 Tensor Cores, 80GB HBM3
H100: 132 SMs
↓ contains multiple
Streaming Multiprocessor (SM)
The autonomous execution unit. Has its own scheduler, register file, shared memory, L1. H100 SM: 128 CUDA cores, 4 Tensor Cores
128 CUDA / SM
↓ executes
Thread Block (CTA)
Programmer-defined group. All threads share SMEM. Can synchronize. Max 1024 threads/block. Assigned to one SM.
max 1024 threads
↓ scheduled as
Warp
32 threads that execute exactly the same instruction simultaneously (SIMT). The atomic scheduling unit. If threads diverge on a branch, they serialize — this is warp divergence.
32 threads / warp
↓ individual
CUDA Thread
The individual worker. Has its own registers (no shared state). Identified by threadIdx, blockIdx. Executes one stream of instructions.
unique (x,y,z) ID

🧠 SIMT Mental Model

SIMD (CPU): one instruction, multiple data items, one thread controls it.
SIMT (GPU): one instruction, multiple data items, each data item has its own thread with its own registers. This lets GPU threads diverge and reconverge — at the cost of serializing divergent warps.

Think of a warp as 32 soldiers who must all do the same move at the same time. If 16 need to go left and 16 need to go right, they first all go left (16 wait idle), then all go right (16 wait idle). This is why GPUs hate branchy code.


SM Internal Architecture

H100 SM Internals

128 FP32 CUDA Cores
General float arithmetic — adds, multiplies, divisions
4 Tensor Cores (Gen 4)
4×4 matrix multiply-accumulate per clock in FP8/FP16/BF16/TF32
64 INT32 Cores
Integer ops — indexing, control, address computation
256KB L1 / Shared Memory
Configurable split: L1 cache vs programmer-managed SMEM
4 Warp Schedulers
Each issues 1 instruction/cycle. 4 warps active simultaneously. Latency hiding via warp switching.

Latency Hiding: The Key Trick

GPU memory latency is ~500 cycles. CPU would stall; GPU switches to another warp instantly.

Warp A issues memory load → waiting (500 cycle latency)
Warp B executes math ops ← runs while A waits
Warp C executes math ops ← runs while A waits
Warp D executes math ops ← runs while A waits
Warp A data arrives → resumes immediately
An H100 SM holds 64 active warps × 32 threads = 2048 threads in flight simultaneously. Zero stall time if workload is compute-bound.

The 3 Core Types: CUDA, Tensor, Ray Tracing

Modern GPUs have three distinct compute engines, each solving a different class of problem.
General Purpose
CUDA Cores
The original GPU compute worker. A single-precision (FP32) floating point and integer ALU. Executes one operation per clock cycle. Massively replicated — 16,896 in H100.
1 op/cycle FP32 + INT32 16,896 in H100
AI Accelerator
Tensor Cores
Purpose-built for matrix multiply-accumulate (MMA). Computes a full 4×4 or 8×4×16 matrix multiply in one clock. The reason modern LLM training is possible. 528 in H100.
256 ops/cycle FP8/FP16/BF16 528 in H100
Ray Tracing
RT Cores
Hardware BVH traversal — accelerates ray-triangle intersection tests for physically-accurate rendering. Not used for AI training, but relevant for simulation and synthetic data generation.
BVH traversal Ray-triangle test 132 in H100

🔵 CUDA Cores — Deep Dive

A CUDA core is essentially a pipelined FPU. It executes floating-point arithmetic at high throughput by accepting a new operand pair every clock cycle (fully pipelined, even if latency is multiple cycles).

What it contains:
// CUDA core functional units
FP32 FMA: a * b + c // fused multiply-add
INT32 ALU: add, sub, and, or, shift
Shared: special functions (sin, sqrt, rcp)

// One CUDA thread = one stream through these
// One warp = 32 CUDA threads in lockstep
The SIMT execution model:
// Kernel: add two vectors
__global__ void vecAdd(float *A, float *B, float *C) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  C[i] = A[i] + B[i]; // each thread handles one element
}

// Launch: 1M threads, all execute same code
// GPU schedules as 32k warps across all SMs

Why CUDA Cores Alone Are Insufficient for LLMs

GPT-3 training: ~314 ZettaFLOPs total. CUDA cores at FP32 on H100: 67 TFLOPS. Tensor Cores at BF16: 1,979 TFLOPS — ~30× faster. CUDA cores handle general work; Tensor Cores are why LLM training is economically feasible.

🔴 Tensor Cores — Deep Dive

Tensor Cores are the most important hardware innovation for AI. They implement fused matrix-multiply-accumulate (MMA) — the exact operation at the heart of every neural network layer.

The Mathematical Operation

Tensor Core computes (in one clock): D = A × B + C

Where A, B, C, D are matrices. For H100 Gen4 Tensor Core:
A: 8×16 FP8 matrix × B: 16×16 FP8 matrix + C: 8×16 FP32 accumulator
= D: 8×16 FP32 result in 1 clock cycle

That's 8×16×16 = 2048 multiply-adds per Tensor Core per cycle
528 Tensor Cores × 2048 × 1.83 GHz = ~1,979 TFLOPS BF16
Why This Maps Perfectly to Neural Networks

Every linear layer is: output = weight_matrix × input + bias

A transformer attention head: Attention(Q,K,V) = softmax(QK^T/√d)V

Both reduce to batched matrix multiplications. Tensor Cores execute these in hardware, not software loops.

Precision Modes (H100 Tensor Core)
PrecisionTFLOPSUse Case
FP83,958Fast inference
FP16/BF161,979Training
TF32989Accurate training
FP6467Scientific

🎯 The Precision Insight

Neural networks are surprisingly tolerant of reduced numerical precision. Training in BF16 instead of FP32 gives 30× speedup with <0.1% accuracy loss. This is because gradients are noisy by nature — you're doing stochastic gradient descent, not exact computation. Hardware precision and algorithm tolerance co-evolved.

🟠 RT Cores — Deep Dive

RT Cores implement hardware-accelerated ray tracing — specifically, the computationally expensive Bounding Volume Hierarchy (BVH) traversal that determines which triangle a ray hits.

What Ray Tracing Computes
For each pixel, cast a ray R(t) = O + t·D Find smallest t where R(t) intersects scene geometry
Test against all triangles → O(N) too slow

BVH: hierarchical bounding boxes
→ Ray-box test is cheap
→ Only test triangles in intersected boxes
→ O(log N) instead of O(N)
AI Connection
Synthetic Data Generation

RT renders photorealistic training data for computer vision models. Cheaper and more varied than real-world data collection.

Neural Radiance Fields (NeRF)

RT Cores accelerate NeRF rendering — AI-generated 3D scenes represented as continuous volumetric radiance functions.

💾 Memory Architecture: The Real Bottleneck

In AI, memory bandwidth often matters more than compute. Moving data is the bottleneck, not doing math on it.
▶ Live: where the time actually goes

Watch the data traffic. The on-chip SRAM↔core path moves dots fast and densely (~19 TB/s). The HBM↔SM path is slower (~3 TB/s). The PCIe path from the CPU is a trickle (~64 GB/s) — this is why you batch work and keep weights resident on the GPU instead of shuttling them back and forth.

CPU host RAM HBM 80 GB · ~3 TB/s SM · SRAM ~20 MB · ~19 TB/s Cores tensor / CUDA PCIe ~64 GB/s HBM bus on-chip

CPU Memory Hierarchy

L1 Cache
Size32–48 KB/core
Latency~4 cycles
Bandwidth~3 TB/s
Per CorePrivate
L2 Cache
Size256KB–2MB/core
Latency~12 cycles
Bandwidth~1 TB/s
Per CorePrivate/Shared
L3 Cache
Size16–128 MB
Latency~40 cycles
Bandwidth~400 GB/s
Per CoreShared (all cores)
DRAM (DDR5)
Size16 GB – 1 TB+
Latency~80 ns / ~400 cycles
Bandwidth~100 GB/s
Cost/GBCheap

GPU Memory Hierarchy

Registers
Size256KB / SM (65,536 regs)
Latency0 cycles
ScopePer-thread
NoteZero-overhead access
Shared Mem (SMEM)
SizeUp to 228KB/SM
Latency~1–5 cycles
Bandwidth~19 TB/s (H100)
ScopePer thread block
L2 Cache
Size50 MB (H100)
Latency~200 cycles
Bandwidth~12 TB/s
ScopeAll SMs
HBM3 (GPU DRAM)
Size80 GB (H100 SXM)
Latency~500 cycles
Bandwidth3.35 TB/s
vs CPU RAM33× faster BW
NVLink / PCIe
NVLink BW900 GB/s (H100)
PCIe 5.0 BW~128 GB/s
NoteGPU↔GPU or GPU↔CPU
BottleneckDistributed training

⚠️ The Roofline Model — Memory vs Compute Bound

Every AI kernel is either compute-bound (math is the bottleneck) or memory-bandwidth bound (data movement is the bottleneck).

Arithmetic Intensity = FLOPs / Bytes loaded from memory

Matrix multiply (large): ~125 FLOPs/byte → compute-bound → Tensor Cores help massively
Attention softmax (small batch): ~1 FLOPs/byte → memory-bound → Flash Attention solves this by keeping activations in SMEM, never writing to HBM

This is why Flash Attention is not about new math — it's about data movement optimization.

📐 Mathematical Model of Neural Network Compute

The math that drives GPU design choices.

Forward Pass of a Linear Layer

Given: input X ∈ ℝ^(B×d_in), weights W ∈ ℝ^(d_in×d_out), bias b ∈ ℝ^d_out Y = X · W + b

FLOPs = 2 × B × d_in × d_out (factor 2: multiply + add)
Memory reads = (B×d_in + d_in×d_out) × sizeof(dtype)

For GPT-3 layer: B=512, d_in=12288, d_out=12288
FLOPs = 2 × 512 × 12288 × 12288 ≈ 155 GFLOPs per layer

Transformer Attention — The Dominant Operation

Multi-Head Attention: Q = X·W_Q, K = X·W_K, V = X·W_V

Attention(Q,K,V) = softmax( Q·K^T / √d_k ) · V

QK^T FLOPs = 2 × B × h × L × L × d_k (quadratic in sequence length L!)

This is why long-context LLMs are expensive:
L=4096: L² = 16M operations per head per batch
L=128k: L² = 16 BILLION operations per head per batch

Gradient Descent: The Optimization Engine

Adam optimizer update (most common in LLM training): m_t = β₁·m_{t-1} + (1-β₁)·∇L(θ_t) (1st moment / gradient)
v_t = β₂·v_{t-1} + (1-β₂)·∇L(θ_t)² (2nd moment / variance)

m̂_t = m_t / (1-β₁^t) (bias-corrected)
v̂_t = v_t / (1-β₂^t)

θ_{t+1} = θ_t - α · m̂_t / (√v̂_t + ε)

Memory cost: 3× model size (params + m + v)
GPT-3 (175B params) in FP32: 175B × 4B × 3 ≈ 2.1 TB of optimizer state alone

Compute Requirements: Training vs Inference

Training FLOPs estimate (Chinchilla scaling): FLOPs_train ≈ 6 × N × D
N = number of parameters, D = number of training tokens

GPT-3: N=175B, D=300B tokens
FLOPs_train ≈ 6 × 175×10⁹ × 300×10⁹ = 3.14 × 10²³ FLOPs

Inference FLOPs per token (autoregressive): FLOPs_inference ≈ 2 × N (2 FLOPs per parameter, forward pass only)
GPT-3: ≈ 350 GFLOPs per token

Training is ~(D/1 token) × 3× costlier (need backward pass + optimizer)
GPT-3 training ≈ generating 300B × 3 = 900B tokens worth of inference compute

⚖️ Full Comparative Study

CPU vs GPU across every architectural dimension.
Dimension CPU GPU
Core Count8–128 complex cores10,000–16,000 simple cores
Clock Speed3–5 GHz (fast)1.3–2.0 GHz (slower)
Execution ModelSequential, OOO, speculativeMassively parallel SIMT
Cache SizeLarge (32–128 MB L3)Smaller L2 (50MB H100)
DRAM BW~100 GB/s (DDR5)3,350 GB/s (HBM3, H100)
DRAM CapacityUp to 1 TB+ (cheap DIMM)Up to 80–141 GB (expensive HBM)
Branch HandlingExcellent (branch predictor)Poor (warp divergence)
Latency per opLow (~1–4 cycles)Higher (~20 cycles, hidden)
Peak FP32~1–2 TFLOPS67 TFLOPS (H100)
Peak AI (BF16)~2 TFLOPS (with AMX)1,979 TFLOPS (Tensor Cores)
Programming ModelThreads, shared memory, OSCUDA kernels, warps, SMEM
Best forBranchy, serial, OS, networkingMatmul, parallel data processing
Power (TDP)~65–350W~300–700W
Cost$200–$5,000$2,000–$40,000+
Transistors~50B (AMD Epyc)~80B (H100)

🎯 The Fundamental Trade-off

CPU: optimized to minimize latency for any single task. One thread of execution runs as fast as physically possible.

GPU: optimized to maximize throughput across thousands of identical tasks. Any single thread is slow, but 10,000 run in parallel.

Neural network training doesn't have a "single task" — it has billions of identical floating-point multiplications with no inter-dependencies. GPU wins by design.

🔄 The AI Training Pipeline on GPU

How CPU and GPU divide the work — neither alone is sufficient.

Training Loop: CPU Orchestrates, GPU Computes

CPUData Loading
fetch batch, preprocess, augment
CPU→GPUH2D Transfer
PCIe / NVLink copy to VRAM
GPU (Tensor)Forward Pass
matmuls via Tensor Cores
GPU (CUDA)Loss + Backward
gradient computation
GPU (CUDA)Optimizer Step
Adam update on all params
CPULogging / LR Schedule
metrics, checkpointing

CPU's Role in AI Training

  • Data pipeline (decoding, tokenizing, batching)
  • Training orchestration (PyTorch runtime, scheduler)
  • Checkpointing to disk (NVMe I/O)
  • Inter-GPU communication coordination
  • Learning rate scheduling, loss tracking
  • Evaluating on validation set (via GPU dispatch)

GPU's Role in AI Training

  • Forward pass: all matmul via Tensor Cores
  • Activation functions: CUDA cores (elementwise)
  • Loss computation: CUDA reduction kernels
  • Backward pass: gradient via chain rule
  • Optimizer: Adam update on weights in HBM
  • All-reduce across GPUs (distributed training)

💡 Mental Models

Intuitions for thinking about CPU vs GPU without drowning in specs.
🏎️

Ferrari vs Highway Bus

CPU is a Ferrari: fastest possible single-passenger journey, adaptive routing, handles any road. GPU is 10,000 buses: each bus is slow, but they all depart simultaneously. If you're moving one person, take the Ferrari. If you're moving 300,000 people doing the same trip — you need buses.

🏗️

Architect vs Construction Army

CPU is the architect: designs the building, makes complex decisions, handles dependencies. GPU is the construction crew of 10,000 workers all doing the same task (laying bricks) simultaneously. The architect (CPU) orchestrates; the crew (GPU) executes in parallel.

🧮

Deep vs Wide Thinking

CPU thinks deep: chain of reasoning, step A → B → C → D, each step depends on the previous. GPU thinks wide: A₁, A₂, A₃...A₁₀,₀₀₀ all simultaneously, no interdependence. Intelligence requires deep thinking (CPU) AND wide pattern recognition (GPU).

📚

The Instruction Coding → Intelligence Coding Shift

Instruction coding: you write the rules. CPU executes them deterministically. Intelligence coding: you show examples. GPU finds the rules via optimization over data. The shift isn't just hardware — it's a different theory of what computation is.

🔑 The Unifying Insight

CPU and GPU aren't competitors — they're co-evolved specialists. Every AI system uses both: CPU for the complex, serial, branchy orchestration (the "thinking about what to compute"), GPU for the massive parallel math (the "actually computing it"). The ratio shifts as AI models scale: a larger fraction of wall-clock time is GPU compute, which is why GPU spending dominates AI infrastructure cost.

Inference Optimizations: Making GPUs Work Smarter

Training is a solved scaling problem — throw more GPUs. Inference is the hard engineering problem: serve millions of users cheaply and fast.

KV Cache — Context Caching at the Hardware Level

The single most important inference optimization. Without it, generating each token would require recomputing attention over the entire context from scratch — O(L²) work per token.

What Gets Cached
Attention for token t requires: Q_t = X_t · W_Q (new each token)
K_t = X_t · W_K ← CACHED
V_t = X_t · W_V ← CACHED

score_t = softmax(Q_t · [K_1...K_t]^T / √d)
output_t = score_t · [V_1...V_t]

Without cache: recompute K,V for all prior tokens every step
With cache: only compute K_t, V_t for the NEW token
Memory Cost of KV Cache
KV cache size per token: bytes = 2 × n_layers × n_heads × d_head × dtype_bytes

Llama-3 70B (BF16):
2 × 80 × 8 × 128 × 2 = 327 KB per token

At 128k context:
327 KB × 128,000 = ~40 GB just for KV cache

→ Why long-context models need H100 SXM (80GB HBM)
→ Why quantizing KV cache to INT8 matters

🔑 Prompt Caching (API Level)

Anthropic's prompt caching maps directly to KV cache reuse. When you send the same system prompt prefix repeatedly, the provider stores the KV pairs for that prefix on GPU HBM. Subsequent calls with that prefix skip recomputing attention for those tokens — typically ~90% cost reduction and ~85% latency reduction for the cached portion. The cache lives in HBM between requests on the same server. This is why long, reused system prompts are economical despite their token count.

Flash Attention — Solving the Memory Bandwidth Bottleneck

Standard attention is memory-bandwidth bound, not compute-bound. Flash Attention is not new math — it's the same result computed via a tiling strategy that keeps data in SRAM and never writes the attention matrix to HBM.

Standard Attention — Memory Round-trips
Step 1: Load Q, K from HBM → SRAM. Compute S = QK^T. Write S back to HBM.
Step 2: Load S from HBM → SRAM. Compute P = softmax(S). Write P back to HBM.
Step 3: Load P, V from HBM → SRAM. Compute O = PV. Write O to HBM.
S matrix size: L×L floats. At L=8192: 268 MB per head per layer. Pure memory bandwidth waste.
Flash Attention — Tiled SRAM Computation
Tile: Split Q, K, V into blocks that fit in SRAM (256KB per SM). Process one tile at a time.
Online softmax: Compute running max and sum to normalize softmax incrementally. Never materialize the full S matrix.
Result: HBM reads/writes: O(L) instead of O(L²). Identical mathematical output. 2–4× faster in practice.
Flash Attention 3 (H100): uses async WGMMA + pipelining to overlap SRAM loads with Tensor Core math. Near-theoretical peak utilization.

🧠 The Mental Model

Flash Attention is the same answer as standard attention — just computed in a different order. It's a hardware-aware algorithm: the math was redesigned around the memory hierarchy of the GPU, not around mathematical convenience. This is the paradigm for GPU kernel optimization: write algorithms that match data access patterns to the memory tier that serves them cheapest.

Speculative Decoding — Parallelizing Autoregressive Generation

Autoregressive generation is inherently serial: token N requires token N-1. This wastes GPU parallelism — Tensor Cores sit idle while the model waits for the previous token. Speculative decoding breaks this serialization.

How It Works
1. Draft
A small, fast draft model (e.g. 7B) generates k=4 candidate tokens speculatively in parallel. Takes ~1/10th the compute of the large model.
2. Verify
The large model (e.g. 70B) verifies all k tokens in ONE forward pass — same cost as generating 1 token. Uses its full parallel capacity.
3. Accept/Reject
Accept tokens where draft matches verifier distribution. On first mismatch, resample from verifier. Expected acceptance: 3–4 tokens per verification pass.
The Math
Speedup analysis: Standard: T_large per token
Speculative: (T_draft × k + T_large) per ~α·k tokens

α = acceptance rate (typically 0.7–0.9)
k = draft tokens per round (typically 4–8)

Speedup ≈ α·k / (α·k · T_draft/T_large + 1)

Real world: 2–3× throughput improvement
Output quality: mathematically identical to large model alone

Continuous Batching & PagedAttention — vLLM's Core Innovation

Naive inference servers reserve GPU memory for the maximum possible KV cache per request at the start. This causes severe internal fragmentation — GPU memory is wasted on reserved-but-unused space.

The Problem: Static KV Cache Allocation
Request A reserved 4096 tokens of KV cache at start.
It only used 512 tokens before finishing.
3584 token-slots wasted while other requests queue.
With static batching: entire batch must finish before new requests enter. Long requests block short ones. GPU utilization: 30–40%.
PagedAttention: OS Virtual Memory for KV Cache
KV cache divided into fixed-size pages (e.g. 16 tokens/page). Each request gets pages on-demand, returned when done. Logical → physical page mapping like OS virtual memory.
Continuous batching: new requests join the batch mid-flight. Short requests finish and free pages immediately. GPU utilization: up to 90%+.
Result on real workloads (vLLM vs naive serving, Llama-2 70B): Throughput: 24× higher requests/second
GPU memory waste: <4% fragmentation (vs 60–80% naive)
Plus: KV cache sharing across requests with identical prefixes (e.g. same system prompt) → further memory savings

🌐 Distributed Training: Scaling Beyond One GPU

A single H100 has 80GB HBM. GPT-4 has ~1.8 trillion parameters at 2 bytes each = 3.6 TB. Distributed training is not optional — it's the only way large models exist.

Three Parallelism Dimensions

Dimension 1
Data Parallelism

Each GPU holds a full copy of the model. Different batches of data are processed in parallel. After each step, gradients are synchronized via all-reduce.

GPU 0: batch[0:64] → grad_0
GPU 1: batch[64:128] → grad_1
AllReduce: avg(grad_0, grad_1)
Both update same weights
Bottleneck: AllReduce communication. At 175B params × 2 bytes × 2 (send+recv) = 700GB per step. NVLink 900GB/s handles this; PCIe cannot.
Dimension 2
Tensor Parallelism

A single weight matrix is split across GPUs. Each GPU computes a shard of the matmul; results are combined. Requires tight synchronization — only works within a node (NVLink latency).

W [d×4d] split into 4 shards:
GPU 0: W[:, 0:d] → partial out
GPU 1: W[:, d:2d] → partial out
AllReduce: sum partials → Y
Bottleneck: 2 AllReduces per transformer layer (one after attention, one after MLP). Latency-sensitive — must use NVLink, not ethernet.
Dimension 3
Pipeline Parallelism

Model layers split across GPUs. GPU 0 holds layers 1–24, GPU 1 holds layers 25–48, etc. Each processes a micro-batch while the next micro-batch fills the earlier stages.

GPU 0: layers 1-24 (micro-batch 1)
GPU 1: layers 25-48 (micro-batch 1)
GPU 0: layers 1-24 (micro-batch 2)
→ pipeline fills, minimizes bubble
Bottleneck: Pipeline bubble (idle time when pipeline drains). 1F1B scheduling (one forward, one backward) reduces bubble to 1/(num_stages).
Real cluster configuration (Llama-3 70B training — 512 H100s): Tensor Parallel (TP) = 8 (within one 8-GPU node, NVLink)
Pipeline Parallel (PP) = 8 (across 8 nodes, InfiniBand)
Data Parallel (DP) = 8 (across DP replicas, InfiniBand)
Total = TP × PP × DP = 8 × 8 × 8 = 512 GPUs

The AllReduce Bottleneck — Why NVLink Exists

Every gradient synchronization step requires every GPU to send its gradients to every other GPU and receive the average. This is an AllReduce collective — the communication pattern that dominates training time at scale.

Ring-AllReduce communication volume: bytes = 2 × (N-1)/N × model_params × dtype

175B params, BF16, 8 GPUs:
= 2 × 7/8 × 175B × 2 = ~612 GB per step

NVLink 4.0 (H100): 900 GB/s bidirectional
Time: ~0.68 seconds per AllReduce

PCIe 5.0: 128 GB/s bidirectional
Time: ~4.8 seconds — 7× slower, training unusable
Gradient Checkpointing

During backprop, intermediate activations must be stored to compute gradients. For a 70B model, this is ~70GB of activations per batch — more than one GPU's HBM.

No checkpointing: store all activations. Fast backprop but OOM on large models.
Gradient checkpointing: only store activations at checkpoint layers (e.g. every 4 layers). Recompute intermediate activations during backprop. Trades 33% more compute for 10× less memory.
ZeRO (DeepSpeed): shard optimizer states, gradients, and parameters across data-parallel ranks. Reduces per-GPU memory from O(model) to O(model/N).

🔑 Connecting Software to Hardware Bottlenecks

Every major framework feature maps to a hardware constraint:

Flash Attention → HBM bandwidth is the bottleneck for attention (not Tensor Core compute)
KV Cache / PagedAttention → HBM capacity is finite; fragmentation kills throughput
Tensor Parallelism → single matmul too big for one GPU's HBM; split it
NVLink vs PCIe → AllReduce communication volume determines whether multi-GPU training is viable
Gradient Checkpointing → activation memory exceeds HBM; recompute is cheaper than spilling to CPU RAM

The pattern: every optimization is the software adapting an algorithm to match a hardware constraint in the memory hierarchy.