01 The Big Picture
All the machinery of docs 06–08 — prefill, decode, the bandwidth wall — produces one thing per step: a list of ~100,000 numbers. The sampler turns that list into a word. It is the smallest component in the stack and it controls creativity, determinism, JSON validity, and (via speculation) speed.
This doc completes loose threads from across the series: doc 02 mentioned temperature/top-p in one line; doc 12 promised an explanation of how tool calls are guaranteed valid JSON; doc 13 kept saying "stochastic" — here is exactly where the randomness enters, and every knob that shapes it.
02 What: Logits → Probabilities
The final layer projects the model's hidden state onto the vocabulary: one raw score (logit) per token. Softmax turns scores into a probability distribution. Crucially, the model's output is always the whole distribution — "the model said X" really means "the sampler picked X from what the model believed."
Temperature divides logits before softmax: low T exaggerates gaps (confident tokens dominate), high T compresses them (underdogs get real chances). T doesn't add knowledge or remove it — it reshapes how boldly the model commits to what it already believes.
03 How: The Sampling Funnel
Production samplers chain filters. Watch one token survive the funnel.
| Knob | Mechanism | Use |
|---|---|---|
| temperature = 0 (greedy) | Always argmax | Extraction, classification, math — but beware: greedy ≠ globally best sentence, and even T=0 isn't bit-exact across batches/hardware (float non-associativity) |
| top-k | Keep k highest, renormalize | Blunt tail-cutting; k fixed regardless of confidence |
| top-p (nucleus) | Keep smallest set with cumulative prob ≥ p | Adaptive: confident step → few candidates; uncertain step → many. The sane default (~0.9) |
| min-p | Keep tokens ≥ p × P(best) | Newer; behaves well at high temperatures |
| frequency/presence penalty | Down-weight already-used tokens | Fights repetition loops, the classic greedy-decoding pathology |
04 Structured Output — Determinism Where It Counts
Doc 12's tool calls must be valid JSON every single time. Prompting "please respond in JSON" gets you 98% — and 2% production incidents. The real mechanism is constrained decoding: compile the JSON schema (or any grammar) into an automaton; at each step, mask the logits of every token that would violate it — set them to −∞ before sampling. The model literally cannot emit an illegal character. After {"name": a string must follow: every non-quote token is masked; sampling proceeds freely among legal options only.
05 Speculative Decoding — Spending FLOPs to Buy Bandwidth
The last unexplained trick from the doc-10 world. Decode is memory-bound (doc 08): each token costs a full read of the weights, while the compute units idle. Speculative decoding exploits the idle compute: a small, fast draft model guesses the next γ tokens (say 5); the big model then verifies all 5 in one parallel forward pass — which costs the same one trip through memory as generating a single token would.
Note the symmetry with doc 08's FlashAttention lesson: there, recompute beat remembering; here, guess-and-verify beats compute-when-asked. Both convert abundant FLOPs into scarce bandwidth. Branch prediction did the same thing to CPU pipelines forty years ago — speculation is what mature architectures do when one resource outgrows another.
06 Mental Models
Forward pass = belief (distribution); sampler = decision (token). Lets you reason about: why "the model is non-deterministic" is imprecise — the math is deterministic; the draw is configured by you. Blame the right component.
Low T: take the safe bet every time. High T: give long-shots a chance. The knowledge is identical. Lets you reason about: T≈0 for facts/extraction, T≈0.7–1.0 for prose/brainstorms, never "high T to make it smarter."
The schema is a type; masking makes ill-typed continuations unrepresentable — correctness by construction, not by validation-and-retry. Lets you reason about: when to use structured output (any machine-consumed text) and what it can't fix (legal JSON with wrong values).
07 Common Misconceptions
"Temperature 0 makes the model deterministic and correct." Deterministic-ish (hardware caveats) — but argmax of a wrong belief is a confidently wrong answer. T=0 removes variance, not error. Hallucinations survive T=0 happily.
"Higher temperature = more creative = better writing." High T means riskier token draws, which reads as surprising word choice — and just as often as incoherence. Genuine "creativity" lives in the model and the prompt; T only loosens the commitment policy.
"Asking for JSON in the prompt is how tool calls work." Production tool calling is grammar-constrained decoding — masked logits, automaton in lockstep. Prompt-only JSON is the 98%-reliable imitation of a 100%-reliable mechanism.
"Speculative decoding trades quality for speed." The rejection-sampling step makes the output distribution provably identical to the target model's. It trades idle compute for speed — quality is untouched. (Quantization trades quality; speculation doesn't. Different rows of doc 10's stack.)