Sampling & Decoding — The Dice Roll After the Math

01 The Big Picture

All the machinery of docs 06–08 — prefill, decode, the bandwidth wall — produces one thing per step: a list of ~100,000 numbers. The sampler turns that list into a word. It is the smallest component in the stack and it controls creativity, determinism, JSON validity, and (via speculation) speed.

This doc completes loose threads from across the series: doc 02 mentioned temperature/top-p in one line; doc 12 promised an explanation of how tool calls are guaranteed valid JSON; doc 13 kept saying "stochastic" — here is exactly where the randomness enters, and every knob that shapes it.

02 What: Logits → Probabilities

The final layer projects the model's hidden state onto the vocabulary: one raw score (logit) per token. Softmax turns scores into a probability distribution. Crucially, the model's output is always the whole distribution — "the model said X" really means "the sampler picked X from what the model believed."

// after the forward pass, for context "The capital of France is"
logits = { " Paris": 9.1, " the": 5.2, " located": 4.8, " a": 4.1, ... 100K more }
probs  = softmax(logits / T)   // T = temperature

// T = 1.0 :  Paris 92%, the 2%, located 1.5%, ...
// T = 0.3 :  Paris 99.8%          — sharpened, near-deterministic
// T = 1.8 :  Paris 38%, the 11%, located 9%  — flattened, adventurous

Temperature divides logits before softmax: low T exaggerates gaps (confident tokens dominate), high T compresses them (underdogs get real chances). T doesn't add knowledge or remove it — it reshapes how boldly the model commits to what it already believes.

03 How: The Sampling Funnel

Production samplers chain filters. Watch one token survive the funnel.

Knob	Mechanism	Use
temperature = 0 (greedy)	Always argmax	Extraction, classification, math — but beware: greedy ≠ globally best sentence, and even T=0 isn't bit-exact across batches/hardware (float non-associativity)
top-k	Keep k highest, renormalize	Blunt tail-cutting; k fixed regardless of confidence
top-p (nucleus)	Keep smallest set with cumulative prob ≥ p	Adaptive: confident step → few candidates; uncertain step → many. The sane default (~0.9)
min-p	Keep tokens ≥ p × P(best)	Newer; behaves well at high temperatures
frequency/presence penalty	Down-weight already-used tokens	Fights repetition loops, the classic greedy-decoding pathology

🧠

Why sampling at all — why not always take the best token? Because the best next token is not the best sequence — greedy is a locally-optimal path through an exponential tree (the same reason greedy algorithms fail generally). Worse, argmax chains drift into degenerate repetition ("the the the…"). Controlled randomness is regularization against the search problem's myopia — and the direct cause of doc 13's "run every eval case N times."

04 Structured Output — Determinism Where It Counts

Doc 12's tool calls must be valid JSON every single time. Prompting "please respond in JSON" gets you 98% — and 2% production incidents. The real mechanism is constrained decoding: compile the JSON schema (or any grammar) into an automaton; at each step, mask the logits of every token that would violate it — set them to −∞ before sampling. The model literally cannot emit an illegal character. After {"name": a string must follow: every non-quote token is masked; sampling proceeds freely among legal options only.

🧠

DSA connection: this is a DFA/pushdown automaton product-construction running in lockstep with generation — compiler theory (lexers, parser states) applied as a logit filter. Creativity inside the grammar, zero freedom outside it. It's also why structured output has near-zero overhead: a mask is just a vector add before softmax.

05 Speculative Decoding — Spending FLOPs to Buy Bandwidth

The last unexplained trick from the doc-10 world. Decode is memory-bound (doc 08): each token costs a full read of the weights, while the compute units idle. Speculative decoding exploits the idle compute: a small, fast draft model guesses the next γ tokens (say 5); the big model then verifies all 5 in one parallel forward pass — which costs the same one trip through memory as generating a single token would.

Draft: the small model (or extra "Medusa" heads, or simple n-gram lookup) cheaply proposes "… the capital of France is Paris".

Verify in parallel: the big model scores all proposed positions at once — prefill-style parallelism applied to decode (doc 06's two personalities, hybridized).

Accept the prefix that matches (with a rejection-sampling correction that keeps the output distribution mathematically identical to the big model's). Easy text → 4–5 tokens per memory-trip; hard text → fall back to ~1. Typical end-to-end: 2–3× faster, zero quality change.

Note the symmetry with doc 08's FlashAttention lesson: there, recompute beat remembering; here, guess-and-verify beats compute-when-asked. Both convert abundant FLOPs into scarce bandwidth. Branch prediction did the same thing to CPU pipelines forty years ago — speculation is what mature architectures do when one resource outgrows another.

06 Mental Models

The model proposes, the sampler disposes

Forward pass = belief (distribution); sampler = decision (token). Lets you reason about: why "the model is non-deterministic" is imprecise — the math is deterministic; the draw is configured by you. Blame the right component.

At scale, batching and float order add hardware-level wobble even at T=0.

Temperature is a risk dial, not an IQ dial

Low T: take the safe bet every time. High T: give long-shots a chance. The knowledge is identical. Lets you reason about: T≈0 for facts/extraction, T≈0.7–1.0 for prose/brainstorms, never "high T to make it smarter."

At extreme T the distribution flattens toward uniform — beyond risk-taking into incoherence.

Grammar masking = type checking at generation time

The schema is a type; masking makes ill-typed continuations unrepresentable — correctness by construction, not by validation-and-retry. Lets you reason about: when to use structured output (any machine-consumed text) and what it can't fix (legal JSON with wrong values).

Over-tight grammars can force the model down low-probability paths — valid syntax, degraded content.

07 Common Misconceptions

"Temperature 0 makes the model deterministic and correct." Deterministic-ish (hardware caveats) — but argmax of a wrong belief is a confidently wrong answer. T=0 removes variance, not error. Hallucinations survive T=0 happily.

"Higher temperature = more creative = better writing." High T means riskier token draws, which reads as surprising word choice — and just as often as incoherence. Genuine "creativity" lives in the model and the prompt; T only loosens the commitment policy.

"Asking for JSON in the prompt is how tool calls work." Production tool calling is grammar-constrained decoding — masked logits, automaton in lockstep. Prompt-only JSON is the 98%-reliable imitation of a 100%-reliable mechanism.

"Speculative decoding trades quality for speed." The rejection-sampling step makes the output distribution provably identical to the target model's. It trades idle compute for speed — quality is untouched. (Quantization trades quality; speculation doesn't. Different rows of doc 10's stack.)

🗺️

Next: everything so far emitted words. Doc 15: how models came to see images — and how generating an image works on completely different mathematics than generating a sentence.