Evaluation — Measuring the Unmeasurable

01 The Big Picture

Doc 12 ended on an uncomfortable note: the agent decides for itself when it's done, and that judgment is fallible. Classical software answers "does it work?" with tests. LLM systems can't — same input, different outputs; infinite valid answers; failure that looks fluent.

Evaluation is the discipline that replaces "it seemed good when I tried it" with measurement. It is also, quietly, the highest-leverage skill in applied AI: the teams that ship reliable LLM products are rarely the ones with secret prompting tricks — they're the ones who can detect a regression in an afternoon. Without evals, every prompt tweak, model upgrade, and RAG change is a blind bet; with them, it's an experiment.

🔑

The engineer's translation: evals are to LLM systems what tests + monitoring are to services. You wouldn't refactor without a test suite; don't change a prompt without an eval. Same muscle, fuzzier assertions.

02 What an Eval Is (Precisely)

An eval = dataset + task definition + grader + metric. Run the system over the dataset, grade each output, aggregate. The four pieces matter more than any tooling:

Piece	What it is	The trap
Dataset	Representative inputs — ideally mined from real traffic + curated hard cases	Toy examples that don't resemble production; too few items (<30 tells you almost nothing)
Task definition	What "good" means, written down precisely enough to grade	If you can't define good, you can't measure it — most "eval problems" are actually spec problems
Grader	Code, exact match, or an LLM judging each output	Trusting an ungraded grader (see §05)
Metric	Pass rate, score distribution, pass@k, cost & latency alongside	One average hiding a bimodal disaster (90% brilliant, 10% catastrophic)

What an eval is not: a benchmark you read about (that's someone else's eval, of someone else's task), and not a vibe check in a playground — single samples from a stochastic system are anecdotes, not data.

03 Why Public Benchmarks Mislead

MMLU, HumanEval, SWE-bench, LMArena — useful for tracking the field, dangerous for your decisions. The failure modes are structural:

Contamination

Benchmarks are published text; published text is training data. A model can "ace" questions it has effectively memorized. Scores ratchet up; the capability gap they measure quietly shrinks. Every popular benchmark decays this way — Goodhart's law with a GPU budget.

Distribution mismatch

MMLU is multiple-choice trivia; your workload is "refactor this C++ module without breaking the build." Ranking on the former predicts the latter weakly. The benchmark that matters is the one whose distribution matches your traffic — which only you can build.

Saturation & selection

Once top models cluster at 88–92%, remaining differences are noise and question bugs. And vendors choose which benchmarks to report. A scoreboard is also an advertisement.

Arena ≠ accuracy

Preference leaderboards measure what raters prefer — which rewards confidence, length, and formatting as much as correctness. Models can win arenas by being charming. Your invoice parser doesn't need charm.

Right use of benchmarks: shortlist 2–3 candidate models. Then decide with your own eval — the subject of the next section.

04 How to Build an Eval — The Loop

The realistic workflow, including the steps everyone skips. Example task: support-ticket summarizer.

For agents (doc 12), grade outcomes, not transcripts: did the test pass, the file get created, the booking exist? Define a checkable end-state per case and let the grader inspect the world, not the prose. Where multiple paths are valid, outcome-grading is the only honest option — and report pass@k (any success in k attempts) separately from pass rate; users retry, so both matter.

05 Graders — Including the Judge Problem

Grader	Use for	Caveat
Exact / regex match	Classification, extraction, structured output	Brittle to harmless variation — normalize first
Code assertions	Agents & codegen: run the tests, check the artifact	The gold standard when possible — arrange your task so it is
Embedding similarity	Soft "is this close to reference?"	Similar ≠ correct; use only as a weak signal (doc 11's lossy geometry)
LLM-as-judge	Fluency, faithfulness, coverage — anything fuzzy	The judge is itself an unevaluated model. See below.

The judge problem: LLM judges have measurable biases — they favor longer answers, the first option shown, confident tone, and their own model family's style. The fix is to eval the evaluator: take ~50 outputs, grade them yourself, and measure judge–human agreement before trusting it. Mitigations: rubric-anchored prompts ("score 1–5 against THESE criteria"), randomized pair order, forced citations of evidence, a stronger model as judge than the one being judged. An unvalidated judge doesn't remove vibes — it launders them through an API.

06 Mental Models

Evals are unit tests with confidence intervals

Same role in the dev loop — run before merge, block on regression — but assertions are statistical: pass rates over distributions, not booleans over cases. Lets you reason about: CI for prompts, sample sizes (30 cases ≈ smoke test, 300 ≈ signal), flakiness as data rather than annoyance.

Unlike unit tests, 100% is suspicious — it usually means your dataset is too easy, not your system perfect.

Goodhart's ratchet

Any published metric becomes a target; any target gets gamed — by training data, by tuning, by selection. Benchmarks decay the moment they matter. Lets you reason about: why private, refreshed, traffic-derived datasets hold value while public ones inflate.

Decay rates differ — outcome-graded, contamination-resistant benchmarks (live codebases, fresh problems) age slower.

The eval IS the spec

Writing the grader forces you to define what good means — and most teams discover they never had. A failing eval case is a requirement document in executable form. Lets you reason about: why eval-writing improves systems even before anything is measured.

Specs drift; datasets must be re-mined from real traffic or they describe last quarter's product.

07 Common Misconceptions

"Model X tops the leaderboard, so it's the best for us." Best on that distribution. Models that differ by 2 points on MMLU routinely differ by 20 on a specific task — in either direction. Shortlist by benchmark, decide by your eval.

"I tried it on five examples and it works." Five samples of a stochastic system at temperature > 0 is a coin flipped five times. And the examples you tried are the ones you thought of — production sends the ones you didn't.

"We'll add evals after we ship." Backwards economics: pre-ship, an eval costs a day and catches regressions in minutes; post-ship, every regression is discovered by users and triaged from anecdotes. The eval suite is cheapest exactly when it feels premature.

"LLM-as-judge solves grading." It moves grading to a model with biases you haven't measured. Validate the judge against human labels once; then it's a tool. Skip that step and your metric is an opinion with decimals.

🗺️

Next: evals measure outputs — but what makes the same prompt produce different outputs at all? Doc 14: the dice roll at the end of every forward pass, and how to control it.