01 The Big Picture
Doc 12 ended on an uncomfortable note: the agent decides for itself when it's done, and that judgment is fallible. Classical software answers "does it work?" with tests. LLM systems can't — same input, different outputs; infinite valid answers; failure that looks fluent.
Evaluation is the discipline that replaces "it seemed good when I tried it" with measurement. It is also, quietly, the highest-leverage skill in applied AI: the teams that ship reliable LLM products are rarely the ones with secret prompting tricks — they're the ones who can detect a regression in an afternoon. Without evals, every prompt tweak, model upgrade, and RAG change is a blind bet; with them, it's an experiment.
02 What an Eval Is (Precisely)
An eval = dataset + task definition + grader + metric. Run the system over the dataset, grade each output, aggregate. The four pieces matter more than any tooling:
| Piece | What it is | The trap |
|---|---|---|
| Dataset | Representative inputs — ideally mined from real traffic + curated hard cases | Toy examples that don't resemble production; too few items (<30 tells you almost nothing) |
| Task definition | What "good" means, written down precisely enough to grade | If you can't define good, you can't measure it — most "eval problems" are actually spec problems |
| Grader | Code, exact match, or an LLM judging each output | Trusting an ungraded grader (see §05) |
| Metric | Pass rate, score distribution, pass@k, cost & latency alongside | One average hiding a bimodal disaster (90% brilliant, 10% catastrophic) |
What an eval is not: a benchmark you read about (that's someone else's eval, of someone else's task), and not a vibe check in a playground — single samples from a stochastic system are anecdotes, not data.
03 Why Public Benchmarks Mislead
MMLU, HumanEval, SWE-bench, LMArena — useful for tracking the field, dangerous for your decisions. The failure modes are structural:
Contamination
Benchmarks are published text; published text is training data. A model can "ace" questions it has effectively memorized. Scores ratchet up; the capability gap they measure quietly shrinks. Every popular benchmark decays this way — Goodhart's law with a GPU budget.
Distribution mismatch
MMLU is multiple-choice trivia; your workload is "refactor this C++ module without breaking the build." Ranking on the former predicts the latter weakly. The benchmark that matters is the one whose distribution matches your traffic — which only you can build.
Saturation & selection
Once top models cluster at 88–92%, remaining differences are noise and question bugs. And vendors choose which benchmarks to report. A scoreboard is also an advertisement.
Arena ≠ accuracy
Preference leaderboards measure what raters prefer — which rewards confidence, length, and formatting as much as correctness. Models can win arenas by being charming. Your invoice parser doesn't need charm.
Right use of benchmarks: shortlist 2–3 candidate models. Then decide with your own eval — the subject of the next section.
04 How to Build an Eval — The Loop
The realistic workflow, including the steps everyone skips. Example task: support-ticket summarizer.
For agents (doc 12), grade outcomes, not transcripts: did the test pass, the file get created, the booking exist? Define a checkable end-state per case and let the grader inspect the world, not the prose. Where multiple paths are valid, outcome-grading is the only honest option — and report pass@k (any success in k attempts) separately from pass rate; users retry, so both matter.
05 Graders — Including the Judge Problem
| Grader | Use for | Caveat |
|---|---|---|
| Exact / regex match | Classification, extraction, structured output | Brittle to harmless variation — normalize first |
| Code assertions | Agents & codegen: run the tests, check the artifact | The gold standard when possible — arrange your task so it is |
| Embedding similarity | Soft "is this close to reference?" | Similar ≠ correct; use only as a weak signal (doc 11's lossy geometry) |
| LLM-as-judge | Fluency, faithfulness, coverage — anything fuzzy | The judge is itself an unevaluated model. See below. |
The judge problem: LLM judges have measurable biases — they favor longer answers, the first option shown, confident tone, and their own model family's style. The fix is to eval the evaluator: take ~50 outputs, grade them yourself, and measure judge–human agreement before trusting it. Mitigations: rubric-anchored prompts ("score 1–5 against THESE criteria"), randomized pair order, forced citations of evidence, a stronger model as judge than the one being judged. An unvalidated judge doesn't remove vibes — it launders them through an API.
06 Mental Models
Same role in the dev loop — run before merge, block on regression — but assertions are statistical: pass rates over distributions, not booleans over cases. Lets you reason about: CI for prompts, sample sizes (30 cases ≈ smoke test, 300 ≈ signal), flakiness as data rather than annoyance.
Any published metric becomes a target; any target gets gamed — by training data, by tuning, by selection. Benchmarks decay the moment they matter. Lets you reason about: why private, refreshed, traffic-derived datasets hold value while public ones inflate.
Writing the grader forces you to define what good means — and most teams discover they never had. A failing eval case is a requirement document in executable form. Lets you reason about: why eval-writing improves systems even before anything is measured.
07 Common Misconceptions
"Model X tops the leaderboard, so it's the best for us." Best on that distribution. Models that differ by 2 points on MMLU routinely differ by 20 on a specific task — in either direction. Shortlist by benchmark, decide by your eval.
"I tried it on five examples and it works." Five samples of a stochastic system at temperature > 0 is a coin flipped five times. And the examples you tried are the ones you thought of — production sends the ones you didn't.
"We'll add evals after we ship." Backwards economics: pre-ship, an eval costs a day and catches regressions in minutes; post-ship, every regression is discovered by users and triaged from anecdotes. The eval suite is cheapest exactly when it feels premature.
"LLM-as-judge solves grading." It moves grading to a model with biases you haven't measured. Validate the judge against human labels once; then it's a tool. Skip that step and your metric is an opinion with decimals.