AI Learning Series · Part 12 · Practice Track

Agents End to End

A model that can only talk becomes a system that can act — through one deceptively simple loop, a protocol, and a lot of context discipline.

Context Solutions
Embeddings & RAG
Agents
Evaluation

01 The Big Picture

Every doc so far studied the model. This one studies the system around it — because the paradigm shift your CLAUDE.md describes (static code → static weights + dynamic context) is completed by agents.

An LLM alone is a brilliant brain in a jar: it can reason about your failing build but can't read the log file. The agent loop hands it eyes and hands — tools — and a goal, then lets it iterate. That single change converts "text generator" into "system that does work": writes code, runs tests, fixes its own mistakes, files tickets. Everything in docs 06–11 becomes load-bearing here: each loop iteration is a fresh inference call (06) over a mostly-stable prefix (07) whose context must be managed ruthlessly (09), often calling retrieval as one tool among many (11).

02 What Is an Agent? (Precisely)

An agent = LLM + tools + a loop + a goal. The model runs until the goal is met, choosing actions each iteration based on what it observed last. The precise boundary worth drawing:

SystemControl flowExample
Not an agent: workflowFixed by the programmer — LLM fills in steps"Summarize → translate → email" chain
Barely an agent: routerLLM picks one branch, onceClassify ticket → route to template
AgentLLM decides what to do next, how many times, and when it's done"Fix this failing test" → reads, edits, runs, repeats

The defining property is that the model owns the control flow. That's also exactly what makes agents powerful and hard to predict — you've traded a deterministic program for a stochastic policy. (Doc 13 deals with the consequences.)

03 Why Tools? (The Honest Division of Labor)

Tools aren't a workaround — they're the correct architecture, because weights and software are good at opposite things:

Weights are good at

Fuzzy understanding, planning, code synthesis, judging relevance, recovering from surprises. Expensive, stochastic, frozen at training time.

Tools are good at

Exact arithmetic, current facts, side effects (writing files, calling APIs), search at scale, anything deterministic. Cheap, exact, always current.

Thought experiment: ask a bare model "what's 847 × 392?" and it predicts digits — plausibly, sometimes wrongly. Give it a calculator tool and the multiplication is exact, while the model does what it's actually good at: deciding that a calculator is needed. Every tool call is the model delegating to a better specialist — the same instinct as doc 08's "recompute via tool instead of memorize in weights" and doc 09's subagents.

04 How It Works — One Loop Iteration at a Time

Task: "Why is checkout failing in prod?" Watch the loop run.

LLM reasons over full transcript, emits text OR a tool call Task + tool catalog "Why is checkout failing?" tools: logs, metrics, code, deploy tool call (structured JSON) search_logs(service="checkout", level=ERROR) Harness executes model never runs anything itself observation appended to transcript: "TypeError in retry.js:42 since deploy #1187" LOOP: transcript grows each iteration → doc 09's discipline applies Final answer "Deploy #1187 broke retry logic; here's the fix + PR"

Three structural facts worth internalizing: (1) The "loop" is just repeated inference calls — each iteration re-sends the whole transcript (stable prefix → cache hits, doc 07, are what make agents affordable). (2) The model emits a request to call a tool; the harness validates and executes it — the security boundary lives in the harness, never in the model's good intentions (doc 16). (3) Termination is the model's judgment — which is why agents sometimes stop early, loop forever, or declare victory falsely; doc 13's evals exist largely to catch this.

05 MCP — USB for Tools

Before MCP (Model Context Protocol, Anthropic 2024, since adopted industry-wide), every app hand-wired every tool: M harnesses × N tools = M×N integrations. MCP is a standard client–server protocol in between: a tool provider ships one MCP server exposing tools/resources/prompts with typed schemas; any MCP client (Claude, IDEs, your own harness) connects and discovers them at runtime. M+N instead of M×N — the same shape as USB, JDBC, or LSP, and the same economic effect: an ecosystem.

🧠
Connecting to doc 09: tool schemas live in the context window (they're tokens! a fat MCP server can eat 10K+ tokens of every request — stable, so cacheable, doc 07). Tool results arrive as observations. MCP standardizes the pipe; context discipline still decides what flows through it.

06 The Design Patterns That Survive Contact With Reality

PatternIdeaWhen
Plan-then-executeMake the model write a plan/todo list first; check items off as it works. The plan in context acts as a program counter and fights goal drift.Multi-step tasks; anything > ~5 tool calls
Verifier loopAfter acting, check via a tool: run the test, lint the file, screenshot the page. Tools turn "I think I'm done" into evidence.Always, when a cheap ground-truth check exists — this is the single biggest reliability lever
Subagent fan-outDelegate exploration to fresh-context workers; only distilled results return (doc 09).Search-heavy subtasks that would pollute the main transcript
Human-in-the-loop gatesIrreversible actions (deploy, send, delete) require explicit approval; the harness enforces it.Any side effect you can't undo
Keep it a workflow if you canIf the steps are known in advance, hard-code them and use the LLM per-step. Cheaper, testable, predictable.The under-rated default — agents are for when you can't enumerate the steps

07 Mental Models

A REPL with a mind

The agent loop is read–eval–print: model proposes (read), harness executes (eval), observation returns (print), repeat. Lets you reason about: debugging agents like debugging REPL sessions — inspect the transcript, find the iteration where reasoning went wrong.

A REPL's evaluator is deterministic; here the proposer is stochastic, so the same task can take different paths.
A new hire with amnesia

Brilliant, eager, forgets everything between sessions (doc 09's volatility), takes instructions literally, and will confidently do the wrong thing if under-specified. Lets you reason about: why onboarding docs (rules), checklists (workflows), and supervised first weeks (HITL gates) map one-to-one onto agent infrastructure.

A hire learns permanently from feedback; the agent only "learns" what you write into memory banks or prompts.
Policy, not program

A program defines transitions; an agent defines a policy over states — closer to RL than to software. Lets you reason about: why testing needs distributions, not single cases (doc 13), and why "it worked yesterday" carries little information.

Unlike RL policies, nothing updates between episodes unless you change the context.

08 Common Misconceptions

"The model executes the tools." Never. It emits structured text; the harness decides whether and how to execute. All authority — sandboxes, allowlists, approval gates — lives in the harness. This distinction is the foundation of agent security (doc 16).

"More tools = more capable agent." Every schema taxes every request and adds a wrong-choice opportunity. Past a few dozen tools, selection (load relevant tools per task — doc 09's skill pattern) beats accumulation.

"Agents will replace workflows." Inverse correlation with predictability: known steps → workflow (cheap, testable); unknown steps → agent. Mature systems are workflows with agentic nodes, not one giant loop.

"Multi-agent = better." Multi-agent systems shine when contexts must be isolated (doc 09) or roles genuinely differ; otherwise they add coordination overhead and compounding error rates. One competent agent with good tools beats five chatting ones surprisingly often.

🗺️
Next: an agent that owns its control flow can fail in ways a unit test never catches. Doc 13: how to measure any of this.