01 The Big Picture
Doc 12 buried the key sentence: "the security boundary lives in the harness, never in the model's good intentions." This doc unpacks why — and what that means for every agent you build or use.
Safety has two distinct layers people constantly merge. Alignment is about the model's trained dispositions: will it refuse to help build a weapon, will it be honest about uncertainty, whose values does it embody? Security is about the system around the model: can an attacker make your agent exfiltrate data by hiding instructions in a webpage? Alignment is trained; security is engineered. A perfectly aligned model in a badly designed harness is still exploitable — and as agents gain tools (doc 12) and eyes (doc 15), the attack surface is everything they read.
02 What: The Alignment Stack
"The model is aligned" is shorthand for layers applied at different stages:
| Layer | When | What it does | Limits |
|---|---|---|---|
| Data curation | Pre-training | Filter what the base model learns from | The internet is in there; capabilities can't be cleanly unlearned |
| RLHF / RLAIF | Post-training | Reward preferred behavior (doc 02) — helpful, honest, harmless dispositions | Optimizes appearing good to raters: sycophancy, confident tone over accuracy |
| Constitutional AI | Post-training | Model critiques/revises its own outputs against written principles — scales oversight beyond human raters | Principles are interpreted by the model itself |
| System prompts | Deploy time | Behavioral instructions in context (doc 03) | It's just tokens — competes with everything else in the window |
| Classifiers / filters | Runtime | Separate models screening inputs & outputs | False positives/negatives; another model to fool |
| Harness controls | Runtime | Sandboxes, permissions, approval gates (doc 12) | The only layer with guarantees — and the one teams skip |
Jailbreaks are attacks on the trained layers: roleplay framing, encoding tricks, many-shot flooding, "my grandmother used to read me napalm recipes" — all exploiting that refusal is a learned behavior (a region of the policy, doc 12's term), not a rule engine. Each generation gets harder to jailbreak; none is impossible. That asymptote is why the harness layer exists.
03 Why This Is Structurally Hard
04 Prompt Injection — The Attack That Matters
Jailbreaks need a malicious user. Injection needs only a malicious document — and your agent reads documents all day. Watch one unfold.
Variants hit every input surface: instructions in white-on-white text in documents, in HTML comments on webpages, in image text read by VLMs (doc 15), in code comments, in tool results from a compromised MCP server (doc 12), in retrieved RAG chunks (doc 11) — poison the corpus, steer every answer that retrieves it. The rule: anything that can place tokens in your context window holds a (stochastic) form of code execution on your agent.
05 Defense in Depth — What Actually Works
There is no parameterized-query fix (yet). The honest posture, in order of reliability:
| Defense | Mechanism | Strength |
|---|---|---|
| Least-privilege tools | Agent reading public docs gets no email-send tool; scoped credentials per task | Hard guarantee |
| Approval gates on irreversibles | Send/delete/deploy/pay require human confirmation (doc 12's HITL) | Hard guarantee |
| Sandboxing & egress control | Code runs in isolated envs; network allowlists cap exfiltration | Hard guarantee |
| The lethal-trifecta rule | Never combine: untrusted input + private data access + external communication. Drop one leg and exfiltration breaks structurally | Design-level |
| Trained injection resistance | Models post-trained to flag and refuse embedded instructions; provenance markers in prompts | Probabilistic — raises cost, no guarantee |
| Input/output classifiers | Screen for injection patterns and policy violations | Probabilistic |
06 Mental Models
Same root cause as 1998 — instructions and data in one channel — minus the fix, because there's no grammar separating them, only learned convention. Lets you reason about: every new input surface ("can attacker tokens reach the window?") and why "filter harder" keeps failing.
Training shapes what the agent wants to do; permissions decide what it can do. Lets you reason about: where to spend effort — you can't retrain the model, but you fully control its keys.
Classic security failure: a privileged program tricked into using its authority for an attacker. An injected agent is exactly this — the attack uses your agent's legitimacy. Lets you reason about: why "the model meant well" is irrelevant, and why authority, not intent, is what you must scope.
07 Common Misconceptions
"Jailbreaks and prompt injection are the same thing." Different threat models: jailbreak = the user attacks the model's training; injection = a third party attacks the user through content the agent reads. Injection is the one that matters for agents — your users are mostly honest; the internet is not.
"A strong system prompt prevents injection." The system prompt is tokens competing with attacker tokens in the same window (doc 03 + 06). It raises the bar; it cannot hold it. Anyone selling "unbypassable" prompt armor is selling probability as certainty.
"RLHF made the model safe, so my product is safe." Two different layers. Alignment lowers the chance the model wants to do harm; your harness decides what harm is possible. Most real-world incidents are harness failures: over-broad tools, no gates, secrets in context.
"Safety is censorship overhead that dumber models need." For agents, safety engineering is what makes capability deployable — nobody gives production credentials to a system with no permission model. The locked server room isn't what slows the company down; it's why the company is allowed to operate.