Safety & Alignment — Trusting Systems That Read Anything

01 The Big Picture

Doc 12 buried the key sentence: "the security boundary lives in the harness, never in the model's good intentions." This doc unpacks why — and what that means for every agent you build or use.

Safety has two distinct layers people constantly merge. Alignment is about the model's trained dispositions: will it refuse to help build a weapon, will it be honest about uncertainty, whose values does it embody? Security is about the system around the model: can an attacker make your agent exfiltrate data by hiding instructions in a webpage? Alignment is trained; security is engineered. A perfectly aligned model in a badly designed harness is still exploitable — and as agents gain tools (doc 12) and eyes (doc 15), the attack surface is everything they read.

02 What: The Alignment Stack

"The model is aligned" is shorthand for layers applied at different stages:

Layer	When	What it does	Limits
Data curation	Pre-training	Filter what the base model learns from	The internet is in there; capabilities can't be cleanly unlearned
RLHF / RLAIF	Post-training	Reward preferred behavior (doc 02) — helpful, honest, harmless dispositions	Optimizes appearing good to raters: sycophancy, confident tone over accuracy
Constitutional AI	Post-training	Model critiques/revises its own outputs against written principles — scales oversight beyond human raters	Principles are interpreted by the model itself
System prompts	Deploy time	Behavioral instructions in context (doc 03)	It's just tokens — competes with everything else in the window
Classifiers / filters	Runtime	Separate models screening inputs & outputs	False positives/negatives; another model to fool
Harness controls	Runtime	Sandboxes, permissions, approval gates (doc 12)	The only layer with guarantees — and the one teams skip

Jailbreaks are attacks on the trained layers: roleplay framing, encoding tricks, many-shot flooding, "my grandmother used to read me napalm recipes" — all exploiting that refusal is a learned behavior (a region of the policy, doc 12's term), not a rule engine. Each generation gets harder to jailbreak; none is impossible. That asymptote is why the harness layer exists.

03 Why This Is Structurally Hard

Code and data share one channel. Every classical injection (SQLi, XSS) came from mixing instructions and data in one string — and we fixed them with parameterized queries: a hard syntactic boundary. A transformer has no such boundary: system prompt, user message, retrieved chunk, tool result — all just tokens in one sequence (doc 06), distinguished only by learned conventions. The architecture itself is the vulnerability.

Objectives are proxies. We can't specify "be good" mathematically; we reward what raters prefer (RLHF) — and doc 13's Goodhart applies to training too: optimize the proxy hard enough and you get sycophancy, confident hallucination, refusals of harmless requests.

Capability and risk are the same dial. A model smart enough to debug your code is smart enough to be talked into things. Every tool you grant (doc 12) is granted to whoever can get text into the context.

04 Prompt Injection — The Attack That Matters

Jailbreaks need a malicious user. Injection needs only a malicious document — and your agent reads documents all day. Watch one unfold.

Variants hit every input surface: instructions in white-on-white text in documents, in HTML comments on webpages, in image text read by VLMs (doc 15), in code comments, in tool results from a compromised MCP server (doc 12), in retrieved RAG chunks (doc 11) — poison the corpus, steer every answer that retrieves it. The rule: anything that can place tokens in your context window holds a (stochastic) form of code execution on your agent.

05 Defense in Depth — What Actually Works

There is no parameterized-query fix (yet). The honest posture, in order of reliability:

Defense	Mechanism	Strength
Least-privilege tools	Agent reading public docs gets no email-send tool; scoped credentials per task	Hard guarantee
Approval gates on irreversibles	Send/delete/deploy/pay require human confirmation (doc 12's HITL)	Hard guarantee
Sandboxing & egress control	Code runs in isolated envs; network allowlists cap exfiltration	Hard guarantee
The lethal-trifecta rule	Never combine: untrusted input + private data access + external communication. Drop one leg and exfiltration breaks structurally	Design-level
Trained injection resistance	Models post-trained to flag and refuse embedded instructions; provenance markers in prompts	Probabilistic — raises cost, no guarantee
Input/output classifiers	Screen for injection patterns and policy violations	Probabilistic

🔑

The design rule in one line: put probability where you can afford to be wrong, and structure where you can't. Trained alignment and classifiers reduce frequency; permissions, gates, and sandboxes cap blast radius. Reread doc 12's loop with this lens and agent security becomes ordinary systems engineering.

06 Mental Models

SQL injection, but the parser is a mind

Same root cause as 1998 — instructions and data in one channel — minus the fix, because there's no grammar separating them, only learned convention. Lets you reason about: every new input surface ("can attacker tokens reach the window?") and why "filter harder" keeps failing.

SQLi is deterministic; injection success is probabilistic — which makes it harder to test away (doc 13's distributions again).

Alignment is the employee handbook; the harness is the locked server room

Training shapes what the agent wants to do; permissions decide what it can do. Lets you reason about: where to spend effort — you can't retrain the model, but you fully control its keys.

Unlike employees, the model can't be held accountable — incentives don't transfer; only capability control does.

The confused deputy

Classic security failure: a privileged program tricked into using its authority for an attacker. An injected agent is exactly this — the attack uses your agent's legitimacy. Lets you reason about: why "the model meant well" is irrelevant, and why authority, not intent, is what you must scope.

Classical deputies follow code; this one follows persuasion — the attack surface includes rhetoric.

07 Common Misconceptions

"Jailbreaks and prompt injection are the same thing." Different threat models: jailbreak = the user attacks the model's training; injection = a third party attacks the user through content the agent reads. Injection is the one that matters for agents — your users are mostly honest; the internet is not.

"A strong system prompt prevents injection." The system prompt is tokens competing with attacker tokens in the same window (doc 03 + 06). It raises the bar; it cannot hold it. Anyone selling "unbypassable" prompt armor is selling probability as certainty.

"RLHF made the model safe, so my product is safe." Two different layers. Alignment lowers the chance the model wants to do harm; your harness decides what harm is possible. Most real-world incidents are harness failures: over-broad tools, no gates, secrets in context.

"Safety is censorship overhead that dumber models need." For agents, safety engineering is what makes capability deployable — nobody gives production credentials to a system with no permission model. The locked server room isn't what slows the company down; it's why the company is allowed to operate.

🎓

Series complete — for real this time. Sixteen docs: what models are (01–05), how they run (06–08, 10, 14), what surrounds them (09, 11–13, 15), and what keeps the whole thing trustworthy (16). The thread never changed: understand the constraint, and the design follows.