01 The Big Picture
Code used to be static: you wrote it, it ran the same way forever. With LLMs, behavior is static weights plus dynamic context — and managing that dynamic half has become an engineering discipline of its own.
The previous three docs established the constraints: the context window is expensive to fill (prefill, doc 06), expensive to carry (KV cache per decode step, docs 06/08), cheap only when stable (caching, doc 07), and qualitatively fragile when overstuffed (attention dilution). Skills, rules, workflows, memory banks, and subagents are not productivity gimmicks — they are the software industry independently rediscovering memory hierarchy management, the same discipline GPU kernel authors practice one level down.
02 What Problem Are We Solving, Exactly?
"The context problem" is really four coupled problems:
Capacity
Windows are finite (and effective capacity is smaller than advertised — recall degrades long before the hard limit).
Cost & latency
Every token in context is paid for repeatedly: once in prefill, then on every decode step as KV-cache bytes through the bandwidth wall.
Attention quality
Irrelevant context isn't neutral — it competes in softmax with relevant context. Stuffing degrades accuracy and invites hallucination ("lost in the middle").
Persistence
The window evaporates when the conversation ends. The model wakes up amnesiac, every time.
Note what these four mirror: capacity = SRAM size; cost = bandwidth; attention quality = cache pollution; persistence = volatility. The mapping is not poetic — it predicts which solutions work.
03 Why Solve It in Software?
Hardware answers exist — longer windows, more HBM, better attention kernels — and they keep coming. But software wins for the same reason caches beat "just buy more RAM":
04 The Five Patterns
Watch how an agent harness assembles context just-in-time, pattern by pattern.
Each pattern, precisely
| Pattern | What it is | When it earns its place |
|---|---|---|
| Rules / system prompts | Small, always-loaded behavioral constraints (CLAUDE.md, .cursorrules) | Things that must hold on every request. Keep tiny — they tax every single call. |
| Skills | Named capability bundles; an index of one-line descriptions stays resident, full instructions load only when triggered | Expertise needed sometimes: how to build a PPTX, how to review a PR. Demand paging for knowledge. |
| Workflows | Multi-step procedures externalized so only the current step (plus state) occupies context | Long processes that would otherwise carry all past steps as dead weight in the window. |
| Memory banks | Facts persisted to files/DB across sessions; a light index resident, bodies recalled by relevance | Anything that must survive the conversation: preferences, project decisions, learned corrections. |
| Subagents | Delegate a subtask to a fresh context; only the distilled result returns | Exploration that generates bulk garbage (searching 100 files) — quadratic context growth becomes additive. |
05 The Hardware Mapping — Connecting All the Dots
Here is the series' full circle. Each software pattern has an exact counterpart in the memory hierarchy of doc 08, and relieves a specific cost from docs 06–07:
| Software pattern | Hardware counterpart | Bottleneck it relieves |
|---|---|---|
| Rules (small, stable, first) | Pinned cache lines / firmware in ROM | Prefill cost — stable prefix maximizes cache hits (doc 07) |
| Skills (index resident, body on demand) | Demand paging; page table in memory, pages on disk | Capacity + attention pollution — pay only when used |
| Workflows (one step in flight) | Instruction streaming; loop tiling | KV-cache growth per decode step (docs 06, 08) |
| Memory banks (persist + recall) | Disk + buffer cache; write-back policy | Volatility — context evaporates, files don't |
| Subagents (fresh window, distilled return) | Scratch buffers; map-reduce; FlashAttention's discard-the-intermediate trick | O(n²) context accumulation → O(n); garbage never enters the parent's "SRAM" |
| Summarization / compaction | Lossy compression; cache eviction with a victim summary | Capacity — trade fidelity for space, in controlled batches (doc 07's cache-invalidation caveat) |
06 Mental Models
Small, fast, expensive, volatile — manage it like a kernel author manages shared memory: stage what's needed, evict aggressively, keep the layout cache-friendly. Lets you reason about: whether any new agent feature helps (does it reduce bytes in the precious tier, or pollute it?).
Rules = kernel config; skills = demand-paged libraries; memory = filesystem; subagents = processes with isolated address spaces; the context window = physical RAM the OS allocates. Lets you reason about: why harness design matters more than model choice for many tasks, and what's still missing (schedulers, quotas, protection rings).
Weights are the compiled binary; context is the runtime configuration + heap. "Programming" an agent = curating what reaches its heap each call. Lets you reason about: why prompt/context engineering is a durable skill, not a passing trick — heaps need managing no matter how good binaries get.
07 Common Misconceptions
"Million-token windows make all this obsolete." The same was said of RAM killing caches and SSDs killing buffer pools. Bigger windows raise the ceiling; they don't change that filled windows cost linearly per request (doc 07), decode slower per token (doc 06), and dilute attention. Hierarchy management survives every capacity jump — it has for sixty years.
"Memory banks are RAG with extra steps." They overlap, but differ in provenance and write path: RAG retrieves from a corpus you indexed; a memory bank is written by the agent itself — curated conclusions, not raw documents. The write policy (what's worth persisting) is the hard part, just as cache write-back policy is harder than read caching.
"More instructions in the system prompt = more reliable agent." Past a point, strictly worse: every rule taxes every request (prefill + decode), and rules compete with the task for attention. The skill/rule split exists precisely to keep the always-loaded set minimal — pin only what's truly global.
"Subagents are about parallelism." Sometimes, but their primary value is context isolation — the parent's window stays clean. A sequential subagent that returns a tight summary is usually a bigger win than three parallel ones returning verbose dumps.