AI Learning Series · Part 9 · Capstone

Software Context Solutions

Skills, rules, workflows, memory banks, subagents — five software patterns, one hardware problem. The doc where the whole series connects.

Inference Anatomy
Context Caching
GPU Memory
Context Solutions

01 The Big Picture

Code used to be static: you wrote it, it ran the same way forever. With LLMs, behavior is static weights plus dynamic context — and managing that dynamic half has become an engineering discipline of its own.

The previous three docs established the constraints: the context window is expensive to fill (prefill, doc 06), expensive to carry (KV cache per decode step, docs 06/08), cheap only when stable (caching, doc 07), and qualitatively fragile when overstuffed (attention dilution). Skills, rules, workflows, memory banks, and subagents are not productivity gimmicks — they are the software industry independently rediscovering memory hierarchy management, the same discipline GPU kernel authors practice one level down.

🔑
The capstone thesis: the context window is the new SRAM — small, fast, precious. Everything else (files, databases, other agents) is the new HBM and disk. Every pattern in this doc is an eviction policy, a prefetcher, or a cache-coherence protocol for that hierarchy.

02 What Problem Are We Solving, Exactly?

"The context problem" is really four coupled problems:

Capacity

Windows are finite (and effective capacity is smaller than advertised — recall degrades long before the hard limit).

Cost & latency

Every token in context is paid for repeatedly: once in prefill, then on every decode step as KV-cache bytes through the bandwidth wall.

Attention quality

Irrelevant context isn't neutral — it competes in softmax with relevant context. Stuffing degrades accuracy and invites hallucination ("lost in the middle").

Persistence

The window evaporates when the conversation ends. The model wakes up amnesiac, every time.

Note what these four mirror: capacity = SRAM size; cost = bandwidth; attention quality = cache pollution; persistence = volatility. The mapping is not poetic — it predicts which solutions work.

03 Why Solve It in Software?

Hardware answers exist — longer windows, more HBM, better attention kernels — and they keep coming. But software wins for the same reason caches beat "just buy more RAM":

Demand always outgrows supply. Windows grew 1,000× in four years; codebases, document stores, and agent histories grew faster. A hierarchy with smart placement beats a bigger flat memory at every scale of hardware. This has been true in computing since the 1960s.
Cost scales with usage, not capability. A 1M-token window you actually fill costs you 1M tokens per request. Loading 2K relevant tokens from a well-organized store costs 2K. The 500× difference lands on your bill, every call (doc 07's table).
Selection is signal. Choosing what enters context is itself intelligence — curation tells the model what matters. A model handed exactly the right 2K tokens outperforms one left to find them inside 200K. Hardware can't make that choice; software can.

04 The Five Patterns

Watch how an agent harness assembles context just-in-time, pattern by pattern.

CONTEXT WINDOW scarce · paid per token · volatile RULES · always loaded · cached prefix SKILL: pdf-handling · loaded on demand Skill library (disk) names always visible; bodies fetched when triggered MEMORY: 3 relevant notes · recalled Memory bank (files/DB) persists across sessions; index loaded, bodies on demand WORKFLOW step 2 of 5 · only this step USER TASK · volatile · last position SUBAGENT own fresh window; burns 50K tokens searching, returns a 200-token answer Result: ~15K curated tokens in the window instead of 300K dumped — cheaper, faster, and more accurate.

Each pattern, precisely

PatternWhat it isWhen it earns its place
Rules / system promptsSmall, always-loaded behavioral constraints (CLAUDE.md, .cursorrules)Things that must hold on every request. Keep tiny — they tax every single call.
SkillsNamed capability bundles; an index of one-line descriptions stays resident, full instructions load only when triggeredExpertise needed sometimes: how to build a PPTX, how to review a PR. Demand paging for knowledge.
WorkflowsMulti-step procedures externalized so only the current step (plus state) occupies contextLong processes that would otherwise carry all past steps as dead weight in the window.
Memory banksFacts persisted to files/DB across sessions; a light index resident, bodies recalled by relevanceAnything that must survive the conversation: preferences, project decisions, learned corrections.
SubagentsDelegate a subtask to a fresh context; only the distilled result returnsExploration that generates bulk garbage (searching 100 files) — quadratic context growth becomes additive.

05 The Hardware Mapping — Connecting All the Dots

Here is the series' full circle. Each software pattern has an exact counterpart in the memory hierarchy of doc 08, and relieves a specific cost from docs 06–07:

Software patternHardware counterpartBottleneck it relieves
Rules (small, stable, first)Pinned cache lines / firmware in ROMPrefill cost — stable prefix maximizes cache hits (doc 07)
Skills (index resident, body on demand)Demand paging; page table in memory, pages on diskCapacity + attention pollution — pay only when used
Workflows (one step in flight)Instruction streaming; loop tilingKV-cache growth per decode step (docs 06, 08)
Memory banks (persist + recall)Disk + buffer cache; write-back policyVolatility — context evaporates, files don't
Subagents (fresh window, distilled return)Scratch buffers; map-reduce; FlashAttention's discard-the-intermediate trickO(n²) context accumulation → O(n); garbage never enters the parent's "SRAM"
Summarization / compactionLossy compression; cache eviction with a victim summaryCapacity — trade fidelity for space, in controlled batches (doc 07's cache-invalidation caveat)
🧠
The FlashAttention parallel deserves emphasis: a subagent burning 50K tokens and returning 200 is the same move as computing the n×n score tile in SRAM and writing back only the output — do bulky intermediate work in a cheap scratch space; never let it transit the expensive tier. One principle, silicon to agent.
💱
And the economics close the loop: every row above also moves spend down doc 07's price table — from output tokens (3–5×) and fresh input (1×) toward cached input (0.1×) and disk (free). Token efficiency and hallucination reduction are the same optimization: a window containing only what matters.

06 Mental Models

The context window is SRAM

Small, fast, expensive, volatile — manage it like a kernel author manages shared memory: stage what's needed, evict aggressively, keep the layout cache-friendly. Lets you reason about: whether any new agent feature helps (does it reduce bytes in the precious tier, or pollute it?).

SRAM access is uniform; context attention is not — position and ordering affect recall quality.
The agent harness is an operating system

Rules = kernel config; skills = demand-paged libraries; memory = filesystem; subagents = processes with isolated address spaces; the context window = physical RAM the OS allocates. Lets you reason about: why harness design matters more than model choice for many tasks, and what's still missing (schedulers, quotas, protection rings).

An OS enforces isolation in hardware; harness "isolation" is by convention — injected text can still leak across boundaries (prompt injection).
Static + dynamic = the new program

Weights are the compiled binary; context is the runtime configuration + heap. "Programming" an agent = curating what reaches its heap each call. Lets you reason about: why prompt/context engineering is a durable skill, not a passing trick — heaps need managing no matter how good binaries get.

Unlike a heap, context cannot be mutated in place — only appended to or rebuilt (which is exactly why caching rules exist).

07 Common Misconceptions

"Million-token windows make all this obsolete." The same was said of RAM killing caches and SSDs killing buffer pools. Bigger windows raise the ceiling; they don't change that filled windows cost linearly per request (doc 07), decode slower per token (doc 06), and dilute attention. Hierarchy management survives every capacity jump — it has for sixty years.

"Memory banks are RAG with extra steps." They overlap, but differ in provenance and write path: RAG retrieves from a corpus you indexed; a memory bank is written by the agent itself — curated conclusions, not raw documents. The write policy (what's worth persisting) is the hard part, just as cache write-back policy is harder than read caching.

"More instructions in the system prompt = more reliable agent." Past a point, strictly worse: every rule taxes every request (prefill + decode), and rules compete with the task for attention. The skill/rule split exists precisely to keep the always-loaded set minimal — pin only what's truly global.

"Subagents are about parallelism." Sometimes, but their primary value is context isolation — the parent's window stays clean. A sequential subagent that returns a tight summary is usually a bigger win than three parallel ones returning verbose dumps.

🎓
Series complete. You can now trace a single thread from a softmax in SRAM (08) through token prices on an invoice (07), the two-phase pipeline that sets your latency (06), to the design of the agent harness you use every day (09). That thread — scarce fast memory, managed deliberately — is the oldest idea in computer architecture, wearing its newest costume.