AI Learning Series · Part 7

Context Caching & Cost

Tokens are the new currency of compute. This is the exchange-rate mechanism: what a token actually costs the hardware, and how caching changes the price.

Transformers
Inference Anatomy
Context Caching
GPU Memory
Context Solutions

01 The Big Picture

Every API pricing page now has a strange line item: "cached input tokens — 10× cheaper." That discount is not marketing. It is hardware physics surfacing in a price.

In the last doc you saw that prefill — reading your prompt — is real compute: every token, through every layer, every turn. But conversations and agent loops re-send almost identical prompts over and over. The system prompt, the tool definitions, the codebase you pasted — unchanged, turn after turn. Recomputing their KV cache each time is pure waste. Context caching (prompt caching, prefix caching — same idea) keeps the KV cache for a stable prefix alive between requests, so the model only prefills what actually changed.

For an engineer this matters twice: it is the difference between an agent loop that costs ₹50 and one that costs ₹500, and it dictates how you should structure prompts — the central skill of context engineering.

02 What It Is (and Is Not)

Context caching is the reuse of an already-computed KV cache for the longest prefix of a new request that exactly matches a previous request. Match is at the token level, position by position, starting from token 1. One changed token at position k invalidates everything from k onward — because each token's K,V depend on its position and all attention before it.

What it is not:

Not semantic caching

"What's the capital of France?" and "Capital of France?" share no usable prefix. Context caching matches tokens, not meaning. (Semantic response caching exists, but it's an application-layer trick, not a model mechanism.)

Not memory

The cache stores K,V tensors, not facts. It expires in minutes. It does not make the model "remember you" — that's the job of memory systems built in software (doc 09).

03 Why It Exists

Three converging pressures made caching unavoidable:

Conversations are append-only. Turn N's prompt = turn N−1's prompt + a little more. Without caching, a 50-turn chat re-prefills the same opening tokens 50 times — O(n²) total prompt compute for a linear conversation.
Agents made prompts huge and repetitive. An agent harness sends the same system prompt + tool schemas (often 10–20K tokens) on every single tool-call round trip. Ten tool calls = ten re-reads of an unchanged manual.
Prefill compute is the provider's cost. Recomputed prefixes burn GPU-seconds that produce nothing new. Caching converts that compute into a much cheaper memory-storage problem — and the provider shares the savings with you to incentivize cache-friendly traffic.
🔑
Thought experiment: without caching, an agent that makes 20 tool calls over a 30K-token context pays ~600K prompt tokens. With a stable prefix, it pays ~30K full-price + ~570K at the cached rate. Same conversation, ~5–8× cheaper and with much lower time-to-first-token. Caching is the single highest-leverage cost optimization available to you.

04 How It Works — Hit vs Miss

Step through two consecutive requests and watch what the prefix match does.

Request 1 system prompt + tools (10K tok) user msg (1K) FULL PREFILL 11K tokens computed KV cache saved keyed by token prefix Request 2 (next turn — same opening) system prompt + tools (10K tok) user msg (1K) reply + new msg (2K) CACHE HIT — first 11K tokens: KV restored, zero prefill compute PREFILL Δ ONLY 2K new tokens computed ⚠ If even ONE token in the 10K prefix had changed — timestamp, reordered tool, “Hi” → “Hey” — everything after it recomputes at full price.

Under the hood the provider keys cache blocks by a hash of the token prefix (vLLM hashes fixed-size token blocks; managed APIs put explicit or implicit "cache breakpoints" in your prompt). Cached blocks live in GPU/CPU memory with a TTL of minutes — this is a working set optimization, not storage.

05 Anatomy of a Cache-Friendly Prompt

Because matching is prefix-based, order is everything: stable content first, volatile content last. This single rule is most of "prompt structure" advice, derived from hardware:

stable — cache foreverSystem prompt: role, rules, output format stable — cache foreverTool / function schemas (don't reorder them!) stable per sessionLarge reference material: codebase, docs, examples append-onlyConversation history (grows at the end — prefix preserved) volatile — never cachedCurrent user message, retrieved chunks, timestamps
✓ Do

Put instructions before data; keep tool lists in fixed order; append new turns at the end; isolate anything dynamic (dates, IDs) at the bottom.

✗ Don't

Inject "Today is 2026-06-11 14:32" into the system prompt; shuffle retrieved chunks into the prefix; rewrite history (summarizing old turns invalidates the whole suffix — do it rarely, in big batches).

06 The Economics — Tokens as Currency

Each token class maps to a different hardware cost, and pricing follows:

Token classHardware realityTypical relative price
Cached inputMemory lookup + restore — almost no compute~0.1×
Fresh inputOne parallel prefill pass — compute-bound
OutputOne full sequential pass each — memory-bound, unbatchable per user3–5×

This table is the budget you optimize as a context engineer. The strategies in doc 09 — skills loaded on demand, memory banks, subagents with separate contexts — are all ways of moving spend between these rows: less fresh input, more cached input, fewer wasted output tokens. And note the compounding effect: smaller context isn't just cheaper per request — it also decodes faster (smaller KV cache to read per token, doc 06) and tends to hallucinate less (less irrelevant material competing for attention).

💱
The currency metaphor, made precise: tokens are denominated claims on GPU time. Cached tokens are claims on memory (cheap, abundant); fresh input tokens are claims on compute (the expensive burst); output tokens are claims on memory bandwidth (the scarcest resource of all — doc 08). Spend in the cheap column.

07 Mental Models

Compiler warm cache

A stable prompt prefix is like an unchanged header file: touch nothing and the build is incremental; touch line 1 and everything rebuilds. Lets you reason about: why prompt order matters more than prompt size for cost.

Unlike build caches, prompt caches expire in minutes and are per-provider, per-model.
Restaurant mise en place

The kitchen preps stable ingredients before service (cached prefix); your order only triggers the final cooking (volatile suffix). Lets you reason about: TTFT improvements from caching — prep done before you even ordered.

Mise en place is reused across different dishes; a KV prefix only serves prompts with the identical opening.

08 Common Misconceptions

"Caching means the model might give me stale answers." No. The cache stores attention K,V for your prompt tokens, not responses. Generation still runs fresh every time; output is exactly what it would be without caching (same seed/temperature caveats aside).

"I'm being charged for the same tokens every turn — that's a scam." The API is stateless by design: re-sending history is what lets any GPU in the fleet serve your next turn. The cached-token discount is precisely the refund for the part the provider didn't recompute.

"More context is always better, it's basically free with caching." Caching makes the prefill of stable context nearly free, but every decode step still reads the whole KV cache — long context permanently taxes output speed and degrades attention quality ("lost in the middle"). Cache discounts don't repeal doc 06's math.

"Cache = the model remembers me." The cache is tensors with a minutes-long TTL keyed to exact token prefixes. Persistent memory is software you build on top (doc 09).

🗺️
Next: we said output tokens are "claims on memory bandwidth — the scarcest resource." Doc 08 goes down to the silicon to show why bandwidth, not FLOPs, is the wall everything hits.