01 The Big Picture
Every API pricing page now has a strange line item: "cached input tokens — 10× cheaper." That discount is not marketing. It is hardware physics surfacing in a price.
In the last doc you saw that prefill — reading your prompt — is real compute: every token, through every layer, every turn. But conversations and agent loops re-send almost identical prompts over and over. The system prompt, the tool definitions, the codebase you pasted — unchanged, turn after turn. Recomputing their KV cache each time is pure waste. Context caching (prompt caching, prefix caching — same idea) keeps the KV cache for a stable prefix alive between requests, so the model only prefills what actually changed.
For an engineer this matters twice: it is the difference between an agent loop that costs ₹50 and one that costs ₹500, and it dictates how you should structure prompts — the central skill of context engineering.
02 What It Is (and Is Not)
Context caching is the reuse of an already-computed KV cache for the longest prefix of a new request that exactly matches a previous request. Match is at the token level, position by position, starting from token 1. One changed token at position k invalidates everything from k onward — because each token's K,V depend on its position and all attention before it.
What it is not:
Not semantic caching
"What's the capital of France?" and "Capital of France?" share no usable prefix. Context caching matches tokens, not meaning. (Semantic response caching exists, but it's an application-layer trick, not a model mechanism.)
Not memory
The cache stores K,V tensors, not facts. It expires in minutes. It does not make the model "remember you" — that's the job of memory systems built in software (doc 09).
03 Why It Exists
Three converging pressures made caching unavoidable:
04 How It Works — Hit vs Miss
Step through two consecutive requests and watch what the prefix match does.
Under the hood the provider keys cache blocks by a hash of the token prefix (vLLM hashes fixed-size token blocks; managed APIs put explicit or implicit "cache breakpoints" in your prompt). Cached blocks live in GPU/CPU memory with a TTL of minutes — this is a working set optimization, not storage.
05 Anatomy of a Cache-Friendly Prompt
Because matching is prefix-based, order is everything: stable content first, volatile content last. This single rule is most of "prompt structure" advice, derived from hardware:
Put instructions before data; keep tool lists in fixed order; append new turns at the end; isolate anything dynamic (dates, IDs) at the bottom.
Inject "Today is 2026-06-11 14:32" into the system prompt; shuffle retrieved chunks into the prefix; rewrite history (summarizing old turns invalidates the whole suffix — do it rarely, in big batches).
06 The Economics — Tokens as Currency
Each token class maps to a different hardware cost, and pricing follows:
| Token class | Hardware reality | Typical relative price |
|---|---|---|
| Cached input | Memory lookup + restore — almost no compute | ~0.1× |
| Fresh input | One parallel prefill pass — compute-bound | 1× |
| Output | One full sequential pass each — memory-bound, unbatchable per user | 3–5× |
This table is the budget you optimize as a context engineer. The strategies in doc 09 — skills loaded on demand, memory banks, subagents with separate contexts — are all ways of moving spend between these rows: less fresh input, more cached input, fewer wasted output tokens. And note the compounding effect: smaller context isn't just cheaper per request — it also decodes faster (smaller KV cache to read per token, doc 06) and tends to hallucinate less (less irrelevant material competing for attention).
07 Mental Models
A stable prompt prefix is like an unchanged header file: touch nothing and the build is incremental; touch line 1 and everything rebuilds. Lets you reason about: why prompt order matters more than prompt size for cost.
The kitchen preps stable ingredients before service (cached prefix); your order only triggers the final cooking (volatile suffix). Lets you reason about: TTFT improvements from caching — prep done before you even ordered.
08 Common Misconceptions
"Caching means the model might give me stale answers." No. The cache stores attention K,V for your prompt tokens, not responses. Generation still runs fresh every time; output is exactly what it would be without caching (same seed/temperature caveats aside).
"I'm being charged for the same tokens every turn — that's a scam." The API is stateless by design: re-sending history is what lets any GPU in the fleet serve your next turn. The cached-token discount is precisely the refund for the part the provider didn't recompute.
"More context is always better, it's basically free with caching." Caching makes the prefill of stable context nearly free, but every decode step still reads the whole KV cache — long context permanently taxes output speed and degrades attention quality ("lost in the middle"). Cache discounts don't repeal doc 06's math.
"Cache = the model remembers me." The cache is tensors with a minutes-long TTL keyed to exact token prefixes. Persistent memory is software you build on top (doc 09).