01 The Big Picture
Doc 09 ended with a rule: put exactly the right 2K tokens in the window, not 200K of maybe-relevant ones. It never said how to find them. This doc is the how.
Two ideas combine. Embeddings turn text into points in a high-dimensional space where distance ≈ semantic similarity — meaning becomes geometry, and "find related text" becomes "find nearest points," a problem computers are excellent at. RAG (Retrieval-Augmented Generation) uses that geometry as the selection mechanism for the context window: embed the question, fetch the nearest knowledge, paste it before generating. It is the standard answer to the question every LLM deployment hits in week one: "how do I make it answer from MY data?"
02 What Is an Embedding?
An embedding is a learned function: text in, fixed-length vector out (typically 384–3,072 floats). Trained — usually by contrastive learning: pull paired texts together, push unrelated apart — so that cosine similarity between vectors tracks semantic relatedness. "How do I reset my password?" and "Steps to recover account access" land close together despite sharing almost no words; that is the entire magic, and it's what keyword search can never do.
Clusters form by meaning, not vocabulary. Distance is computed with cosine similarity: cos(a,b) = a·b / (|a||b|) — cheap, parallel, GPU-friendly.
What embeddings are not: the LLM's internal states (a separate, smaller model produces them), and not exact — they compress meaning lossily, like a hash that preserves neighborhood instead of equality.
03 Why RAG Exists
Thought experiment: ask "what did we decide about the payment-retry logic in March?" Without RAG the model must hallucinate — the answer literally isn't in the weights. With RAG, the meeting note is fetched and the model merely summarizes it. Same model, opposite reliability.
04 How RAG Works — The Pipeline
Two phases: an offline indexing pass, then a per-query loop. Step through both.
The decisions that actually matter
| Decision | Trade-off |
|---|---|
| Chunk size | Small chunks → precise retrieval, lost surrounding context. Large → context preserved, similarity diluted. Common fix: retrieve small, expand to parent section before pasting. |
| Hybrid search | Embeddings miss exact identifiers (error codes, function names) that keyword/BM25 nails. Production = vector + keyword, fused (e.g. reciprocal rank fusion). |
| Rerank or not | Bi-encoder (one vector per text, fast, indexable) recalls candidates; cross-encoder (reads query+chunk together, slow, accurate) re-orders the top 50. Classic two-stage: cheap filter, expensive refine. |
| k and budget | More chunks ≠ better: each costs tokens (doc 07) and dilutes attention (doc 09). Retrieve generously, paste selectively. |
05 Where RAG Fails (and What's Replacing Parts of It)
Failure modes
Questions whose answers span many chunks ("summarize all Q1 decisions") — similarity finds pieces, not wholes. Implicit queries that share no semantics with their answer. Stale indexes after docs change. And the silent one: retrieval returns plausible-but-wrong chunks and the model confidently grounds on them — garbage in, citation out.
The agentic shift
Classic RAG is one-shot: retrieve once, answer. Agentic retrieval (doc 12) lets the model iterate — search, read, realize it needs something else, search again — and use non-vector tools too (grep, SQL, file reads). One-shot RAG is becoming a tool inside a loop rather than the whole architecture.
06 Mental Models
A hash map sends equal keys to equal slots; an embedding sends similar meanings to nearby points. Lets you reason about: what embeddings can (fuzzy match at scale) and cannot (exact lookup, logic, recency) do.
Fine-tuning = studying (knowledge in the head, fuzzy recall); RAG = open-book (knowledge on the desk, must find the page fast). Lets you reason about: fine-tune for style/skills, retrieve for facts; and why the librarian (retriever), not the student (LLM), is usually the weak link.
Doc 08's hierarchy one level up: corpus = disk, vector index = page table, retrieved chunks = the pages staged into SRAM (the window). Lets you reason about: every RAG design choice as a caching/prefetch policy question.
07 Common Misconceptions
"The model searches the database." The model never touches the DB. An external pipeline searches and pastes text into the prompt; the model just reads. (Agentic tool use changes this — but then the model requests a search; it still doesn't execute it.)
"Long context windows kill RAG." They kill small-corpus RAG. For 50 pages, pasting everything beats a pipeline. For 50 GB, selection is forever — and docs 06–07 showed full windows cost money and speed every single call. Capacity moves the threshold; it doesn't repeal selection.
"Better embeddings will fix my RAG." Most RAG failures are chunking, missing keyword search, stale indexes, or unanswerable-from-corpus questions. The embedding model is rarely the binding constraint — measure retrieval recall before swapping models (doc 13's whole point).
"Fine-tuning teaches the model my documents." Fine-tuning on your docs teaches their style and vocabulary; reliable factual recall of specific content is exactly what gradient descent on a few epochs does worst. Facts belong in context; behaviors belong in weights.