AI Learning Series · Part 11 · Practice Track

Embeddings & RAG

How meaning becomes geometry, and how retrieval decides what earns a place in the precious context window.

Context Caching
Context Solutions
Embeddings & RAG
Agents

01 The Big Picture

Doc 09 ended with a rule: put exactly the right 2K tokens in the window, not 200K of maybe-relevant ones. It never said how to find them. This doc is the how.

Two ideas combine. Embeddings turn text into points in a high-dimensional space where distance ≈ semantic similarity — meaning becomes geometry, and "find related text" becomes "find nearest points," a problem computers are excellent at. RAG (Retrieval-Augmented Generation) uses that geometry as the selection mechanism for the context window: embed the question, fetch the nearest knowledge, paste it before generating. It is the standard answer to the question every LLM deployment hits in week one: "how do I make it answer from MY data?"

02 What Is an Embedding?

An embedding is a learned function: text in, fixed-length vector out (typically 384–3,072 floats). Trained — usually by contrastive learning: pull paired texts together, push unrelated apart — so that cosine similarity between vectors tracks semantic relatedness. "How do I reset my password?" and "Steps to recover account access" land close together despite sharing almost no words; that is the entire magic, and it's what keyword search can never do.

embedding space (2D cartoon of 1,536D) reset password recover account k8s pod eviction OOMKilled debug biryani recipe

Clusters form by meaning, not vocabulary. Distance is computed with cosine similarity: cos(a,b) = a·b / (|a||b|) — cheap, parallel, GPU-friendly.

What embeddings are not: the LLM's internal states (a separate, smaller model produces them), and not exact — they compress meaning lossily, like a hash that preserves neighborhood instead of equality.

🧠
DSA connection: finding nearest neighbors among millions of vectors exactly is O(n) per query — too slow. Vector databases use ANN (approximate nearest neighbor) indexes, chiefly HNSW: a multi-layer skip-list-of-graphs where the top layers make coarse hops and lower layers refine — O(log n)-ish search by the same intuition as skip lists and B-tree levels. You trade a little recall for orders of magnitude of speed.

03 Why RAG Exists

Weights are frozen and dated. The model knows nothing after its training cutoff and nothing private to you. Fine-tuning to inject facts is slow, expensive, hard to update, and bad at reliable recall — weights store skills and patterns well, individual facts poorly.
The window can't hold your corpus. Even a 1M-token window can't take your company's wiki — and docs 06–07 showed that filling it costs linearly per request and degrades attention. You need selection, not capacity.
Grounding fights hallucination. A model answering from retrieved text in-context can cite it; a model answering from parametric memory plausibly confabulates. Retrieval converts "recall" into "reading comprehension" — a task LLMs are far more reliable at.

Thought experiment: ask "what did we decide about the payment-retry logic in March?" Without RAG the model must hallucinate — the answer literally isn't in the weights. With RAG, the meeting note is fetched and the model merely summarizes it. Same model, opposite reliability.

04 How RAG Works — The Pipeline

Two phases: an offline indexing pass, then a per-query loop. Step through both.

OFFLINE (once) Docs / wiki / codebase Chunker ~300–800 tok pieces Embedding model chunk → vector Vector DB HNSW index + metadata PER QUERY User question embedded with same model ANN search top-k nearest chunks Reranker cross-encoder re-scores k→3 Context assembly chunks at the END (doc 07!) LLM grounded answer + citations The LLM never "searches" — it just reads what the pipeline chose. Retrieval quality caps answer quality.

The decisions that actually matter

DecisionTrade-off
Chunk sizeSmall chunks → precise retrieval, lost surrounding context. Large → context preserved, similarity diluted. Common fix: retrieve small, expand to parent section before pasting.
Hybrid searchEmbeddings miss exact identifiers (error codes, function names) that keyword/BM25 nails. Production = vector + keyword, fused (e.g. reciprocal rank fusion).
Rerank or notBi-encoder (one vector per text, fast, indexable) recalls candidates; cross-encoder (reads query+chunk together, slow, accurate) re-orders the top 50. Classic two-stage: cheap filter, expensive refine.
k and budgetMore chunks ≠ better: each costs tokens (doc 07) and dilutes attention (doc 09). Retrieve generously, paste selectively.

05 Where RAG Fails (and What's Replacing Parts of It)

Failure modes

Questions whose answers span many chunks ("summarize all Q1 decisions") — similarity finds pieces, not wholes. Implicit queries that share no semantics with their answer. Stale indexes after docs change. And the silent one: retrieval returns plausible-but-wrong chunks and the model confidently grounds on them — garbage in, citation out.

The agentic shift

Classic RAG is one-shot: retrieve once, answer. Agentic retrieval (doc 12) lets the model iterate — search, read, realize it needs something else, search again — and use non-vector tools too (grep, SQL, file reads). One-shot RAG is becoming a tool inside a loop rather than the whole architecture.

06 Mental Models

Embeddings are a locality-preserving hash

A hash map sends equal keys to equal slots; an embedding sends similar meanings to nearby points. Lets you reason about: what embeddings can (fuzzy match at scale) and cannot (exact lookup, logic, recency) do.

"Similarity" is whatever the embedding model was trained on — domain mismatch silently degrades the geometry.
RAG is an open-book exam

Fine-tuning = studying (knowledge in the head, fuzzy recall); RAG = open-book (knowledge on the desk, must find the page fast). Lets you reason about: fine-tune for style/skills, retrieve for facts; and why the librarian (retriever), not the student (LLM), is usually the weak link.

In real open-book exams you know which book; RAG must also decide that.
The prefetcher for the context window

Doc 08's hierarchy one level up: corpus = disk, vector index = page table, retrieved chunks = the pages staged into SRAM (the window). Lets you reason about: every RAG design choice as a caching/prefetch policy question.

Prefetchers exploit access patterns; retrieval must guess relevance from a single query.

07 Common Misconceptions

"The model searches the database." The model never touches the DB. An external pipeline searches and pastes text into the prompt; the model just reads. (Agentic tool use changes this — but then the model requests a search; it still doesn't execute it.)

"Long context windows kill RAG." They kill small-corpus RAG. For 50 pages, pasting everything beats a pipeline. For 50 GB, selection is forever — and docs 06–07 showed full windows cost money and speed every single call. Capacity moves the threshold; it doesn't repeal selection.

"Better embeddings will fix my RAG." Most RAG failures are chunking, missing keyword search, stale indexes, or unanswerable-from-corpus questions. The embedding model is rarely the binding constraint — measure retrieval recall before swapping models (doc 13's whole point).

"Fine-tuning teaches the model my documents." Fine-tuning on your docs teaches their style and vocabulary; reliable factual recall of specific content is exactly what gradient descent on a few epochs does worst. Facts belong in context; behaviors belong in weights.

🗺️
Next: RAG retrieves once. Doc 12 puts retrieval — and every other tool — inside a loop with a goal: the agent.