Mental Models: Transformers → LLMs → Agents → Inference

01 Why Transformers? (The Problem They Solved)

To understand attention, you must first feel the pain that motivated it.

Before 2017, sequence problems (translation, text, speech) used RNNs (Recurrent Neural Networks). An RNN processes a sequence one token at a time, left to right. At each step it maintains a "hidden state" — a vector that's supposed to summarize everything seen so far.

Problem 1: The Vanishing Gradient

Gradients flow back through every time step. In long sequences (100+ tokens), they shrink exponentially. Early words barely influence the model's weights. The network effectively forgets the beginning of a long sentence by the time it reads the end.

Problem 2: Sequential = Slow

Step 2 can't begin until step 1 is done. You cannot parallelize RNN training across the sequence. Modern GPUs with thousands of cores sit idle, waiting. This fundamentally limited how much data you could train on.

💡

The Transformer's answer: Throw away sequential processing entirely. Process all tokens simultaneously, and instead of a single hidden state carrying all context, let every token directly attend to every other token. One operation, fully parallelizable, no information bottleneck. This is the self-attention mechanism.

02 Attention — The Core Idea

The mechanism that lets every word ask: "which other words should I look at?"

Imagine you're translating "The animal didn't cross the street because it was too tired." What does "it" refer to? You need to look at the word "animal," not "street." Attention lets the model do this dynamically, learned from data.

Query, Key, Value — The Mental Model

Think of it like a database lookup, but soft (probabilistic) rather than exact:

Query (Q) — "What am I looking for?" Each token asks a question about what context it needs. Like a search query.

Key (K) — "What do I offer?" Each token describes what information it contains. Like a database index key.

Value (V) — "What do I actually give?" If a key matches, you return this vector — the actual information content.

Score → Softmax → Weighted Sum — Q·Kᵀ gives compatibility scores. Softmax normalizes to weights. Then sum Values weighted by those scores. Result: a new representation of each token, enriched with context from all others.

Attention(Q, K, V) = softmax( QKᵀ / \sqrtd_k ) \cdot V d_k = dimension of keys (scaling prevents vanishing gradients from large dot products)

Self-Attention: Every Token Attends to Every Other Token

Multi-Head Attention

A single attention head learns one type of relationship. Multi-head attention runs H attention heads in parallel, each learning different relationships — syntactic, semantic, coreference, positional. Their outputs are concatenated and projected back.

🔍 Model: Multiple Lenses

Think of each head as a different question you can ask about token relationships: "Who acts on whom?" "What qualifies what?" "What happened before what?" Multi-head attention lets the model ask all these questions simultaneously, independently.

The learned roles of heads are not cleanly interpretable. Some heads seem to specialize, but "head 3 does syntax" is an oversimplification.

Attention Weights — Where Does "it" Look?

The0.05

animal0.61

cross0.04

was0.04

tired0.26

softmax(QKᵀ/√d_k) for the query token "it" — darker cells = higher weight. "animal" dominates, which is how the model resolves the pronoun's referent.

03 The Transformer Architecture

All the pieces assembled — and why each one exists.

Transformer Decoder Block (what LLMs use — repeated N times)

Key Design Decisions — Why Each Piece Exists

Positional Encoding

Attention has no built-in notion of order. "cat sat on mat" and "mat on sat cat" look the same. Positional encodings inject position information as additive vectors — the model learns to use them.

Residual Connections

Output = input + transformed_input. This ensures gradients can flow straight back through the network without vanishing. Makes training very deep (100+ layer) networks stable.

Layer Normalization

Normalizes each layer's activations to zero mean, unit variance. Stabilizes training. Without it, early layers' activations would grow or shrink wildly, causing training to explode or collapse.

🏗️

Encoder vs Decoder: The original Transformer had both. Encoders (BERT-style) read the entire sequence bidirectionally — great for classification, understanding. Decoders (GPT-style) generate left-to-right, seeing only past tokens — great for generation. Modern LLMs are decoder-only. The "masked" self-attention enforces that each token can only look backward, not forward into the future.

04 Training an LLM — The Full Story

How you go from random weights to a model that speaks English, writes code, and reasons.

Stage 1: Pretraining — Next Token Prediction

The core task is deceptively simple: given the previous tokens, predict the next one. That's it. Feed the model the internet — trillions of tokens of text — and have it predict, at every position, what comes next.

Why next-token prediction works as a pretraining objective:

To predict the next word well, you have to understand context, grammar, facts, reasoning, and world knowledge. "The capital of France is ___" — the model must know geography. "To sort a list in Python, use ___" — the model must know programming. The label (next token) is free — it's just the existing text. You get a rich supervision signal from unlabeled data.

Next Token Prediction — The Pretraining Signal

Tokenization — Before Any Training

Text is broken into tokens (subword units) using algorithms like Byte-Pair Encoding (BPE). Common words become single tokens; rare words are split into subword pieces. This gives a fixed vocabulary (50K–200K tokens) that can represent any text.

Example tokenization (GPT-style):

"Un""believ""able""ly" → 1 common word split into 4 tokens because "Unbelievably" is rare

Scale: What Changed at 100B+ Parameters

In 2020, OpenAI's Scaling Laws paper showed that model capability scales predictably with compute, data, and parameters — following power laws. Larger models trained on more data with more compute were consistently better. This justified building GPT-3 (175B parameters) and led to the modern race to scale.

Emergent Capabilities

At certain scale thresholds, abilities appear that weren't present in smaller models — few-shot learning, chain-of-thought reasoning, arithmetic. These "emergent" behaviors are a major active research area. Why do they appear? Not fully understood.

Compute: GPUs & Parallelism

Training LLMs requires thousands of GPUs for months. Three parallelism strategies: data parallel (same model, different data shards), tensor parallel (split weight matrices), pipeline parallel (split layers across GPUs).

05 Fine-Tuning & RLHF

Pretraining gives raw capability. Fine-tuning shapes it into useful, safe behavior.

A pretrained LLM is a raw text predictor — it will complete "How do I make a bomb?" with the statistically likely continuation from its training corpus. Not useful. Instruction fine-tuning and RLHF align the model to human intent.

The Three-Stage Pipeline: Pretraining → SFT → RLHF

How RLHF Works (Step by Step)

Generate responses. For a given prompt, sample multiple completions from the SFT model.

Human ranking. Human raters compare pairs of completions and indicate which is better. This is cheaper than writing ideal responses from scratch.

Train a Reward Model (RM). Supervised learning on the preference data. The RM learns to predict "how much would a human prefer this completion?" — outputs a scalar reward.

RL fine-tuning. Use PPO (Proximal Policy Optimization) to fine-tune the LLM to maximize reward from the RM, while staying close to the SFT model (KL penalty prevents reward hacking).

⚡

Modern shortcut — DPO: RLHF with PPO is complex and unstable. Direct Preference Optimization (DPO, 2023) reformulates the RLHF objective as a supervised learning problem — no separate reward model, no RL training loop. Just train directly on preference pairs. Much simpler, similar results. Most current fine-tuning uses DPO or variants.

06 LoRA & QLoRA — Fine-Tuning Without Full Training

How to specialise a 70B parameter model on a consumer GPU — the technique that democratised LLM fine-tuning.

Full fine-tuning of a 70B model requires ~140GB of GPU VRAM just for the weights, plus more for optimiser states (Adam needs 2× the weight memory). That's 4–8 A100s. LoRA (Low-Rank Adaptation) makes fine-tuning accessible by observing that the update to pretrained weights during fine-tuning is intrinsically low-rank — you don't need to update all 140GB.

The core insight:

Freeze the original weight matrix W₀. Instead of computing the full update ΔW (same shape as W₀), decompose it: ΔW = A·B where A ∈ ℝ^{d×r} and B ∈ ℝ^{r×k}, with rank r ≪ min(d,k). Only train A and B. For a 4096×4096 weight matrix with r=16: 16M parameters → 131K parameters — a 122× reduction.

LoRA: Parallel Low-Rank Update Path

QLoRA (Quantised LoRA)

Combines LoRA with 4-bit quantisation of the frozen base model. The base weights are stored in NF4 (Normal Float 4-bit) — 4× smaller than float16. LoRA adapters remain in float16. A 70B model fits in ~35GB VRAM instead of 140GB. This is how a 70B model runs on a single A100 (80GB). State-of-the-art fine-tuning quality at 10× lower cost.

Where LoRA is Applied

Typically applied to the attention projection matrices (Q, K, V, O) and sometimes the FFN layers. Rank r=4–64 is common. After training, the LoRA weights can be merged back into W₀ (W₀ + AB) for zero-overhead inference — no extra compute at inference time.

07 Mixture of Experts (MoE)

How to build a 141B parameter model that only activates 22B parameters per token — making scale affordable.

A standard Transformer has a single FFN per layer that every token passes through. MoE replaces the FFN with N expert FFNs plus a lightweight router. For each token, the router selects the top-k experts (typically k=2). Only those k experts compute. The rest are idle. You get the capacity of a large model at the cost of a small one — per token.

MoE Layer: Each Token Routes to 2 of N Experts

Why MoE is powerful

Mixtral 8×7B has 47B total parameters but only activates ~13B per token (2 of 8 experts). It matches or beats Llama-2-70B on most benchmarks at ~5× lower inference cost. Experts implicitly specialise — some handle code, some handle reasoning, some handle languages — though this is emergent, not engineered.

MoE challenges

Load balancing: Without regularisation, the router collapses to always routing to the same 2 experts, wasting capacity. A load-balancing auxiliary loss penalises uneven routing. Memory: All N experts must live in VRAM even though only 2 are active — the 47B total weights must all be loaded.

08 Multimodal Models — Vision Language Models

How LLMs learn to see: connecting image encoders to text decoders.

A Vision Language Model (VLM) combines a visual encoder (typically a Vision Transformer or CNN) with a text decoder (a standard LLM). The key challenge: image patches live in a different representation space than text tokens. A projection layer maps visual features into the LLM's token embedding space, where they can be processed identically to text tokens.

VLM Architecture — Image Tokens Join the LLM Context

Training Strategy

Stage 1: Freeze both encoder and LLM; train only the projection layer on image-caption pairs (aligns representations). Stage 2: Unfreeze the LLM; fine-tune on instruction-following vision tasks ("describe this image", "answer questions about this chart"). LLaVA, InternVL, Qwen-VL use this two-stage approach.

Why CLIP as Encoder

CLIP (OpenAI, 2021) trained a vision encoder and text encoder contrastively — matching images to captions at web scale. The visual representations are already semantically aligned with language, making projection to LLM embedding space much easier. CLIP's visual encoder is the default choice in most open VLMs.

09 Flash Attention — How Long Contexts Actually Work

Standard attention is O(T²) in memory. Flash Attention computes the same result in O(T) memory by restructuring the computation around GPU hardware.

The naive attention algorithm computes the full T×T score matrix, stores it in HBM (GPU high-bandwidth memory — large but slow, ~2TB/s bandwidth), then multiplies by V. For T=32K tokens: the score matrix is 32K×32K×2 bytes = 2GB. Just storing it dominates your memory budget, and the repeated HBM reads dominate your runtime.

Flash Attention's key idea: tiling

Split Q, K, V into tiles that fit in SRAM (on-chip memory — tiny but fast, ~19TB/s bandwidth). For each tile pair, compute attention locally, accumulate the output using a running softmax normaliser. Never materialise the full T×T matrix in HBM.

Memory: O(T²) → O(T) — enables 128K+ contexts
Speed: 2–4× faster than standard attention on A100
Mathematically exact — not an approximation
Backward pass also recomputed from tiles (no storing activations)

Why Hardware Awareness Matters

Modern ML performance is dominated by memory bandwidth, not FLOPs. The GPU has fast SRAM (~20MB) and slow HBM (~80GB). Every time you read/write HBM, you pay a latency penalty. Flash Attention restructures the algorithm to maximise SRAM reuse — the same computation, far fewer HBM accesses.

Flash Attention 2 (2023) adds work partitioning across thread blocks for further speedup. Flash Attention 3 (2024) targets Hopper GPUs with tensor memory accelerators. This is the difference between theory and systems engineering.

🔗

Ring Attention extends Flash Attention to multi-GPU long-context: each GPU holds a shard of the K/V sequence; GPUs pass K/V shards around in a ring topology while computing attention locally. This enables contexts of millions of tokens distributed across thousands of GPUs — used in Gemini 1.5's 1M-token context window.

10 Open Weights

What it means for a model to be "open," and what you actually get.

When a lab releases "open weights," they publish the trained parameter tensors — the actual numbers that define the model. You can download them and run inference locally. This is not the same as open source — the training code, training data, and methodology may all be proprietary.

What you get

Model weight files (often .safetensors)
Tokenizer vocabulary & config
Architecture config (layer count, dims)
Inference code (usually)

What you usually don't get

Training data
Full training code at scale
Detailed training methodology
RLHF preference data

Weight Files — What They Are

The weights are tensors — multi-dimensional arrays of floating-point numbers. A 7B model has ~7 billion numbers, typically stored as float16 (2 bytes each) = ~14GB. The file structure maps to named layers: model.layers.0.self_attn.q_proj.weight, etc. Loading these into memory and running forward passes is "inference."

💾

Quantization: To run a 70B model on consumer hardware, you quantize weights from 16-bit to 4-bit (4x compression). You lose some precision but often <5% quality degradation. GGUF format (used by llama.cpp) enables this, letting 70B models run on a 48GB Mac Studio.

11 Prompting

Prompting is not asking a search engine. It's programming in natural language.

A prompt is the input to an LLM. The model doesn't "understand" your intent — it predicts the most likely continuation of the prompt+response template it was fine-tuned on. Good prompting means providing the context that makes the correct output the statistically most likely continuation.

Zero-Shot

Just ask. No examples. Works for tasks the model has seen a lot during training.

"Translate to French: Hello"

Few-Shot

Give 2–5 examples before your actual query. Tells the model the format and task via demonstration.

"EN: Hello → FR: Bonjour\nEN: Bye → FR:"

System Prompt

A special prefix (injected before user messages) that sets persona, rules, context. Invisible to users in production apps.

Chain-of-Thought (CoT) prompting — "Let's think step by step" — dramatically improves reasoning on math, logic, and multi-step problems. Why? Because by generating intermediate steps, the model gets more computation (more tokens) to allocate to the problem. Each token generated is one "step of thought."

🧮

The deep insight: A Transformer does a fixed amount of computation per token generated. Complex problems need more steps. CoT externalizes the model's "working memory" into the context window — forcing it to show its work before concluding. This is why models reason better when they generate scratchpad text.

Chat models are fine-tuned with a specific message format. The raw text passed to the model looks like:

        <|system|>

        You are a helpful assistant.

        <|end|>

        <|user|>

        What is 2+2?

        <|end|>

        <|assistant|>

        ← model generates from here

Different models have different templates (ChatML, Llama, Mistral, etc.). Running a model with the wrong template produces garbage output — a common mistake when running open-weight models locally.

12 Inference — The Complete Pipeline

What actually happens when you send a message and get a response back.

End-to-End Inference Pipeline

The KV Cache — Why It Matters for Performance

In auto-regressive generation, each new token requires a forward pass through every layer. But the attention Keys and Values for all previous tokens are the same every time — recomputing them is wasteful. The KV cache stores these tensors in memory, so each new token only needs to compute its own K and V, plus attend over cached ones. This is the single biggest inference optimization.

Sampling Parameters

Temperature

Divides logits before softmax. T=1.0: normal. T→0: greedy (always pick most likely). T>1: more random. The creativity dial.

Top-p (Nucleus)

Sample from the smallest set of tokens whose cumulative probability exceeds p. Cuts off the long tail of unlikely tokens. Better than Top-k for quality.

Top-k

Only sample from the k most probable tokens. Simple, but the right k varies by context — Top-p is generally preferred. k=1 is greedy decoding.

🔄

Agentic Inference: In agent settings, the model doesn't just generate one response — it generates actions (tool calls), receives results, then continues generation. The "context window" accumulates the conversation + all tool results. The model is in a loop: generate → act → observe → generate. The same inference pipeline runs repeatedly, with a growing context each time.

The Complete Mental Model

Supervised learning gave us the training loop. Neural networks gave us function approximation. Transformers gave us parallel sequence processing via attention. Pretraining on internet text gave LLMs world knowledge. RLHF shaped that knowledge into helpful behavior. Open weights make that behavior downloadable. Prompting is the interface. Inference is the runtime. Agentic loops connect inference to the world.

Next: Part 3 will go deep on the mathematics — attention score computation, softmax, loss functions, gradient derivations, and scaling laws.