01 The Big Picture
"Multimodal AI" bundles two unrelated technologies. Understanding that they're unrelated is most of the understanding.
Vision-language models (Claude reading your screenshot) extend doc 02's transformer: images become tokens, attention does the rest — one architecture, new input type. Diffusion models (Stable Diffusion, DALL·E-class, and the video generators) generate images by an entirely different process: learning to reverse the destruction of data by noise. Different math, different training, different failure modes. An engineer who conflates them predicts both wrongly.
02 Seeing — Vision-Language Models
The transformer never knew tokens were "words." It processes vectors with positions (doc 02). So the recipe for vision is almost cheeky:
03 Why Diffusion? (Why Not Just Autoregress Pixels?)
Doc 06's recipe — predict the next item, append, repeat — works for images in principle (predict pixel 1, then 2…). In practice it fails three ways: a 1024² image is a million-step sequential decode (the doc 06 slow lane, times a thousand); images aren't ordered left-to-right the way language is — global structure (a face's symmetry) is everywhere at once; and one early wrong pixel poisons everything after it.
Diffusion's insight inverts the problem: destroying an image with noise is trivial — so learn to undo the destruction. Add a little Gaussian noise step by step until pure static (easy, mechanical); train a network to estimate, at every noise level, what noise was added (supervised, parallel over steps, no ordering needed). Generation = start from pure static and run the destruction backwards. Every step refines the whole image at once — global structure first, details later, no fatal early commitments.
04 How Denoising Generation Works
Step through one generation, static to picture.
05 The Trick That Made It Affordable — Latent Diffusion
Denoising a 1024×1024×3 pixel tensor 50 times is brutal compute. Latent diffusion (the "LD" in Stable Diffusion's lineage) first trains an autoencoder that compresses images ~48× into a small latent tensor, runs the entire diffusion process in that compressed space, then decodes the final latent back to pixels once. Diffusion over 128×128×4 instead of 1024×1024×3 — the doc 08 move again: do the expensive iterative work in the small fast representation; only touch the big expensive one at the boundary. SRAM tiles, subagent scratchpads, latent spaces — the same shape, third appearance in this series.
| Dimension | Autoregressive LLM (docs 02–14) | Diffusion model |
|---|---|---|
| Generation order | Left to right, one token at a time | Whole canvas at once, coarse → fine |
| Steps to generate | = output length (thousands) | 20–50, fixed, regardless of "complexity" |
| Randomness enters | Sampler's dice roll per token (doc 14) | The initial noise (the "seed image") + per-step noise |
| Mid-course correction | None — committed tokens are final | Constant — every step revises everything |
| Conditioning | Prompt as prefix context | Prompt via cross-attention at every step (+ guidance scale) |
The boundary is blurring — video models run diffusion over space-time latents with transformer backbones (DiT), and "thinking in drafts then refining" diffusion-style is being explored for text too. But the two mental models remain the honest foundation.
06 Mental Models
Vision is "just" a new front-end: patches in, attention as usual. Lets you reason about: image token costs, resolution limits, why VLM hallucination feels exactly like text hallucination — it is.
A writer commits word by word; a sculptor roughs the whole form, then refines everywhere at once. Lets you reason about: why diffusion nails global composition but fumbles fine sequential detail (text in images, finger counts), and the reverse for LLMs.
Adding noise is trivial; removing it is generation. When a task is hard, find the easy destructive direction and learn its reverse — the deepest reusable idea here (also: compression→decompression, encryption→cryptanalysis asymmetries).
07 Common Misconceptions
"Claude/GPT generate the images they show me." Chat assistants that return images typically call a separate diffusion model as a tool (doc 12's pattern). The LLM writes the prompt; the diffusion model paints. Two systems, one chat window.
"Diffusion models stitch together pieces of training images." The network never stores images — it learns a noise-prediction function (gradients of the data distribution, score matching). Outputs are samples from a learned distribution, not collages. (Memorization of near-duplicated training images can occur — that's a failure mode, not the mechanism.)
"VLMs read text in images like OCR." They read it through 14px patches — fine print, dense tables, and low-contrast text degrade into hallucinated plausibilities. For exact transcription at small scale, dedicated OCR as a tool still wins.
"Why can't image models spell?" Now you can answer it: text in an image is a long, exact, sequential constraint — precisely what refine-everywhere generation is worst at, and what autoregression is built for. The failure mode is the architecture, visible to the naked eye.