Multimodal & Diffusion — When Models Got Eyes and Brushes

01 The Big Picture

"Multimodal AI" bundles two unrelated technologies. Understanding that they're unrelated is most of the understanding.

Vision-language models (Claude reading your screenshot) extend doc 02's transformer: images become tokens, attention does the rest — one architecture, new input type. Diffusion models (Stable Diffusion, DALL·E-class, and the video generators) generate images by an entirely different process: learning to reverse the destruction of data by noise. Different math, different training, different failure modes. An engineer who conflates them predicts both wrongly.

02 Seeing — Vision-Language Models

The transformer never knew tokens were "words." It processes vectors with positions (doc 02). So the recipe for vision is almost cheeky:

Patchify. Slice the image into a grid of patches (e.g. 14×14 px each). Each patch is flattened into a vector — a "visual word." A 1024×1024 screenshot becomes ~1–2K patch tokens.

Encode. A vision encoder (a ViT — transformer over patches, attention instead of convolution) turns patches into embeddings that carry meaning, trained CLIP-style: pull matching image/caption pairs together in a shared space — doc 11's contrastive trick, across modalities.

Project & concatenate. A small adapter maps visual embeddings into the LLM's embedding space; they're inserted into the sequence alongside text tokens. From the LLM's perspective, the image is just... more context. Attention attends across both; "what's the error in this screenshot?" works because text tokens can attend to patch tokens directly.

💡

Practical consequences you can derive: images consume context window (patch tokens are tokens — doc 07's bill applies); fine detail dies with patch resolution (tiny text in screenshots → blurry patches → misreads); and VLMs inherit every LLM behavior — including hallucinating plausible readings of images they can't actually resolve.

03 Why Diffusion? (Why Not Just Autoregress Pixels?)

Doc 06's recipe — predict the next item, append, repeat — works for images in principle (predict pixel 1, then 2…). In practice it fails three ways: a 1024² image is a million-step sequential decode (the doc 06 slow lane, times a thousand); images aren't ordered left-to-right the way language is — global structure (a face's symmetry) is everywhere at once; and one early wrong pixel poisons everything after it.

Diffusion's insight inverts the problem: destroying an image with noise is trivial — so learn to undo the destruction. Add a little Gaussian noise step by step until pure static (easy, mechanical); train a network to estimate, at every noise level, what noise was added (supervised, parallel over steps, no ordering needed). Generation = start from pure static and run the destruction backwards. Every step refines the whole image at once — global structure first, details later, no fatal early commitments.

04 How Denoising Generation Works

Step through one generation, static to picture.

05 The Trick That Made It Affordable — Latent Diffusion

Denoising a 1024×1024×3 pixel tensor 50 times is brutal compute. Latent diffusion (the "LD" in Stable Diffusion's lineage) first trains an autoencoder that compresses images ~48× into a small latent tensor, runs the entire diffusion process in that compressed space, then decodes the final latent back to pixels once. Diffusion over 128×128×4 instead of 1024×1024×3 — the doc 08 move again: do the expensive iterative work in the small fast representation; only touch the big expensive one at the boundary. SRAM tiles, subagent scratchpads, latent spaces — the same shape, third appearance in this series.

Dimension	Autoregressive LLM (docs 02–14)	Diffusion model
Generation order	Left to right, one token at a time	Whole canvas at once, coarse → fine
Steps to generate	= output length (thousands)	20–50, fixed, regardless of "complexity"
Randomness enters	Sampler's dice roll per token (doc 14)	The initial noise (the "seed image") + per-step noise
Mid-course correction	None — committed tokens are final	Constant — every step revises everything
Conditioning	Prompt as prefix context	Prompt via cross-attention at every step (+ guidance scale)

The boundary is blurring — video models run diffusion over space-time latents with transformer backbones (DiT), and "thinking in drafts then refining" diffusion-style is being explored for text too. But the two mental models remain the honest foundation.

06 Mental Models

VLM = transformer that learned a second tokenizer

Vision is "just" a new front-end: patches in, attention as usual. Lets you reason about: image token costs, resolution limits, why VLM hallucination feels exactly like text hallucination — it is.

The vision encoder's training data bounds what can be seen at all; no prompt recovers what the encoder discarded.

Diffusion = sculpting, autoregression = writing

A writer commits word by word; a sculptor roughs the whole form, then refines everywhere at once. Lets you reason about: why diffusion nails global composition but fumbles fine sequential detail (text in images, finger counts), and the reverse for LLMs.

Sculptors choose where to refine; diffusion refines uniformly per step.

Learning the inverse of an easy function

Adding noise is trivial; removing it is generation. When a task is hard, find the easy destructive direction and learn its reverse — the deepest reusable idea here (also: compression→decompression, encryption→cryptanalysis asymmetries).

Only works when the forward process is gradual and information-preserving enough to invert step-wise.

07 Common Misconceptions

"Claude/GPT generate the images they show me." Chat assistants that return images typically call a separate diffusion model as a tool (doc 12's pattern). The LLM writes the prompt; the diffusion model paints. Two systems, one chat window.

"Diffusion models stitch together pieces of training images." The network never stores images — it learns a noise-prediction function (gradients of the data distribution, score matching). Outputs are samples from a learned distribution, not collages. (Memorization of near-duplicated training images can occur — that's a failure mode, not the mechanism.)

"VLMs read text in images like OCR." They read it through 14px patches — fine print, dense tables, and low-contrast text degrade into hallucinated plausibilities. For exact transcription at small scale, dedicated OCR as a tool still wins.

"Why can't image models spell?" Now you can answer it: text in an image is a long, exact, sequential constraint — precisely what refine-everywhere generation is worst at, and what autoregression is built for. The failure mode is the architecture, visible to the naked eye.

🗺️

Next, and last: models that see screens and operate tools raise the stakes considerably. Doc 16: alignment, jailbreaks, and the injection attack every agent engineer must understand.