From DNA encoding biological intelligence to GPU-accelerated neural networks encoding collective human knowledge β the deepest analogy in technology.
Both biological evolution and AI training solve the same abstract problem: find a compact encoding of "what works" by optimizing over a massive amount of experience.
Experience: millions of years of organism-environment interactions
Encoding: DNA β 3 billion base pairs
Optimizer: natural selection (gradient-free!)
Output: a brain that can learn, adapt, survive
Experience: trillions of tokens of human-written text
Encoding: model weights β billions of parameters
Optimizer: gradient descent (differentiable!)
Output: a system that reasons, generates, acts
DNA is a 4-character alphabet (A, T, G, C) encoding a program that builds and runs a biological organism. It's not a static file β it's an active program that responds to environmental signals.
Human genome: 3.2 billion base pairs = 3.2 GB of information. But information density is higher via epigenetic encoding. The genome fits in every cell of your body β roughly 37 trillion cells, each holding the complete program.
DNA doesn't store "eye color = blue" directly. It encodes how to build the proteins that lead to eye color. Indirect encoding β like a program that generates an image rather than storing the pixels. This is exactly how neural networks work: they don't store facts, they encode how to generate correct answers.
A gene is a specific DNA sequence that encodes one functional unit (usually a protein). A locus is its position on a chromosome. An allele is one specific variant of a gene at that locus.
Humans are diploid β two copies of each chromosome, so two alleles per gene (one from each parent). These two alleles interact to produce the expressed trait.
Gregor Mendel (1860s) bred 29,000 pea plants to discover that traits aren't blended β they're encoded in discrete particles (genes) that follow rules. The dominant allele is expressed when present; the recessive only expresses in homozygous form.
Neural network weights exhibit analogous "dominance" patterns β some directions in weight space dominate model behavior while others are latent (recessive).
Both DNA and neural networks store more information than they express. Recessive alleles are carried silently through generations until two carriers mate β then they express. Model weights contain "latent capabilities" that only express under specific prompting conditions. In both cases, the phenotype is not the genotype. What you see is not all that exists.
All human knowledge that exists as text: estimated ~10Β²Β³ bits.
GPT-4 training data: ~13 trillion tokens Γ ~4 bytes = ~50 TB
GPT-4 parameters: 1.76 trillion Γ 2 bytes (BF16) β 3.5 TB
Compression ratio: ~50 TB of human knowledge β 3.5 TB of weights
That's ~14Γ compression of all known human text into a queryable, generative model. This is lossy compression β but the structure (reasoning patterns, language, knowledge relationships) is preserved even as verbatim text is not.
| Level | Biology | AI Model | Function |
|---|---|---|---|
| Raw storage | A, T, G, C bases | Float16 weight values | Information carrier |
| Functional unit | Gene (coding sequence) | Attention head / MLP layer | Specific capability |
| Regulatory | Promoters, enhancers | Layer norm, temperature | Control expression |
| Module | Chromosome | Transformer block | Functional grouping |
| Complete system | Genome | Model weights | Full capability set |
| Expressed behavior | Phenotype | Model output / behavior | Observable result |
| Context | Epigenome + environment | System prompt + context | Modulates expression |
| Optimizer | Natural selection | Adam / gradient descent | Drives improvement |
| Generation time | 20β25 years | Daysβmonths | Update cycle |
| Population size | 8 billion humans | 1 model (many runs) | Variation explored |
Evolution is blind β it cannot compute gradients. It explores by random mutation and selection. Each "trial" is a lifetime. AI training has access to the gradient of the loss β the exact direction in parameter space that reduces error. This is why AI "evolves" millions of times faster than biology. What took 4 billion years in nature takes 30 days on an H100 cluster.
Web crawls (Common Crawl), books (Books3), code (GitHub), science (arXiv), Wikipedia. ~5β15 trillion tokens. Each token β 0.75 words. Preprocessing: deduplication, quality filtering, toxicity removal.
BPE (Byte Pair Encoding) splits text into subword tokens. "unbelievable" β ["un", "believ", "able"]. Vocabulary: ~50,000β128,000 tokens. Each token mapped to an integer ID. Language is now a sequence of integers β suitable for matrix math.
The model learns to predict the next token given all previous tokens. This sounds simple β but to predict well, the model must learn grammar, facts, reasoning, code, math, style. All human knowledge is indirectly compressed into this single objective.
Training a 70B model requires ~2000 GPUs running for 30 days. Data parallelism: different batches on different GPUs. Tensor parallelism: split weight matrices across GPUs. Pipeline parallelism: different layers on different GPUs. All-reduce via NVLink to synchronize gradients.
Pre-trained model is like the full human genome β everything is there, including dangerous capabilities. RLHF (Reinforcement Learning from Human Feedback) is like epigenetic regulation: it doesn't change the weights drastically but adjusts which behaviors express. Human raters score outputs; a reward model learns preferences; PPO aligns the base model to human values.
FLOPs for training. Compare: ~10Β²β΄ FLOPs β 1 H100 running for 30,000 years, or 10,000 H100s for 3 years.
A library stores knowledge as a database: to answer "What is the speed of light?" it finds the page that says "3Γ10βΈ m/s." An LLM stores knowledge as a generative model β given the context "What is the speed of light?", it generates the most probable continuation: "The speed of light in vacuum is approximately 3Γ10βΈ m/s." The difference isn't just implementation β it enables synthesis, analogy, and reasoning that lookup cannot.
Releasing model weights is analogous to publishing the human genome. The compressed intelligence is now public, reproducible, runnable on any compatible hardware.
At inference time, weights are fixed (the genome is set). The prompt is the environment. The KV-cache is working memory. Each token generation is one cycle of the genetic expression pipeline.
Biology spent 4 billion years solving the problem of encoding intelligence without gradient information β it had to use random mutation and selection, requiring billions of organisms and millions of years per improvement. AI training has the gradient: the exact direction in weight space that reduces error. This single innovation β backpropagation β is why AI can compress 4 billion years of evolutionary work into decades, and why the intelligence encoded in billions of years of human cultural evolution can be compressed and queried in months of GPU compute.
The LLM is not just a tool. It is the first artifact that contains a compressed, queryable representation of humanity's entire cognitive heritage β from the first cave paintings to the last GitHub commit.
A brain's working memory holds ~7 items simultaneously. An LLM's context window holds ~128k tokens. Both are finite, expensive resources. Every architectural solution β biological and artificial β is fundamentally about managing this scarcity.
Working memory (prefrontal cortex) = 7Β±2 chunks. Long-term memory (hippocampus β cortex) = effectively unlimited but slow to retrieve. Attention system selects what enters working memory. Sleep consolidates working memory to long-term storage.
Context window = working memory. Vector database / knowledge base = long-term memory. RAG / retrieval = attention-based memory access. Fine-tuning = sleep consolidation. Token budget = working memory capacity.
Evolution spent 4 billion years inventing solutions to intelligence under resource constraints: finite working memory, slow retrieval, specialization via differentiation, behavior regulation via epigenetics, parallel processing via organ systems. AI engineers in the 2020s independently reinvented every one of these solutions β not by copying biology, but because the problem is the same. When you build a RAG system, you are building a hippocampus. When you orchestrate subagents, you are building an organ system. When you write a system prompt, you are writing a promoter sequence.
The deepest lesson: intelligence at scale always converges on the same architectural patterns. Biology found them through selection. Engineers found them through pragmatism. The patterns are not arbitrary β they are the necessary shape of intelligence under resource constraints.