Pre-ML Era — Classical NLP & Smaller Skill Models

00 The Pre-Deep Learning Era

~1990 to ~2015: when clever feature engineering and probabilistic models ruled NLP.

1950s–70s

Symbolic / Rule-Based AI

Hand-crafted rules, decision trees by hand, pattern matching. ELIZA (1966) faked conversation with regex. SHRDLU (1970) parsed natural language into symbolic logic. Worked in narrow domains; failed to generalize.

1980s–90s

Statistical NLP Begins

Hidden Markov Models for speech and POS tagging. N-gram language models. IBM's statistical machine translation. Shift: replace linguistic rules with probability estimated from data. "Every time I fire a linguist, the perplexity goes down." — Frederick Jelinek.

2000–2010

Feature Engineering + Discriminative Models

Support Vector Machines, Maximum Entropy (logistic regression), Conditional Random Fields. Engineers hand-craft features: POS tags, dependency parse, character n-grams, capitalization patterns. These features feed powerful classifiers. State of art for intent, NER, sentiment.

2013

Word2Vec — Embeddings Arrive

Word representations learned from co-occurrence statistics. "king − man + woman ≈ queen." Distributed representations replace sparse one-hot vectors. This bridged classical and deep learning NLP.

2014–2017

Deep Learning Takes Over

BiLSTM-CRF becomes NER standard. CNN for text classification. Seq2Seq for translation. Then: Transformers (2017). The feature engineering era ends almost overnight.

🔑

Why study this? (1) Production systems still use classical models where latency, interpretability, or data size make LLMs overkill. (2) Features and heuristics from this era (TF-IDF, BM25, regex pre-filters) still run inside RAG pipelines and search systems. (3) Understanding what LLMs replaced makes you appreciate what they do. (4) Interview questions — HMMs, n-grams, Naive Bayes still appear.

01 Representing Text — The Core Problem

ML models operate on numbers. Text is symbols. Every classical NLP method starts with this conversion.

Bag of Words (BoW)

Represent each document as a vector of word counts. Vocabulary size = V (typically 10K–100K). Each document → sparse vector in ℝᵛ. Order is discarded — "cat ate dog" = "dog ate cat".

Bag of Words Transformation

TF-IDF — Weighting by Importance

Common words ("the", "is") appear in every document — they carry no discriminative signal. TF-IDF downweights frequent words and upweights rare, informative ones.

$$\text{TF}(t,d) = \frac{\text{count}(t,d)}{|d|} \qquad \text{IDF}(t) = \log\frac{N}{\text{df}(t)} \qquad \text{TF-IDF}(t,d) = \text{TF}(t,d)\cdot\text{IDF}(t)$$ TF = term frequency in document $d$. IDF = inverse document frequency across a corpus of $N$ docs; $\text{df}(t)$ = number of docs containing term $t$. High TF-IDF: a word appears often in this doc but rarely elsewhere → distinctive. BM25 is a refined version still used in Elasticsearch, Solr, and as the sparse component in hybrid RAG.

N-grams — Capturing Local Word Order

BoW loses word order. N-grams add contiguous sequences of N words as features. "New York" as a bigram is different from "new" + "york" separately.

Unigrams	the · cat · sat · on · the · mat
Bigrams	the cat · cat sat · sat on · on the · the mat
Trigrams	the cat sat · cat sat on · sat on the · on the mat
Char 3-grams	"cat" → ca · at · t (robust to typos / morphology)

The catch — vocabulary explosion: 50K unigrams imply up to 2.5 billion possible bigrams. The fix is to keep only the top-K most frequent n-grams and prune the rest.

N-gram Language Model

$$P(w_1,\dots,w_n) = \prod_{t=1}^{n} P(w_t \mid w_{t-1},\dots,w_1) \;\approx\; \prod_{t=1}^{n} P(w_t \mid w_{t-(k-1)},\dots,w_{t-1})$$ Markov assumption: only the last $k-1$ words matter ($k=2$: bigram, $k=3$: trigram). Probabilities estimated from corpus counts with smoothing (Kneser-Ney, Good-Turing) to handle unseen n-grams. Used in speech recognition, early machine translation, autocomplete. Limitation: no long-range context beyond $k$ words.

02 Naive Bayes — The Classic Text Classifier

Surprisingly competitive despite its extreme simplifying assumption. Still used in spam filters and as a fast baseline.

Naive Bayes applies Bayes' theorem to text classification, making the "naive" assumption that all word features are conditionally independent given the class. This is false (word order and co-occurrence clearly matter), yet the model works remarkably well in practice.

$$P(c \mid d) \;\propto\; P(c)\prod_{t \in d} P(w_t \mid c) \qquad\Longrightarrow\qquad \hat{c} = \arg\max_{c}\; P(c)\prod_{t \in d} P(w_t \mid c)$$ $P(c)$ = prior probability of class $c$ (from training-set frequencies). $P(w_t \mid c)$ = probability of word $t$ in documents of this class $= \text{count}(t,c)\,/\,\text{total words in } c$. The $\propto$ means we normalize at the end; for the $\arg\max$ the normalizer is irrelevant.

Train · prior	$P(c) = \dfrac{\#\text{docs in }c}{\#\text{docs}}$
Train · likelihood	$P(w\mid c) = \dfrac{\text{count}(w,c) + 1}{\text{words}(c) + V}$ the +1 is Laplace smoothing — no word gets zero probability
Predict · score	$\text{score}(c) = \log P(c) + \sum_{w\in d}\log P(w\mid c)$ log-space sum avoids underflow
Predict · choose	$\hat{c} = \arg\max_{c}\ \text{score}(c)$

Strengths

Extremely fast: O(V) training and O(|doc|) prediction
Works well with little data
Naturally handles new features (just add new words)
Strong baseline — often within 5% of complex models
Interpretable: you can inspect P(w|class) directly

Weaknesses

Independence assumption: "not good" treated as "not" + "good"
Probability calibration is poor (argmax is fine; the probabilities themselves aren't)
Long documents: domination by high-frequency features
No feature interactions

03 SVM — The Pre-Deep Learning Champion

From 2000–2012, SVMs were the dominant model for text classification, sentiment, and many NLP tasks. Understanding why reveals something deep about generalization.

The Core Idea: Maximum Margin

Instead of finding any hyperplane that separates classes, find the one with the maximum margin — maximum distance from the hyperplane to the nearest training points (support vectors). This margin-maximization is theoretically linked to better generalization.

$$\max_{\mathbf{w},b}\; \frac{2}{\lVert\mathbf{w}\rVert} \;\;\text{s.t.}\;\; y_i(\mathbf{w}^\top\mathbf{x}_i + b) \ge 1 \qquad\Longleftrightarrow\qquad \min_{\mathbf{w},b}\; \tfrac{1}{2}\lVert\mathbf{w}\rVert^2 \;\;\text{s.t. constraints}$$ This is a convex quadratic program — one global optimum, efficiently solvable. The solution depends only on the support vectors (points nearest the boundary). Most training points are irrelevant — they don't affect the hyperplane. This "sparsity" is why SVMs generalize well even in high dimensions.

$$\min_{\mathbf{w},b,\boldsymbol{\xi}}\; \tfrac{1}{2}\lVert\mathbf{w}\rVert^2 + C\sum_i \xi_i \qquad \text{s.t.}\quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \ge 1-\xi_i,\;\; \xi_i \ge 0$$ $\xi_i$ = slack variable: how much point $i$ violates the margin. $C$ = regularization strength: large $C$ = small margin tolerated (fit training data harder), small $C$ = large margin (more regularization). This is the bias–variance tradeoff in SVM form.

The Kernel Trick — Non-Linear SVMs

The SVM decision function only involves dot products between training points: $\mathbf{w}^\top\mathbf{x} = \sum_i \alpha_i y_i\,(\mathbf{x}_i \cdot \mathbf{x})$ . Replace the dot product with a kernel function $K(\mathbf{x}_i, \mathbf{x})$ — implicitly mapping to a high-dimensional space without ever computing the coordinates there.

Common Kernels

$$\begin{aligned} \text{Linear:}\quad & K(\mathbf{x},\mathbf{x}') = \mathbf{x}^\top\mathbf{x}' \\ \text{RBF:}\quad & K(\mathbf{x},\mathbf{x}') = \exp\!\big(-\gamma\lVert\mathbf{x}-\mathbf{x}'\rVert^2\big) \\ \text{Poly:}\quad & K(\mathbf{x},\mathbf{x}') = (\mathbf{x}^\top\mathbf{x}' + c)^d \\ \text{String:}\quad & K(s,s') = \#\,\text{common substrings} \end{aligned}$$

Why SVMs Dominated NLP (2000–2012)

Text in TF-IDF space is already high-dimensional and nearly linearly separable — linear kernel SVMs work extremely well. String kernels could capture n-gram features without explicitly computing them. They had solid theoretical guarantees. And they were much better than Naive Bayes on enough data.

04 Sequence Labeling — HMM & CRF

Text isn't just a bag of words — it's a sequence where each element's label depends on its neighbors. HMM and CRF are the two key models for this.

Hidden Markov Model (HMM)

A generative model: assume the observed words are produced by a hidden sequence of states (e.g., POS tags). Learn the model that most likely generated the observations.

HMM: Hidden States Generate Observations

$$P(w_1{:}w_n,\, t_1{:}t_n) = \prod_{i=1}^{n} \underbrace{P(w_i \mid t_i)}_{\text{emission}}\cdot \underbrace{P(t_i \mid t_{i-1})}_{\text{transition}}$$ Emission $P(w \mid t)$: how likely word $w$ is given tag $t$ (learned from counts). Transition $P(t \mid t')$: how likely tag $t$ is after $t'$. Decoding (the Viterbi algorithm): find the most likely tag sequence in $O(T\cdot|S|^2)$ via dynamic programming.

CRF — Conditional Random Field

HMMs are generative (model P(words, tags)). CRFs are discriminative — directly model P(tags|words). This means they can use arbitrary overlapping features of the observation, not just the current word. This is the key advantage.

$$P(\mathbf{y}\mid\mathbf{x}) = \frac{1}{Z(\mathbf{x})}\exp\!\Big(\sum_{t}\sum_{k} \lambda_k\, f_k(y_t, y_{t-1}, \mathbf{x}_t)\Big)$$ $f_k$ = feature functions (arbitrary — can look at any part of $\mathbf{x}$). $\lambda_k$ = learned weights. $Z(\mathbf{x})$ = partition function (normalizer). A feature can be "current word is capitalized AND previous tag is O" — impossible in an HMM, which only conditions on the current word's emission.

CRF Feature Examples (for NER)

word is capitalized previous word = "Dr." ends in "-ington" (location?) in company gazetteer previous tag = B-PER word shape = Xxxxx is a number? POS = NNP

HMM vs CRF

Aspect	HMM	CRF
Model type	Generative	Discriminative
Features	Current word only	Any features of input
Speed	Fast inference	Slower (partition fn)
Accuracy (NER)	~88–90 F1	~91–94 F1

⚡

Still relevant: BiLSTM-CRF (2015) — the deep learning NER standard before Transformers — combines a BiLSTM for context-aware embeddings with a CRF output layer for globally consistent tag sequences. The CRF part ensures "B-ORG can't follow I-PER" constraints. Many production NER systems still use this architecture today.

05 Intent Detection — A Complete System

The bedrock of voice assistants and chatbots. How it was built before LLMs.

Intent detection answers: "What is the user trying to do?" Given "Book me a flight to Paris", the intent is book_flight. Given "What's the weather like?", the intent is get_weather. This is fundamentally a text classification problem.

The Classical Intent Pipeline

Raw Text

→

Preprocessing

→

Feature Extraction

→

Classifier

→

Intent + Confidence

1 · Lowercase	"Book ME a FLIGHT" → "book me a flight"
2 · Tokenize	"don't book it" → [do, n't, book, it]
3 · Stopwords	drop me, a, to → [book, flight, paris]
4 · Lemmatize	booking → book · flights → flight
5 · Entity norm.	"to Paris" → "to CITY" · "for 3pm" → "for TIME"

Stopword removal is optional (it sometimes hurts intent detection). Entity normalization is the critical one — replacing names with type tags stops the classifier from memorizing specific cities.

Lexical Features

TF-IDF unigrams + bigrams
First/last word indicators
Word length buckets
Presence of question words ("what","how","when")
Punctuation pattern

Syntactic Features

POS tag sequence
Dependency parse head words
Sentence length
Verb type (action vs state)
Root verb of the sentence

Semantic Features

WordNet hypernyms (flight → travel)
Named entity types (PERSON, DATE)
Domain vocabulary matches (gazetteer)
Word2Vec sentence vector (average)

Classifier	How	Pros	Cons
Logistic Regression	Softmax over TF-IDF features	Fast, interpretable, strong baseline	Linear, no feature interactions
SVM (linear)	Max margin in TF-IDF space	Best on high-dim sparse text	No probability output; multi-class awkward
MaxEnt (LogReg)	Maximize entropy subject to feature constraints	Handles overlapping features well	Same as logistic regression mathematically
Random Forest	Ensemble of decision trees	Non-linear, handles noise	Slower, less effective on sparse text
Gradient Boosting	XGBoost/LightGBM on hand-crafted features	Often best classical accuracy	Feature engineering still needed

Confidence is critical in production: when the model is unsure, don't act — fall back to clarification or a human.

conf > 0.85	execute the intent directly
0.60 – 0.85	confirm first — "Do you want to book a flight?"
conf ≤ 0.60	ask a clarifying question or hand off to a human

Multi-intent: "Book a flight AND a hotel" → [book_flight, book_hotel] via multi-label output. Out-of-scope: reject when max confidence is below threshold, or train a dedicated in-scope / out-of-scope gate first.

06 Named Entity Recognition

Identify and classify proper nouns: people, organizations, locations, dates, products. The backbone of information extraction.

IOB Tagging Scheme

Every token gets one tag, so the model can mark entity spans of any length:

B-TYPE = begins an entity I-TYPE = inside / continues it O = not an entity

Barack	Obama	visited	New	York	City	on	Monday
B-PER	I-PER	O	B-LOC	I-LOC	I-LOC	O	B-DATE

The B/I split is what separates adjacent mentions: "New York New York" → B-LOC I-LOC B-LOC I-LOC = two distinct locations, not one four-word blob.

Classical NER Feature Engineering

This is where the craft was — every percentage point of F1 required careful feature design for each domain:

The feature vector for one token — say "Obama" in "Barack Obama visited…":

The word itself

lower	obama	surface form
istitle	True	starts capital
isdigit	False	not a number
suffix[-3:]	ama	morphology
prefix[:3]	Oba	morphology
shape	Xxxxx	caps pattern
pos	NNP	proper noun

Context & lookups

prev_word	barack	left context
next_word	visited	right context
prev_label	B-PER	CRF: last tag
in_gazetteer	True	known-name list

Every feature here was hand-designed. A CRF combines them; prev_label is what makes the output a globally consistent sequence rather than independent guesses.

Gazetteers — The Secret Weapon

A gazetteer is a lookup table of known entities (lists of cities, company names, people names). Matching against gazetteers was one of the highest-value features. The challenge: ambiguity ("Apple" = company or fruit? "Jordan" = person or country?) and coverage (new entities constantly appear).

🏆

Why NER was hard: F1 jumped from ~85% (HMM) to ~91% (CRF, 2003) to ~93% (BiLSTM-CRF, 2016) to ~96% (BERT fine-tuned, 2019). Each jump required a fundamentally different model family. The CRF → BiLSTM jump happened because deep learning automatically learned features from raw tokens, eliminating years of manual feature engineering. The BERT jump came from pretraining on massive text.

07 Word Vectors — The Bridge to Deep Learning

The single innovation that made classical NLP people take neural networks seriously.

The Problem with One-Hot Vectors

"cat" and "kitten" are similar, but one-hot vectors for them are orthogonal — dot product = 0, distance = √2. The representation encodes no semantic relationships. Every model had to learn similarity from scratch for every task.

Word2Vec (2013, Mikolov et al.)

Train a shallow neural network to predict neighboring words. The hidden layer weights become word embeddings. Two architectures:

CBOW — Continuous Bag of Words

Predict center word from context words. Input: average of context word embeddings → softmax → predict center word. Faster, better for frequent words.

$$P(w \mid \text{ctx}) = \mathrm{softmax}\!\Big(W_2 \cdot \mathrm{mean}\big(\{\mathbf{e}_v : v \in \text{ctx}\}\big)\Big)$$ $W_1$ = embedding matrix (what we keep). $W_2$ = output projection (discarded after training).

Skip-gram

Predict context words from center word. For each training word, predict each word within a window of ±k. Slower, better for rare words and analogies.

$$\sum_{\substack{-k\le j\le k \\ j\neq 0}} \log P(w_{t+j} \mid w_t)$$ With negative sampling: for each positive pair, sample $K$ negative (random) words and pose a binary task — is this a real neighbor or noise? Much faster than the full softmax over the whole vocabulary.

The analogy test that shocked the NLP world — vector arithmetic on word meanings:

king − man + woman ≈ queen Paris − France + Germany ≈ Berlin walk + (ing direction) ≈ walking big + (est direction) ≈ biggest

It works because the space encodes each relationship — gender, capital-of, tense — as a consistent geometric direction. Subtract one word and add another, and you slide along that direction. See it move below:

▶ Interactive — vector arithmetic: king − man + woman ≈ queen

A 2D sketch of embedding space. The male→female direction is the same constant offset everywhere. So starting at king, subtracting man and adding woman lands almost exactly on queen. Press play to watch the walk; the dashed offsets are identical vectors.

GloVe (2014) — Global Vectors

$$J = \sum_{i,j} f(X_{ij})\,\big(\mathbf{w}_i^\top\mathbf{w}_j + b_i + b_j - \log X_{ij}\big)^2$$ $X_{ij}$ = co-occurrence count of words $i$ and $j$ in a window. $f(x) = \min\!\big(1, (x/x_{\max})^\alpha\big)$ is a weighting function that caps the influence of very frequent pairs. The objective asks the dot product of two word vectors to match the log of their co-occurrence count — explicitly using global corpus statistics, unlike Word2Vec's local window sampling.

🌉

The bridge: Word2Vec and GloVe gave classical NLP engineers a way to use neural representations without training neural networks end-to-end. Plug the pre-trained 300-dim vectors into your CRF or SVM as features — instant improvement. This broke down the resistance to "neural methods" and set the stage for LSTM and Transformer adoption.

08 Classical Dialogue Systems

Building task-oriented chatbots before LLMs — finite state machines, slot filling, and the frame-based architecture.

The Spoken Dialogue System Architecture

ASR (Speech → Text)

→

NLU (Intent + Slots)

→

Dialogue Manager

→

NLG (Response)

→

TTS (Text → Speech)

Slot Filling — Beyond Intent

Intent detection tells you what the user wants. Slot filling fills in the parameters. For book_flight, the slots are: origin, destination, date, passengers. The dialogue manager tracks which slots are filled and asks for missing ones.

The dialogue manager holds a frame for book_flight and tracks each slot's state:

origin	Delhi	✓ filled
destination	Paris	✓ filled
date	— empty —	✗ ask the user next
class	economy (default)	optional

Seeing date empty, the manager asks "When would you like to fly?", runs NLU on the reply, extracts the date, and updates the frame. Once every required slot is filled, it fires the booking API call.

Finite State vs. Frame-Based vs. Information-State

Finite State Machine

Hard-coded graph of states and transitions. "If in state GREET and user says YES → move to state COLLECT_ORIGIN." Fast, predictable. Can't handle unexpected inputs. Used for simple IVR (phone menu) systems.

Frame-Based (slot filling)

Fill slots in any order. More flexible than FSM. Can handle "Book Paris to London on Tuesday" in one shot. The standard for task-oriented bots 2000–2017 (ATIS, hotel booking).

Statistical Dialogue (POMDP)

Maintain a probability distribution over possible dialogue states. Choose next action to maximize expected reward. More robust to ASR/NLU errors. Cambridge Dialogue Systems Group (2006–2016).

Natural Language Generation (Rule-Based)

ask_destination	"Where would you like to fly?" · "What's your destination?" · "Where are you headed?"
confirm_booking	"I've booked a {class} flight from {origin} to {destination} on {date}."

Pick a template → fill the slots → rotate phrasings to avoid sounding robotic. Fancier systems added sentence planners + surface realizers (OpenCCG, FUF/SURGE). The hard limit: every variation has to be hand-authored.

09 Information Retrieval — The Inverted Index

How search engines work, why BM25 still runs inside modern RAG, and the direct lineage from TF-IDF to dense retrieval.

Before neural retrieval, finding relevant documents from a corpus of millions required an efficient data structure. The inverted index — the engine behind Elasticsearch, Solr, and Lucene — maps each term to the list of documents containing it. Query evaluation becomes an intersection of sorted lists rather than a scan of the entire corpus.

Inverted Index Structure

BM25 — The Backbone of Modern Search

BM25 (Best Match 25) is the ranking function used by Elasticsearch and the sparse component of most hybrid RAG pipelines. It extends TF-IDF with two key improvements: term frequency saturation (doubling word count doesn't double relevance) and document length normalisation (penalise long documents for having more term occurrences by chance).

$$\text{BM25}(q,d) = \sum_{w \in q} \text{IDF}(w)\cdot \frac{\text{TF}(w,d)\,(k_1+1)}{\text{TF}(w,d) + k_1\big(1-b + b\,\frac{|d|}{\text{avgdl}}\big)}$$ $$\text{IDF}(w) = \log\!\left(\frac{N - \text{df}(w) + 0.5}{\text{df}(w) + 0.5} + 1\right)$$

$k_1 \approx 1.2$ controls TF saturation; $b \approx 0.75$ controls length normalisation; avgdl = average document length. As $\text{TF}\to\infty$, the score $\to \text{IDF}\cdot(k_1+1)/k_1$ — it saturates, unlike TF-IDF which grows unboundedly. BM25 has been the baseline neural models must beat since 1994.

▶ Interactive — TF saturation: BM25 vs raw TF-IDF

Both curves rank by term frequency. Raw TF-IDF (dashed) grows forever — a doc that spams a word 50× scores 50×. BM25 (solid) saturates: the 2nd occurrence matters far more than the 20th. Drag $k_1$ — low values saturate almost instantly (one mention is enough); high values behave more like linear TF-IDF.

k₁ (saturation) 1.2

BM25 Strengths (still relevant)

Exact keyword match — never misses a literal term
No GPU, no embeddings, runs on commodity hardware
Interpretable: you know exactly why a document ranked
Millisecond latency at billion-document scale

Why hybrid RAG uses BM25 + dense

Dense retrieval understands semantics ("automobile" ≈ "car") but can miss exact matches (product codes, names, rare terms). BM25 never misses exact terms but doesn't understand semantics. Hybrid = BM25 + dense, merged with Reciprocal Rank Fusion (RRF). Consistently outperforms either alone.

10 FastText — Subword Embeddings

Word2Vec's successor that handles morphologically rich languages, typos, and rare words — and is still fast enough to train in minutes.

Word2Vec gives a single embedding per word. Unknown words (unseen during training) get no embedding at all. For morphologically rich languages like Finnish, Turkish, or Hindi — where a word can have thousands of inflections — this is crippling. FastText (Facebook, 2017) extends Word2Vec by representing each word as a bag of character n-grams.

FastText: word → bag of character n-grams

"where" decomposed, with < > marking word boundaries:

n=3: <wh whe her ere re> n=4: <whe wher here ere> n=5: <wher where here> full: <where>

Word vector = the sum of its n-gram vectors. A typo like "wheree" shares almost every n-gram with "where" → a near-identical vector. That's how out-of-vocabulary words still get meaningful embeddings.

Advantages over Word2Vec

Out-of-vocabulary words get meaningful embeddings from shared n-grams
Robust to spelling variations and typos
Handles morphology: "running", "runs", "ran" share n-gram vectors
Excellent for agglutinative languages (Finnish, Turkish, Japanese)
Still very fast: n-gram hashing keeps vocabulary manageable

FastText Classification

FastText also includes a text classifier: average word n-gram vectors, apply a linear classifier. Trains in seconds on millions of documents. Still a strong baseline for intent detection and text classification — outperformed by BERT but at 1/1000 the inference cost. Used in production at Facebook/Meta for language identification (176 languages).

11 ELMo, BERT & the Bridge to Modern LLMs

The 2018–2019 revolution: contextualised embeddings and pretraining that ended the classical NLP era overnight.

The Problem Word2Vec Couldn't Solve: Polysemy

Word2Vec gives a single static vector per word. "Bank" (financial institution) and "bank" (river bank) have the same embedding. "Play" (theatre play) and "play" (sport play) are identical. Context is completely ignored. Every downstream task had to figure out word sense disambiguation itself.

ELMo (2018) — Contextual Embeddings from LSTMs

ELMo (Embeddings from Language Models) trained a deep bidirectional LSTM to predict the next word (forward) and previous word (backward) on a large corpus. The key insight: instead of using the word's static embedding, use the internal activations of the LSTM — which change depending on the surrounding context — as the word representation.

💡

The ELMo insight: The same word has different representations in different contexts. "bank" in "I went to the bank" activates different LSTM hidden states than "bank" in "the river bank eroded." These contextual representations drastically improved NER, QA, and sentiment — just by swapping the embeddings. ELMo proved that context matters and set the stage for BERT.

BERT (2018) — Bidirectional Transformer Pretraining

BERT (Bidirectional Encoder Representations from Transformers) replaced ELMo's BiLSTM with a Transformer encoder, and replaced next-word prediction with two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Masked Language Modeling (MLM)

Randomly mask 15% of tokens. Train the model to predict the masked tokens from the bidirectional context. Unlike GPT (predict next token, left-to-right), MLM lets every token attend to tokens on both sides — giving much richer contextual representations. The input "The [MASK] sat on the mat" forces the model to use all surrounding context to predict "cat".

BERT Fine-tuning (The Key Shift)

Add a task-specific head on top of BERT's [CLS] token output (classification) or per-token output (NER, QA). Fine-tune all weights jointly. Results: BERT-base achieved SOTA on 11 NLP benchmarks simultaneously. NER went from ~93% (BiLSTM-CRF) to ~96% F1. The feature engineering era ended in weeks.

Embedding Evolution: Word2Vec → ELMo → BERT → LLMs

🏁

Why BERT matters for your understanding: BERT is the hinge point of the entire story. Classical NLP built features → ML classified them. ELMo learned features → ML classified them. BERT learned features AND the classifier together, end-to-end, from pretraining. GPT scaled this further — no classifier needed at all. Understanding BERT makes the LLM training pipeline feel inevitable rather than magical.

12 The Transition to Deep Learning

Why classical methods gave way — and what the tipping points were.

What Classical Methods Couldn't Solve

Feature Engineering Bottleneck

Every task needed a new set of manually designed features. POS tagger features differ from NER features differ from sentiment features. Expertise was non-transferable. Neural networks learn features automatically from raw text.

No Transfer Learning

A TF-IDF+SVM model for spam detection couldn't help an intent classifier — no shared representation. Deep learning enabled pretraining on large corpora and fine-tuning on specific tasks — the modern paradigm.

Long-Range Dependencies

N-gram models are local. CRFs see a small window. "The bank that the customer who the employee that the department hired…" requires tracking nested dependencies across many tokens. LSTMs and Transformers handle this; classical models don't.

Semantic Compositionality

"Not bad" ≠ bad. "The best way to ruin a meal" (sarcasm). Composing word meanings into sentence meaning required complex hand-crafted logic in classical systems. Neural networks learn composition implicitly.

What Classical Methods Still Do Better

Interpretability

You can inspect every feature weight in a logistic regression or SVM. "This prediction is primarily driven by the word 'cancel'." Try explaining a 70B parameter attention pattern.

Low-Data Regimes

With 50–200 labeled examples, Naive Bayes and logistic regression often outperform fine-tuned LLMs. Classical models don't need 10K+ examples to work. In highly specialized domains with little data, they remain competitive.

Latency & Cost

TF-IDF + logistic regression classifies in microseconds on CPU, for free. A BERT inference costs ~$0.0001 and takes 50ms. An LLM API call costs 100× more. At scale, classical models run inside other models' preprocessing pipelines.

Predictability

Rule-based and classical models behave deterministically. "If input contains regex /\bcancel\b/, fire cancel intent" — 100% reproducible. LLMs introduce stochasticity that can be hard to debug in production.

Where They Coexist Today

A modern production query passes through cheap-to-expensive layers — each runs only when the ones above it can't resolve the query:

1 · Rule pre-filter	regex (profanity, greetings)	µs · free	block or return a canned reply
2 · Keyword retrieval	BM25 inverted index	~ms	fetch top-20 candidate docs (RAG)
3 · Intent router	BERT-small classifier	~10 ms	simple_faq & conf>0.9 → answer, no LLM
4 · Re-rank	cross-encoder	~50 ms	re-score BM25 docs, keep top-5
5 · Generate	LLM	~sec · $$$	only when cheaper layers punt

Each layer is the cheapest tool that can settle the query. Most requests are answered before they ever reach the LLM — the classical components do the filtering, retrieval, and routing.

The Full Picture

Classical NLP taught us that text representation, feature design, and probabilistic sequence modeling matter. Deep learning automated the feature engineering. Transformers automated the sequence modeling. LLMs automated the task-specific fine-tuning. But the problems are the same: intent, slots, entities, context, disambiguation. The tools evolved; the problem structure didn't. Understanding the classical approaches makes you a better engineer of the modern ones — you know what the neural networks are implicitly learning to do, and when a simpler classical tool is the right one.

Train · prior	\(P(c) = \dfrac{\#\text{docs in }c}{\#\text{docs}}\)
Train · likelihood	\(P(w\mid c) = \dfrac{\text{count}(w,c) + 1}{\text{words}(c) + V}\) the +1 is Laplace smoothing — no word gets zero probability
Predict · score	\(\text{score}(c) = \log P(c) + \sum_{w\in d}\log P(w\mid c)\) log-space sum avoids underflow
Predict · choose	\(\hat{c} = \arg\max_{c}\ \text{score}(c)\)

Before Neural NetworksClassical NLP & Skill Models