00 The Pre-Deep Learning Era
~1990 to ~2015: when clever feature engineering and probabilistic models ruled NLP.
Symbolic / Rule-Based AI
Hand-crafted rules, decision trees by hand, pattern matching. ELIZA (1966) faked conversation with regex. SHRDLU (1970) parsed natural language into symbolic logic. Worked in narrow domains; failed to generalize.
Statistical NLP Begins
Hidden Markov Models for speech and POS tagging. N-gram language models. IBM's statistical machine translation. Shift: replace linguistic rules with probability estimated from data. "Every time I fire a linguist, the perplexity goes down." — Frederick Jelinek.
Feature Engineering + Discriminative Models
Support Vector Machines, Maximum Entropy (logistic regression), Conditional Random Fields. Engineers hand-craft features: POS tags, dependency parse, character n-grams, capitalization patterns. These features feed powerful classifiers. State of art for intent, NER, sentiment.
Word2Vec — Embeddings Arrive
Word representations learned from co-occurrence statistics. "king − man + woman ≈ queen." Distributed representations replace sparse one-hot vectors. This bridged classical and deep learning NLP.
Deep Learning Takes Over
BiLSTM-CRF becomes NER standard. CNN for text classification. Seq2Seq for translation. Then: Transformers (2017). The feature engineering era ends almost overnight.
01 Representing Text — The Core Problem
ML models operate on numbers. Text is symbols. Every classical NLP method starts with this conversion.
Bag of Words (BoW)
Represent each document as a vector of word counts. Vocabulary size = V (typically 10K–100K). Each document → sparse vector in ℝᵛ. Order is discarded — "cat ate dog" = "dog ate cat".
Bag of Words Transformation
TF-IDF — Weighting by Importance
Common words ("the", "is") appear in every document — they carry no discriminative signal. TF-IDF downweights frequent words and upweights rare, informative ones.
N-grams — Capturing Local Word Order
BoW loses word order. N-grams add contiguous sequences of N words as features. "New York" as a bigram is different from "new" + "york" separately.
N-gram Language Model
02 Naive Bayes — The Classic Text Classifier
Surprisingly competitive despite its extreme simplifying assumption. Still used in spam filters and as a fast baseline.
Naive Bayes applies Bayes' theorem to text classification, making the "naive" assumption that all word features are conditionally independent given the class. This is false (word order and co-occurrence clearly matter), yet the model works remarkably well in practice.
Strengths
- Extremely fast: O(V) training and O(|doc|) prediction
- Works well with little data
- Naturally handles new features (just add new words)
- Strong baseline — often within 5% of complex models
- Interpretable: you can inspect P(w|class) directly
Weaknesses
- Independence assumption: "not good" treated as "not" + "good"
- Probability calibration is poor (argmax is fine; the probabilities themselves aren't)
- Long documents: domination by high-frequency features
- No feature interactions
03 SVM — The Pre-Deep Learning Champion
From 2000–2012, SVMs were the dominant model for text classification, sentiment, and many NLP tasks. Understanding why reveals something deep about generalization.
The Core Idea: Maximum Margin
Instead of finding any hyperplane that separates classes, find the one with the maximum margin — maximum distance from the hyperplane to the nearest training points (support vectors). This margin-maximization is theoretically linked to better generalization.
The Kernel Trick — Non-Linear SVMs
The SVM decision function only involves dot products between training points: wᵀx = Σᵢ αᵢyᵢ (xᵢ · x). Replace the dot product with a kernel function K(xᵢ, x) — implicitly mapping to a high-dimensional space without computing it explicitly.
Common Kernels
Why SVMs Dominated NLP (2000–2012)
Text in TF-IDF space is already high-dimensional and nearly linearly separable — linear kernel SVMs work extremely well. String kernels could capture n-gram features without explicitly computing them. They had solid theoretical guarantees. And they were much better than Naive Bayes on enough data.
04 Sequence Labeling — HMM & CRF
Text isn't just a bag of words — it's a sequence where each element's label depends on its neighbors. HMM and CRF are the two key models for this.
Hidden Markov Model (HMM)
A generative model: assume the observed words are produced by a hidden sequence of states (e.g., POS tags). Learn the model that most likely generated the observations.
HMM: Hidden States Generate Observations
CRF — Conditional Random Field
HMMs are generative (model P(words, tags)). CRFs are discriminative — directly model P(tags|words). This means they can use arbitrary overlapping features of the observation, not just the current word. This is the key advantage.
CRF Feature Examples (for NER)
HMM vs CRF
| Aspect | HMM | CRF |
|---|---|---|
| Model type | Generative | Discriminative |
| Features | Current word only | Any features of input |
| Speed | Fast inference | Slower (partition fn) |
| Accuracy (NER) | ~88–90 F1 | ~91–94 F1 |
05 Intent Detection — A Complete System
The bedrock of voice assistants and chatbots. How it was built before LLMs.
Intent detection answers: "What is the user trying to do?" Given "Book me a flight to Paris", the intent is book_flight. Given "What's the weather like?", the intent is get_weather. This is fundamentally a text classification problem.
The Classical Intent Pipeline
Lexical Features
- TF-IDF unigrams + bigrams
- First/last word indicators
- Word length buckets
- Presence of question words ("what","how","when")
- Punctuation pattern
Syntactic Features
- POS tag sequence
- Dependency parse head words
- Sentence length
- Verb type (action vs state)
- Root verb of the sentence
Semantic Features
- WordNet hypernyms (flight → travel)
- Named entity types (PERSON, DATE)
- Domain vocabulary matches (gazetteer)
- Word2Vec sentence vector (average)
| Classifier | How | Pros | Cons |
|---|---|---|---|
| Logistic Regression | Softmax over TF-IDF features | Fast, interpretable, strong baseline | Linear, no feature interactions |
| SVM (linear) | Max margin in TF-IDF space | Best on high-dim sparse text | No probability output; multi-class awkward |
| MaxEnt (LogReg) | Maximize entropy subject to feature constraints | Handles overlapping features well | Same as logistic regression mathematically |
| Random Forest | Ensemble of decision trees | Non-linear, handles noise | Slower, less effective on sparse text |
| Gradient Boosting | XGBoost/LightGBM on hand-crafted features | Often best classical accuracy | Feature engineering still needed |
Confidence score is critical for production: when the model is uncertain, don't act on the prediction — fall back to clarification or human handoff.
06 Named Entity Recognition
Identify and classify proper nouns: people, organizations, locations, dates, products. The backbone of information extraction.
IOB Tagging Scheme
Classical NER Feature Engineering
This is where the craft was — every percentage point of F1 required careful feature design for each domain:
Gazetteers — The Secret Weapon
A gazetteer is a lookup table of known entities (lists of cities, company names, people names). Matching against gazetteers was one of the highest-value features. The challenge: ambiguity ("Apple" = company or fruit? "Jordan" = person or country?) and coverage (new entities constantly appear).
07 Word Vectors — The Bridge to Deep Learning
The single innovation that made classical NLP people take neural networks seriously.
The Problem with One-Hot Vectors
"cat" and "kitten" are similar, but one-hot vectors for them are orthogonal — dot product = 0, distance = √2. The representation encodes no semantic relationships. Every model had to learn similarity from scratch for every task.
Word2Vec (2013, Mikolov et al.)
Train a shallow neural network to predict neighboring words. The hidden layer weights become word embeddings. Two architectures:
CBOW — Continuous Bag of Words
Predict center word from context words. Input: average of context word embeddings → softmax → predict center word. Faster, better for frequent words.
Skip-gram
Predict context words from center word. For each training word, predict each word within a window of ±k. Slower, better for rare words and analogies.
GloVe (2014) — Global Vectors
08 Classical Dialogue Systems
Building task-oriented chatbots before LLMs — finite state machines, slot filling, and the frame-based architecture.
The Spoken Dialogue System Architecture
Slot Filling — Beyond Intent
Intent detection tells you what the user wants. Slot filling fills in the parameters. For book_flight, the slots are: origin, destination, date, passengers. The dialogue manager tracks which slots are filled and asks for missing ones.
Finite State vs. Frame-Based vs. Information-State
Finite State Machine
Hard-coded graph of states and transitions. "If in state GREET and user says YES → move to state COLLECT_ORIGIN." Fast, predictable. Can't handle unexpected inputs. Used for simple IVR (phone menu) systems.
Frame-Based (slot filling)
Fill slots in any order. More flexible than FSM. Can handle "Book Paris to London on Tuesday" in one shot. The standard for task-oriented bots 2000–2017 (ATIS, hotel booking).
Statistical Dialogue (POMDP)
Maintain a probability distribution over possible dialogue states. Choose next action to maximize expected reward. More robust to ASR/NLU errors. Cambridge Dialogue Systems Group (2006–2016).
Natural Language Generation (Rule-Based)
09 Information Retrieval — The Inverted Index
How search engines work, why BM25 still runs inside modern RAG, and the direct lineage from TF-IDF to dense retrieval.
Before neural retrieval, finding relevant documents from a corpus of millions required an efficient data structure. The inverted index — the engine behind Elasticsearch, Solr, and Lucene — maps each term to the list of documents containing it. Query evaluation becomes an intersection of sorted lists rather than a scan of the entire corpus.
Inverted Index Structure
BM25 — The Backbone of Modern Search
BM25 (Best Match 25) is the ranking function used by Elasticsearch and the sparse component of most hybrid RAG pipelines. It extends TF-IDF with two key improvements: term frequency saturation (doubling word count doesn't double relevance) and document length normalisation (penalise long documents for having more term occurrences by chance).
k₁ ≈ 1.2 controls TF saturation. b ≈ 0.75 controls length normalisation. avgdl = average document length. As TF→∞, score → IDF·(k₁+1)/k₁ — it saturates, unlike TF-IDF which grows unboundedly. BM25 has been the baseline that neural models must beat since 1994.
- Exact keyword match — never misses a literal term
- No GPU, no embeddings, runs on commodity hardware
- Interpretable: you know exactly why a document ranked
- Millisecond latency at billion-document scale
Dense retrieval understands semantics ("automobile" ≈ "car") but can miss exact matches (product codes, names, rare terms). BM25 never misses exact terms but doesn't understand semantics. Hybrid = BM25 + dense, merged with Reciprocal Rank Fusion (RRF). Consistently outperforms either alone.
10 FastText — Subword Embeddings
Word2Vec's successor that handles morphologically rich languages, typos, and rare words — and is still fast enough to train in minutes.
Word2Vec gives a single embedding per word. Unknown words (unseen during training) get no embedding at all. For morphologically rich languages like Finnish, Turkish, or Hindi — where a word can have thousands of inflections — this is crippling. FastText (Facebook, 2017) extends Word2Vec by representing each word as a bag of character n-grams.
- Out-of-vocabulary words get meaningful embeddings from shared n-grams
- Robust to spelling variations and typos
- Handles morphology: "running", "runs", "ran" share n-gram vectors
- Excellent for agglutinative languages (Finnish, Turkish, Japanese)
- Still very fast: n-gram hashing keeps vocabulary manageable
FastText also includes a text classifier: average word n-gram vectors, apply a linear classifier. Trains in seconds on millions of documents. Still a strong baseline for intent detection and text classification — outperformed by BERT but at 1/1000 the inference cost. Used in production at Facebook/Meta for language identification (176 languages).
11 ELMo, BERT & the Bridge to Modern LLMs
The 2018–2019 revolution: contextualised embeddings and pretraining that ended the classical NLP era overnight.
The Problem Word2Vec Couldn't Solve: Polysemy
Word2Vec gives a single static vector per word. "Bank" (financial institution) and "bank" (river bank) have the same embedding. "Play" (theatre play) and "play" (sport play) are identical. Context is completely ignored. Every downstream task had to figure out word sense disambiguation itself.
ELMo (2018) — Contextual Embeddings from LSTMs
ELMo (Embeddings from Language Models) trained a deep bidirectional LSTM to predict the next word (forward) and previous word (backward) on a large corpus. The key insight: instead of using the word's static embedding, use the internal activations of the LSTM — which change depending on the surrounding context — as the word representation.
BERT (2018) — Bidirectional Transformer Pretraining
BERT (Bidirectional Encoder Representations from Transformers) replaced ELMo's BiLSTM with a Transformer encoder, and replaced next-word prediction with two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
Randomly mask 15% of tokens. Train the model to predict the masked tokens from the bidirectional context. Unlike GPT (predict next token, left-to-right), MLM lets every token attend to tokens on both sides — giving much richer contextual representations. The input "The [MASK] sat on the mat" forces the model to use all surrounding context to predict "cat".
Add a task-specific head on top of BERT's [CLS] token output (classification) or per-token output (NER, QA). Fine-tune all weights jointly. Results: BERT-base achieved SOTA on 11 NLP benchmarks simultaneously. NER went from ~93% (BiLSTM-CRF) to ~96% F1. The feature engineering era ended in weeks.
Embedding Evolution: Word2Vec → ELMo → BERT → LLMs
12 The Transition to Deep Learning
Why classical methods gave way — and what the tipping points were.
What Classical Methods Couldn't Solve
Feature Engineering Bottleneck
Every task needed a new set of manually designed features. POS tagger features differ from NER features differ from sentiment features. Expertise was non-transferable. Neural networks learn features automatically from raw text.
No Transfer Learning
A TF-IDF+SVM model for spam detection couldn't help an intent classifier — no shared representation. Deep learning enabled pretraining on large corpora and fine-tuning on specific tasks — the modern paradigm.
Long-Range Dependencies
N-gram models are local. CRFs see a small window. "The bank that the customer who the employee that the department hired…" requires tracking nested dependencies across many tokens. LSTMs and Transformers handle this; classical models don't.
Semantic Compositionality
"Not bad" ≠ bad. "The best way to ruin a meal" (sarcasm). Composing word meanings into sentence meaning required complex hand-crafted logic in classical systems. Neural networks learn composition implicitly.
What Classical Methods Still Do Better
Interpretability
You can inspect every feature weight in a logistic regression or SVM. "This prediction is primarily driven by the word 'cancel'." Try explaining a 70B parameter attention pattern.
Low-Data Regimes
With 50–200 labeled examples, Naive Bayes and logistic regression often outperform fine-tuned LLMs. Classical models don't need 10K+ examples to work. In highly specialized domains with little data, they remain competitive.
Latency & Cost
TF-IDF + logistic regression classifies in microseconds on CPU, for free. A BERT inference costs ~$0.0001 and takes 50ms. An LLM API call costs 100× more. At scale, classical models run inside other models' preprocessing pipelines.
Predictability
Rule-based and classical models behave deterministically. "If input contains regex /\bcancel\b/, fire cancel intent" — 100% reproducible. LLMs introduce stochasticity that can be hard to debug in production.
Where They Coexist Today
Classical NLP taught us that text representation, feature design, and probabilistic sequence modeling matter. Deep learning automated the feature engineering. Transformers automated the sequence modeling. LLMs automated the task-specific fine-tuning. But the problems are the same: intent, slots, entities, context, disambiguation. The tools evolved; the problem structure didn't. Understanding the classical approaches makes you a better engineer of the modern ones — you know what the neural networks are implicitly learning to do, and when a simpler classical tool is the right one.