Historical Foundation · Pre-Deep Learning Era

Before Neural Networks
Classical NLP & Skill Models

How engineers built working intent detectors, entity recognizers, dialogue systems, and search engines before "deep learning" existed — and what still matters today.

00 The Pre-Deep Learning Era

~1990 to ~2015: when clever feature engineering and probabilistic models ruled NLP.

1950s–70s

Symbolic / Rule-Based AI

Hand-crafted rules, decision trees by hand, pattern matching. ELIZA (1966) faked conversation with regex. SHRDLU (1970) parsed natural language into symbolic logic. Worked in narrow domains; failed to generalize.

1980s–90s

Statistical NLP Begins

Hidden Markov Models for speech and POS tagging. N-gram language models. IBM's statistical machine translation. Shift: replace linguistic rules with probability estimated from data. "Every time I fire a linguist, the perplexity goes down." — Frederick Jelinek.

2000–2010

Feature Engineering + Discriminative Models

Support Vector Machines, Maximum Entropy (logistic regression), Conditional Random Fields. Engineers hand-craft features: POS tags, dependency parse, character n-grams, capitalization patterns. These features feed powerful classifiers. State of art for intent, NER, sentiment.

2013

Word2Vec — Embeddings Arrive

Word representations learned from co-occurrence statistics. "king − man + woman ≈ queen." Distributed representations replace sparse one-hot vectors. This bridged classical and deep learning NLP.

2014–2017

Deep Learning Takes Over

BiLSTM-CRF becomes NER standard. CNN for text classification. Seq2Seq for translation. Then: Transformers (2017). The feature engineering era ends almost overnight.

🔑
Why study this? (1) Production systems still use classical models where latency, interpretability, or data size make LLMs overkill. (2) Features and heuristics from this era (TF-IDF, BM25, regex pre-filters) still run inside RAG pipelines and search systems. (3) Understanding what LLMs replaced makes you appreciate what they do. (4) Interview questions — HMMs, n-grams, Naive Bayes still appear.

01 Representing Text — The Core Problem

ML models operate on numbers. Text is symbols. Every classical NLP method starts with this conversion.

Bag of Words (BoW)

Represent each document as a vector of word counts. Vocabulary size = V (typically 10K–100K). Each document → sparse vector in ℝᵛ. Order is discarded — "cat ate dog" = "dog ate cat".

Bag of Words Transformation

"the cat sat on the mat" Tokenize → count words Sparse Vector (V dimensions) cat:1 mat:1 on:1 sat:1 the:2 ...:0 99.9% zeros — extremely sparse. "the","a","and" dominate. → need TF-IDF.

TF-IDF — Weighting by Importance

Common words ("the", "is") appear in every document — they carry no discriminative signal. TF-IDF downweights frequent words and upweights rare, informative ones.

TF(t,d) = count(t,d) / |d| IDF(t) = log(N / df(t)) TF-IDF(t,d) = TF(t,d) × IDF(t) TF = term frequency in document d. IDF = inverse document frequency across corpus of N docs; df(t) = number of docs containing term t. High TF-IDF: word appears often in this doc but rarely elsewhere → distinctive. BM25 is a refined version still used in Elasticsearch, Solr, and as the sparse component in hybrid RAG.

N-grams — Capturing Local Word Order

BoW loses word order. N-grams add contiguous sequences of N words as features. "New York" as a bigram is different from "new" + "york" separately.

// Unigrams: ["the","cat","sat","on","the","mat"] // Bigrams: ["the cat","cat sat","sat on","on the","the mat"] // Trigrams: ["the cat sat","cat sat on","sat on the","on the mat"] // Character n-grams (n=3): "cat" → ["ca","at","t"] — robust to typos and morphology // Language model n-gram: P("the cat sat on the mat") = P("cat"|"the")·P("sat"|"the cat")··· // Problem: vocabulary explosion. V=50K unigrams → V²=2.5B possible bigrams. // Solution: keep only top-K most frequent n-grams. Prune the rest.

N-gram Language Model

P(w₁,w₂,...,wₙ) = Π_{t=1}^{n} P(wₜ | wₜ₋₁,...,w₁) ≈ Π_{t=1}^{n} P(wₜ | wₜ₋(k-1),...,wₜ₋₁) Markov assumption: only last k−1 words matter (k=2: bigram, k=3: trigram). Probabilities estimated from corpus counts with smoothing (Kneser-Ney, Good-Turing) to handle unseen n-grams. Used in speech recognition, early machine translation, autocomplete. Limitation: no long-range context beyond k words.

02 Naive Bayes — The Classic Text Classifier

Surprisingly competitive despite its extreme simplifying assumption. Still used in spam filters and as a fast baseline.

Naive Bayes applies Bayes' theorem to text classification, making the "naive" assumption that all word features are conditionally independent given the class. This is false (word order and co-occurrence clearly matter), yet the model works remarkably well in practice.

P(class|doc) ∝ P(class) · Π_{t ∈ doc} P(wₜ|class) P(class) = prior probability of each class (from training set frequencies). P(wₜ|class) = probability of word t in documents of this class = count(t, class) / total_words_in_class. The "∝" means we normalize at the end. Classification: argmax_c P(c) · Π P(wₜ|c).
// Training: for each class c: P(c) = num_docs_in_class_c / total_docs for each word w: P(w|c) = (count(w in class c) + 1) / (total_words_in_c + V) // +1 Laplace smoothing // Prediction (log space to avoid underflow from many small probabilities): score(c) = log P(c) + Σ_{w ∈ doc} log P(w|c) predicted_class = argmax_c score(c)

Strengths

  • Extremely fast: O(V) training and O(|doc|) prediction
  • Works well with little data
  • Naturally handles new features (just add new words)
  • Strong baseline — often within 5% of complex models
  • Interpretable: you can inspect P(w|class) directly

Weaknesses

  • Independence assumption: "not good" treated as "not" + "good"
  • Probability calibration is poor (argmax is fine; the probabilities themselves aren't)
  • Long documents: domination by high-frequency features
  • No feature interactions

03 SVM — The Pre-Deep Learning Champion

From 2000–2012, SVMs were the dominant model for text classification, sentiment, and many NLP tasks. Understanding why reveals something deep about generalization.

The Core Idea: Maximum Margin

Instead of finding any hyperplane that separates classes, find the one with the maximum margin — maximum distance from the hyperplane to the nearest training points (support vectors). This margin-maximization is theoretically linked to better generalization.

Maximize margin: 2/||w|| subject to: yᵢ(wᵀxᵢ + b) ≥ 1 Equivalently: Minimize (1/2)||w||² subject to constraints This is a convex quadratic program — one global optimum, efficiently solvable. The solution depends only on the support vectors (points nearest the boundary). Most training points are irrelevant — they don't affect the hyperplane. This "sparsity" is why SVMs generalize well even in high dimensions.
Soft-margin SVM: Minimize (1/2)||w||² + C·Σᵢ ξᵢ subject to: yᵢ(wᵀxᵢ + b) ≥ 1−ξᵢ, ξᵢ ≥ 0 ξᵢ = slack variable: how much point i violates the margin. C = regularization: large C = small margin tolerated (fit training data better). Small C = large margin (more regularization). This is the bias-variance tradeoff in SVM form.

The Kernel Trick — Non-Linear SVMs

The SVM decision function only involves dot products between training points: wᵀx = Σᵢ αᵢyᵢ (xᵢ · x). Replace the dot product with a kernel function K(xᵢ, x) — implicitly mapping to a high-dimensional space without computing it explicitly.

Common Kernels

Linear: K(x,x') = xᵀx' RBF: K(x,x') = exp(−γ||x−x'||²) Poly: K(x,x') = (xᵀx' + c)ᵈ String: K(s,s') = # common substrings

Why SVMs Dominated NLP (2000–2012)

Text in TF-IDF space is already high-dimensional and nearly linearly separable — linear kernel SVMs work extremely well. String kernels could capture n-gram features without explicitly computing them. They had solid theoretical guarantees. And they were much better than Naive Bayes on enough data.

04 Sequence Labeling — HMM & CRF

Text isn't just a bag of words — it's a sequence where each element's label depends on its neighbors. HMM and CRF are the two key models for this.

Hidden Markov Model (HMM)

A generative model: assume the observed words are produced by a hidden sequence of states (e.g., POS tags). Learn the model that most likely generated the observations.

HMM: Hidden States Generate Observations

Hidden States (POS tags) DET NOUN VERB NOUN P(NOUN|DET) P(VERB|NOUN) P(NOUN|VERB) Observed Words The cat sat mat P(cat|NOUN) P(sat|VERB)
P(w₁..wₙ, t₁..tₙ) = Π_{i=1}^n P(wᵢ|tᵢ) · P(tᵢ|tᵢ₋₁) Emission probability P(w|t): how likely is word w given tag t (learned from counts). Transition probability P(t|t'): how likely is tag t after t' (learned from counts). Decoding (Viterbi algorithm): find the most likely tag sequence in O(T·|S|²) via dynamic programming.

CRF — Conditional Random Field

HMMs are generative (model P(words, tags)). CRFs are discriminative — directly model P(tags|words). This means they can use arbitrary overlapping features of the observation, not just the current word. This is the key advantage.

P(y|x) = (1/Z(x)) · exp(Σ_t Σ_k λₖ fₖ(yₜ, yₜ₋₁, xₜ)) fₖ = feature functions (arbitrary, can look at any part of x). λₖ = learned weights. Z(x) = partition function (normalizer). Features can be: "Is current word capitalized AND previous tag is O?" — impossible in HMM which only conditions on current word emission.

CRF Feature Examples (for NER)

"Current word is capitalized" "Previous word is 'Dr.'" "Word ends in '-ington' (likely location)" "Current word in gazetteer of company names" "Previous tag is B-PER" "Word shape: Xxxxx (capitalized, 5 letters)" "Is current word a number?" "POS tag of current word is NNP"

HMM vs CRF

AspectHMMCRF
Model typeGenerativeDiscriminative
FeaturesCurrent word onlyAny features of input
SpeedFast inferenceSlower (partition fn)
Accuracy (NER)~88–90 F1~91–94 F1
Still relevant: BiLSTM-CRF (2015) — the deep learning NER standard before Transformers — combines a BiLSTM for context-aware embeddings with a CRF output layer for globally consistent tag sequences. The CRF part ensures "B-ORG can't follow I-PER" constraints. Many production NER systems still use this architecture today.

05 Intent Detection — A Complete System

The bedrock of voice assistants and chatbots. How it was built before LLMs.

Intent detection answers: "What is the user trying to do?" Given "Book me a flight to Paris", the intent is book_flight. Given "What's the weather like?", the intent is get_weather. This is fundamentally a text classification problem.

The Classical Intent Pipeline

Raw Text
Preprocessing
Feature Extraction
Classifier
Intent + Confidence
INPUT "Book me a flightt to Paris pls!!" PREPROCESS lowercase · strip punct · spell-fix → [book, me, a, flight, to, paris, pls] STOPWORDS + LEMMATIZE [book, flight, paris] TF-IDF VECTOR (sparse, |V| ≈ 10K dims) [0, 0, …, book: 2.31, …, flight: 3.84, …, paris: 4.12, …, 0] SVM SCORES (one-vs-rest) book_flight 0.91 get_weather 0.04 cancel_booking 0.03 intent = book_flight (0.91) slots: destination = "Paris" (NER) …but say "I need to be in Paris by Friday" and it breaks.
// 1. Lowercasing "Book ME a FLIGHT" → "book me a flight" // 2. Tokenization (careful! contractions, punctuation) "don't book it" → ["do", "n't", "book", "it"] // 3. Stopword removal (optional — sometimes hurts intent detection) ["book", "flight", "paris"] // removed: "me", "a", "to" // 4. Stemming / Lemmatization "booking" → "book", "flights" → "flight" (conflate morphological variants) // 5. Entity normalization (critical for generalization) "Book a flight to Paris" → "Book a flight to CITY" "Set alarm for 3pm" → "Set alarm for TIME" // Replace named entities with type tags so classifier isn't city-specific

Lexical Features

  • TF-IDF unigrams + bigrams
  • First/last word indicators
  • Word length buckets
  • Presence of question words ("what","how","when")
  • Punctuation pattern

Syntactic Features

  • POS tag sequence
  • Dependency parse head words
  • Sentence length
  • Verb type (action vs state)
  • Root verb of the sentence

Semantic Features

  • WordNet hypernyms (flight → travel)
  • Named entity types (PERSON, DATE)
  • Domain vocabulary matches (gazetteer)
  • Word2Vec sentence vector (average)
ClassifierHowProsCons
Logistic RegressionSoftmax over TF-IDF featuresFast, interpretable, strong baselineLinear, no feature interactions
SVM (linear)Max margin in TF-IDF spaceBest on high-dim sparse textNo probability output; multi-class awkward
MaxEnt (LogReg)Maximize entropy subject to feature constraintsHandles overlapping features wellSame as logistic regression mathematically
Random ForestEnsemble of decision treesNon-linear, handles noiseSlower, less effective on sparse text
Gradient BoostingXGBoost/LightGBM on hand-crafted featuresOften best classical accuracyFeature engineering still needed

Confidence score is critical for production: when the model is uncertain, don't act on the prediction — fall back to clarification or human handoff.

// Confidence thresholds (typical production system): if confidence > 0.85: execute intent directly if confidence > 0.60: confirm with user ("Do you want to book_flight?") if confidence ≤ 0.60: ask clarifying question or hand off to human // Multi-intent detection: sometimes a single utterance has multiple intents // "Book a flight AND a hotel" → [book_flight, book_hotel] // Solution: hierarchical classification or multi-label output // Out-of-scope detection: utterance doesn't match any trained intent // Approach 1: extra "other" class trained on diverse out-of-scope examples // Approach 2: reject if max confidence < threshold regardless of class // Approach 3: train a binary in-scope/out-of-scope classifier first

06 Named Entity Recognition

Identify and classify proper nouns: people, organizations, locations, dates, products. The backbone of information extraction.

IOB Tagging Scheme

// BIO (or IOB) labeling — every token gets a tag: B-TYPE: beginning of an entity span I-TYPE: inside (continuation) of an entity O: outside (not an entity) // Example: "Barack Obama visited New York City on Monday" B-PER I-PER O B-LOC I-LOC I-LOC O B-DATE // Why B/I distinction matters: "New York New York" needs // B-LOC I-LOC B-LOC I-LOC to identify two separate mentions

Classical NER Feature Engineering

This is where the craft was — every percentage point of F1 required careful feature design for each domain:

// For the token "Obama" at position i: features = { "word.lower()": "obama", "word.isupper()": False, "word.istitle()": True, // starts with capital "word.isdigit()": False, "word[-3:]": "ama", // last 3 chars (suffix pattern) "word[:3]": "Oba", // first 3 chars (prefix pattern) "prev_word.lower()": "barack", // context window "next_word.lower()": "visited", "in_person_gazetteer": True, // lookup table of known names "word_shape": "Xxxxx", // capitalization pattern "pos_tag": "NNP", // proper noun "prev_label": "B-PER", // CRF: previous prediction }

Gazetteers — The Secret Weapon

A gazetteer is a lookup table of known entities (lists of cities, company names, people names). Matching against gazetteers was one of the highest-value features. The challenge: ambiguity ("Apple" = company or fruit? "Jordan" = person or country?) and coverage (new entities constantly appear).

🏆
Why NER was hard: F1 jumped from ~85% (HMM) to ~91% (CRF, 2003) to ~93% (BiLSTM-CRF, 2016) to ~96% (BERT fine-tuned, 2019). Each jump required a fundamentally different model family. The CRF → BiLSTM jump happened because deep learning automatically learned features from raw tokens, eliminating years of manual feature engineering. The BERT jump came from pretraining on massive text.

07 Word Vectors — The Bridge to Deep Learning

The single innovation that made classical NLP people take neural networks seriously.

The Problem with One-Hot Vectors

"cat" and "kitten" are similar, but one-hot vectors for them are orthogonal — dot product = 0, distance = √2. The representation encodes no semantic relationships. Every model had to learn similarity from scratch for every task.

Word2Vec (2013, Mikolov et al.)

Train a shallow neural network to predict neighboring words. The hidden layer weights become word embeddings. Two architectures:

CBOW — Continuous Bag of Words

Predict center word from context words. Input: average of context word embeddings → softmax → predict center word. Faster, better for frequent words.

P(w|context) = softmax(W₂ · mean({eᵥ : v ∈ context})) W₁ = embedding matrix (what we keep). W₂ = output projection (discarded).

Skip-gram

Predict context words from center word. For each training word, predict each word within a window of ±k. Slower, better for rare words and analogies.

Σ_{-k≤j≤k, j≠0} log P(wₜ₊ⱼ | wₜ) With negative sampling: for each positive pair, sample K negative (random) words. Binary classification: is this a real neighbor or noise? Much faster than full softmax.
// The famous analogy test that shocked the NLP world: king − man + woman ≈ queen vec("king") − vec("man") + vec("woman") ≈ vec("queen") // More examples: Paris − France + Germany ≈ Berlin walking − walk + run ≈ running biggest − big + small ≈ smallest // Why it works: the embedding space encodes semantic relationships // as consistent geometric directions (gender, capital-of, tense, etc.)

GloVe (2014) — Global Vectors

J = Σ_{i,j} f(Xᵢⱼ) (wᵢᵀwⱼ + bᵢ + bⱼ − log Xᵢⱼ)² Xᵢⱼ = co-occurrence count of words i and j in a window. f(x) = min(1, (x/xmax)^α) = weighting function (cap the influence of very frequent pairs). Minimize: the dot product of word vectors should equal the log of their co-occurrence probability. Explicitly uses global corpus statistics vs Word2Vec's local window sampling.
🌉
The bridge: Word2Vec and GloVe gave classical NLP engineers a way to use neural representations without training neural networks end-to-end. Plug the pre-trained 300-dim vectors into your CRF or SVM as features — instant improvement. This broke down the resistance to "neural methods" and set the stage for LSTM and Transformer adoption.

08 Classical Dialogue Systems

Building task-oriented chatbots before LLMs — finite state machines, slot filling, and the frame-based architecture.

The Spoken Dialogue System Architecture

ASR (Speech → Text)
NLU (Intent + Slots)
Dialogue Manager
NLG (Response)
TTS (Text → Speech)

Slot Filling — Beyond Intent

Intent detection tells you what the user wants. Slot filling fills in the parameters. For book_flight, the slots are: origin, destination, date, passengers. The dialogue manager tracks which slots are filled and asks for missing ones.

// Frame for book_flight intent: frame = { "intent": "book_flight", "slots": { "origin": { "value": "Delhi", "filled": True }, "destination": { "value": "Paris", "filled": True }, "date": { "value": None, "filled": False }, // ← ask for this "class": { "value": None, "filled": False, "default": "economy" } // ← has default } } // Dialogue manager asks: "When would you like to fly?" // After user responds: run NLU on response, extract date slot, update frame // When all required slots filled: execute booking API call

Finite State vs. Frame-Based vs. Information-State

Finite State Machine

Hard-coded graph of states and transitions. "If in state GREET and user says YES → move to state COLLECT_ORIGIN." Fast, predictable. Can't handle unexpected inputs. Used for simple IVR (phone menu) systems.

Frame-Based (slot filling)

Fill slots in any order. More flexible than FSM. Can handle "Book Paris to London on Tuesday" in one shot. The standard for task-oriented bots 2000–2017 (ATIS, hotel booking).

Statistical Dialogue (POMDP)

Maintain a probability distribution over possible dialogue states. Choose next action to maximize expected reward. More robust to ASR/NLU errors. Cambridge Dialogue Systems Group (2006–2016).

Natural Language Generation (Rule-Based)

// Template-based NLG: templates = { "ask_destination": [ "Where would you like to fly?", "What's your destination?", "And where are you headed?" ], "confirm_booking": "I've booked a {class} flight from {origin} to {destination} on {date}." } // Select template → fill slots → optionally vary phrasing to avoid repetition // More advanced: sentence planners + surface realizers (OpenCCG, FUF/SURGE) // Limitation: every response variation must be hand-authored

09 Information Retrieval — The Inverted Index

How search engines work, why BM25 still runs inside modern RAG, and the direct lineage from TF-IDF to dense retrieval.

Before neural retrieval, finding relevant documents from a corpus of millions required an efficient data structure. The inverted index — the engine behind Elasticsearch, Solr, and Lucene — maps each term to the list of documents containing it. Query evaluation becomes an intersection of sorted lists rather than a scan of the entire corpus.

Inverted Index Structure

Documents Doc 1: "the cat sat on the mat" Doc 2: "the dog sat on the rug" Doc 3: "cat and dog are pets" index Inverted Index (term → posting list) "cat" → [Doc1(tf=1), Doc3(tf=1)] "sat" → [Doc1(tf=1,pos=3), Doc2(tf=1,pos=3)] "dog" → [Doc2(tf=1), Doc3(tf=1)] "the" → [Doc1(tf=2), Doc2(tf=2)] ← high df → low IDF Query "cat sat": intersect [Doc1,Doc3] ∩ [Doc1,Doc2] = [Doc1] → instant! Time complexity: O(|posting list|) not O(|corpus|) — enables billion-doc search

BM25 — The Backbone of Modern Search

BM25 (Best Match 25) is the ranking function used by Elasticsearch and the sparse component of most hybrid RAG pipelines. It extends TF-IDF with two key improvements: term frequency saturation (doubling word count doesn't double relevance) and document length normalisation (penalise long documents for having more term occurrences by chance).

BM25(q,d) = Σ_{w ∈ q} IDF(w) · [TF(w,d)·(k₁+1)] / [TF(w,d) + k₁·(1−b + b·|d|/avgdl)] IDF(w) = log[(N − df(w) + 0.5) / (df(w) + 0.5) + 1]

k₁ ≈ 1.2 controls TF saturation. b ≈ 0.75 controls length normalisation. avgdl = average document length. As TF→∞, score → IDF·(k₁+1)/k₁ — it saturates, unlike TF-IDF which grows unboundedly. BM25 has been the baseline that neural models must beat since 1994.

BM25 Strengths (still relevant)
  • Exact keyword match — never misses a literal term
  • No GPU, no embeddings, runs on commodity hardware
  • Interpretable: you know exactly why a document ranked
  • Millisecond latency at billion-document scale
Why hybrid RAG uses BM25 + dense

Dense retrieval understands semantics ("automobile" ≈ "car") but can miss exact matches (product codes, names, rare terms). BM25 never misses exact terms but doesn't understand semantics. Hybrid = BM25 + dense, merged with Reciprocal Rank Fusion (RRF). Consistently outperforms either alone.

10 FastText — Subword Embeddings

Word2Vec's successor that handles morphologically rich languages, typos, and rare words — and is still fast enough to train in minutes.

Word2Vec gives a single embedding per word. Unknown words (unseen during training) get no embedding at all. For morphologically rich languages like Finnish, Turkish, or Hindi — where a word can have thousands of inflections — this is crippling. FastText (Facebook, 2017) extends Word2Vec by representing each word as a bag of character n-grams.

FastText: from word → character n-grams
// Word "where" with n=3,4,5 (plus boundary markers <, >): character 3-grams: <wh, whe, her, ere, re> character 4-grams: <whe, wher, here, ere> character 5-grams: <wher, where, here> plus the full word: <where> // Word vector = sum of all its n-gram vectors // Unknown word "wheree" (typo): shares most n-grams with "where" → similar vector
Advantages over Word2Vec
  • Out-of-vocabulary words get meaningful embeddings from shared n-grams
  • Robust to spelling variations and typos
  • Handles morphology: "running", "runs", "ran" share n-gram vectors
  • Excellent for agglutinative languages (Finnish, Turkish, Japanese)
  • Still very fast: n-gram hashing keeps vocabulary manageable
FastText Classification

FastText also includes a text classifier: average word n-gram vectors, apply a linear classifier. Trains in seconds on millions of documents. Still a strong baseline for intent detection and text classification — outperformed by BERT but at 1/1000 the inference cost. Used in production at Facebook/Meta for language identification (176 languages).

11 ELMo, BERT & the Bridge to Modern LLMs

The 2018–2019 revolution: contextualised embeddings and pretraining that ended the classical NLP era overnight.

The Problem Word2Vec Couldn't Solve: Polysemy

Word2Vec gives a single static vector per word. "Bank" (financial institution) and "bank" (river bank) have the same embedding. "Play" (theatre play) and "play" (sport play) are identical. Context is completely ignored. Every downstream task had to figure out word sense disambiguation itself.

ELMo (2018) — Contextual Embeddings from LSTMs

ELMo (Embeddings from Language Models) trained a deep bidirectional LSTM to predict the next word (forward) and previous word (backward) on a large corpus. The key insight: instead of using the word's static embedding, use the internal activations of the LSTM — which change depending on the surrounding context — as the word representation.

💡
The ELMo insight: The same word has different representations in different contexts. "bank" in "I went to the bank" activates different LSTM hidden states than "bank" in "the river bank eroded." These contextual representations drastically improved NER, QA, and sentiment — just by swapping the embeddings. ELMo proved that context matters and set the stage for BERT.

BERT (2018) — Bidirectional Transformer Pretraining

BERT (Bidirectional Encoder Representations from Transformers) replaced ELMo's BiLSTM with a Transformer encoder, and replaced next-word prediction with two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Masked Language Modeling (MLM)

Randomly mask 15% of tokens. Train the model to predict the masked tokens from the bidirectional context. Unlike GPT (predict next token, left-to-right), MLM lets every token attend to tokens on both sides — giving much richer contextual representations. The input "The [MASK] sat on the mat" forces the model to use all surrounding context to predict "cat".

BERT Fine-tuning (The Key Shift)

Add a task-specific head on top of BERT's [CLS] token output (classification) or per-token output (NER, QA). Fine-tune all weights jointly. Results: BERT-base achieved SOTA on 11 NLP benchmarks simultaneously. NER went from ~93% (BiLSTM-CRF) to ~96% F1. The feature engineering era ended in weeks.

Embedding Evolution: Word2Vec → ELMo → BERT → LLMs

Word2Vec/GloVe Static embeddings context-free ELMo (2018) Context-dependent BiLSTM activations plug-in to any model BERT (2018) Transformer encoder Bidirectional MLM Fine-tune whole model SOTA on 11 benchmarks GPT → LLMs (2019+) Decoder-only, autoregressive Scale to billions of params Few-shot, in-context learning No task-specific fine-tuning needed
🏁
Why BERT matters for your understanding: BERT is the hinge point of the entire story. Classical NLP built features → ML classified them. ELMo learned features → ML classified them. BERT learned features AND the classifier together, end-to-end, from pretraining. GPT scaled this further — no classifier needed at all. Understanding BERT makes the LLM training pipeline feel inevitable rather than magical.

12 The Transition to Deep Learning

Why classical methods gave way — and what the tipping points were.

What Classical Methods Couldn't Solve

Feature Engineering Bottleneck

Every task needed a new set of manually designed features. POS tagger features differ from NER features differ from sentiment features. Expertise was non-transferable. Neural networks learn features automatically from raw text.

No Transfer Learning

A TF-IDF+SVM model for spam detection couldn't help an intent classifier — no shared representation. Deep learning enabled pretraining on large corpora and fine-tuning on specific tasks — the modern paradigm.

Long-Range Dependencies

N-gram models are local. CRFs see a small window. "The bank that the customer who the employee that the department hired…" requires tracking nested dependencies across many tokens. LSTMs and Transformers handle this; classical models don't.

Semantic Compositionality

"Not bad" ≠ bad. "The best way to ruin a meal" (sarcasm). Composing word meanings into sentence meaning required complex hand-crafted logic in classical systems. Neural networks learn composition implicitly.

What Classical Methods Still Do Better

Interpretability

You can inspect every feature weight in a logistic regression or SVM. "This prediction is primarily driven by the word 'cancel'." Try explaining a 70B parameter attention pattern.

Low-Data Regimes

With 50–200 labeled examples, Naive Bayes and logistic regression often outperform fine-tuned LLMs. Classical models don't need 10K+ examples to work. In highly specialized domains with little data, they remain competitive.

Latency & Cost

TF-IDF + logistic regression classifies in microseconds on CPU, for free. A BERT inference costs ~$0.0001 and takes 50ms. An LLM API call costs 100× more. At scale, classical models run inside other models' preprocessing pipelines.

Predictability

Rule-based and classical models behave deterministically. "If input contains regex /\bcancel\b/, fire cancel intent" — 100% reproducible. LLMs introduce stochasticity that can be hard to debug in production.

Where They Coexist Today

// Modern production NLP pipeline — many layers, each best tool for the job: function processQuery(userInput): // Layer 1: Fast rule-based pre-filter (microseconds, ~free) if regex_matches_profanity(userInput): return block() if regex_matches_greeting(userInput): return standard_greeting() // Layer 2: BM25 keyword retrieval for RAG (fast, interpretable) bm25_docs = bm25_index.search(userInput, top_k=20) // Layer 3: Small classifier to route (BERT-small, 10ms) intent, confidence = intent_classifier.predict(userInput) if intent == "simple_faq" and confidence > 0.9: return lookup_faq(intent) // no LLM needed! // Layer 4: Dense re-ranking of BM25 results (cross-encoder, 50ms) top_docs = cross_encoder.rerank(userInput, bm25_docs, top_k=5) // Layer 5: LLM for complex generation (only when needed) return llm.generate(userInput, context=top_docs)

The Full Picture

Classical NLP taught us that text representation, feature design, and probabilistic sequence modeling matter. Deep learning automated the feature engineering. Transformers automated the sequence modeling. LLMs automated the task-specific fine-tuning. But the problems are the same: intent, slots, entities, context, disambiguation. The tools evolved; the problem structure didn't. Understanding the classical approaches makes you a better engineer of the modern ones — you know what the neural networks are implicitly learning to do, and when a simpler classical tool is the right one.