Large language models feel like magic, but they’re built on surprisingly elegant mathematics. This post walks through the key concepts at a high level - enough to understand what’s actually happening without drowning in tensor calculus.
Everything is Numbers
The first insight: computers can’t read words. They need numbers. So we convert text into numbers through a process called tokenization.
A tokenizer breaks text into chunks (tokens) and maps each to an integer:
"The cat sat" → [464, 3797, 3332]
Tokens aren’t always whole words - “understanding” might become [“under”, “stand”, “ing”]. Modern tokenizers like BPE (Byte Pair Encoding) learn these splits from data, optimizing for common patterns.
Embeddings: Words as Vectors
A token ID like 3797 is meaningless on its own. We need to capture meaning. Enter embeddings - vectors that place words in a high-dimensional space where similar concepts cluster together.
Each token maps to a vector of, say, 4096 numbers:
"cat" → [0.2, -0.5, 0.8, 0.1, ...] (4096 dimensions)
"dog" → [0.3, -0.4, 0.7, 0.2, ...] (nearby in vector space)
"democracy" → [-0.9, 0.1, -0.3, 0.6, ...] (far away)
The famous example: king - man + woman ≈ queen. Vector arithmetic on embeddings captures semantic relationships.
These embeddings are learned during training - the model discovers that “cat” and “dog” should be near each other because they appear in similar contexts.
Attention: The Core Innovation
Here’s where transformers diverge from older approaches. Previous models processed text sequentially - word by word, carrying a hidden state. This made it hard to connect distant words (“The cat that the dog chased ran away” - what ran away?).
Attention lets every token look at every other token simultaneously and decide what’s relevant.
For each token, we compute three vectors:
- Query (Q): “What am I looking for?”
- Key (K): “What do I contain?”
- Value (V): “What information do I provide?”
The attention score between tokens is:
Attention(Q, K, V) = softmax(QKᵀ / √d) × V
In plain English:
- Compare each query with all keys (dot product)
- Scale down to prevent exploding gradients (divide by √d)
- Softmax to get probabilities (weights sum to 1)
- Use these weights to blend the values
The result: each token gets a context-aware representation that incorporates relevant information from the entire sequence.
Multi-Head Attention
One attention pattern isn’t enough. “Bank” relates differently to “river” (geography) and “money” (finance). Multi-head attention runs several attention mechanisms in parallel:
Head 1: tracks subject-verb relationships
Head 2: tracks adjective-noun relationships
Head 3: tracks coreference (pronouns → nouns)
...
Each head learns different patterns. Their outputs concatenate and project back to the model dimension.
The Transformer Block
A transformer stacks identical blocks, each containing:
- Multi-head self-attention (with residual connection)
- Layer normalization
- Feed-forward network (two linear layers with activation)
- Another residual connection and layer norm
The residual connections (adding input to output) are crucial - they let gradients flow backward through many layers without vanishing.
GPT-4 reportedly has ~120 layers. Each layer refines the representations, building increasingly abstract features.
Predicting the Next Token
After all those layers, we have a rich representation for each position. The final step: predict what comes next.
A linear layer projects to vocabulary size (say, 50,000 tokens), then softmax converts to probabilities:
P("the") = 0.02
P("cat") = 0.15
P("dog") = 0.12
P("quantum") = 0.0001
...
During training, we know the actual next token. The loss function measures how wrong we were:
Loss = -log(P(correct token))
This is cross-entropy loss. If we assigned probability 0.15 to the correct token, loss = -log(0.15) ≈ 1.9. If we assigned 0.95, loss ≈ 0.05. Lower is better.
Training: Gradient Descent
Training adjusts billions of parameters to minimize loss across trillions of tokens. The algorithm:
- Forward pass: run input through the model, compute loss
- Backward pass: compute gradients (how each parameter affects loss)
- Update: nudge parameters opposite to gradients
- Repeat
The update rule (simplified):
θ_new = θ_old - η × ∇Loss
Where η (eta) is the learning rate (typically ~0.0001). Too high and training explodes. Too low and it takes forever.
Modern training uses Adam optimizer, which adapts learning rates per-parameter based on gradient history. It also includes momentum to escape local minima.
Scaling Laws
A remarkable discovery: LLM performance follows predictable scaling laws. Loss decreases as a power law with:
- More parameters (N)
- More training data (D)
- More compute (C)
L(N, D) ≈ (Nc/N)^αn + (Dc/D)^αd
This lets researchers predict performance before training, allocating compute optimally. It’s why the field has pursued scale so aggressively - the returns are predictable.
What About Understanding?
Here’s the philosophical question: does any of this constitute “understanding”?
The mathematics is clear - it’s pattern matching and probability over sequences. But the emergent behaviours (reasoning, analogy, creativity) arise from simple operations at massive scale.
Whether that’s “real” understanding or a very good approximation is a debate for philosophers. The maths doesn’t care.
Further Reading
- Attention Is All You Need - the original transformer paper
- The Illustrated Transformer - visual explanations
- Scaling Laws for Neural Language Models - the empirical scaling paper
- 3Blue1Brown’s Neural Network Series - excellent visualizations
The mathematics is beautiful in its simplicity. A few operations - matrix multiplication, softmax, layer norm - repeated billions of times, trained on human text, produce something that can write poetry and debug code. That’s the real magic.