The Mathematics of Large Language Models

Large language models feel like magic, but they’re built on surprisingly elegant mathematics. This post walks through the key concepts at a high level - enough to understand what’s actually happening without drowning in tensor calculus.

Everything is Numbers

The first insight: computers can’t read words. They need numbers. So we convert text into numbers through a process called tokenization.

A tokenizer breaks text into chunks (tokens) and maps each to an integer:

"The cat sat" → [464, 3797, 3332]

Tokens aren’t always whole words - “understanding” might become [“under”, “stand”, “ing”]. Modern tokenizers like BPE (Byte Pair Encoding) learn these splits from data, optimizing for common patterns.

Embeddings: Words as Vectors

A token ID like 3797 is meaningless on its own. We need to capture meaning. Enter embeddings - vectors that place words in a high-dimensional space where similar concepts cluster together.

Each token maps to a vector of, say, 4096 numbers:

"cat" → [0.2, -0.5, 0.8, 0.1, ...]  (4096 dimensions)
"dog" → [0.3, -0.4, 0.7, 0.2, ...]  (nearby in vector space)
"democracy" → [-0.9, 0.1, -0.3, 0.6, ...]  (far away)

The famous example: king - man + woman ≈ queen. Vector arithmetic on embeddings captures semantic relationships.

These embeddings are learned during training - the model discovers that “cat” and “dog” should be near each other because they appear in similar contexts.

Attention: The Core Innovation

Here’s where transformers diverge from older approaches. Previous models processed text sequentially - word by word, carrying a hidden state. This made it hard to connect distant words (“The cat that the dog chased ran away” - what ran away?).

Attention lets every token look at every other token simultaneously and decide what’s relevant.

For each token, we compute three vectors:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information do I provide?”

The attention score between tokens is:

Attention(Q, K, V) = softmax(QKᵀ / √d) × V

In plain English:

Compare each query with all keys (dot product)
Scale down to prevent exploding gradients (divide by √d)
Softmax to get probabilities (weights sum to 1)
Use these weights to blend the values

The result: each token gets a context-aware representation that incorporates relevant information from the entire sequence.

Multi-Head Attention

One attention pattern isn’t enough. “Bank” relates differently to “river” (geography) and “money” (finance). Multi-head attention runs several attention mechanisms in parallel:

Head 1: tracks subject-verb relationships
Head 2: tracks adjective-noun relationships
Head 3: tracks coreference (pronouns → nouns)
...

Each head learns different patterns. Their outputs concatenate and project back to the model dimension.

The Transformer Block

A transformer stacks identical blocks, each containing:

Multi-head self-attention (with residual connection)
Layer normalization
Feed-forward network (two linear layers with activation)
Another residual connection and layer norm

The residual connections (adding input to output) are crucial - they let gradients flow backward through many layers without vanishing.

GPT-4 reportedly has ~120 layers. Each layer refines the representations, building increasingly abstract features.

Predicting the Next Token

After all those layers, we have a rich representation for each position. The final step: predict what comes next.

A linear layer projects to vocabulary size (say, 50,000 tokens), then softmax converts to probabilities:

P("the") = 0.02
P("cat") = 0.15
P("dog") = 0.12
P("quantum") = 0.0001
...

During training, we know the actual next token. The loss function measures how wrong we were:

Loss = -log(P(correct token))

This is cross-entropy loss. If we assigned probability 0.15 to the correct token, loss = -log(0.15) ≈ 1.9. If we assigned 0.95, loss ≈ 0.05. Lower is better.

Training: Gradient Descent

Training adjusts billions of parameters to minimize loss across trillions of tokens. The algorithm:

Forward pass: run input through the model, compute loss
Backward pass: compute gradients (how each parameter affects loss)
Update: nudge parameters opposite to gradients
Repeat

The update rule (simplified):

θ_new = θ_old - η × ∇Loss

Where η (eta) is the learning rate (typically ~0.0001). Too high and training explodes. Too low and it takes forever.

Modern training uses Adam optimizer, which adapts learning rates per-parameter based on gradient history. It also includes momentum to escape local minima.

Scaling Laws

A remarkable discovery: LLM performance follows predictable scaling laws. Loss decreases as a power law with:

More parameters (N)
More training data (D)
More compute (C)

L(N, D) ≈ (Nc/N)^αn + (Dc/D)^αd

This lets researchers predict performance before training, allocating compute optimally. It’s why the field has pursued scale so aggressively - the returns are predictable.

What About Understanding?

Here’s the philosophical question: does any of this constitute “understanding”?

The mathematics is clear - it’s pattern matching and probability over sequences. But the emergent behaviours (reasoning, analogy, creativity) arise from simple operations at massive scale.

Whether that’s “real” understanding or a very good approximation is a debate for philosophers. The maths doesn’t care.