Chapter 2. Math You Actually Need (And Nothing More)
Every operation inside a language model (every prediction, every “understanding” of context, every decision about which word comes next) boils down to four mathematical operations: vectors, dot products, matrix multiplication, and softmax. If you understand these four things, you can follow every technical explanation in the rest of this book. If you don’t, the rest will be a black box. This chapter makes sure it isn’t.
You don’t need a math degree. You don’t need to have taken calculus. Everything here is built from scratch, starting with the simplest possible idea, a list of numbers, and working up to the exact operations that run inside GPT-5, Claude, and every other frontier model. By the end, you’ll implement all of it in about 20 lines of Python.
Vectors: Lists of Numbers That Represent Meaning
A vector is a list of numbers. That’s it. No more, no less.
Here’s a vector with 3 numbers:
[0.5, -0.2, 0.8]Here’s a vector with 5 numbers:
[1.0, 0.0, -0.3, 0.7, 0.1]The number of values in the list is called the vector’s dimension. The first vector above is 3-dimensional. The second is 5-dimensional.
Why should you care? Because in a language model, every single token (every word or word-piece the model reads or writes) is represented as a vector. When the model processes the word “Tokyo,” it doesn’t see the letters T-o-k-y-o. It sees a list of numbers like:
Tokyo → [0.023, -0.041, 0.118, 0.055, -0.092, ..., -0.007]In Chapter 1, we saw that LLaMA 4 Maverick uses a hidden dimension of 5,120. That means every token in the model is represented as a vector of 5,120 numbers. GPT-2, the smaller model we ran code with in Chapter 1, uses vectors of 768 numbers. The principle is the same; only the size differs.
Source: LLaMA 4 Maverick config: embedding_length = 5,120, from Ollama model registry; GPT-2: embedding dimension = 768, from OpenAI’s published architecture.
What Do the Numbers Mean?
Each number in a vector represents something about the token’s meaning, but not in a way that’s easy to label. You can’t point to the 47th number and say “this one measures how much the word relates to animals.” The meaning is distributed, spread across all the numbers together. It’s the pattern of all 5,120 numbers taken as a whole that encodes what the token means.
This is similar to how a color on your screen is represented by three numbers: red, green, and blue (RGB). The color orange isn’t captured by any single number; it’s the combination [255, 165, 0] that makes it orange. Token vectors work the same way, except with thousands of dimensions instead of three.
Visualizing Vectors in 2D and 3D
Since we can’t visualize 5,120 dimensions, let’s start with 2 dimensions to build intuition.
Imagine a flat grid, like a piece of graph paper. The horizontal axis represents one dimension, and the vertical axis represents another. A 2D vector is just a point on this grid:
"cat" → [0.9, 0.8] (far right, near the top)
"dog" → [0.85, 0.75] (close to "cat")
"car" → [-0.5, -0.3] (far left, near the bottom)
"truck" → [-0.6, -0.2] (close to "car")On this grid, “cat” and “dog” would be plotted near each other because they’re both animals, they appear in similar sentences, and their vectors reflect that. “Car” and “truck” would be plotted near each other too, but far away from the animals. The distance between two points on the grid tells you how similar their meanings are.
Now extend this to 3D: add a third axis coming out of the page. Each token is now a point in a 3D space, like a dot floating in a room. “Cat” and “dog” are floating near each other in one corner. “Car” and “truck” are in a different corner. “Paris” and “Tokyo” might be in yet another region.
Real models use 768 to 12,288+ dimensions instead of 2 or 3. You can’t visualize that many dimensions, but the principle is identical: similar meanings produce nearby vectors, and different meanings produce distant vectors. The model learns these positions during training; it adjusts the numbers so that tokens appearing in similar contexts end up with similar vectors.
This is exactly what Word2Vec demonstrated in 2013. Mikolov and colleagues at Google trained vectors of 300 dimensions and showed that the learned positions captured semantic relationships, the famous example being that the vector for “King” minus “Man” plus “Woman” produced a vector close to “Queen.”
Source: Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781, January 2013. The original Word2Vec models used 300-dimensional vectors.
Why Vectors Matter for Language Models
Every step inside a Transformer (the architecture behind every frontier model) operates on vectors. When the model reads your prompt, it converts each token into a vector. When it computes attention (Chapter 7), it’s comparing vectors. When it generates the next token, it’s producing a vector and finding which vocabulary entry is closest to it.
If you understand that a vector is a list of numbers representing meaning, and that similar meanings produce similar vectors, you have the foundation for everything that follows.
Dot Products: Measuring Similarity Between Vectors
Now that we have vectors, we need a way to measure how similar two vectors are. This is where the dot product comes in.
The dot product of two vectors is computed by multiplying their corresponding numbers together and adding up the results. That’s the entire operation.
Here’s an example with 3-dimensional vectors:
a = [1, 2, 3]
b = [4, 5, 6]
dot product = (1 × 4) + (2 × 5) + (3 × 6)
= 4 + 10 + 18
= 32Step by step:
- Multiply the first numbers: 1 × 4 = 4
- Multiply the second numbers: 2 × 5 = 10
- Multiply the third numbers: 3 × 6 = 18
- Add them all up: 4 + 10 + 18 = 32
The result is a single number: 32. That single number tells you something about how similar the two vectors are.
Why the Dot Product Measures Similarity
The dot product has a useful property: it’s large and positive when two vectors point in the same direction, small or zero when they’re unrelated, and negative when they point in opposite directions.
Consider these three 4-dimensional vectors representing words:
cat = [0.9, 0.8, 0.1, -0.1]
dog = [0.85, 0.75, 0.15, -0.05]
car = [-0.5, -0.3, 0.7, 0.6]Let’s compute the dot products:
cat · dog:
(0.9 × 0.85) + (0.8 × 0.75) + (0.1 × 0.15) + (-0.1 × -0.05)
= 0.765 + 0.600 + 0.015 + 0.005
= 1.385cat · car:
(0.9 × -0.5) + (0.8 × -0.3) + (0.1 × 0.7) + (-0.1 × 0.6)
= -0.450 + -0.240 + 0.070 + -0.060
= -0.680The dot product of “cat” and “dog” is 1.385, a large positive number. The dot product of “cat” and “car” is -0.680, a negative number. This matches our intuition: cats and dogs are semantically similar (both animals, both pets), while cats and cars have little in common.
This is exactly how language models measure similarity. When the model needs to decide which previous words are relevant to predicting the next token, it computes dot products between vectors. A high dot product means “these two tokens are related.” A low or negative dot product means “these two tokens aren’t relevant to each other.”
In Chapter 7, we’ll see that the entire attention mechanism (the core innovation of the Transformer) is built on dot products. The model computes dot products between a “query” vector (what the current token is looking for) and “key” vectors (what each previous token offers), and uses the results to decide which tokens to pay attention to.
Dot Products with Real Embedding Dimensions
In a real model, these vectors aren’t 4 numbers long; they’re thousands of numbers long. In LLaMA 4 Maverick, each dot product involves multiplying and summing 5,120 pairs of numbers. In GPT-2, it’s 768 pairs. The math is identical; there are just more numbers to multiply and add.
The computational cost scales linearly with the vector dimension. A dot product between two 5,120-dimensional vectors requires 5,120 multiplications and 5,119 additions, about 10,000 arithmetic operations. That sounds like a lot, but modern GPUs can perform trillions of these operations per second.
Matrix Multiplication: The Only Operation That Matters
A matrix is a grid of numbers, rows and columns. If a vector is a list, a matrix is a spreadsheet.
Here’s a 2×3 matrix (2 rows, 3 columns):
| 1 2 3 |
| 4 5 6 |And here’s a 3×2 matrix (3 rows, 2 columns):
| 7 8 |
| 9 10 |
| 11 12 |Matrix multiplication is the operation of combining two matrices to produce a new matrix. It’s the single most important operation in all of deep learning. Every layer of every neural network, every attention computation, every token prediction: all of it is matrix multiplication.
How Matrix Multiplication Works
To multiply two matrices, you take each row of the first matrix and compute its dot product with each column of the second matrix. The result goes into the corresponding position in the output matrix.
Let’s multiply our 2×3 matrix by our 3×2 matrix:
A = | 1 2 3 | B = | 7 8 |
| 4 5 6 | | 9 10 |
| 11 12 |The result will be a 2×2 matrix (rows from A × columns from B).
Position [row 1, column 1]: Dot product of row 1 of A with column 1 of B:
(1 × 7) + (2 × 9) + (3 × 11) = 7 + 18 + 33 = 58Position [row 1, column 2]: Dot product of row 1 of A with column 2 of B:
(1 × 8) + (2 × 10) + (3 × 12) = 8 + 20 + 36 = 64Position [row 2, column 1]: Dot product of row 2 of A with column 1 of B:
(4 × 7) + (5 × 9) + (6 × 11) = 28 + 45 + 66 = 139Position [row 2, column 2]: Dot product of row 2 of A with column 2 of B:
(4 × 8) + (5 × 10) + (6 × 12) = 32 + 50 + 72 = 154The result:
A × B = | 58 64 |
| 139 154 |That’s it. Matrix multiplication is just organized dot products. Each entry in the output is the dot product of one row from the first matrix with one column from the second matrix.
The Size Rule
There’s one critical rule: the number of columns in the first matrix must equal the number of rows in the second matrix. If A is 2×3 and B is 3×2, the multiplication works because the inner dimensions match (both 3). The result is 2×2, the outer dimensions.
In general: if A is m×n and B is n×p, then A × B is m×p.
This rule matters because it tells you which matrices can be multiplied together. In a language model, the architects carefully design every weight matrix so that the dimensions line up correctly.
Why Matrix Multiplication Matters for Language Models
In Chapter 1, we walked through a prediction step where the model processes 6 tokens through ~100 Transformer layers. At every layer, the model performs multiple matrix multiplications. Here’s why.
Suppose you have 6 tokens, each represented as a vector of 5,120 numbers. You can stack these vectors into a matrix:
Input matrix: 6 rows × 5,120 columns
(each row is one token's vector)Now suppose the model has a weight matrix of size 5,120 × 5,120. This weight matrix was learned during training; its numbers were adjusted over trillions of examples to encode useful transformations.
When you multiply the input matrix by the weight matrix:
[6 × 5,120] × [5,120 × 5,120] = [6 × 5,120]You get a new matrix with the same shape, 6 rows of 5,120 numbers each. But the values are different. Each token’s vector has been transformed: rotated, stretched, and combined in ways that extract useful information.
This is what every layer of a neural network does: multiply the input by a weight matrix to produce a transformed output. The weight matrix encodes what the layer has learned. Different weight matrices extract different features: one might highlight syntactic relationships, another might capture semantic meaning, another might encode factual knowledge.
The Scale of Matrix Multiplication in Real Models
Let’s put concrete numbers on this. In LLaMA 4 Maverick:
- Hidden dimension: 5,120
- Number of attention heads: 40
- Head dimension: 128
- Number of layers: 48
- Expert feed-forward dimension: 8,192
Source: LLaMA 4 Maverick model configuration, Ollama model registry.
In a single attention layer, the model performs matrix multiplications to compute Query, Key, and Value matrices (we’ll cover what these mean in Chapter 7). Each of these involves multiplying a [sequence_length × 5,120] matrix by a [5,120 × 5,120] weight matrix. That’s three large matrix multiplications just for the attention step of one layer.
Then there’s the feed-forward network in each layer, which involves multiplying by matrices of size [5,120 × 8,192] and [8,192 × 5,120]. That’s two more large matrix multiplications.
Across all 48 layers, a single forward pass through LLaMA 4 Maverick involves hundreds of matrix multiplications, each operating on matrices with millions of entries. This is why language models need GPUs, graphics processing units that were originally designed for rendering video games. GPUs are built to perform massive numbers of multiplications in parallel, which makes them ideal for matrix multiplication.
An NVIDIA H100 GPU can perform roughly 2,000 trillion multiply-add operations per second (2 petaFLOPS in half-precision floating point). Even at that speed, generating a single token from a frontier model takes measurable time, typically 10 to 50 milliseconds, because the sheer volume of matrix multiplications is enormous.
Softmax: Turning Raw Scores into Probabilities
In Chapter 1, we saw that a language model produces a probability for every possible next token. The model might assign 92% to “Paris,” 3% to “the,” 1% to “a,” and so on. But the model’s internal computations don’t naturally produce probabilities; they produce raw scores called logits that can be any number: positive, negative, large, small.
For example, after processing the prompt “The capital of France is,” the model might produce these raw scores for the top few tokens:
| Token | Logit (raw score) |
|---|---|
| Paris | 8.5 |
| the | 3.2 |
| a | 2.1 |
| located | 1.8 |
| not | 0.5 |
These logits aren’t probabilities; they don’t add up to 1, and some could be negative. We need a way to convert them into proper probabilities: numbers between 0 and 1 that sum to 1. That’s what softmax does.
The Softmax Formula
Softmax works in two steps:
Exponentiate each score: raise the mathematical constant e (approximately 2.718) to the power of each logit. This makes all values positive and amplifies the differences between large and small scores.
Divide each result by the sum of all results: this ensures everything adds up to 1.
In mathematical notation:
softmax(x_i) = e^(x_i) / (e^(x_1) + e^(x_2) + ... + e^(x_n))Where e ≈ 2.718 is Euler’s number, a mathematical constant that appears throughout calculus and probability theory. You don’t need to know why e is special; just know that raising e to a power is called the exponential function, and it has the useful property of turning any number into a positive number while preserving the ordering (bigger inputs produce bigger outputs).
Working Through a Real Example
Let’s apply softmax to our logits step by step.
Step 1: Exponentiate each logit.
e^8.5 = 4,914.77
e^3.2 = 24.53
e^2.1 = 8.17
e^1.8 = 6.05
e^0.5 = 1.65Notice how the exponential function dramatically amplifies differences. The logit 8.5 is only about 2.7 times larger than 3.2, but e^8.5 is about 200 times larger than e^3.2. This is a key property of softmax: it makes the highest score dominate.
Step 2: Sum all the exponentials.
sum = 4,914.77 + 24.53 + 8.17 + 6.05 + 1.65 = 4,955.17(In a real model, this sum would include all ~128,000 vocabulary entries, not just 5. The other entries would have small or negative logits, contributing tiny amounts to the sum.)
Step 3: Divide each exponential by the sum.
Paris: 4,914.77 / 4,955.17 = 0.9918 (99.2%)
the: 24.53 / 4,955.17 = 0.0050 (0.5%)
a: 8.17 / 4,955.17 = 0.0016 (0.2%)
located: 6.05 / 4,955.17 = 0.0012 (0.1%)
not: 1.65 / 4,955.17 = 0.0003 (0.03%)Now we have proper probabilities that sum to 1 (approximately; the remaining probability is spread across the other ~128,000 tokens). The model is 99.2% confident that the next token is “Paris.”
Why Softmax and Not Something Simpler?
You might wonder: why not just divide each logit by the sum of all logits? That would also produce numbers that sum to 1. The problem is that logits can be negative, and dividing negative numbers by a sum doesn’t produce meaningful probabilities.
Or why not just pick the largest logit every time? Because sometimes you want the model to be creative, to occasionally pick a less likely word. Softmax gives you a full probability distribution that you can sample from. When the model writes a story, you might want it to sometimes pick “warm” instead of always picking “mild” for Tokyo’s weather. The probability distribution from softmax makes this possible.
Softmax also has a useful mathematical property: it’s differentiable, which means the model can compute gradients through it during training. This is essential for the learning process (covered in Chapter 3), where the model needs to adjust its weights based on how wrong its predictions were.
Temperature: Controlling How Sharp the Distribution Is
There’s one more detail about softmax that matters in practice. Language models have a setting called temperature that controls how “peaked” or “flat” the probability distribution is.
The temperature-adjusted softmax divides each logit by the temperature before exponentiating:
softmax(x_i, T) = e^(x_i / T) / sum(e^(x_j / T))- Temperature = 1.0: Standard softmax, as shown above.
- Temperature < 1.0 (e.g., 0.3): Dividing by a small number makes the logits larger, which makes the exponentials more extreme, which makes the distribution sharper. The model becomes more confident and predictable, almost always picking the top token.
- Temperature > 1.0 (e.g., 2.0): Dividing by a large number makes the logits smaller, which flattens the distribution. The model becomes more random and creative, and lower-probability tokens get a better chance of being selected.
Let’s see this with our “Paris” example:
| Token | Logit | T=0.3 | T=1.0 | T=2.0 |
|---|---|---|---|---|
| Paris | 8.5 | ~1.0000 | 0.9918 | 0.8585 |
| the | 3.2 | ~0.0000 | 0.0050 | 0.0607 |
| a | 2.1 | ~0.0000 | 0.0016 | 0.0350 |
| located | 1.8 | ~0.0000 | 0.0012 | 0.0301 |
| not | 0.5 | ~0.0000 | 0.0003 | 0.0157 |
At temperature 0.3, the model is essentially deterministic: “Paris” gets virtually 100% of the probability. At temperature 2.0, “Paris” still leads at 85.9% but other tokens have a meaningful chance of being selected. This is why chatbots let you adjust temperature: lower for factual answers, higher for creative writing. We’ll cover sampling strategies in detail in Chapter 17.
How These Operations Connect Inside a Language Model
Before we write code, let’s see how vectors, dot products, matrix multiplication, and softmax fit together in a single prediction step. This connects what we learned in this chapter to the prediction walkthrough from Chapter 1.
When the model processes the prompt “The capital of France is”:
Vectors: Each of the 6 tokens is converted into a vector of 5,120 numbers (in a model like LLaMA 4 Maverick). These vectors are looked up from the embedding table.
Matrix multiplication: These vectors pass through 48 Transformer layers. At each layer, the vectors are multiplied by weight matrices to compute attention scores and feed-forward transformations. Each matrix multiplication transforms the vectors, gradually building up a richer representation of the input.
Dot products: Inside each attention layer, the model computes dot products between token vectors to determine which tokens are relevant to each other. The dot product between “France” and “capital” will be high (they’re related), while the dot product between “The” and “France” will be lower (less directly relevant for prediction).
Softmax: The dot products from attention are passed through softmax to create attention weights, probabilities that determine how much each token influences each other token. Then, at the very end, the final vector is multiplied by the output matrix to produce 128,000 logits (one per vocabulary entry), and softmax converts those logits into the probability distribution we saw in Chapter 1: “Paris” at 99.2%, “the” at 0.5%, and so on.
Every single step in this pipeline is one of the four operations from this chapter. There is no other math. The entire Transformer architecture (the engine behind every frontier model) is built from vectors, dot products, matrix multiplications, and softmax, arranged in a specific pattern that we’ll unpack layer by layer starting in Chapter 7.
Hands-On: Implementing Everything in Python
Let’s implement all four operations from scratch. This code is runnable; you can copy it into a Python file or Jupyter notebook and execute it. We’ll use NumPy, Python’s standard library for numerical computation (pip install numpy).
import numpy as np
# --- Vectors ---
cat = np.array([0.9, 0.8, 0.1, -0.1])
dog = np.array([0.85, 0.75, 0.15, -0.05])
car = np.array([-0.5, -0.3, 0.7, 0.6])
# --- Dot products: measure similarity ---
print("cat · dog =", np.dot(cat, dog)) # ~1.385 (similar)
print("cat · car =", np.dot(cat, car)) # ~-0.680 (dissimilar)
# --- Matrix multiplication ---
A = np.array([[1, 2, 3],
[4, 5, 6]]) # 2×3
B = np.array([[7, 8],
[9, 10],
[11, 12]]) # 3×2
print("A × B =\n", A @ B) # 2×2 result: [[58, 64], [139, 154]]
# --- Softmax ---
def softmax(logits, temperature=1.0):
scaled = logits / temperature
exp = np.exp(scaled - np.max(scaled)) # subtract max for numerical stability
return exp / exp.sum()
logits = np.array([8.5, 3.2, 2.1, 1.8, 0.5])
tokens = ["Paris", "the", "a", "located", "not"]
for temp in [0.3, 1.0, 2.0]:
probs = softmax(logits, temperature=temp)
print(f"\nTemperature = {temp}:")
for token, prob in zip(tokens, probs):
print(f" {token:>10s} {prob:.6f}")Running this produces:
cat · dog = 1.385
cat · car = -0.68
A × B =
[[ 58 64]
[139 154]]
Temperature = 0.3:
Paris 1.000000
the 0.000000
a 0.000000
located 0.000000
not 0.000000
Temperature = 1.0:
Paris 0.991847
the 0.004951
a 0.001648
located 0.001221
not 0.000333
Temperature = 2.0:
Paris 0.858507
the 0.060655
a 0.034995
located 0.030120
not 0.015724Let’s walk through what this code does:
Vectors: We create three vectors using
np.array(). Each is a list of 4 numbers representing a word.Dot products:
np.dot(cat, dog)multiplies corresponding elements and sums them, exactly the operation we did by hand earlier. The result confirms that “cat” and “dog” are similar (1.385) while “cat” and “car” are not (-0.68).Matrix multiplication:
A @ Bis Python’s operator for matrix multiplication (equivalent tonp.matmul(A, B)). It multiplies our 2×3 matrix by our 3×2 matrix and produces the 2×2 result we computed by hand.Softmax: Our
softmaxfunction takes an array of logits and an optional temperature. It divides by temperature, exponentiates, and normalizes. The- np.max(scaled)trick is a standard numerical stability technique: subtracting the maximum value before exponentiating prevents the numbers from becoming astronomically large (which could cause overflow errors), without changing the final probabilities.
A Note on Numerical Stability
You might have noticed the - np.max(scaled) in the softmax function. This deserves a brief explanation because it’s something every real implementation does.
When you compute e^8.5, you get about 4,915. That’s fine. But in a real model with 128,000 vocabulary entries, some logits might be 50 or 100. And e^100 is a number with 43 digits, too large for a computer to represent accurately in standard floating-point arithmetic. The computation would overflow and produce infinity.
The fix is simple: subtract the maximum logit from all logits before exponentiating. If the logits are [8.5, 3.2, 2.1, 1.8, 0.5], we subtract 8.5 to get [0, -5.3, -6.4, -6.7, -8.0]. Now the largest exponent is e^0 = 1, and all others are small fractions. The final probabilities are identical because the subtraction cancels out in the division step.
This is not a mathematical detail you need to memorize; it’s a practical implementation detail. But it’s worth knowing because every softmax implementation in every deep learning framework (PyTorch, TensorFlow, JAX) does this, and if you ever write your own, you’ll need to do it too.
Seeing It in a Real Model: Dot Products on GPT-2 Embeddings
The examples above used made-up 4-dimensional vectors to illustrate the concepts. Let’s now use real embeddings from a real model, GPT-2, to see that dot products genuinely measure semantic similarity in practice.
GPT-2 has a vocabulary of 50,257 tokens, each represented as a 768-dimensional vector. These vectors were learned during training on 8 million web pages. We can extract them and compute dot products to see which words the model considers similar.
Source: GPT-2 architecture: 50,257 token vocabulary, 768-dimensional embeddings. OpenAI, “Language Models are Unsupervised Multitask Learners,” 2019.
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
embeddings = model.wte.weight.detach() # shape: [50257, 768]
def get_embedding(word):
token_id = tokenizer.encode(" " + word)[0] # space prefix for proper tokenization
return embeddings[token_id]
def similarity(word1, word2):
e1, e2 = get_embedding(word1), get_embedding(word2)
return torch.dot(e1, e2).item()
pairs = [("cat", "dog"), ("cat", "car"), ("Paris", "Tokyo"),
("Paris", "bicycle"), ("king", "queen"), ("king", "table")]
for w1, w2 in pairs:
print(f" {w1:>8s} · {w2:<10s} = {similarity(w1, w2):>8.2f}")Running this on GPT-2 produces results like:
cat · dog = 5.28
cat · car = 1.73
Paris · Tokyo = 8.41
Paris · bicycle = 0.52
king · queen = 7.93
king · table = 2.14The pattern is clear: semantically related words have higher dot products. “Cat” and “dog” (5.28) score much higher than “cat” and “car” (1.73). “Paris” and “Tokyo” (8.41), both capital cities, score far higher than “Paris” and “bicycle” (0.52). “King” and “queen” (7.93) are closely related; “king” and “table” (2.14) are not.
These aren’t hand-picked numbers; they come directly from the model’s learned embeddings. The model discovered these relationships by reading billions of words of text and adjusting its vectors so that words appearing in similar contexts ended up with similar vectors. Nobody told the model that Paris and Tokyo are both capitals. It learned that from the patterns in its training data.
This is the same mechanism that operates at every layer of every Transformer model. The numbers are larger (5,120 dimensions instead of 768), the computations are more complex (attention involves multiple rounds of dot products and matrix multiplications), but the fundamental operation, measuring similarity via dot products, is identical.
The Four Operations in Context: A Size Comparison
To give you a sense of scale, here’s how these operations appear in real models:
| Operation | GPT-2 (124M params) | LLaMA 4 Maverick (400B total params) |
|---|---|---|
| Vector dimension | 768 | 5,120 |
| Vocabulary size | 50,257 | 202,048 |
| Embedding table size | 50,257 × 768 = 38.6M numbers | 202,048 × 5,120 = 1.03B numbers |
| Dot product (per pair) | 768 multiplications + 767 additions | 5,120 multiplications + 5,119 additions |
| Attention matrix multiply | [seq_len × 768] × [768 × 768] | [seq_len × 5,120] × [5,120 × 5,120] |
| Softmax output size | 50,257 probabilities | 202,048 probabilities |
| Number of layers | 12 | 48 |
Sources: GPT-2 architecture from OpenAI (2019); LLaMA 4 Maverick configuration from Ollama model registry and Meta AI (April 2025).
The operations are identical. The only difference is scale. GPT-2’s embedding table has 38.6 million numbers; LLaMA 4 Maverick’s has over a billion. GPT-2 computes softmax over 50,257 options; Maverick computes it over 202,048. GPT-2 has 12 layers of matrix multiplications; Maverick has 48 (plus the MoE routing across 128 experts).
This is a pattern you’ll see throughout this book: the fundamental ideas are simple, and the engineering challenge is making them work at enormous scale.
Key Takeaways
- A vector is a list of numbers. In language models, every token is represented as a vector: 768 numbers in GPT-2, 5,120 in LLaMA 4 Maverick. Similar meanings produce similar vectors.
- The dot product multiplies corresponding elements of two vectors and sums the results. It measures similarity: high dot product = similar vectors, low or negative = dissimilar. This is how language models determine which tokens are related to each other.
- Matrix multiplication is organized dot products: each entry in the output is the dot product of a row from the first matrix with a column from the second. Every layer of every neural network is a matrix multiplication. It’s the single most common operation in deep learning.
- Softmax converts raw scores (logits) into probabilities that sum to 1. It exponentiates each score and divides by the total. The temperature parameter controls how peaked or flat the distribution is: low temperature makes the model more deterministic, high temperature makes it more random.
- These four operations (vectors, dot products, matrix multiplication, and softmax) are the complete mathematical toolkit for understanding Transformers. Every computation inside GPT-5, Claude, Gemini, LLaMA, and every other language model is built from these building blocks.
- The difference between a small model (GPT-2, 124M parameters) and a frontier model (LLaMA 4 Maverick, 400B parameters) is not the math; it’s the scale. Larger vectors, larger matrices, more layers, more vocabulary entries. The operations are identical.
What’s Next
Now that you have the mathematical toolkit (vectors, dot products, matrix multiplication, and softmax) the next step is understanding how these operations are organized into a learning system. In Chapter 3, we’ll cover neural networks: how neurons combine these operations, how layers stack together, and how the model learns from its mistakes through backpropagation and gradient descent.