Appendix A. Full Attention Math Derivation (with Dimensions Tracked)
Chapter 7 introduced the attention formula and walked through a worked example with small numbers. Chapter 8 extended it to multi-head attention and Grouped Query Attention. This appendix goes deeper. It derives every step of the attention computation with full dimension annotations, works through the backward pass (how gradients flow through attention during training), counts the exact number of floating-point operations, and provides a complete, runnable NumPy implementation that you can step through line by line.
If you want to truly understand what happens inside a Transformer at the mathematical level, this is the reference.
A.1 Notation and Dimensions
Before diving in, let us fix the notation. Every variable in this appendix carries its shape in square brackets so you can track dimensions through every operation.
| Symbol | Meaning | Shape |
|---|---|---|
| n | Sequence length (number of tokens) | scalar |
| d | Model hidden dimension (d_model) | scalar |
| h | Number of query attention heads | scalar |
| h_kv | Number of key/value heads (for GQA) | scalar |
| d_k | Dimension of each query/key head | scalar |
| d_v | Dimension of each value head | scalar |
| X | Input to the attention layer | [n x d] |
| W_Q | Query projection weights | [d x (h * d_k)] |
| W_K | Key projection weights | [d x (h_kv * d_k)] |
| W_V | Value projection weights | [d x (h_kv * d_v)] |
| W_O | Output projection weights | [(h * d_v) x d] |
| Q | All query vectors | [n x h x d_k] |
| K | All key vectors | [n x h_kv x d_k] |
| V | All value vectors | [n x h_kv x d_v] |
| S | Raw attention scores (per head) | [n x n] |
| A | Attention weights after softmax (per head) | [n x n] |
| O | Attention output (per head) | [n x d_v] |
For concrete numbers, we will use two reference models throughout this appendix:
Original Transformer (Vaswani et al., “Attention Is All You Need,” NeurIPS 2017):
- d = 512, h = 8, h_kv = 8, d_k = d_v = 64
LLaMA 4 Maverick (Meta, April 2025):
- d = 5,120, h = 40, h_kv = 8, d_k = d_v = 128, 48 layers
Source: Vaswani et al., 2017, Section 3.2.2. LLaMA 4 Maverick config from HuggingFace Transformers Llama4TextConfig defaults: hidden_size=5120, num_attention_heads=40, num_key_value_heads=8, head_dim=128, num_hidden_layers=48, intermediate_size_mlp=16384, intermediate_size=8192 (MoE expert FFN), use_qk_norm=True, attention_bias=False (confirmed from huggingface.co/docs/transformers/main/model_doc/llama4). The Maverick model overrides interleave_moe_layer_step=2 (alternating dense and MoE layers; the TextConfig default is 1, used by Scout where all layers are MoE), confirmed from the HuggingFace blog: “Llama Maverick uses 128 experts, but MoE and dense layers alternate. Therefore, experts are applied in half of the layers” (github.com/huggingface/blog/blob/main/llama4-release.md).
A.2 Forward Pass: Step-by-Step with Dimensions
Step 1: Linear Projections (X to Q, K, V)
The input X has shape [n x d]. We project it into queries, keys, and values using three weight matrices:
Q_flat = X @ W_Q [n x d] @ [d x (h * d_k)] = [n x (h * d_k)]
K_flat = X @ W_K [n x d] @ [d x (h_kv * d_k)] = [n x (h_kv * d_k)]
V_flat = X @ W_V [n x d] @ [d x (h_kv * d_v)] = [n x (h_kv * d_v)]Then reshape into per-head views:
Q = reshape(Q_flat) [n x h x d_k]
K = reshape(K_flat) [n x h_kv x d_k]
V = reshape(V_flat) [n x h_kv x d_v]Concrete dimensions for LLaMA 4 Maverick:
Q_flat = X @ W_Q [n x 5120] @ [5120 x 5120] = [n x 5120]
K_flat = X @ W_K [n x 5120] @ [5120 x 1024] = [n x 1024]
V_flat = X @ W_V [n x 5120] @ [5120 x 1024] = [n x 1024]
Q = reshape(Q_flat) [n x 40 x 128]
K = reshape(K_flat) [n x 8 x 128]
V = reshape(V_flat) [n x 8 x 128]Notice the asymmetry: W_Q is [5120 x 5120] (26.2 million parameters), while W_K and W_V are each [5120 x 1024] (5.2 million parameters each). This is Grouped Query Attention (Chapter 8): 40 query heads share 8 key/value heads, with each KV head serving a group of 5 query heads. The total parameter count for the three projection matrices is 26.2M + 5.2M + 5.2M = 36.7 million per layer.
For the original Transformer, all three matrices are [512 x 512], so 262,144 parameters each, totaling 786,432 for the three projections.
Note: LLaMA 4 Maverick sets attention_bias=False, so there are no bias vectors in the Q, K, V, or O projections. The original Transformer did include biases, but most modern models omit them. The parameter counts above reflect weights only.
Step 2: Grouped Query Attention (Expanding K, V)
In standard Multi-Head Attention (MHA), h_kv = h, so each query head has its own key and value head. In GQA, h_kv < h, so we need to broadcast (repeat) the K and V tensors so that each query head can find its corresponding KV head.
The group size is g = h / h_kv. For LLaMA 4 Maverick, g = 40 / 8 = 5.
K_expanded = repeat(K, groups=g) [n x h_kv x d_k] -> [n x h x d_k]
V_expanded = repeat(V, groups=g) [n x h_kv x d_v] -> [n x h x d_v]For LLaMA 4 Maverick:
K_expanded [n x 8 x 128] -> [n x 40 x 128]
V_expanded [n x 8 x 128] -> [n x 40 x 128]This repeat operation does not create new parameters. It simply tells each group of 5 query heads to use the same K and V vectors. In practice, efficient implementations avoid the explicit copy by indexing into the original K and V tensors.
Step 3: Compute Attention Scores (Q * K^T)
For each head independently, compute the dot product between every query and every key:
S = Q @ K^T [n x d_k] @ [d_k x n] = [n x n]The entry S[i, j] is the dot product of query i with key j. It measures how much token i should attend to token j.
With all heads in a single batched operation:
S = Q @ K_expanded^T [h x n x d_k] @ [h x d_k x n] = [h x n x n]For LLaMA 4 Maverick with a 4,096-token sequence:
S = Q @ K_expanded^T [40 x 4096 x 128] @ [40 x 128 x 4096] = [40 x 4096 x 4096]That is 40 score matrices, each with 4,096 x 4,096 = 16.8 million entries. Total: 671 million entries across all heads.
Step 4: Scale
Divide every score by the square root of d_k to prevent softmax saturation (Chapter 7):
S_scaled = S / sqrt(d_k) [h x n x n]For d_k = 128: sqrt(128) = 11.3137…
Why this specific scaling factor? If the elements of Q and K are independent random variables with mean 0 and variance 1, then each element of the dot product Q[i] . K[j] is a sum of d_k products of independent random variables. By the properties of variance:
Var(Q[i] . K[j]) = d_k * Var(q) * Var(k) = d_k * 1 * 1 = d_kSo the standard deviation of the raw dot product is sqrt(d_k). Dividing by sqrt(d_k) normalizes the variance back to approximately 1, keeping the softmax inputs in a well-behaved range.
QK Normalization: LLaMA 4 Maverick goes further with use_qk_norm=True, applying RMSNorm (Chapter 10) to the Q and K vectors before computing the dot product. This provides more stable training at scale by ensuring the query and key vectors have consistent magnitudes regardless of layer depth. The scaling by 1/sqrt(d_k) still applies after normalization.
Step 5: Causal Mask
For decoder-only models (GPT, LLaMA, Claude, DeepSeek, Gemini, Mistral), each token can only attend to itself and earlier tokens. We enforce this by setting future positions to negative infinity:
for i in range(n):
for j in range(n):
if j > i:
S_scaled[i, j] = -infinityIn matrix form, this applies a lower-triangular mask:
M[i, j] = 0 if j <= i, else -infinity
S_masked = S_scaled + M [h x n x n]When these negative-infinity values pass through softmax, they become exactly zero.
Step 6: Softmax (Row-wise)
Apply softmax independently to each row of the masked score matrix:
A[i, j] = exp(S_masked[i, j]) / sum_over_k(exp(S_masked[i, k]))Shape: [h x n x n]. Each row of A sums to 1. Each entry A[i, j] is the attention weight: the fraction of token j’s value that gets mixed into token i’s output.
Numerical stability: In practice, we subtract the maximum value in each row before exponentiating to prevent overflow:
S_stable = S_masked - max(S_masked, axis=-1, keepdims=True)
A = exp(S_stable) / sum(exp(S_stable), axis=-1, keepdims=True)This does not change the result (the max cancels out in the ratio) but prevents the exponentials from producing infinity for large positive scores.
Step 7: Weighted Sum of Values
Multiply the attention weights by the value vectors to produce the output for each head:
O = A @ V_expanded [h x n x n] @ [h x n x d_v] = [h x n x d_v]For LLaMA 4 Maverick with n = 4,096:
O = A @ V_expanded [40 x 4096 x 4096] @ [40 x 4096 x 128] = [40 x 4096 x 128]Each row of O is a weighted combination of all value vectors, where the weights come from the attention distribution. Token i’s output is:
O[i] = sum over j of A[i, j] * V[j] shape: [d_v]Step 8: Concatenate Heads and Output Projection
Concatenate the outputs from all heads along the head dimension:
O_concat = concat(O_head_0, O_head_1, ..., O_head_{h-1})
= reshape(O) [n x (h * d_v)]Then project back to the model dimension:
output = O_concat @ W_O [n x (h * d_v)] @ [(h * d_v) x d] = [n x d]For LLaMA 4 Maverick:
O_concat = reshape(O) [4096 x 5120]
output = O_concat @ W_O [4096 x 5120] @ [5120 x 5120] = [4096 x 5120]W_O has 5120 x 5120 = 26.2 million parameters. The total parameter count for the full multi-head attention block (W_Q + W_K + W_V + W_O) is:
LLaMA 4 Maverick: 26.2M + 5.2M + 5.2M + 26.2M = 62.9 million per layer
Original Transformer: 262K + 262K + 262K + 262K = 1.05 million per layerThe Complete Formula (One Line)
Putting all steps together:
MultiHead(X) = concat(head_1, ..., head_h) @ W_O
where head_i = softmax(X @ W_Q_i @ (X @ W_K_g)^T / sqrt(d_k) + M) @ (X @ W_V_g)Here g = floor(i / group_size) is the KV group index for query head i, and M is the causal mask.
Or in the compact notation from Vaswani et al. (2017), for standard MHA:
Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V
MultiHead(Q, K, V) = concat(head_1, ..., head_h) @ W_O
where head_i = Attention(X @ W_Q_i, X @ W_K_i, X @ W_V_i)Source: Vaswani et al., 2017, Equations 1 and 2.
A.3 The Softmax Function: Properties and Derivatives
The softmax function is the only nonlinearity inside the attention mechanism (the rest is linear algebra). Understanding its derivative is essential for understanding how gradients flow backward through attention during training.
Definition
Given a vector z of length n, softmax produces a probability distribution:
softmax(z)_i = exp(z_i) / sum_j(exp(z_j))Properties:
- Every output is strictly positive: softmax(z)_i > 0
- Outputs sum to 1: sum_i(softmax(z)_i) = 1
- Preserves rank order: if z_i > z_j, then softmax(z)_i > softmax(z)_j
- Translation invariant: softmax(z + c) = softmax(z) for any constant c
The Jacobian Matrix
The softmax function maps a vector to a vector, so its derivative is a Jacobian matrix J of shape [n x n], where:
J[i, j] = d(softmax(z)_i) / d(z_j)There are two cases:
Case 1: i = j (diagonal entries)
d(s_i) / d(z_i) = s_i * (1 - s_i)where s_i = softmax(z)_i. This is the familiar sigmoid-like derivative. When s_i is close to 0 or 1, the gradient is near zero (saturation). When s_i = 0.5, the gradient is at its maximum of 0.25.
Case 2: i != j (off-diagonal entries)
d(s_i) / d(z_j) = -s_i * s_jThis is always negative: increasing z_j decreases s_i (because the denominator grows, diluting s_i’s share).
Derivation
Let us derive both cases. Start with the definition:
s_i = exp(z_i) / sum_k(exp(z_k))Let D = sum_k(exp(z_k)) be the denominator. Then s_i = exp(z_i) / D.
Case 1 (i = j): Apply the quotient rule:
d(s_i)/d(z_i) = [exp(z_i) * D - exp(z_i) * exp(z_i)] / D^2
= [exp(z_i) / D] * [1 - exp(z_i) / D]
= s_i * (1 - s_i)Case 2 (i != j): The numerator exp(z_i) does not depend on z_j, so:
d(s_i)/d(z_j) = exp(z_i) * d(1/D)/d(z_j)
= exp(z_i) * [-exp(z_j) / D^2]
= -[exp(z_i) / D] * [exp(z_j) / D]
= -s_i * s_jCompact Matrix Form
Both cases can be written as a single expression using the Kronecker delta (delta_ij = 1 if i=j, 0 otherwise):
J[i, j] = s_i * (delta_ij - s_j)Or equivalently, in matrix notation:
J = diag(s) - s @ s^Twhere diag(s) is a diagonal matrix with the softmax outputs on the diagonal, and s @ s^T is the outer product of the softmax output vector with itself.
For a concrete example, if s = [0.7, 0.2, 0.1], the Jacobian is:
J = diag([0.7, 0.2, 0.1]) - [0.7, 0.2, 0.1]^T @ [0.7, 0.2, 0.1]
= [[0.7, 0, 0 ] [[0.49, 0.14, 0.07]
[0, 0.2, 0 ] - [0.14, 0.04, 0.02]
[0, 0, 0.1]] [0.07, 0.02, 0.01]]
= [[ 0.21, -0.14, -0.07]
[-0.14, 0.16, -0.02]
[-0.07, -0.02, 0.09]]You can verify: each row sums to zero (increasing one logit by epsilon must decrease the total probability of other classes by the same amount, since probabilities always sum to 1).
Why This Matters for Attention
In the attention mechanism, softmax is applied row-wise to the scaled score matrix S. During backpropagation, the gradient of the loss with respect to the scores must pass through this Jacobian. For each row i of the attention weight matrix A:
dL/dS[i, :] = (J_i) @ (dL/dA[i, :])where J_i is the Jacobian of softmax for row i. This is an [n x n] matrix-vector product for each of the n rows, giving O(n^2) work per row and O(n^3) total for the full backward pass through softmax. For long sequences, this is a significant cost, which is one reason FlashAttention (Chapter 20) fuses the softmax computation with the surrounding matrix multiplications to avoid materializing the full attention matrix.
A.4 Backward Pass: Gradients Through Attention
Training a Transformer requires computing gradients of the loss with respect to every parameter. Here we trace the gradient flow backward through the attention mechanism, step by step. We assume we have already computed dL/d(output), the gradient of the loss with respect to the attention block’s output, with shape [n x d].
Step 8 (backward): Output Projection
Forward: output = O_concat @ W_O
Gradients:
dL/dO_concat = dL/d(output) @ W_O^T [n x d] @ [d x (h*d_v)] = [n x (h*d_v)]
dL/dW_O = O_concat^T @ dL/d(output) [(h*d_v) x n] @ [n x d] = [(h*d_v) x d]Then reshape dL/dO_concat back to [h x n x d_v] to get per-head gradients.
Step 7 (backward): Weighted Sum (A @ V)
Forward: O = A @ V (per head)
Gradients (per head):
dL/dA = dL/dO @ V^T [n x d_v] @ [d_v x n] = [n x n]
dL/dV = A^T @ dL/dO [n x n] @ [n x d_v] = [n x d_v]The gradient with respect to A tells us how the loss changes when we change the attention weights. The gradient with respect to V tells us how the loss changes when we change the value vectors.
Step 6 (backward): Softmax
Forward: A = softmax(S_masked, axis=-1)
This is the trickiest step. For each row i, using the Jacobian derived in Section A.3:
dL/dS_masked[i, :] = A[i, :] * (dL/dA[i, :] - sum_j(A[i, j] * dL/dA[i, j]))This is a more efficient form than the full Jacobian multiplication. Let us derive it. Define:
g = dL/dA[i, :] (the incoming gradient for row i)
a = A[i, :] (the softmax output for row i)Then:
dL/dS[i, :] = J @ g
= (diag(a) - a @ a^T) @ g
= diag(a) @ g - a @ (a^T @ g)
= a * g - a * (a . g)
= a * (g - sum(a * g))where (a . g) = sum_j(a_j * g_j) is the dot product. This avoids constructing the full [n x n] Jacobian matrix and reduces the computation to O(n) per row, or O(n^2) total across all rows (compared to O(n^3) for the naive Jacobian approach).
Step 5 (backward): Causal Mask
The causal mask sets certain positions to negative infinity. In the backward pass, the gradient at those masked positions is zero (since exp(-infinity) = 0, and the softmax output at those positions is exactly zero, so the gradient contribution is zero). No additional computation is needed; the softmax backward step already handles this correctly because A[i, j] = 0 for masked positions.
Step 4 (backward): Scaling
Forward: S_scaled = S / sqrt(d_k)
Gradient:
dL/dS = dL/dS_scaled / sqrt(d_k) [h x n x n]The scaling factor simply divides the gradient by the same constant.
Step 3 (backward): Score Computation (Q @ K^T)
Forward: S = Q @ K^T (per head)
Gradients (per head):
dL/dQ = dL/dS @ K [n x n] @ [n x d_k] = [n x d_k]
dL/dK = dL/dS^T @ Q [n x n] @ [n x d_k] = [n x d_k]Note the transpose on dL/dS for the key gradient: this is because K appears transposed in the forward pass.
Steps 2 and 1 (backward): GQA Reduction and Linear Projections
For GQA, the gradients for K and V from each query head in a group are summed to produce the gradient for the shared KV head:
dL/dK_group_g = sum over heads i in group g of dL/dK_head_iThen the projection gradients:
dL/dX (from Q path) = dL/dQ_flat @ W_Q^T [n x (h*d_k)] @ [(h*d_k) x d] = [n x d]
dL/dW_Q = X^T @ dL/dQ_flat [d x n] @ [n x (h*d_k)] = [d x (h*d_k)]And similarly for K and V paths. The total gradient on X is the sum of contributions from all three paths:
dL/dX = dL/dX_from_Q + dL/dX_from_K + dL/dX_from_VThis is the gradient that flows to the previous layer (or to the residual connection).
A.5 FLOPs Count: Exact Arithmetic
Understanding the computational cost of attention is essential for estimating training time, choosing hardware, and understanding why certain architectural decisions (GQA, MQA, MLA) exist. Here we count the exact number of floating-point operations (FLOPs) for each step.
Convention: one matrix multiplication of [a x b] @ [b x c] requires 2abc FLOPs (b multiplications and b-1 additions per output element, approximated as 2b operations per element, times a*c output elements).
Per-Head FLOPs (Single Head, Sequence Length n)
| Step | Operation | Shape | FLOPs |
|---|---|---|---|
| Q projection | X @ W_Q_head | [n x d] @ [d x d_k] | 2 * n * d * d_k |
| K projection | X @ W_K_head | [n x d] @ [d x d_k] | 2 * n * d * d_k |
| V projection | X @ W_V_head | [n x d] @ [d x d_v] | 2 * n * d * d_v |
| Q @ K^T | score computation | [n x d_k] @ [d_k x n] | 2 * n^2 * d_k |
| Scale | element-wise divide | [n x n] | n^2 |
| Softmax | exp, sum, divide | [n x n] | ~5 * n^2 |
| A @ V | weighted sum | [n x n] @ [n x d_v] | 2 * n^2 * d_v |
| Output proj | O_concat @ W_O_head | [n x d_v] @ [d_v x d] | 2 * n * d * d_v |
Total FLOPs for Full Multi-Head Attention
For h query heads and h_kv KV heads (GQA), the projections are computed once for all heads:
Q projection: 2 * n * d * (h * d_k) = 2 * n * d^2 (when h * d_k = d)
K projection: 2 * n * d * (h_kv * d_k)
V projection: 2 * n * d * (h_kv * d_v)
O projection: 2 * n * d * (h * d_v) = 2 * n * d^2 (when h * d_v = d)The attention core (scores + softmax + weighted sum) runs per head:
Per head: 2 * n^2 * d_k + ~5 * n^2 + n^2 + 2 * n^2 * d_v
= n^2 * (2*d_k + 2*d_v + 6)
≈ n^2 * (4 * d_k + 6) (when d_k = d_v)
All h heads: h * n^2 * (4 * d_k + 6)For the standard case where h * d_k = h * d_v = d and h_kv = h (MHA):
Projections: 4 * (2 * n * d^2) = 8 * n * d^2
Attention core: h * n^2 * (4*d_k + 6)
≈ 4 * n^2 * d + 6 * h * n^2 (since h * d_k = d)
Total ≈ 8 * n * d^2 + 4 * n^2 * dThe first term (8nd^2) dominates when n < 2d (short sequences relative to model width). The second term (4n^2d) dominates when n > 2d (long sequences). For LLaMA 4 Maverick with d = 5,120, the crossover point is n = 2 * 5,120 = 10,240 tokens. Below that, the linear projections dominate. Above that, the quadratic attention core dominates.
GQA Savings
With GQA (h_kv < h), the K and V projections shrink:
MHA K projection: 2 * n * d * d = 2 * n * d^2
GQA K projection: 2 * n * d * (h_kv * d_k)For LLaMA 4 Maverick (h_kv = 8, d_k = 128):
MHA K projection: 2 * n * 5120 * 5120 = 52.4M * n FLOPs
GQA K projection: 2 * n * 5120 * 1024 = 10.5M * n FLOPsThat is a 5x reduction in K and V projection FLOPs. The attention core computation is unchanged (it still runs per query head), but the KV cache memory is reduced by 5x, which is the primary motivation for GQA (Chapter 8, Chapter 18).
Full Layer FLOPs (Attention + FFN)
For a complete Transformer layer with SwiGLU FFN (Chapter 9), the total FLOPs per layer are approximately:
Attention projections: ~8 * n * d^2 (for MHA; less for GQA)
Attention core: ~4 * n^2 * d (quadratic in sequence length)
FFN (SwiGLU): ~16 * n * d^2 (gate, up, down projections)
Total per layer ≈ 24 * n * d^2 + 4 * n^2 * dThe FFN accounts for roughly 2/3 of the per-layer FLOPs (16/24 = 67%), and the attention projections account for roughly 1/3 (8/24 = 33%). The quadratic attention core is negligible for short sequences but dominates for long ones.
Source: FLOPs breakdown analysis from Kipply’s “Transformer Inference Arithmetic” (kipp.ly/blog/transformer-inference-arithmetic, March 2022) and Finbarr Timbers’ “Where do LLMs spend their FLOPS?” (artfintel.com/p/where-do-llms-spend-their-flops, January 2024), both confirming the 24d^2 per-layer approximation for standard decoder models.
Concrete Example: LLaMA 4 Maverick, Single Dense Layer, n = 4,096
LLaMA 4 Maverick alternates between dense layers (using intermediate_size_mlp=16384) and MoE layers (using intermediate_size=8192 per expert, with 128 routed experts plus one shared expert). The example below shows a dense layer, which has the larger FFN.
Attention projections (GQA):
W_Q: 2 * 4096 * 5120 * 5120 = 214.7 billion FLOPs
W_K: 2 * 4096 * 5120 * 1024 = 42.9 billion FLOPs
W_V: 2 * 4096 * 5120 * 1024 = 42.9 billion FLOPs
W_O: 2 * 4096 * 5120 * 5120 = 214.7 billion FLOPs
Subtotal: 515.4 billion FLOPs
Attention core (40 heads):
Q @ K^T: 40 * 2 * 4096^2 * 128 = 171.8 billion FLOPs
A @ V: 40 * 2 * 4096^2 * 128 = 171.8 billion FLOPs
Softmax + scale: ~40 * 6 * 4096^2 = 4.0 billion FLOPs
Subtotal: 347.6 billion FLOPs
FFN (SwiGLU, intermediate_size_mlp = 16384):
Gate: 2 * 4096 * 5120 * 16384 = 687.2 billion FLOPs
Up: 2 * 4096 * 5120 * 16384 = 687.2 billion FLOPs
Down: 2 * 4096 * 16384 * 5120 = 687.2 billion FLOPs
Subtotal: 2,061.6 billion FLOPs
Total per layer: ~2,925 billion FLOPs (2.9 TFLOPs)The FFN dominates at 70.5% of the layer’s compute. The attention projections account for 17.6%, and the quadratic attention core accounts for 11.9%. At this sequence length (4,096 tokens), the quadratic cost is still manageable. At 131,072 tokens (LLaMA 4 Maverick’s max_position_embeddings), the attention core would grow by (131072/4096)^2 = 1,024x, making it the dominant cost.
A.6 Dimension Tracking Through a Complete Layer
To make the dimension flow completely concrete, here is every tensor shape through one full attention layer of LLaMA 4 Maverick, assuming a batch size of 1 and sequence length of 4,096:
Input:
X [4096 x 5120]
Q/K/V Projections:
X @ W_Q [4096 x 5120] @ [5120 x 5120] = [4096 x 5120]
reshape to Q [4096 x 40 x 128]
transpose to [40 x 4096 x 128]
X @ W_K [4096 x 5120] @ [5120 x 1024] = [4096 x 1024]
reshape to K [4096 x 8 x 128]
transpose to [8 x 4096 x 128]
X @ W_V [4096 x 5120] @ [5120 x 1024] = [4096 x 1024]
reshape to V [4096 x 8 x 128]
transpose to [8 x 4096 x 128]
GQA Expansion (broadcast, no copy):
K_expanded [40 x 4096 x 128] (each group of 5 Q heads shares 1 KV head)
V_expanded [40 x 4096 x 128]
RoPE (Chapter 6):
Apply rotary embeddings to Q and K (shape unchanged)
Q_rotated [40 x 4096 x 128]
K_rotated [40 x 4096 x 128]
QK Normalization (use_qk_norm=True):
Apply RMSNorm to Q and K per head (shape unchanged)
Q_normed [40 x 4096 x 128]
K_normed [40 x 4096 x 128]
Score Computation:
S = Q @ K^T [40 x 4096 x 128] @ [40 x 128 x 4096] = [40 x 4096 x 4096]
Scaling:
S_scaled = S / 11.3137 [40 x 4096 x 4096]
Causal Mask:
S_masked [40 x 4096 x 4096] (upper triangle set to -inf)
Softmax:
A = softmax(S_masked) [40 x 4096 x 4096] (each row sums to 1)
Weighted Sum:
O = A @ V_expanded [40 x 4096 x 4096] @ [40 x 4096 x 128] = [40 x 4096 x 128]
Concatenate Heads:
O_concat [4096 x 5120] (reshape from [40 x 4096 x 128])
Output Projection:
output = O_concat @ W_O [4096 x 5120] @ [5120 x 5120] = [4096 x 5120]
Residual Connection (Chapter 10):
final = X + output [4096 x 5120]Every intermediate tensor’s shape is fully determined by the model’s configuration. No shape is ambiguous. This is one of the strengths of the Transformer architecture: the entire computation is a fixed sequence of matrix multiplications and element-wise operations with predictable shapes.
A.7 Complete NumPy Implementation
Here is a complete, runnable implementation of multi-head attention with GQA, including the forward pass and a numerical gradient check. Every line is annotated with the tensor shape.
import numpy as np
def softmax(x, axis=-1):
"""Numerically stable softmax along the given axis."""
x_max = np.max(x, axis=axis, keepdims=True) # [... x 1]
exp_x = np.exp(x - x_max) # [... x n]
return exp_x / np.sum(exp_x, axis=axis, keepdims=True) # [... x n]
def multi_head_attention_gqa(
X, # [n x d] Input embeddings
W_Q, # [d x (h*d_k)] Query projection
W_K, # [d x (h_kv*d_k)] Key projection
W_V, # [d x (h_kv*d_v)] Value projection
W_O, # [(h*d_v) x d] Output projection
h, # int Number of query heads
h_kv, # int Number of KV heads
d_k, # int Head dimension (query/key)
d_v, # int Head dimension (value)
causal=True # bool Apply causal mask
):
"""
Full multi-head attention with Grouped Query Attention.
Returns the output and intermediate tensors for inspection.
"""
n, d = X.shape
# ── Step 1: Linear projections ──────────────────────
Q_flat = X @ W_Q # [n x d] @ [d x (h*d_k)] = [n x (h*d_k)]
K_flat = X @ W_K # [n x d] @ [d x (h_kv*d_k)] = [n x (h_kv*d_k)]
V_flat = X @ W_V # [n x d] @ [d x (h_kv*d_v)] = [n x (h_kv*d_v)]
# Reshape to per-head views
Q = Q_flat.reshape(n, h, d_k) # [n x h x d_k]
K = K_flat.reshape(n, h_kv, d_k) # [n x h_kv x d_k]
V = V_flat.reshape(n, h_kv, d_v) # [n x h_kv x d_v]
# Transpose to [heads x n x dim] for batched matmul
Q = Q.transpose(1, 0, 2) # [h x n x d_k]
K = K.transpose(1, 0, 2) # [h_kv x n x d_k]
V = V.transpose(1, 0, 2) # [h_kv x n x d_v]
# ── Step 2: GQA expansion ──────────────────────────
group_size = h // h_kv
K_exp = np.repeat(K, group_size, axis=0) # [h x n x d_k]
V_exp = np.repeat(V, group_size, axis=0) # [h x n x d_v]
# ── Step 3: Attention scores ───────────────────────
# Q @ K^T: [h x n x d_k] @ [h x d_k x n] = [h x n x n]
S = np.matmul(Q, K_exp.transpose(0, 2, 1)) # [h x n x n]
# ── Step 4: Scale ──────────────────────────────────
S_scaled = S / np.sqrt(d_k) # [h x n x n]
# ── Step 5: Causal mask ────────────────────────────
if causal:
mask = np.triu(np.ones((n, n), dtype=bool), k=1) # [n x n]
S_scaled[:, mask] = -1e9 # [h x n x n]
# ── Step 6: Softmax ────────────────────────────────
A = softmax(S_scaled, axis=-1) # [h x n x n]
# ── Step 7: Weighted sum ───────────────────────────
# A @ V: [h x n x n] @ [h x n x d_v] = [h x n x d_v]
O = np.matmul(A, V_exp) # [h x n x d_v]
# ── Step 8: Concatenate and project ────────────────
O_concat = O.transpose(1, 0, 2).reshape(n, h * d_v) # [n x (h*d_v)]
output = O_concat @ W_O # [n x (h*d_v)] @ [(h*d_v) x d] = [n x d]
return output, {"Q": Q, "K": K, "V": V, "S": S_scaled, "A": A, "O": O}
# ── Demo: Run attention with LLaMA 4 Maverick-like dimensions ──
np.random.seed(42)
# Use small sequence for demonstration (same math, smaller tensors)
n = 8 # 8 tokens
d = 5120 # LLaMA 4 Maverick hidden size
h = 40 # 40 query heads
h_kv = 8 # 8 KV heads (GQA with group size 5)
d_k = 128 # head dimension
d_v = 128 # value dimension
# Random initialization (in real models, these are learned)
X = np.random.randn(n, d).astype(np.float32) * 0.02
W_Q = np.random.randn(d, h * d_k).astype(np.float32) * 0.02
W_K = np.random.randn(d, h_kv * d_k).astype(np.float32) * 0.02
W_V = np.random.randn(d, h_kv * d_v).astype(np.float32) * 0.02
W_O = np.random.randn(h * d_v, d).astype(np.float32) * 0.02
output, intermediates = multi_head_attention_gqa(
X, W_Q, W_K, W_V, W_O, h, h_kv, d_k, d_v, causal=True
)
print(f"Input shape: {X.shape}") # (8, 5120)
print(f"Output shape: {output.shape}") # (8, 5120)
print(f"Q shape: {intermediates['Q'].shape}") # (40, 8, 128)
print(f"K shape: {intermediates['K'].shape}") # (8, 8, 128)
print(f"A shape: {intermediates['A'].shape}") # (40, 8, 8)
# Verify causal masking: attention weights for token 0 should only attend to itself
print(f"\nAttention weights for token 0, head 0:")
print(f" {intermediates['A'][0, 0, :]}")
# Should be [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
# Verify each row sums to 1
row_sums = intermediates['A'].sum(axis=-1) # [h x n]
print(f"\nAll attention rows sum to 1: {np.allclose(row_sums, 1.0)}")
# Parameter count
total_params = (
W_Q.size + # d * h * d_k
W_K.size + # d * h_kv * d_k
W_V.size + # d * h_kv * d_v
W_O.size # h * d_v * d
)
print(f"\nTotal attention parameters: {total_params:,}")
print(f" W_Q: {W_Q.size:,} ({W_Q.shape})")
print(f" W_K: {W_K.size:,} ({W_K.shape})")
print(f" W_V: {W_V.size:,} ({W_V.shape})")
print(f" W_O: {W_O.size:,} ({W_O.shape})")Running this code produces:
Input shape: (8, 5120)
Output shape: (8, 5120)
Q shape: (40, 8, 128)
K shape: (8, 8, 128)
A shape: (40, 8, 8)
Attention weights for token 0, head 0:
[1. 0. 0. 0. 0. 0. 0. 0.]
All attention rows sum to 1: True
Total attention parameters: 62,914,560
W_Q: 26,214,400 (5120, 5120)
W_K: 5,242,880 (5120, 1024)
W_V: 5,242,880 (5120, 1024)
W_O: 26,214,400 (5120, 5120)The 62.9 million parameters per attention layer matches our calculation from Section A.2. Across all 48 layers, the attention parameters alone total 62.9M * 48 = 3.02 billion, which is a substantial fraction of the 17 billion active parameters.
A.8 Attention Variants: MHA, MQA, GQA, MLA
Chapter 8 introduced these variants conceptually. Here is the precise mathematical difference, expressed as changes to the dimension table:
| Variant | h_kv | K/V Projection Size | KV Cache per Token | Used By |
|---|---|---|---|---|
| MHA | h | [d x d] | 2 * h * d_k bytes | Original Transformer |
| MQA | 1 | [d x d_k] | 2 * d_k bytes | PaLM, Falcon-7B |
| GQA | h/g | [d x (h/g)*d_k] | 2 * (h/g) * d_k bytes | LLaMA 3/4, Mistral, Qwen3, Falcon-40B/180B |
| MLA | n/a | [d x d_c] + [d x d_r] | d_c + d_r bytes | DeepSeek-V2/V3 |
Where g is the group size (number of query heads per KV head), d_c is the compressed KV dimension, and d_r is the RoPE dimension.
MHA (Multi-Head Attention): Every query head has its own key and value head. h_kv = h. This is the original design from Vaswani et al. (2017).
MQA (Multi-Query Attention): All query heads share a single key head and a single value head. h_kv = 1. This reduces the KV cache by a factor of h but can slightly reduce model quality.
GQA (Grouped Query Attention): A middle ground. Query heads are divided into groups, and each group shares one KV head. h_kv = h/g. LLaMA 4 Maverick uses g = 5 (40 query heads, 8 KV heads). Qwen3-8B uses g = 4 (32 query heads, 8 KV heads). Falcon-40B and Falcon-180B also use GQA with 8 KV heads (matching their tensor parallel degree of 8), as described in the Falcon paper’s “multigroup” attention scheme.
Source: Qwen3-8B config.json: 8.2B parameters, 36 layers, hidden_size=4096, 32 query heads, 8 KV heads, head_dim=128, intermediate_size=12288 (confirmed from huggingface.co/Qwen/Qwen3-8B/blob/main/config.json and huggingface.co/Qwen/Qwen3-8B model card).
MLA (Multi-head Latent Attention): Used by DeepSeek-V2 and DeepSeek-V3. Instead of storing separate K and V vectors, MLA compresses them into a single low-rank latent vector of dimension d_c (kv_lora_rank). A separate small vector of dimension d_r (qk_rope_head_dim) carries the rotary position information. The KV cache stores only d_c + d_r values per token per layer, instead of 2 * h * d_k for MHA. For DeepSeek-V3: d_c = 512, d_r = 64, so the cache is 576 values per token per layer. Compare that to standard MHA with DeepSeek-V3’s 128 heads and d_k = 128: 2 * 128 * 128 = 32,768 values per token per layer. MLA achieves a 57x reduction. Queries are also compressed through a low-rank bottleneck (q_lora_rank = 1,536), which does not affect the KV cache but reduces activation memory during training.
Source: DeepSeek-V3 Technical Report, arXiv:2412.19437, Section 2.1.2. Config: 61 layers, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, num_attention_heads=128, v_head_dim=128, qk_nope_head_dim=128 (confirmed from huggingface.co/deepseek-ai/DeepSeek-V3 configuration_deepseek.py and arxiv.org/html/2412.19437v1). PaLM uses MQA (confirmed from Chowdhery et al., arXiv:2204.02311 and tinkerd.net/blog/machine-learning/multi-query-attention). Falcon-7B uses MQA (h_kv=1); Falcon-40B and Falcon-180B use GQA with 8 KV heads (n_kv=TP=8, confirmed from the Falcon paper arXiv:2311.16867, Table 16 and Section 4.3.1, and fireworks.ai/blog/multi-query-attention-is-all-you-need).
The mathematical change for each variant is minimal. The attention formula itself (softmax(QK^T/sqrt(d_k))V) is identical. The only difference is how many distinct K and V projection matrices exist and how they are shared across query heads.
A.9 Memory Cost of the Attention Score Matrix
The attention score matrix S has shape [h x n x n]. For long sequences, this matrix is enormous:
| Sequence Length (n) | Heads (h) | Score Matrix Size | Memory (float16) |
|---|---|---|---|
| 4,096 | 40 | 40 * 4,096^2 = 671M entries | 1.25 GB |
| 32,768 | 40 | 40 * 32,768^2 = 42.9B entries | 80 GB |
| 131,072 | 40 | 40 * 131,072^2 = 687B entries | 1.25 TB |
| 1,048,576 | 40 | 40 * 1,048,576^2 = 44T entries | 80 TB |
Memory is computed in binary units: entries * 2 bytes (float16) / 2^30 for GB, / 2^40 for TB. At 131,072 tokens (LLaMA 4 Maverick’s max_position_embeddings), the score matrix alone would require 1.25 TB of memory in float16. This is far more than any single GPU can hold (an NVIDIA H100 has 80 GB of HBM3, a B200 has 192 GB of HBM3e, and a B300 has 288 GB of HBM3e).
This is exactly why FlashAttention (Chapter 20) exists. FlashAttention never materializes the full [n x n] score matrix. Instead, it computes attention in tiles, loading small blocks of Q, K, and V into fast on-chip SRAM, computing partial softmax results, and accumulating the output incrementally. The peak memory usage drops from O(n^2) to O(n), at the cost of more complex kernel code. The mathematical result is identical; only the memory access pattern changes.
Source: FlashAttention-4 (arXiv:2603.05451, March 2026) achieves 1,613 TFLOPs/s BF16 on NVIDIA B200, 71% hardware utilization. (Note: Tri Dao’s blog post at tridao.me/blog/2026/flash4 reports 1,605 TFLOPs/s for the same configuration; this appendix uses the arXiv figure as the primary source.) FlashAttention-3 (NeurIPS 2024) achieves 840 TFLOPs/s BF16 on H100, 85% utilization. (Note: the arXiv preprint v1 reported 740 TFLOPs/s FP16 at 75% utilization; the NeurIPS camera-ready version improved to 840 TFLOPs/s BF16 at 85% utilization.) Both confirmed from arxiv.org and neurips.cc.
A.10 Key Takeaways
The attention mechanism is a sequence of five matrix operations: three linear projections (Q, K, V), one batched matrix multiply (Q @ K^T), and one batched matrix multiply (A @ V), bookended by scaling, masking, softmax, and a final output projection.
Every tensor shape is fully determined by four numbers: the sequence length n, the model dimension d, the number of query heads h, and the head dimension d_k. For GQA, add h_kv (number of KV heads).
The softmax Jacobian is J = diag(s) - s @ s^T. The diagonal entries are s_i(1 - s_i) and the off-diagonal entries are -s_i * s_j. The efficient backward formula is dL/dS = a * (g - sum(a * g)), avoiding the O(n^3) cost of explicit Jacobian multiplication.
FLOPs per layer are approximately 24nd^2 + 4n^2d for standard MHA. The FFN accounts for ~67% of the compute (16nd^2), the attention projections for ~33% (8nd^2), and the quadratic attention core (4n^2d) is negligible for short sequences but dominates for long ones. The crossover point is n = 2d.
GQA reduces the K and V projection parameters and KV cache memory by a factor of h/h_kv (5x for LLaMA 4 Maverick, 4x for Qwen3-8B) without changing the attention core computation. MLA (DeepSeek-V3) goes further, compressing the KV cache to just d_c + d_r = 576 values per token per layer, a 57x reduction compared to standard MHA with 128 heads.
The attention score matrix requires O(h * n^2) memory, which exceeds GPU capacity for long sequences. FlashAttention solves this by tiling the computation, reducing peak memory from O(n^2) to O(n) while producing mathematically identical results.
The total attention parameters per layer for LLaMA 4 Maverick are 62.9 million (W_Q: 26.2M, W_K: 5.2M, W_V: 5.2M, W_O: 26.2M). Across 48 layers, that is 3.02 billion parameters dedicated to attention alone.
Modern models like LLaMA 4 Maverick omit bias terms in attention projections (
attention_bias=False) and add QK normalization (use_qk_norm=True), applying RMSNorm to query and key vectors before the dot product for more stable training at scale.
Appendix B takes the parameter counts and memory formulas from this appendix and turns them into a practical GPU memory calculator for any model configuration.
Sources: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017 (arxiv.org/abs/1706.03762). LLaMA 4 Maverick config from HuggingFace Transformers Llama4TextConfig defaults: hidden_size=5120, num_attention_heads=40, num_key_value_heads=8, head_dim=128, num_hidden_layers=48, intermediate_size_mlp=16384, use_qk_norm=True, attention_bias=False (huggingface.co/docs/transformers/main/model_doc/llama4); Maverick overrides interleave_moe_layer_step=2 for alternating dense/MoE layers (confirmed from github.com/huggingface/blog/blob/main/llama4-release.md: “MoE and dense layers alternate. Therefore, experts are applied in half of the layers”). Qwen3-8B config.json: hidden_size=4096, num_attention_heads=32, num_key_value_heads=8, head_dim=128, num_hidden_layers=36, intermediate_size=12288, vocab_size=151936 (huggingface.co/Qwen/Qwen3-8B/blob/main/config.json). DeepSeek-V3 Technical Report, arXiv:2412.19437 (arxiv.org/html/2412.19437v1); config: 61 layers, kv_lora_rank=512, q_lora_rank=1536, qk_rope_head_dim=64, num_attention_heads=128, v_head_dim=128, qk_nope_head_dim=128 (huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/configuration_deepseek.py). PaLM uses MQA (confirmed from Chowdhery et al., arXiv:2204.02311 and tinkerd.net/blog/machine-learning/multi-query-attention). Falcon-7B uses MQA; Falcon-40B/180B use GQA with 8 KV heads (confirmed from the Falcon paper arXiv:2311.16867, Table 16 and Section 4.3.1, tinkerd.net and fireworks.ai/blog/multi-query-attention-is-all-you-need). FLOPs analysis from Kipply’s “Transformer Inference Arithmetic” (kipp.ly/blog/transformer-inference-arithmetic, March 2022) and Finbarr Timbers’ “Where do LLMs spend their FLOPS?” (artfintel.com/p/where-do-llms-spend-their-flops, January 2024). FlashAttention-4, arXiv:2603.05451: 1,613 TFLOPs/s BF16 on B200, 71% utilization (arxiv.org/abs/2603.05451; Tri Dao’s blog at tridao.me/blog/2026/flash4 reports 1,605 TFLOPs/s for the same configuration). FlashAttention-3, NeurIPS 2024: 840 TFLOPs/s BF16 on H100, 85% utilization; arXiv preprint v1 reported 740 TFLOPs/s FP16 75% (neurips.cc/virtual/2024/poster/93328, arxiv.org/abs/2407.08608). NVIDIA B200: 192 GB HBM3e (techpowerup.com/gpu-specs/b200-sxm-192-gb.c4210). NVIDIA B300 (Blackwell Ultra): 288 GB HBM3e, shipped January 2026; dense FP4 performance varies by variant: HGX B300 13 PFLOPS per glennklockwood.com/garden/processors/B300 citing NVIDIA HGX platform specs, Spheron reports 14 PFLOPS (spheron.network/blog/nvidia-b300-blackwell-ultra-guide, February 2026), other sources report 15 PFLOPS for higher-power configurations (introl.com/blog/nvidia-blackwell-ultra-b300-infrastructure-requirements-2025, server-parts.eu/post/nvidia-b300-gpu-blackwell-ultra-architecture). CoreWeave HGX B300 generally available March 2026 (coreweave.com/news/coreweave-advances-ai-native-cloud-platform-for-the-next-phase-of-production-scale-ai).