Skip to content
Chapter 9. Feed-Forward Networks, The Thinking Step

Chapter 9. Feed-Forward Networks, The Thinking Step

Attention gathers information from across the sequence, letting each token see what other tokens are relevant. But gathering information is not the same as processing it. After attention has mixed contextual signals into each token’s vector, the model needs a way to transform that information: to combine features, recognize patterns, and store the factual knowledge that makes a language model useful. This is the job of the feed-forward network (FFN), the other major component of every Transformer layer, and the one that contains the majority of the model’s parameters.


What Does the FFN Actually Do?

In Chapter 8, you learned how multi-head attention lets each token pull in information from other tokens in the sequence. After the attention step, the vector for each token is a context-aware blend of information from across the sequence. But this blended vector is still just a weighted sum of value vectors. It has not been deeply processed.

The feed-forward network takes each token’s vector independently and transforms it through a series of matrix multiplications and nonlinear activations. Unlike attention, which mixes information between tokens, the FFN processes each token in isolation. The same FFN weights are applied to every token position, but each token’s vector is transformed independently.

This division of labor is fundamental to how Transformers work:

  • Attention handles inter-token communication: “What information from other tokens is relevant to me?”
  • FFN handles per-token computation: “Given all the information I have gathered, what should I compute?”

Think of it this way: attention is like reading a document and highlighting the relevant passages. The FFN is like sitting down with those highlighted passages and reasoning about what they mean together. Both steps are necessary. Attention without FFN would just shuffle information around without deeply processing it. FFN without attention would process each token in isolation, unable to use context.


The Basic FFN Architecture

The feed-forward network in the original Transformer (Vaswani et al., 2017) is remarkably simple. It consists of two linear transformations with a nonlinear activation function in between:

FFN(x) = W_2 * activation(W_1 * x + b_1) + b_2

Where:

  • x is the input vector for a single token, with shape [hidden_size]
  • W_1 is the “up-projection” matrix, with shape [hidden_size x d_ff]
  • W_2 is the “down-projection” matrix, with shape [d_ff x hidden_size]
  • b_1 and b_2 are bias vectors
  • activation is a nonlinear function (ReLU in the original Transformer)

The key design choice is the expansion ratio: d_ff is larger than hidden_size. In the original Transformer, d_ff = 2,048 and hidden_size (d_model) = 512, giving an expansion ratio of 4x. The FFN expands the representation into a higher-dimensional space, applies a nonlinear transformation, and then projects it back down to the original dimension.

Source: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017, Section 3.3. d_model = 512, d_ff = 2,048, with ReLU activation.

Why Expand and Contract?

The expansion step is critical. When the FFN projects a 512-dimensional vector into a 2,048-dimensional space, it creates room for the model to represent more complex patterns. In the higher-dimensional space, the nonlinear activation can selectively activate or suppress different features. The contraction step then compresses this processed representation back to the original dimension so it can be passed to the next layer.

Here is a concrete way to think about it. The input vector x has 512 dimensions. After multiplying by W_1, the result has 2,048 dimensions. Each of these 2,048 dimensions can be thought of as a “feature detector” that checks for a specific pattern in the input. The activation function then decides which of these features are active (nonzero) and which are suppressed (zero or near-zero). The down-projection W_2 combines the active features back into a 512-dimensional output.

This expand-activate-contract pattern is sometimes called a “bottleneck” architecture (though in this case, the bottleneck is the input/output dimension, not the expanded dimension). It gives the FFN far more representational capacity than a single linear transformation of the same input and output dimensions would have.

Step-by-Step Walkthrough

Let’s trace through the FFN computation for a single token in the original Transformer:

  1. Input: A token vector x of shape [512], coming from the attention output.

  2. Up-projection: Multiply by W_1 (shape [512 x 2,048]) to get a vector of shape [2,048].

    h = x * W_1 + b_1    shape: [2,048]
  3. Activation: Apply ReLU to each element of h.

    h_activated = ReLU(h)    shape: [2,048]

    ReLU(z) = max(0, z). Any negative value becomes zero; positive values pass through unchanged.

  4. Down-projection: Multiply by W_2 (shape [2,048 x 512]) to get back to shape [512].

    output = h_activated * W_2 + b_2    shape: [512]
  5. Output: A transformed vector of shape [512], the same dimension as the input.

The output has the same shape as the input, which is essential because it needs to be added back via the residual connection (covered in Chapter 10) and passed to the next Transformer layer.


Activation Functions: From ReLU to SwiGLU

The activation function is the source of nonlinearity in the FFN. Without it, the two linear transformations would collapse into a single linear transformation (since the product of two matrices is just another matrix), and the FFN would have no more representational power than a single linear layer. The activation function is what gives the FFN the ability to learn complex, nonlinear patterns.

ReLU: The Original Choice

The original Transformer used ReLU (Rectified Linear Unit), one of the simplest activation functions in deep learning:

ReLU(x) = max(0, x)

If the input is positive, ReLU passes it through unchanged. If the input is negative, ReLU outputs zero. This creates a sparse activation pattern: for any given input, roughly half of the 2,048 neurons in the expanded layer will output zero (their inputs happened to be negative), and the other half will output their positive values.

ReLU was popularized in deep learning around 2010-2012 and became the default activation function for most neural networks. It is computationally cheap (just a comparison and a max operation) and avoids the “vanishing gradient” problem that plagued earlier activation functions like sigmoid and tanh.

Source: ReLU was first used in neural networks by Hahnloser et al. (2000) and popularized for deep learning by Nair and Hinton (2010). It became the standard activation after its use in AlexNet (Krizhevsky et al., 2012).

GELU: A Smoother Alternative

GELU (Gaussian Error Linear Unit) was introduced by Hendrycks and Gimpel (2016, arXiv:1606.08415) and became popular in models like BERT and GPT-2. Unlike ReLU, which has a hard cutoff at zero, GELU provides a smooth transition:

GELU(x) = x * Φ(x)

Where Φ(x) is the cumulative distribution function of the standard normal distribution. In practice, GELU is often approximated as:

GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))

GELU allows small negative values to pass through (slightly attenuated) rather than being completely zeroed out. This smoother behavior can help with training stability and gradient flow.

Source: Hendrycks and Gimpel, “Gaussian Error Linear Units (GELUs),” arXiv:1606.08415, June 2016.

Swish/SiLU: The Bridge to SwiGLU

Swish (also called SiLU, Sigmoid Linear Unit) was proposed by Ramachandran et al. (2017, arXiv:1710.05941):

Swish(x) = x * sigmoid(x) = x * (1 / (1 + exp(-x)))

Like GELU, Swish is smooth and allows small negative values through. It was found to outperform ReLU in many settings, particularly in deeper networks.

Gated Linear Units (GLU): Adding a Gate

The key innovation that led to modern FFN designs is the Gated Linear Unit (GLU), introduced by Dauphin et al. (2017) in the context of convolutional language models. A GLU splits the computation into two parallel paths and uses one path to “gate” (control) the other:

GLU(x) = (x * W_1) ⊙ sigmoid(x * W_gate)

Where ⊙ means element-wise multiplication. The first path (x * W_1) computes a set of candidate values. The second path (sigmoid(x * W_gate)) computes gate values between 0 and 1. The element-wise product means the gate controls how much of each candidate value passes through. A gate value near 1 lets the candidate through; a gate value near 0 blocks it.

This gating mechanism gives the network more control over information flow than a simple pointwise activation like ReLU. Instead of just zeroing out negative values, the gate can selectively amplify, attenuate, or block any feature based on the input.

Source: Dauphin et al., “Language Modeling with Gated Convolutional Networks,” ICML 2017 (arXiv:1612.08083).

SwiGLU: The Modern Standard

SwiGLU combines the Swish activation with the GLU gating mechanism. It was proposed by Noam Shazeer in 2020 and has become the dominant FFN activation in modern LLMs. The SwiGLU FFN replaces the standard two-matrix FFN with a three-matrix design:

SwiGLU_FFN(x) = (Swish(x * W_gate) ⊙ (x * W_up)) * W_down

Where:

  • W_gate (sometimes called W_1) projects the input to the expanded dimension and applies Swish
  • W_up (sometimes called W_3) projects the input to the expanded dimension (no activation)
  • The element-wise product of these two paths creates the gated output
  • W_down (sometimes called W_2) projects back to the hidden dimension

The critical difference from the standard FFN: SwiGLU uses three weight matrices instead of two. The gate path and the up-projection path are separate linear transformations of the same input, and their element-wise product creates a richer, more expressive transformation than a single activation function could achieve.

Source: Shazeer, “GLU Variants Improve Transformer,” arXiv:2002.05202, February 2020. Tested multiple GLU variants (ReGLU, GEGLU, SwiGLU) in Transformer FFN layers and found SwiGLU achieved the best results.

Why SwiGLU Won

Shazeer (2020) tested several GLU variants in Transformer feed-forward layers:

VariantGate ActivationPerformance
ReLU (baseline)None (standard FFN)Baseline
ReGLUReLUBetter than baseline
GEGLUGELUBetter than ReGLU
SwiGLUSwishBest overall

SwiGLU consistently outperformed the alternatives across multiple benchmarks. The combination of Swish’s smooth, non-monotonic behavior with the GLU gating mechanism provides the best balance of expressiveness and training stability.

Today, SwiGLU is used in virtually every major open-weight LLM: LLaMA 2, LLaMA 3, LLaMA 4, Mistral, DeepSeek-V3, and many others. It has become the de facto standard for Transformer FFN layers.

Source: LLaMA models use SwiGLU per Meta’s technical reports. Mistral 7B uses SwiGLU (hidden_act = “silu” in config). DeepSeek-V3 uses SwiGLU per its technical report (arXiv:2412.19437).


The Three-Matrix SwiGLU FFN in Detail

Since SwiGLU is what modern models actually use, let’s examine its architecture carefully. The standard (ReLU) FFN has two weight matrices. The SwiGLU FFN has three:

Standard FFN:
  h = ReLU(x * W_1)     up-projection + activation
  output = h * W_2       down-projection

SwiGLU FFN:
  gate = Swish(x * W_gate)    gate path with Swish activation
  up = x * W_up               value path (no activation)
  h = gate ⊙ up               element-wise gating
  output = h * W_down          down-projection

Dimension Accounting

Because SwiGLU uses three matrices instead of two, the expanded dimension (intermediate_size) is adjusted to keep the total parameter count roughly comparable to a standard FFN with the same expansion ratio.

In a standard FFN with 4x expansion:

  • W_1: [hidden_size x 4*hidden_size] parameters
  • W_2: [4*hidden_size x hidden_size] parameters
  • Total: 2 * hidden_size * 4 * hidden_size = 8 * hidden_size^2

In a SwiGLU FFN, to keep the parameter count similar, the expansion factor is reduced from 4x to approximately 8/3 ≈ 2.67x, because there are three matrices:

  • W_gate: [hidden_size x intermediate_size]
  • W_up: [hidden_size x intermediate_size]
  • W_down: [intermediate_size x hidden_size]
  • Total: 3 * hidden_size * intermediate_size

Setting 3 * hidden_size * intermediate_size = 8 * hidden_size^2 gives intermediate_size = (8/3) * hidden_size ≈ 2.67 * hidden_size.

In practice, the intermediate_size is often rounded to a convenient multiple. Here are the actual values from real models:

Modelhidden_sizeintermediate_sizeRatioFFN TypeYear
Original Transformer5122,0484.0xStandard (ReLU)2017
GPT-2 (small)7683,0724.0xStandard (GELU)2019
LLaMA 3 8B4,09614,3363.5xSwiGLU2024
Mistral 7B4,09614,3363.5xSwiGLU2023
DeepSeek-V3 (dense layers)7,16818,4322.57xSwiGLU2024
DeepSeek-V3 (MoE experts)7,1682,0480.29xSwiGLU2024

DeepSeek-V3 has two different FFN sizes. The first 3 layers use standard dense FFN with intermediate_size = 18,432. The remaining 58 layers use MoE, where each individual expert has a much smaller moe_intermediate_size = 2,048. The small per-expert size is compensated by having 256 routed experts plus 1 shared expert per MoE layer, with 8 routed experts activated per token.

Sources: Original Transformer from Vaswani et al. (2017), d_model=512, d_ff=2,048. GPT-2 from OpenAI (2019), hidden_size=768, intermediate_size=3,072. LLaMA 3 8B from Meta (April 18, 2024), hidden_size=4,096, intermediate_size=14,336. Mistral 7B from Mistral AI (September 27, 2023), hidden_size=4,096, intermediate_size=14,336. DeepSeek-V3 from technical report (arXiv:2412.19437) and HuggingFace configuration (deepseek-ai/DeepSeek-V3-Base): hidden_size=7,168, intermediate_size=18,432 (dense), moe_intermediate_size=2,048 (per expert), 256 routed experts, 1 shared expert, 8 experts activated per token. Model released December 26, 2024.

Notice that the SwiGLU models (LLaMA 3, Mistral) use expansion ratios between 2.5x and 3.5x for their dense FFN layers, rather than the 4x used in standard FFN models. This compensates for the extra weight matrix, keeping the total parameter count in a similar range. DeepSeek-V3’s dense layers follow the same pattern (2.57x), while its MoE expert FFNs use a much smaller expansion ratio (0.29x) because the total capacity comes from having 256 experts rather than from a large per-expert dimension.


Why the FFN Has Most of the Parameters

One of the most important facts about Transformer architecture is that the FFN contains the majority of each layer’s parameters. In a standard Transformer layer, the FFN typically accounts for roughly two-thirds of the total parameters.

Let’s verify this with real numbers from LLaMA 3 8B (a dense model, which makes the accounting straightforward):

Attention Parameters (per layer)

LLaMA 3 8B uses GQA with 32 query heads, 8 KV heads, and head_dim = 128:

  • W_Q: [4,096 x 4,096] = 16,777,216 parameters
  • W_K: [4,096 x 1,024] = 4,194,304 parameters (8 KV heads * 128)
  • W_V: [4,096 x 1,024] = 4,194,304 parameters
  • W_O: [4,096 x 4,096] = 16,777,216 parameters

Total attention per layer: 41,943,040 parameters (approximately 42.0 million)

FFN Parameters (per layer)

LLaMA 3 8B uses SwiGLU with intermediate_size = 14,336:

  • W_gate: [4,096 x 14,336] = 58,720,256 parameters
  • W_up: [4,096 x 14,336] = 58,720,256 parameters
  • W_down: [14,336 x 4,096] = 58,720,256 parameters

Total FFN per layer: 176,160,768 parameters (approximately 176.2 million)

The Ratio

FFN parameters per layer: 176.2 million Attention parameters per layer: 42.0 million Total per layer: 218.2 million

FFN fraction: 176.2 / 218.2 = 80.7%

The FFN accounts for over 80% of the parameters in each Transformer layer of LLaMA 3 8B. This is even higher than the commonly cited “two-thirds” figure, because GQA reduces the attention parameter count (by sharing KV heads) while the SwiGLU FFN with its three matrices is relatively large.

For comparison, in the original Transformer with standard MHA and a standard 2-matrix FFN:

  • Attention: 4 * 512 * 512 = 1,048,576 parameters (W_Q, W_K, W_V, W_O)
  • FFN: 2 * 512 * 2,048 = 2,097,152 parameters (W_1, W_2)
  • FFN fraction: 2,097,152 / 3,145,728 = 66.7%, exactly two-thirds

Source: LLaMA 3 8B architecture from Meta (April 18, 2024): hidden_size=4,096, intermediate_size=14,336, num_attention_heads=32, num_key_value_heads=8, head_dim=128, num_hidden_layers=32. Original Transformer from Vaswani et al. (2017): d_model=512, d_ff=2,048, h=8.

The takeaway: when you hear that a model has billions of parameters, the vast majority of those parameters live in the FFN layers, not in the attention layers. This has important implications for model compression, fine-tuning, and understanding where knowledge is stored.

Computational Cost: Parameters vs. FLOPs

The FFN dominates not only in parameter count but also in floating-point operations (FLOPs) per token. For a single token passing through one layer of LLaMA 3 8B:

  • FFN FLOPs: The SwiGLU FFN performs three matrix multiplications. Each matrix multiply of shape [1 x M] @ [M x N] costs approximately 2MN FLOPs (one multiply and one add per output element). So the FFN costs roughly 2 * 4,096 * 14,336 * 3 = 352,321,536 FLOPs per token per layer (approximately 352 million FLOPs).

  • Attention FLOPs: The attention computation includes Q/K/V projections, the attention score computation, and the output projection. The projections cost roughly 2 * 4,096 * (4,096 + 1,024 + 1,024 + 4,096) = 83,886,080 FLOPs. The attention score computation (QK^T and scoreV) depends on sequence length, but for a typical sequence of 1,024 tokens, it adds roughly 2 * 1,024 * 4,096 * 2 = 16,777,216 FLOPs. Total: approximately 101 million FLOPs per token per layer.

The FFN uses roughly 3.5x more FLOPs than attention per token per layer (for a 1,024-token sequence). For longer sequences, attention’s quadratic cost grows and eventually dominates, but for typical sequence lengths, the FFN is the computational bottleneck as well as the parameter bottleneck.


What FFN Layers Actually Store

The FFN layers are not just generic computation blocks. Research has shown that they serve as the primary storage mechanism for factual knowledge in Transformer models. This is one of the most important findings in the field of mechanistic interpretability.

FFN Layers as Key-Value Memories

Geva et al. (2021) published a landmark paper titled “Transformer Feed-Forward Layers Are Key-Value Memories.” Their key insight is that the two-layer structure of the FFN (up-projection followed by down-projection) naturally implements a key-value memory system:

  • The first layer (W_1 or W_gate/W_up in SwiGLU) acts as a set of “keys.” Each row of the weight matrix defines a pattern that the layer checks for in the input. When the input matches a particular pattern (high dot product with that row), the corresponding neuron activates.

  • The second layer (W_2 or W_down) acts as a set of “values.” Each column of the weight matrix defines what information to add to the output when the corresponding neuron is active.

Together, the FFN implements a pattern-matching system: “If the input matches pattern X (key), then add information Y (value) to the output.” This is structurally similar to the attention mechanism’s key-value lookup, but operating on learned patterns rather than on other tokens in the sequence.

Source: Geva et al., “Transformer Feed-Forward Layers Are Key-Value Memories,” EMNLP 2021 (arXiv:2012.14913).

Factual Knowledge Lives in the FFN

Meng et al. (2022) took this further with their paper “Locating and Editing Factual Associations in GPT.” They developed a method called causal tracing to identify exactly where in the model factual knowledge is stored. Their findings were striking:

  • Factual associations (like “The Eiffel Tower is located in Paris”) are primarily mediated by feed-forward modules in the middle layers of the Transformer.
  • When processing a factual query, specific neurons in the FFN layers activate in response to the subject entity (“Eiffel Tower”) and contribute the factual information (“Paris”) to the output.
  • These factual associations can be directly edited by modifying the FFN weights, using a technique called ROME (Rank-One Model Editing). Changing a few values in a specific FFN layer can update the model’s factual knowledge (for example, making it believe the Eiffel Tower is in Rome instead of Paris).

Source: Meng et al., “Locating and Editing Factual Associations in GPT,” NeurIPS 2022 (arXiv:2202.05262).

What This Means

The FFN layers are where the model stores its “world knowledge.” When a language model correctly answers “What is the capital of France?” it is not performing a database lookup. Instead, during the forward pass, the FFN layers in the middle of the network activate specific neurons that encode the association between “France” and “Paris,” and these neurons contribute the correct answer to the output.

This explains why larger FFN layers (more neurons, more parameters) generally lead to more knowledgeable models. More neurons means more key-value pairs, which means more factual associations can be stored. It also explains why the FFN is the largest component of each layer: storing the vast amount of knowledge needed for a general-purpose language model requires an enormous number of parameters.

Dai et al. (2022) further developed this line of research by introducing the concept of knowledge neurons: specific neurons in the FFN layers that are responsible for expressing particular factual knowledge. They showed that suppressing these neurons causes the model to “forget” the corresponding fact, while amplifying them strengthens the model’s confidence in that fact. This provides additional evidence that factual knowledge is not distributed uniformly across the network but is concentrated in identifiable neurons within the FFN layers.

Source: Dai et al., “Knowledge Neurons in Pretrained Transformers,” ACL 2022.


Hands-On: Implementing the FFN

Let’s implement both the standard FFN and the SwiGLU FFN from scratch, so you can see exactly how they work:

import numpy as np

def relu(x):
    """ReLU activation: max(0, x)."""
    return np.maximum(0, x)

def swish(x):
    """Swish/SiLU activation: x * sigmoid(x)."""
    return x * (1 / (1 + np.exp(-np.clip(x, -500, 500))))

def standard_ffn(x, W1, W2, b1, b2):
    """Standard FFN as in the original Transformer.
    
    x: input vector, shape [hidden_size]
    W1: up-projection, shape [hidden_size, d_ff]
    W2: down-projection, shape [d_ff, hidden_size]
    """
    h = relu(x @ W1 + b1)
    return h @ W2 + b2

def swiglu_ffn(x, W_gate, W_up, W_down):
    """SwiGLU FFN as used in LLaMA, Mistral, etc.
    
    x: input vector, shape [hidden_size]
    W_gate: gate projection, shape [hidden_size, intermediate_size]
    W_up: up projection, shape [hidden_size, intermediate_size]
    W_down: down projection, shape [intermediate_size, hidden_size]
    """
    gate = swish(x @ W_gate)
    up = x @ W_up
    h = gate * up  # element-wise gating
    return h @ W_down


# Compare both FFN types
np.random.seed(42)
hidden_size = 16
d_ff = 64  # 4x expansion for standard FFN
intermediate_size = 43  # ~2.67x expansion for SwiGLU (to match param count)

x = np.random.randn(hidden_size) * 0.5

# Standard FFN
W1 = np.random.randn(hidden_size, d_ff) * 0.1
W2 = np.random.randn(d_ff, hidden_size) * 0.1
b1 = np.zeros(d_ff)
b2 = np.zeros(hidden_size)

out_standard = standard_ffn(x, W1, W2, b1, b2)

# SwiGLU FFN
W_gate = np.random.randn(hidden_size, intermediate_size) * 0.1
W_up = np.random.randn(hidden_size, intermediate_size) * 0.1
W_down = np.random.randn(intermediate_size, hidden_size) * 0.1

out_swiglu = swiglu_ffn(x, W_gate, W_up, W_down)

print(f"Input shape:  {x.shape}")
print(f"Standard FFN output shape: {out_standard.shape}")
print(f"SwiGLU FFN output shape:   {out_swiglu.shape}")
print()

# Parameter counts
std_params = hidden_size * d_ff + d_ff * hidden_size + d_ff + hidden_size
swiglu_params = hidden_size * intermediate_size * 3
print(f"Standard FFN parameters: {std_params:,}")
print(f"  W1: {hidden_size}x{d_ff} = {hidden_size*d_ff:,}")
print(f"  W2: {d_ff}x{hidden_size} = {d_ff*hidden_size:,}")
print()
print(f"SwiGLU FFN parameters: {swiglu_params:,}")
print(f"  W_gate: {hidden_size}x{intermediate_size} = {hidden_size*intermediate_size:,}")
print(f"  W_up:   {hidden_size}x{intermediate_size} = {hidden_size*intermediate_size:,}")
print(f"  W_down: {intermediate_size}x{hidden_size} = {intermediate_size*hidden_size:,}")
print()

# Show the gating effect
gate_values = swish(x @ W_gate)
up_values = x @ W_up
gated = gate_values * up_values

print("Gating effect (first 10 neurons):")
print(f"  Gate values:  [{', '.join(f'{v:.3f}' for v in gate_values[:10])}]")
print(f"  Up values:    [{', '.join(f'{v:.3f}' for v in up_values[:10])}]")
print(f"  Gated output: [{', '.join(f'{v:.3f}' for v in gated[:10])}]")

When you run this, you will see that both FFN types produce output vectors of the same shape as the input. The parameter counts are roughly comparable (the SwiGLU intermediate_size of 43 is chosen to approximately match the standard FFN’s parameter count with d_ff=64). The gating effect shows how the gate values modulate the up-projection values: some neurons are amplified, others are suppressed, and some have their signs flipped.


Visualizing the Gating Mechanism

To build intuition for how SwiGLU’s gating works, let’s visualize the activation patterns of both standard ReLU and SwiGLU:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-4, 4, 200)

# ReLU
relu_y = np.maximum(0, x)

# Swish/SiLU
swish_y = x * (1 / (1 + np.exp(-x)))

# GELU (approximate)
gelu_y = 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].plot(x, relu_y, 'b-', linewidth=2)
axes[0].set_title('ReLU: max(0, x)', fontsize=12)
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Input')
axes[0].set_ylabel('Output')
axes[0].grid(True, alpha=0.3)

axes[1].plot(x, gelu_y, 'r-', linewidth=2)
axes[1].set_title('GELU: x * Φ(x)', fontsize=12)
axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[1].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Input')
axes[1].grid(True, alpha=0.3)

axes[2].plot(x, swish_y, 'g-', linewidth=2)
axes[2].set_title('Swish: x * sigmoid(x)', fontsize=12)
axes[2].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[2].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
axes[2].set_xlabel('Input')
axes[2].grid(True, alpha=0.3)

plt.suptitle('Activation Functions Used in Transformer FFN Layers', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('activation_functions.png', dpi=150, bbox_inches='tight')
plt.show()
print("Plot saved to activation_functions.png")

The key visual difference: ReLU has a hard corner at zero, completely killing all negative inputs. GELU and Swish are smooth curves that allow small negative values through. This smoothness helps with gradient flow during training, because the gradient is never exactly zero for any input (unlike ReLU, where the gradient is zero for all negative inputs).

In SwiGLU, the Swish activation is applied to the gate path, and the result is multiplied element-wise with the ungated up-projection. This means the gate can smoothly scale features from fully suppressed (gate near 0) to fully passed (gate near 1) to amplified (gate greater than 1, which Swish allows for large positive inputs). This is more expressive than ReLU’s binary on/off behavior.


The FFN in the Full Transformer Pipeline

Let’s update the model pipeline from Chapter 8 to show exactly where the FFN fits, using LLaMA 3 8B as our example (a dense model, since LLaMA 4 Maverick uses Mixture-of-Experts, which we will cover in Chapter 12):

Step 1: Tokenization (Chapter 4)
  "The capital of France is"
  --> [791, 6864, 315, 9822, 374]
  --> 5 tokens

Step 2: Embedding Lookup (Chapter 5)
  Each token ID --> row in embedding table (128,256 x 4,096)
  --> Matrix of shape [5 x 4,096]

Step 3: For each of the 32 Transformer layers:

  a) RMSNorm (Chapter 10)
     Normalize each token's vector

  b) Multi-Head Attention with GQA (Chapter 8)
     - 32 query heads, 8 KV heads, head_dim = 128
     - Each token attends to all previous tokens
     - Output: [5 x 4,096]

  c) Residual Connection (Chapter 10)
     Add attention output to input

  d) RMSNorm (Chapter 10)

  e) Feed-Forward Network with SwiGLU (THIS CHAPTER)
     For EACH token independently:
     - Gate path: [4,096] @ W_gate [4,096 x 14,336] --> Swish --> [14,336]
     - Up path:   [4,096] @ W_up   [4,096 x 14,336] --> [14,336]
     - Gating:    gate ⊙ up --> [14,336]
     - Down:      [14,336] @ W_down [14,336 x 4,096] --> [4,096]
     Output: [5 x 4,096]

  f) Residual Connection (Chapter 10)

Step 4: Final RMSNorm + Output Projection
  --> 128,256 probabilities for the next token

Notice that the FFN processes each token independently. The input to the FFN is a matrix of shape [5 x 4,096] (5 tokens, each with a 4,096-dimensional vector), and the FFN applies the same three matrix multiplications to each row independently. There is no interaction between tokens in the FFN step. All inter-token communication happens in the attention step.

This independence is important for two reasons:

  1. Parallelism: Since each token is processed independently, the FFN computation can be fully parallelized across tokens. On a GPU, all 5 tokens (or thousands of tokens in a real batch) are processed simultaneously.

  2. Position invariance: The FFN applies the same transformation regardless of where a token appears in the sequence. The same factual knowledge is accessible to every token position. Position-dependent behavior comes from the attention mechanism and positional encodings, not from the FFN.


Real Numbers: FFN Parameters in Production Models

Let’s compute the exact FFN parameter counts for several real models to understand the scale:

LLaMA 3 8B (Dense Model)

  • hidden_size: 4,096
  • intermediate_size: 14,336
  • num_hidden_layers: 32
  • FFN type: SwiGLU (3 matrices)

Parameters per FFN layer:

  • W_gate: 4,096 * 14,336 = 58,720,256
  • W_up: 4,096 * 14,336 = 58,720,256
  • W_down: 14,336 * 4,096 = 58,720,256
  • Total per layer: 176,160,768 (176.2M)

Total FFN parameters across all 32 layers:

  • 176,160,768 * 32 = 5,637,144,576 (5.64 billion)

Total model parameters (approximate): ~8 billion FFN fraction of total model: 5.64B / 8B ≈ 70.5%

Mistral 7B

  • hidden_size: 4,096
  • intermediate_size: 14,336
  • num_hidden_layers: 32
  • FFN type: SwiGLU (3 matrices)

The FFN dimensions are identical to LLaMA 3 8B, so the FFN parameter count is the same: 5.64 billion across all 32 layers.

Source: Mistral 7B from Mistral AI (September 27, 2023): hidden_size=4,096, intermediate_size=14,336, 32 layers, 32 query heads, 8 KV heads.

DeepSeek-V3

  • hidden_size: 7,168
  • num_hidden_layers: 61
  • FFN type: SwiGLU (3 matrices per expert)
  • Architecture: Mixture-of-Experts (256 routed experts + 1 shared expert per MoE layer)
  • First 3 layers: dense FFN with intermediate_size = 18,432
  • Remaining 58 layers: MoE with moe_intermediate_size = 2,048 per expert
  • 8 routed experts activated per token

For a single MoE expert (moe_intermediate_size = 2,048):

  • W_gate: 7,168 * 2,048 = 14,680,064
  • W_up: 7,168 * 2,048 = 14,680,064
  • W_down: 2,048 * 7,168 = 14,680,064
  • Total per expert: 44,040,192 (44.0M)

Each MoE layer has 256 routed experts plus 1 shared expert (257 total), so the total FFN parameters per MoE layer are:

  • 257 * 44,040,192 = 11,318,329,344 (11.3 billion per layer)

But only 8 routed experts + 1 shared expert = 9 experts are activated per token, so the active FFN parameters per MoE layer per token are:

  • 9 * 44,040,192 = 396,361,728 (396.4M per token)

With 58 MoE layers and 3 dense layers, the total FFN parameters across all layers is what drives the model’s total parameter count to 671 billion. The vast majority of those parameters are in the 256 routed expert FFN blocks per MoE layer, most of which are inactive for any given token.

Source: DeepSeek-V3 Technical Report, arXiv:2412.19437, and HuggingFace configuration (deepseek-ai/DeepSeek-V3-Base). hidden_size=7,168, intermediate_size=18,432 (dense layers), moe_intermediate_size=2,048 (per expert), 61 layers, first_k_dense_replace=3, 256 routed experts, 1 shared expert, num_experts_per_tok=8, 671B total parameters, 37B active parameters. Model released December 26, 2024.

The Original Transformer (for comparison)

  • d_model: 512
  • d_ff: 2,048
  • num_layers: 6 encoder + 6 decoder = 12 total
  • FFN type: Standard (2 matrices + biases)

Parameters per FFN layer:

  • W_1: 512 * 2,048 = 1,048,576
  • W_2: 2,048 * 512 = 1,048,576
  • b_1: 2,048
  • b_2: 512
  • Total per layer: 2,099,712 (2.1M)

Total FFN across all 12 layers (6 encoder + 6 decoder): 25,196,544 (25.2M)

The contrast is staggering. A single FFN layer in LLaMA 3 8B (176.2M parameters) has more parameters than the entire original Transformer model (approximately 63M total parameters for the base configuration). This growth reflects both the increase in hidden dimensions and the addition of the third matrix for SwiGLU.


Worked Example: Tracing Through a SwiGLU FFN

Let’s trace through a complete SwiGLU FFN computation with small, concrete numbers. We will use hidden_size = 4 and intermediate_size = 6 for readability.

Setup

Input vector x (after attention and normalization), shape [4]:

x = [0.5, -0.3, 0.8, 0.1]

Weight matrices (randomly initialized for illustration):

W_gate = [[ 0.2,  0.1, -0.3,  0.4,  0.0, -0.2],
          [-0.1,  0.3,  0.2, -0.1,  0.5,  0.1],
          [ 0.4, -0.2,  0.1,  0.3, -0.1,  0.2],
          [ 0.0,  0.1, -0.1,  0.2,  0.3, -0.3]]

W_up   = [[ 0.3, -0.1,  0.2,  0.0,  0.4, -0.1],
          [ 0.1,  0.2, -0.3,  0.5, -0.2,  0.3],
          [-0.2,  0.4,  0.1, -0.1,  0.3,  0.0],
          [ 0.2, -0.3,  0.0,  0.1,  0.1,  0.2]]

W_down = [[ 0.1, -0.2,  0.3,  0.0],
          [ 0.2,  0.1, -0.1,  0.4],
          [-0.3,  0.2,  0.0,  0.1],
          [ 0.1,  0.0,  0.2, -0.3],
          [ 0.0,  0.3, -0.2,  0.1],
          [-0.1,  0.1,  0.1,  0.2]]

Step 1: Gate Path

Compute x @ W_gate:

gate_pre = [0.5*0.2 + (-0.3)*(-0.1) + 0.8*0.4 + 0.1*0.0,    = 0.45
            0.5*0.1 + (-0.3)*0.3 + 0.8*(-0.2) + 0.1*0.1,     = -0.19
            0.5*(-0.3) + (-0.3)*0.2 + 0.8*0.1 + 0.1*(-0.1),  = -0.14
            0.5*0.4 + (-0.3)*(-0.1) + 0.8*0.3 + 0.1*0.2,     = 0.49
            0.5*0.0 + (-0.3)*0.5 + 0.8*(-0.1) + 0.1*0.3,     = -0.20
            0.5*(-0.2) + (-0.3)*0.1 + 0.8*0.2 + 0.1*(-0.3)]  = 0.00

Apply Swish: Swish(z) = z * sigmoid(z)

gate = [0.45 * sigmoid(0.45),   = 0.45 * 0.611 = 0.275
        -0.19 * sigmoid(-0.19), = -0.19 * 0.453 = -0.086
        -0.14 * sigmoid(-0.14), = -0.14 * 0.465 = -0.065
        0.49 * sigmoid(0.49),   = 0.49 * 0.620 = 0.304
        -0.20 * sigmoid(-0.20), = -0.20 * 0.450 = -0.090
        0.00 * sigmoid(0.00)]   = 0.00 * 0.500 = 0.000

Step 2: Up Path

Compute x @ W_up:

up = [0.5*0.3 + (-0.3)*0.1 + 0.8*(-0.2) + 0.1*0.2,     = -0.02
      0.5*(-0.1) + (-0.3)*0.2 + 0.8*0.4 + 0.1*(-0.3),   = 0.18
      0.5*0.2 + (-0.3)*(-0.3) + 0.8*0.1 + 0.1*0.0,      = 0.27
      0.5*0.0 + (-0.3)*0.5 + 0.8*(-0.1) + 0.1*0.1,      = -0.22
      0.5*0.4 + (-0.3)*(-0.2) + 0.8*0.3 + 0.1*0.1,      = 0.51
      0.5*(-0.1) + (-0.3)*0.3 + 0.8*0.0 + 0.1*0.2]      = -0.12

Step 3: Element-wise Gating

h = gate ⊙ up = [0.275 * (-0.02),  = -0.005
                  -0.086 * 0.18,    = -0.015
                  -0.065 * 0.27,    = -0.018
                  0.304 * (-0.22),  = -0.067
                  -0.090 * 0.51,    = -0.046
                  0.000 * (-0.12)]  = 0.000

Notice how the gating works: neuron 5 (gate = 0.000) is completely suppressed regardless of its up-projection value. Neuron 0 has a positive gate but a negative up value, resulting in a small negative output. Neuron 3 has the largest magnitude because both the gate and up values are relatively large.

Step 4: Down-projection

output = h @ W_down
       = [-0.005*0.1 + (-0.015)*0.2 + (-0.018)*(-0.3) + (-0.067)*0.1 + (-0.046)*0.0 + 0.000*(-0.1),
          -0.005*(-0.2) + (-0.015)*0.1 + (-0.018)*0.2 + (-0.067)*0.0 + (-0.046)*0.3 + 0.000*0.1,
          -0.005*0.3 + (-0.015)*(-0.1) + (-0.018)*0.0 + (-0.067)*0.2 + (-0.046)*(-0.2) + 0.000*0.1,
          -0.005*0.0 + (-0.015)*0.4 + (-0.018)*0.1 + (-0.067)*(-0.3) + (-0.046)*0.1 + 0.000*0.2]
       = [-0.005, -0.018, -0.004, 0.008]

The final output is a 4-dimensional vector, the same shape as the input. This output will be added to the input via the residual connection (Chapter 10) before being passed to the next layer.


The FFN and Mixture-of-Experts

In Chapter 12, we will cover Mixture-of-Experts (MoE) in detail. But it is worth noting here that MoE architectures replace the single FFN in each Transformer layer with multiple “expert” FFN blocks and a router that selects which experts to use for each token.

LLaMA 4 Maverick alternates between dense layers and MoE layers. Each MoE layer has 1 shared expert and 128 routed experts. Each expert is a complete SwiGLU FFN with its own W_gate, W_up, and W_down matrices. For each token, the router uses Top-1 selection: the shared expert plus 1 of the 128 routed experts. This means each token is processed by 2 FFN blocks per MoE layer, while the model has 129 FFN blocks available per MoE layer.

This is why LLaMA 4 Maverick has 400 billion total parameters but only 17 billion active parameters: the vast majority of the parameters are in the 128 routed expert FFN blocks, but only 1 of those 128 is activated for any given token. The attention layers and the shared expert are always active.

The FFN is the natural place for MoE because it is the largest component of each layer and it processes tokens independently. Different experts can specialize in different types of knowledge or patterns, and the router learns to direct each token to the most appropriate expert.

Source: LLaMA 4 Maverick from Meta (April 5, 2025): 17B active parameters, 400B total parameters, 128 routed experts + 1 shared expert per MoE layer, Top-1 routing, SwiGLU activation, alternating dense and MoE layers. PyTorch blog, “MetaShuffling: Accelerating Llama 4 MoE Inference,” May 12, 2025.


No Biases in Modern FFN Layers

One detail worth noting: modern LLMs typically do not use bias terms in their FFN layers. The original Transformer included biases (b_1 and b_2 in the FFN formula), but LLaMA, Mistral, DeepSeek, and most other recent models set all biases to zero (or equivalently, do not include bias parameters at all).

The SwiGLU FFN in these models is:

output = (Swish(x @ W_gate) ⊙ (x @ W_up)) @ W_down

No bias terms anywhere. This simplifies the implementation, reduces the parameter count slightly, and has been found to have negligible impact on model quality. The weight matrices alone provide sufficient representational capacity.

Source: LLaMA models use mlp_bias=False per HuggingFace Transformers configuration. Mistral 7B similarly uses no MLP bias.


Hands-On: Full SwiGLU FFN for a Sequence

Let’s implement a complete SwiGLU FFN that processes an entire sequence of tokens, matching the structure used in real models:

import numpy as np

def swiglu_ffn_sequence(X, W_gate, W_up, W_down):
    """SwiGLU FFN applied to a full sequence.
    
    X: input matrix, shape [seq_len, hidden_size]
    W_gate: shape [hidden_size, intermediate_size]
    W_up: shape [hidden_size, intermediate_size]
    W_down: shape [intermediate_size, hidden_size]
    
    Returns: output matrix, shape [seq_len, hidden_size]
    """
    gate = X @ W_gate
    gate = gate * (1 / (1 + np.exp(-np.clip(gate, -500, 500))))  # Swish
    up = X @ W_up
    h = gate * up
    return h @ W_down


# Simulate LLaMA 3 8B dimensions (scaled down for speed)
np.random.seed(42)
seq_len = 10
hidden_size = 128
intermediate_size = 448  # ~3.5x, matching LLaMA 3's ratio

X = np.random.randn(seq_len, hidden_size) * 0.5
W_gate = np.random.randn(hidden_size, intermediate_size) * (2 / hidden_size)**0.5
W_up = np.random.randn(hidden_size, intermediate_size) * (2 / hidden_size)**0.5
W_down = np.random.randn(intermediate_size, hidden_size) * (2 / intermediate_size)**0.5

output = swiglu_ffn_sequence(X, W_gate, W_up, W_down)

print(f"Input shape:  {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Expansion ratio: {intermediate_size / hidden_size:.1f}x")
print()

# Parameter count
total_params = 3 * hidden_size * intermediate_size
print(f"FFN parameters: {total_params:,}")
print(f"  W_gate: {hidden_size} x {intermediate_size} = {hidden_size * intermediate_size:,}")
print(f"  W_up:   {hidden_size} x {intermediate_size} = {hidden_size * intermediate_size:,}")
print(f"  W_down: {intermediate_size} x {hidden_size} = {intermediate_size * hidden_size:,}")
print()

# Verify each token is processed independently
output_token0 = swiglu_ffn_sequence(X[0:1], W_gate, W_up, W_down)
print(f"Token 0 output (from full sequence): {output[0, :5].round(4)}")
print(f"Token 0 output (processed alone):    {output_token0[0, :5].round(4)}")
print(f"Identical: {np.allclose(output[0], output_token0[0])}")

The last check is important: it verifies that the FFN processes each token independently. The output for token 0 is identical whether we process the full 10-token sequence or just token 0 alone. This confirms that the FFN has no inter-token interaction; all inter-token communication happens in the attention layers.


Sparsity in FFN Activations

An interesting property of FFN layers is that their activations are often sparse: for any given input, many of the neurons in the expanded layer produce zero or near-zero outputs. This sparsity is especially pronounced with ReLU (which explicitly zeros out negative values) but also occurs with SwiGLU (where the gating mechanism suppresses many neurons).

Li et al. (2022) documented this as the “lazy neuron phenomenon”: in trained Transformers using ReLU, fewer than 10% of FFN neurons are activated per token. For example, T5-Base showed only 3.0% nonzero entries in its FFN activation maps, and ViT-B16 showed 6.3%. Models using SwiGLU do not produce exact zeros (since Swish is smooth and never exactly zero for nonzero inputs), but the gating mechanism still produces near-zero activations for many neurons, creating effective sparsity.

Source: Li et al., “The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers,” ICLR 2023 (arXiv:2210.06313, October 2022). Showed that fewer than 10% of FFN neurons produce nonzero activations per token in trained ReLU Transformers.

This natural sparsity is one of the motivations behind Mixture-of-Experts architectures (Chapter 12): if only a fraction of neurons are active anyway, why not formalize this by having separate expert networks and routing each token to the most relevant ones? MoE takes the implicit sparsity of dense FFN layers and makes it explicit and structured.


Key Takeaways

  • The feed-forward network (FFN) is the other major component of each Transformer layer, alongside attention. Attention gathers contextual information from other tokens; the FFN processes each token’s information independently through matrix multiplications and nonlinear activations.

  • The basic FFN architecture is expand, activate, contract: project the input to a higher-dimensional space (up-projection), apply a nonlinear activation, then project back to the original dimension (down-projection). The original Transformer used a 4x expansion ratio with ReLU activation.

  • Modern LLMs use SwiGLU, a gated activation that combines the Swish function with a Gated Linear Unit. SwiGLU uses three weight matrices (W_gate, W_up, W_down) instead of two, providing more expressive transformations. It was proposed by Shazeer (2020) and is used in LLaMA, Mistral, DeepSeek, and virtually every other major open-weight model.

  • The FFN contains the majority of each layer’s parameters. In LLaMA 3 8B, the FFN accounts for 80.7% of per-layer parameters (176.2M out of 218.2M). In the original Transformer, the FFN was exactly two-thirds (66.7%). Across all 32 layers of LLaMA 3 8B, the FFN contributes 5.64 billion of the model’s approximately 8 billion total parameters.

  • Research has shown that FFN layers function as key-value memories that store factual knowledge (Geva et al., EMNLP 2021). Factual associations are primarily mediated by FFN modules in the middle layers of the Transformer (Meng et al., NeurIPS 2022), and these associations can be directly edited by modifying FFN weights. Specific “knowledge neurons” responsible for individual facts have been identified (Dai et al., ACL 2022).

  • The FFN processes each token independently. There is no interaction between tokens in the FFN; all inter-token communication happens in the attention layers. This independence enables full parallelization and is the reason MoE architectures replace the FFN (not the attention) with multiple expert networks.

  • In Mixture-of-Experts models like LLaMA 4 Maverick, the single FFN per layer is replaced by multiple expert FFN blocks (128 routed experts + 1 shared expert). Each expert is a complete SwiGLU FFN. A router selects which experts process each token, enabling the model to have 400 billion total parameters while only activating 17 billion per token.

  • Modern FFN layers use no bias terms. The weight matrices alone provide sufficient representational capacity, and removing biases simplifies implementation with negligible quality impact.


What’s Next

You now understand how the feed-forward network transforms each token’s representation through an expand-activate-contract pattern, how SwiGLU’s gating mechanism provides more expressive transformations than simple ReLU, and why the FFN contains the majority of the model’s parameters and factual knowledge. In Chapter 10, we will put the attention and FFN components together with layer normalization and residual connections to form the complete Transformer block, and see how stacking dozens or hundreds of these blocks creates the deep networks that power modern language models.