Chapter 17. Token-by-Token Generation

Every response you have ever received from ChatGPT, Claude, Gemini, or any other language model was produced one token at a time, from left to right, with each new token depending on every token that came before it. This process, called autoregressive generation, is the fundamental mechanism that turns a trained model into something you can actually talk to. This chapter explains exactly how it works: how the model picks each token, how parameters like temperature and top-p control the randomness of those choices, how penalties prevent repetitive loops, and how stop tokens tell the model when to be quiet.

Autoregressive Generation: One Token at a Time

In Chapter 14, you learned that language models are trained to predict the next token given all previous tokens. During pre-training, the model sees billions of sequences and learns to minimize the cross-entropy loss between its predictions and the actual next tokens. At inference time (when you actually use the model), this same next-token prediction runs in a loop: the model predicts one token, appends it to the sequence, and then predicts the next token based on the updated sequence.

This is autoregressive generation. The word “autoregressive” means “self-feeding”: each output becomes part of the input for the next step. Here is the process, step by step:

You provide a prompt (your input text), which gets tokenized into a sequence of token IDs.
The model processes the entire prompt through all its transformer layers (attention, feed-forward networks, layer normalization, as covered in Chapters 7 through 10).
The final layer produces a logits vector: a list of raw scores, one for each token in the vocabulary. For LLaMA 3, this vector has 128,256 entries. For GPT-4o, it has approximately 200,000 entries (using the o200k_base tokenizer).
These logits are converted into probabilities using the softmax function (covered in Chapter 2).
A token is selected from this probability distribution using a sampling strategy (which we will cover in detail below).
The selected token is appended to the sequence.
Steps 2 through 6 repeat until a stop condition is met (the model generates a stop token, or a maximum length is reached).

Each iteration of this loop produces exactly one token. A 500-token response requires 500 passes through the model. A 10,000-token response requires 10,000 passes. This is why longer responses take longer to generate: the model is doing real computation for every single token.

import numpy as np

def autoregressive_generate(model, prompt_tokens, max_new_tokens=50):
    """
    Simplified autoregressive generation loop.
    Each iteration: run model, sample one token, append to sequence.
    """
    sequence = list(prompt_tokens)
    
    for _ in range(max_new_tokens):
        # Step 1: Run the model on the full sequence so far
        logits = model.forward(sequence)  # Shape: (vocab_size,)
        
        # Step 2: Convert logits to probabilities
        probabilities = softmax(logits)
        
        # Step 3: Sample one token from the distribution
        next_token = np.random.choice(len(probabilities), p=probabilities)
        
        # Step 4: Append to sequence
        sequence.append(next_token)
        
        # Step 5: Check for stop token
        if next_token == model.eos_token_id:
            break
    
    return sequence

def softmax(logits):
    """Convert raw logits to probabilities."""
    exp_logits = np.exp(logits - np.max(logits))  # Subtract max for numerical stability
    return exp_logits / exp_logits.sum()

This loop is deceptively simple, but it is the core of every language model interaction you have ever had. The complexity lies in two places: how the model computes the logits (covered in previous chapters), and how we select a token from the resulting probability distribution (covered in this chapter).

Why Not Generate All Tokens at Once?

A natural question: if the model knows so much about language, why can it not just produce the entire response in one shot? The answer is that each token depends on the tokens before it. The probability of the fifth word in a sentence depends on what the first four words were. Since the model has not yet decided what those first four words are, it cannot compute the probability of the fifth word until it has generated the first four.

This sequential dependency is fundamental to how autoregressive models work. It is also their primary speed bottleneck: you cannot parallelize the generation of tokens within a single sequence because each token depends on the previous one. (During training, this is not a problem because the model sees the entire correct sequence and can compute all positions in parallel using masked attention. But at inference time, the model must generate tokens one at a time.)

This is why Google DeepMind announced Gemini Diffusion at Google I/O on May 21, 2025: a research model that uses diffusion (the same technique behind image generators like Stable Diffusion) instead of autoregressive generation. Diffusion models can generate entire blocks of text in parallel by iteratively refining noise into coherent output. Google DeepMind reported a benchmark speed of 1,479 tokens per second for general tasks, with speeds reaching up to 2,000 tokens per second on programming tasks (according to DeepMind researcher Brendan O’Donoghue). Practical demos showed speeds closer to 857 tokens per second depending on the task. As of March 2026, Gemini Diffusion remains an experimental research demo, but it represents a potential future alternative to the autoregressive paradigm described in this chapter.

Two other techniques also address this sequential bottleneck without abandoning autoregressive generation entirely. Multi-token prediction (Gloeckle et al., arXiv:2404.19737, ICML 2024) trains the model with multiple output heads that predict several future tokens simultaneously, enabling 2 to 3x inference speedups through self-speculative decoding. This technique has been adopted in production models including DeepSeek-V3 and Qwen 3.5. Speculative decoding uses a small, fast “draft” model to propose multiple candidate tokens, which the larger “target” model then verifies in a single parallel forward pass. We will cover speculative decoding in detail in Chapter 24.

Source: Google DeepMind, “Gemini Diffusion,” announced May 21, 2025 at Google I/O (deepmind.google/models/gemini-diffusion). Simon Willison, “Gemini Diffusion,” May 21, 2025 (simonwillison.net). 1,479 tokens per second reported by Google DeepMind; up to 2,000 tokens per second on programming tasks per Brendan O’Donoghue (the-decoder.com, May 22, 2025); 857 tokens per second observed in live demo (gigazine.net, May 22, 2025). Gloeckle et al., “Better & Faster Large Language Models via Multi-token Prediction,” arXiv:2404.19737, April 2024. ICML 2024. Meta FAIR.

The Logits Vector: Raw Predictions

Before we can discuss how tokens are selected, we need to understand what the model actually outputs at each step. The final layer of the transformer produces a logits vector: one number for every token in the vocabulary. These numbers are raw, unnormalized scores that indicate how likely the model thinks each token is to come next.

For a model with a vocabulary of 128,256 tokens (like LLaMA 3), the logits vector has 128,256 entries. Most of these entries will be large negative numbers (tokens that are extremely unlikely), a handful will be near zero, and a few will be positive (the tokens the model considers most likely).

Here is a concrete example. Suppose the model has processed the prompt “The capital of France is” and produced logits for the next token. The logits might look something like this (showing only the top entries out of 128,256):

import numpy as np

# Simulated logits for "The capital of France is ___"
# In reality, these come from the model's final linear layer
# Only showing top tokens out of ~128,256 total

token_logits = {
    " Paris":     8.2,
    " the":       3.1,
    " a":         2.8,
    " located":   2.5,
    " known":     2.0,
    " one":       1.5,
    " called":    1.2,
    " not":       0.8,
    " definitely": 0.3,
    # ... 128,247 more tokens with increasingly negative logits
    " pizza":    -5.2,
    " quantum":  -8.7,
    " xylophone": -12.3,
}

# Convert to arrays for computation
tokens = list(token_logits.keys())
logits = np.array(list(token_logits.values()))

# Apply softmax to get probabilities
probs = np.exp(logits - np.max(logits)) / np.sum(np.exp(logits - np.max(logits)))

print("Token Probabilities (after softmax):")
print(f"{'Token':<15} {'Logit':>8} {'Probability':>12}")
print("-" * 38)
for token, logit, prob in zip(tokens, logits, probs):
    print(f"{token:<15} {logit:>8.1f} {prob:>12.4%}")

print(f"\nTotal probability shown: {sum(probs):.4%}")
print(f"(Remaining ~128,247 tokens share the rest)")

The key insight is that the logits are not probabilities. They can be any real number: positive, negative, or zero. The softmax function converts them into a valid probability distribution where all values are between 0 and 1 and they sum to 1. After softmax, " Paris" might have a probability of 0.85, " the" might have 0.05, and the remaining 128,254 tokens share the remaining 0.10.

The question is: given this probability distribution, how do we pick the next token?

Greedy Decoding: Always Pick the Most Likely Token

The simplest strategy is greedy decoding: always pick the token with the highest probability. If " Paris" has probability 0.85, pick " Paris". If the next step has " ," at 0.60, pick " ,". And so on.

def greedy_decode(logits):
    """Always select the token with the highest logit."""
    return np.argmax(logits)

Greedy decoding is deterministic: given the same prompt, it always produces the same output. This sounds like a good thing, but it has serious problems for open-ended text generation:

Repetitive output: Greedy decoding tends to get stuck in loops. The model generates “The cat sat on the mat. The cat sat on the mat. The cat sat on the mat.” because at each step, the most likely continuation is to repeat what just happened.
Bland, generic text: By always picking the most probable token, greedy decoding produces text that is safe and predictable but lacks variety or creativity. It gravitates toward the most common phrases in the training data.
Locally optimal, globally suboptimal: Picking the best token at each step does not guarantee the best overall sequence. Sometimes a slightly less likely token at step 5 leads to a much better sequence overall.

Greedy decoding is still used in specific situations where determinism and precision matter: code generation, structured data extraction, and factual question answering. But for general text generation, we need something better.

Temperature: Controlling Randomness

Temperature is the most important parameter for controlling how “creative” or “random” a model’s output is. It works by rescaling the logits before applying softmax, which changes the shape of the probability distribution.

The temperature-modified softmax is:

p_i = exp(z_i / T) / sum(exp(z_j / T))

Where z_i is the logit for token i, T is the temperature, and the sum is over all tokens in the vocabulary.

Here is what different temperature values do:

T = 1.0 (default): The probabilities are exactly what the model learned during training. No modification.
T < 1.0 (e.g., 0.2): The distribution becomes sharper. High-probability tokens get even higher probability, and low-probability tokens get pushed closer to zero. The model becomes more deterministic and predictable.
T > 1.0 (e.g., 1.5): The distribution becomes flatter. The gap between high-probability and low-probability tokens shrinks. The model becomes more random and “creative,” but also more likely to produce nonsensical output.
T approaching 0: The distribution collapses to a spike on the highest-probability token. This is equivalent to greedy decoding.
T approaching infinity: The distribution becomes uniform. Every token in the vocabulary is equally likely. This produces random gibberish.

import numpy as np

def softmax_with_temperature(logits, temperature=1.0):
    """Apply temperature scaling before softmax."""
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
    return exp_logits / exp_logits.sum()

# Example: logits for "The capital of France is ___"
logits = np.array([8.2, 3.1, 2.8, 2.5, 2.0, 1.5])
tokens = ["Paris", "the", "a", "located", "known", "one"]

print("Effect of Temperature on Token Probabilities")
print(f"{'Token':<10}", end="")
for t in [0.1, 0.5, 1.0, 1.5, 2.0]:
    print(f"{'T='+str(t):>10}", end="")
print()
print("-" * 60)

for i, token in enumerate(tokens):
    print(f"{token:<10}", end="")
    for t in [0.1, 0.5, 1.0, 1.5, 2.0]:
        probs = softmax_with_temperature(logits, temperature=t)
        print(f"{probs[i]:>10.4f}", end="")
    print()

print("\nAt T=0.1, 'Paris' gets ~100% of the probability (near-greedy).")
print("At T=2.0, the distribution is much flatter, giving other tokens a real chance.")

Practical Temperature Guidelines

Different tasks call for different temperatures:

Temperature	Behavior	Good For
0.0	Greedy (deterministic)	Code generation, factual Q&A, structured output
0.1 - 0.3	Very focused, minimal variation	Technical writing, data extraction, translation
0.5 - 0.7	Balanced creativity and coherence	General conversation, summarization
0.8 - 1.0	More creative, some surprises	Creative writing, brainstorming
1.0 - 1.5	High creativity, risk of incoherence	Poetry, experimental writing
> 1.5	Increasingly random	Rarely useful in practice

Most API providers default to temperature 1.0 and allow values between 0 and 2. OpenAI’s documentation recommends adjusting either temperature or top-p, but not both at the same time, because they both control randomness and their effects can interact unpredictably.

The Math Behind Temperature

To understand why temperature works, consider two tokens with logits 8.0 and 3.0. At temperature 1.0:

p(token_A) = exp(8.0) / (exp(8.0) + exp(3.0))
           = 2981 / (2981 + 20.1)
           = 0.993  (99.3%)

p(token_B) = exp(3.0) / (exp(8.0) + exp(3.0))
           = 20.1 / (2981 + 20.1)
           = 0.007  (0.7%)

Token A is 148 times more likely than token B. Now at temperature 0.5 (dividing logits by 0.5, which doubles them):

p(token_A) = exp(16.0) / (exp(16.0) + exp(6.0))
           = 8.9M / (8.9M + 403)
           = 0.99995  (99.995%)

p(token_B) = exp(6.0) / (exp(16.0) + exp(6.0))
           = 403 / (8.9M + 403)
           = 0.00005  (0.005%)

Token A is now about 22,000 times more likely. The lower temperature amplified the gap between the two tokens. At temperature 2.0 (dividing logits by 2.0, which halves them):

p(token_A) = exp(4.0) / (exp(4.0) + exp(1.5))
           = 54.6 / (54.6 + 4.48)
           = 0.924  (92.4%)

p(token_B) = exp(1.5) / (exp(4.0) + exp(1.5))
           = 4.48 / (54.6 + 4.48)
           = 0.076  (7.6%)

Token A is now only 12 times more likely. The higher temperature compressed the gap, giving token B a much better chance of being selected.

Source: OpenAI API documentation, temperature parameter (platform.openai.com/docs). The temperature parameter ranges from 0 to 2, with a default of 1. OpenAI recommends altering temperature or top_p, but not both.

Top-k Sampling: Limiting the Candidate Pool

Top-k sampling restricts the model’s choices to the k most likely tokens, then samples from only those tokens (after renormalizing their probabilities to sum to 1). All other tokens, no matter how many there are, get zero probability.

The technique was popularized by Fan et al. in “Hierarchical Neural Story Generation” (ACL 2018), where it was used to generate more coherent long-form stories. The idea is simple: if the vocabulary has 128,256 tokens but only the top 50 are remotely plausible, why risk sampling from the other 128,206?

import numpy as np

def top_k_sampling(logits, k=50, temperature=1.0):
    """
    Sample from only the top-k most likely tokens.
    """
    # Apply temperature
    scaled_logits = logits / temperature
    
    # Find the top-k token indices
    top_k_indices = np.argsort(scaled_logits)[-k:]
    
    # Zero out everything except top-k
    filtered_logits = np.full_like(scaled_logits, -np.inf)
    filtered_logits[top_k_indices] = scaled_logits[top_k_indices]
    
    # Convert to probabilities
    probs = np.exp(filtered_logits - np.max(filtered_logits))
    probs = probs / probs.sum()
    
    # Sample
    return np.random.choice(len(probs), p=probs)

# Example: top-k with different k values
logits = np.array([8.2, 3.1, 2.8, 2.5, 2.0, 1.5, 0.8, 0.3, -0.5, -2.0])
tokens = ["Paris", "the", "a", "located", "known", "one", "not", "definitely", "pizza", "quantum"]

print("Top-k Sampling: Which tokens are candidates?")
print(f"{'k':<5} {'Candidates':<50} {'Top token prob':>15}")
print("-" * 72)
for k in [1, 3, 5, 10]:
    top_indices = np.argsort(logits)[-k:]
    candidates = [tokens[i] for i in sorted(top_indices, key=lambda x: -logits[x])]
    
    # Compute renormalized probability of top token
    top_logits = logits[top_indices]
    top_probs = np.exp(top_logits - np.max(top_logits))
    top_probs = top_probs / top_probs.sum()
    max_prob = top_probs.max()
    
    print(f"{k:<5} {', '.join(candidates):<50} {max_prob:>14.2%}")

The Problem with Top-k

Top-k has a fundamental limitation: the right value of k depends on the context. Sometimes the model is very confident and only 2 or 3 tokens are reasonable. Other times, dozens of tokens are plausible. A fixed k cannot adapt to both situations:

If k is too small (e.g., k=5), the model cannot express uncertainty when many tokens are plausible. It is forced to choose among a narrow set even when the distribution is flat.
If k is too large (e.g., k=100), the model might sample from tokens that are implausible when the distribution is peaked. A token with 0.001% probability might get selected, producing nonsensical output.

This limitation motivated the development of a more adaptive approach: top-p sampling.

Source: Fan et al., “Hierarchical Neural Story Generation,” arXiv:1805.04833, May 2018. ACL 2018. The paper introduced top-k sampling for neural text generation to improve story coherence.

Top-p (Nucleus Sampling): Adaptive Token Selection

Top-p sampling, also called nucleus sampling, was introduced by Holtzman et al. in “The Curious Case of Neural Text Degeneration” (arXiv:1904.09751, ICLR 2020). It solves the fixed-k problem by dynamically adjusting the number of candidate tokens based on the shape of the probability distribution.

Instead of picking a fixed number of tokens, top-p picks the smallest set of tokens whose cumulative probability exceeds a threshold p. If the model is very confident (one token has 95% probability), the nucleus might contain just 1 or 2 tokens. If the model is uncertain (many tokens have similar probabilities), the nucleus might contain 50 or 100 tokens.

import numpy as np

def top_p_sampling(logits, p=0.9, temperature=1.0):
    """
    Nucleus sampling: sample from the smallest set of tokens
    whose cumulative probability exceeds p.
    """
    # Apply temperature
    scaled_logits = logits / temperature
    
    # Convert to probabilities
    probs = np.exp(scaled_logits - np.max(scaled_logits))
    probs = probs / probs.sum()
    
    # Sort tokens by probability (descending)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    
    # Find the nucleus: smallest set with cumulative prob >= p
    cumulative_probs = np.cumsum(sorted_probs)
    nucleus_size = np.searchsorted(cumulative_probs, p) + 1
    
    # Keep only nucleus tokens
    nucleus_indices = sorted_indices[:nucleus_size]
    nucleus_probs = probs[nucleus_indices]
    nucleus_probs = nucleus_probs / nucleus_probs.sum()  # Renormalize
    
    # Sample from nucleus
    chosen_idx = np.random.choice(nucleus_indices, p=nucleus_probs)
    return chosen_idx

# Demonstrate how nucleus size adapts to confidence
print("Top-p Sampling: Nucleus size adapts to model confidence\n")

# Scenario 1: Model is very confident
confident_logits = np.array([10.0, 2.0, 1.5, 1.0, 0.5, 0.0, -1.0, -2.0, -3.0, -5.0])
confident_probs = np.exp(confident_logits - np.max(confident_logits))
confident_probs = confident_probs / confident_probs.sum()

# Scenario 2: Model is uncertain
uncertain_logits = np.array([3.0, 2.8, 2.6, 2.4, 2.2, 2.0, 1.8, 1.6, 1.4, 1.2])
uncertain_probs = np.exp(uncertain_logits - np.max(uncertain_logits))
uncertain_probs = uncertain_probs / uncertain_probs.sum()

for name, probs in [("Confident", confident_probs), ("Uncertain", uncertain_probs)]:
    sorted_probs = np.sort(probs)[::-1]
    cumulative = np.cumsum(sorted_probs)
    nucleus_size = np.searchsorted(cumulative, 0.9) + 1
    print(f"{name} distribution:")
    print(f"  Top token probability: {sorted_probs[0]:.2%}")
    print(f"  Nucleus size (p=0.9): {nucleus_size} tokens")
    print(f"  Top 3 cumulative:     {cumulative[2]:.2%}")
    print()

Why Top-p Is Better Than Top-k

The adaptive nature of top-p makes it superior to top-k in most situations:

Scenario	Top-k (k=50)	Top-p (p=0.9)
Model is confident (1 token at 95%)	Includes 49 unnecessary tokens	Nucleus has ~1-2 tokens
Model is uncertain (flat distribution)	Might exclude plausible tokens	Nucleus expands to include all plausible tokens
After “The capital of France is”	50 tokens, most irrelevant	1-3 tokens (Paris, the, a)
After “I like to eat”	50 tokens, some odd	20-40 tokens (many foods are plausible)

Top-p has become the default sampling method for most language model APIs. OpenAI defaults to top_p=1.0 (which means no filtering; all tokens are candidates). Anthropic, Google, and other providers offer similar parameters. Common values in practice range from 0.8 to 0.95.

The Holtzman et al. paper demonstrated that nucleus sampling produces text that is significantly more human-like than either greedy decoding, beam search, or pure sampling. They showed that human-written text has a characteristic pattern: at each position, the probability mass is concentrated in a relatively small “nucleus” of tokens, and the tail of the distribution contains tokens that humans would almost never choose. Nucleus sampling respects this pattern by truncating the unreliable tail.

Source: Holtzman et al., “The Curious Case of Neural Text Degeneration,” arXiv:1904.09751, April 2019. ICLR 2020. The paper introduced nucleus sampling (top-p) and demonstrated that it produces more human-like text than greedy decoding, beam search, or pure sampling by truncating the unreliable tail of the probability distribution.

Min-p Sampling: A Newer Alternative

While top-p has been the dominant sampling method since 2020, a newer approach called min-p sampling has gained traction in the open-source community. Introduced by Nguyen et al. in “Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs” (arXiv:2407.01082, July 2024), min-p addresses a subtle problem with top-p at high temperatures.

The issue is this: when you increase the temperature to get more creative output, top-p’s nucleus can expand to include tokens that are genuinely implausible, because the flattened distribution pushes more tokens above the cumulative threshold. Min-p solves this by setting a dynamic floor: a token is only included if its probability is at least some fraction of the top token’s probability.

import numpy as np

def min_p_sampling(logits, min_p=0.1, temperature=1.0):
    """
    Min-p sampling: include a token only if its probability
    is at least min_p times the top token's probability.
    """
    # Apply temperature
    scaled_logits = logits / temperature
    
    # Convert to probabilities
    probs = np.exp(scaled_logits - np.max(scaled_logits))
    probs = probs / probs.sum()
    
    # Find the threshold: min_p * max probability
    threshold = min_p * np.max(probs)
    
    # Keep only tokens above the threshold
    mask = probs >= threshold
    filtered_probs = probs * mask
    filtered_probs = filtered_probs / filtered_probs.sum()
    
    return np.random.choice(len(filtered_probs), p=filtered_probs)

# Compare top-p and min-p at high temperature
logits = np.array([8.0, 3.0, 2.5, 2.0, 1.0, 0.0, -1.0, -3.0, -5.0, -8.0])
tokens = ["Paris", "the", "a", "located", "known", "one", "not", "pizza", "quantum", "xylophone"]

print("Comparing top-p and min-p at high temperature (T=1.5)")
print("=" * 55)

# At high temperature, probabilities flatten
high_temp_probs = np.exp(logits / 1.5 - np.max(logits / 1.5))
high_temp_probs = high_temp_probs / high_temp_probs.sum()

# Top-p nucleus at p=0.9
sorted_idx = np.argsort(high_temp_probs)[::-1]
cumulative = np.cumsum(high_temp_probs[sorted_idx])
nucleus_size_top_p = np.searchsorted(cumulative, 0.9) + 1

# Min-p candidates at min_p=0.1
threshold = 0.1 * np.max(high_temp_probs)
min_p_candidates = sum(high_temp_probs >= threshold)

print(f"Top-p (p=0.9) nucleus size: {nucleus_size_top_p} tokens")
print(f"Min-p (min_p=0.1) candidates: {min_p_candidates} tokens")
print(f"\nMin-p adapts better at high temperature by excluding")
print(f"tokens that are far less likely than the top choice.")

Min-p has been adopted by popular open-source inference frameworks including Hugging Face Transformers, vLLM, and llama.cpp. It is particularly popular for local LLM deployments where users want creative output without the incoherence that top-p can produce at high temperatures.

Another recent approach is top-nσ sampling (Tang et al., “Not All Logits Are You Need,” arXiv:2411.07641, November 2024; ACL 2025). Instead of working with probabilities (post-softmax), top-nσ operates directly on the raw logits (pre-softmax). It computes the standard deviation of all logit scores and keeps only tokens whose logits are within n standard deviations of the maximum logit. This statistical threshold is inherently temperature-independent: because it filters in logit space before temperature scaling, the set of candidate tokens stays stable regardless of the temperature setting. Top-nσ has been adopted by llama.cpp and other open-source frameworks, and some practitioners consider it the best general-purpose sampler as of early 2026.

Source: Nguyen et al., “Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs,” arXiv:2407.01082, July 2024. ICLR 2025 Oral. Min-p has been adopted by Hugging Face Transformers, vLLM, and other open-source frameworks. Tang et al., “Top-nσ: Not All Logits Are You Need,” arXiv:2411.07641, November 2024. ACL 2025. Operates on pre-softmax logits using a statistical threshold, providing temperature-independent candidate selection.

Combining Temperature with Sampling Methods

In practice, temperature and sampling methods (top-k, top-p, min-p) are used together. The typical pipeline is:

The model produces raw logits.
Temperature scaling is applied (dividing logits by T).
A sampling method filters the candidates (top-k, top-p, or min-p).
Probabilities are renormalized over the remaining candidates.
A token is sampled from the filtered distribution.

import numpy as np

def generate_token(logits, temperature=1.0, top_p=0.9, top_k=None):
    """
    Full sampling pipeline: temperature -> top-k -> top-p -> sample.
    """
    # Step 1: Temperature scaling
    if temperature != 1.0:
        logits = logits / temperature
    
    # Step 2: Convert to probabilities
    probs = np.exp(logits - np.max(logits))
    probs = probs / probs.sum()
    
    # Step 3: Top-k filtering (if specified)
    if top_k is not None and top_k < len(probs):
        top_k_indices = np.argsort(probs)[-top_k:]
        mask = np.zeros_like(probs)
        mask[top_k_indices] = 1
        probs = probs * mask
    
    # Step 4: Top-p filtering
    if top_p < 1.0:
        sorted_indices = np.argsort(probs)[::-1]
        sorted_probs = probs[sorted_indices]
        cumulative = np.cumsum(sorted_probs)
        cutoff_idx = np.searchsorted(cumulative, top_p) + 1
        keep_indices = sorted_indices[:cutoff_idx]
        mask = np.zeros_like(probs)
        mask[keep_indices] = 1
        probs = probs * mask
    
    # Step 5: Renormalize and sample
    probs = probs / probs.sum()
    return np.random.choice(len(probs), p=probs)

Here is how different parameter combinations affect the output:

Settings	Behavior	Use Case
T=0.0, top_p=1.0	Greedy (deterministic)	Code, structured output
T=0.3, top_p=0.9	Focused with slight variation	Technical writing
T=0.7, top_p=0.9	Balanced	General conversation
T=1.0, top_p=0.95	Creative but coherent	Creative writing
T=1.0, top_k=50	Creative, fixed candidate pool	Story generation
T=1.2, top_p=0.95	Very creative, some risk	Brainstorming

Repetition Penalties: Avoiding Loops

Even with temperature and top-p sampling, language models can fall into repetitive patterns. They might repeat the same phrase, circle back to the same idea, or get stuck in a loop where the same sequence of tokens repeats indefinitely. This happens because the model’s training data contains repetitive patterns (lists, refrains, repeated structures), and the autoregressive loop can amplify these patterns.

Three types of penalties have been developed to combat this problem. They all work by modifying the logits before sampling, making previously generated tokens less likely to be selected again.

Repetition Penalty (Keskar et al., 2019)

The repetition penalty was introduced in the CTRL paper by Keskar et al. (arXiv:1909.05858, September 2019). It works by dividing the logits of previously generated tokens by a penalty factor (if the logit is positive) or multiplying them by the penalty factor (if the logit is negative). This reduces the probability of any token that has already appeared in the generated text, regardless of how many times it appeared.

import numpy as np

def apply_repetition_penalty(logits, generated_token_ids, penalty=1.2):
    """
    Repetition penalty from Keskar et al. (CTRL, 2019).
    Reduces the logit of any token that has already been generated.
    
    penalty > 1.0: discourage repetition (typical: 1.1 to 1.5)
    penalty = 1.0: no effect
    penalty < 1.0: encourage repetition (rarely used)
    """
    penalized_logits = logits.copy()
    
    for token_id in set(generated_token_ids):
        if penalized_logits[token_id] > 0:
            penalized_logits[token_id] /= penalty
        else:
            penalized_logits[token_id] *= penalty
    
    return penalized_logits

# Example: the model wants to repeat "the" (token_id=1)
logits = np.array([8.2, 5.5, 2.8, 2.5, 2.0])
tokens = ["Paris", "the", "a", "located", "known"]
generated = [1, 1, 1]  # "the" has appeared 3 times

print("Repetition Penalty Effect")
print(f"{'Token':<10} {'Original':>10} {'Penalized':>10} {'Change':>10}")
print("-" * 42)

penalized = apply_repetition_penalty(logits, generated, penalty=1.2)
for i, token in enumerate(tokens):
    change = penalized[i] - logits[i]
    marker = " *" if i in set(generated) else ""
    print(f"{token:<10} {logits[i]:>10.2f} {penalized[i]:>10.2f} {change:>+10.2f}{marker}")

print("\n* = token appeared in generated text")
print("Note: only 'the' is penalized, regardless of how many times it appeared.")

The repetition penalty is a binary penalty: a token is either penalized or not, based on whether it has appeared at all. It does not distinguish between a token that appeared once and a token that appeared ten times. This is the key difference from frequency and presence penalties.

Frequency Penalty

The frequency penalty (used by OpenAI’s API) applies a penalty proportional to how many times each token has appeared in the generated text. The more often a token has been used, the stronger the penalty. The formula modifies the logits directly:

logit[j] = logit[j] - count[j] * frequency_penalty

Where count[j] is the number of times token j has appeared in the generated text so far, and frequency_penalty is a value between -2.0 and 2.0 (positive values discourage repetition).

import numpy as np

def apply_frequency_penalty(logits, token_counts, penalty=0.5):
    """
    Frequency penalty: penalize tokens proportional to their count.
    
    penalty > 0: discourage repetition (typical: 0.1 to 1.0)
    penalty = 0: no effect
    penalty < 0: encourage repetition
    """
    penalized_logits = logits.copy()
    for token_id, count in token_counts.items():
        penalized_logits[token_id] -= count * penalty
    return penalized_logits

# Example: "the" appeared 5 times, "is" appeared 2 times
logits = np.array([8.2, 5.5, 4.0, 2.8, 2.5])
tokens = ["Paris", "the", "is", "a", "located"]
counts = {1: 5, 2: 2}  # "the": 5 times, "is": 2 times

print("Frequency Penalty Effect (penalty=0.5)")
print(f"{'Token':<10} {'Count':>6} {'Original':>10} {'Penalized':>10} {'Change':>10}")
print("-" * 48)

penalized = apply_frequency_penalty(logits, counts, penalty=0.5)
for i, token in enumerate(tokens):
    count = counts.get(i, 0)
    change = penalized[i] - logits[i]
    print(f"{token:<10} {count:>6} {logits[i]:>10.2f} {penalized[i]:>10.2f} {change:>+10.2f}")

print("\n'the' (5 occurrences) gets a much larger penalty than 'is' (2 occurrences).")

Presence Penalty

The presence penalty (also used by OpenAI’s API) applies a flat penalty to any token that has appeared at all, regardless of how many times. It is a binary signal: “has this token been used before?” The formula is:

logit[j] = logit[j] - (1 if count[j] > 0 else 0) * presence_penalty

import numpy as np

def apply_presence_penalty(logits, token_counts, penalty=0.6):
    """
    Presence penalty: flat penalty for any token that has appeared.
    
    penalty > 0: discourage reuse of any previously used token
    penalty = 0: no effect
    penalty < 0: encourage reuse
    """
    penalized_logits = logits.copy()
    for token_id, count in token_counts.items():
        if count > 0:
            penalized_logits[token_id] -= penalty
    return penalized_logits

# Same example
logits = np.array([8.2, 5.5, 4.0, 2.8, 2.5])
tokens = ["Paris", "the", "is", "a", "located"]
counts = {1: 5, 2: 2}

print("Presence Penalty Effect (penalty=0.6)")
print(f"{'Token':<10} {'Count':>6} {'Original':>10} {'Penalized':>10} {'Change':>10}")
print("-" * 48)

penalized = apply_presence_penalty(logits, counts, penalty=0.6)
for i, token in enumerate(tokens):
    count = counts.get(i, 0)
    change = penalized[i] - logits[i]
    print(f"{token:<10} {count:>6} {logits[i]:>10.2f} {penalized[i]:>10.2f} {change:>+10.2f}")

print("\nBoth 'the' and 'is' get the SAME penalty, regardless of count.")
print("Presence penalty encourages topic diversity (new words).")
print("Frequency penalty discourages verbatim repetition (same words).")

When to Use Each Penalty

Penalty Type	Effect	Best For
Repetition penalty (1.1-1.3)	Discourages any repeated token	General-purpose anti-repetition
Frequency penalty (0.1-0.8)	Stronger penalty for more-repeated tokens	Preventing verbatim repetition
Presence penalty (0.1-0.6)	Flat penalty for any used token	Encouraging topic diversity
Combined frequency + presence	Both effects together	Maximum variety

In practice, the frequency penalty is the most commonly used for preventing repetitive output, while the presence penalty is useful when you want the model to explore new topics rather than circling back to the same ideas.

Source: Keskar et al., “CTRL: A Conditional Transformer Language Model for Controllable Generation,” arXiv:1909.05858, September 2019. Salesforce Research. Introduced the repetition penalty for autoregressive generation. OpenAI API documentation describes frequency_penalty and presence_penalty as additive modifications to logits, with values between -2.0 and 2.0.

Stop Tokens: How the Model Knows When to Stop

A language model, left to its own devices, would generate tokens forever. It has no built-in concept of “I am done talking.” The mechanism that tells the model to stop is the stop token (also called the end-of-sequence token or EOS token): a special token in the vocabulary that signals “generation is complete.”

How Stop Tokens Work

During training, every sequence in the training data ends with a special EOS token. The model learns that this token should appear at the end of a coherent response. At inference time, when the model generates the EOS token, the generation loop stops and the response is returned to the user.

Different models use different stop tokens:

Model Family	Stop Token(s)	Token ID(s)
GPT-2	`<\|endoftext\|>`	50256
GPT-3.5/GPT-4 (cl100k_base)	`<\|endoftext\|>`	100257
GPT-4o (o200k_base)	`<\|endoftext\|>`	199999
LLaMA 3	`<\|end_of_text\|>`, `<\|eot_id\|>`	128001, 128009

LLaMA 3 uses two different stop tokens for different purposes. <|end_of_text|> (token ID 128001) signals the absolute end of text generation, used by base models. <|eot_id|> (token ID 128009) signals the end of a conversational turn, used by instruct-tuned models in multi-turn conversations. The instruct model generates <|eot_id|> at the end of its response, which tells the system to stop generation and wait for the next user message.

# How stop tokens work in the generation loop
def generate_with_stop_tokens(model, prompt_tokens, stop_token_ids, max_tokens=4096):
    """
    Generate tokens until a stop token is produced or max length is reached.
    """
    sequence = list(prompt_tokens)
    generated_tokens = []
    
    for step in range(max_tokens):
        logits = model.forward(sequence)
        next_token = sample_token(logits)  # Using temperature, top-p, etc.
        
        # Check if this is a stop token
        if next_token in stop_token_ids:
            break  # Stop generation
        
        generated_tokens.append(next_token)
        sequence.append(next_token)
    
    return generated_tokens

# LLaMA 3 instruct model stop tokens
llama3_stop_tokens = {
    128001,  # <|end_of_text|>  (absolute end)
    128009,  # <|eot_id|>       (end of turn)
}

Custom Stop Sequences

Beyond the built-in EOS token, most APIs allow you to specify custom stop sequences: strings that, when generated, cause the model to stop. This is useful for structured output where you want the model to stop at a specific delimiter.

# OpenAI API: custom stop sequences
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "List three fruits:"}],
    stop=["\n\n", "4."]  # Stop at double newline or "4."
)
# The model will generate items 1-3 and stop before item 4

Custom stop sequences work by checking the generated text after each token. If the text ends with any of the specified stop sequences, generation halts. This is a string-level check, not a token-level check, so it works even when the stop sequence spans multiple tokens.

Maximum Token Limits

The final safety net is the maximum token limit (max_tokens in most APIs). This sets an absolute upper bound on how many tokens the model can generate, regardless of whether it produces a stop token. If the model reaches this limit without generating a stop token, generation is truncated.

This is important because:

Cost control: Without a limit, a model could generate thousands of tokens (and charge you for all of them) on a simple question.
Runaway generation: If the model fails to generate a stop token (which can happen, especially with fine-tuned models where the EOS token was not properly handled during training), the max token limit prevents infinite generation.
Latency control: In real-time applications, you may want to cap response length to ensure fast responses.

# The three stopping conditions, in order of priority
def should_stop(next_token, generated_text, config):
    """Check all stopping conditions."""
    # 1. Built-in stop token
    if next_token in config.stop_token_ids:
        return True, "stop_token"
    
    # 2. Custom stop sequence
    for seq in config.stop_sequences:
        if generated_text.endswith(seq):
            return True, "stop_sequence"
    
    # 3. Maximum token limit
    if len(generated_text) >= config.max_tokens:
        return True, "max_tokens"
    
    return False, None

Source: Meta, LLaMA 3 prompt format documentation (github.com/meta-llama/llama-models). LLaMA 3 uses <|end_of_text|> (ID 128001) for absolute end and <|eot_id|> (ID 128009) for end of conversational turn. OpenAI API documentation describes the stop parameter for custom stop sequences. OpenAI special token IDs: GPT-2 <|endoftext|> at ID 50256; cl100k_base <|endoftext|> at ID 100257; o200k_base <|endoftext|> at ID 199999 (Salman Quazi, “The Grammar of LLM Special Tokens,” salmanq.com, February 24, 2026; OpenAI community forums, tiktoken special tokens discussion).

Beam Search: An Alternative to Sampling

Before sampling methods like top-p became dominant, beam search was the standard decoding strategy for neural text generation, especially in machine translation. Beam search does not sample randomly. Instead, it maintains multiple candidate sequences (called “beams”) in parallel and selects the one with the highest overall probability.

Here is how beam search works with a beam width of 3:

Start with the prompt.
Generate the top 3 most likely first tokens. You now have 3 candidate sequences.
For each candidate, generate the top 3 most likely next tokens. You now have 9 candidates (3 sequences times 3 extensions each).
Keep only the top 3 candidates by total sequence probability.
Repeat steps 3 and 4 until all beams produce a stop token or reach the maximum length.
Return the beam with the highest total probability.

import numpy as np

def beam_search(model, prompt_tokens, beam_width=3, max_length=50):
    """
    Simplified beam search: maintain top-k candidate sequences.
    """
    # Each beam: (sequence, cumulative_log_prob)
    beams = [(list(prompt_tokens), 0.0)]
    completed = []
    
    for step in range(max_length):
        all_candidates = []
        
        for sequence, score in beams:
            logits = model.forward(sequence)
            log_probs = np.log(softmax(logits))
            
            # Expand each beam with top-k tokens
            top_indices = np.argsort(log_probs)[-beam_width:]
            for idx in top_indices:
                new_seq = sequence + [idx]
                new_score = score + log_probs[idx]
                
                if idx == model.eos_token_id:
                    completed.append((new_seq, new_score))
                else:
                    all_candidates.append((new_seq, new_score))
        
        # Keep only top beam_width candidates
        all_candidates.sort(key=lambda x: x[1], reverse=True)
        beams = all_candidates[:beam_width]
        
        if not beams:
            break
    
    # Return the best completed sequence (or best beam if none completed)
    all_results = completed + beams
    all_results.sort(key=lambda x: x[1], reverse=True)
    return all_results[0][0]

Why Sampling Won Over Beam Search

Beam search produces high-probability sequences, but for open-ended text generation, this is actually a problem. The highest-probability sequence is often repetitive, generic, and boring. Holtzman et al. (2020) showed that human-written text does not follow the highest-probability path; humans make surprising, creative choices that beam search would never select.

Beam search remains useful for tasks where there is a single “correct” output:

Machine translation (there is usually one best translation)
Summarization (the summary should be accurate, not creative)
Structured data generation (JSON, SQL, etc.)

But for conversational AI, creative writing, and general-purpose language models, sampling with temperature and top-p produces more natural, engaging text.

Tracing a Real Generation: Step by Step

Let us trace through the generation of a short response to see all of these concepts working together. We will use realistic (simulated) probabilities to show what happens at each step.

Prompt: “Explain gravity in one sentence.”

Settings: temperature=0.7, top_p=0.9, frequency_penalty=0.3

import numpy as np

np.random.seed(42)  # For reproducibility

def softmax(logits):
    exp_l = np.exp(logits - np.max(logits))
    return exp_l / exp_l.sum()

def generate_step(logits, temperature, top_p, token_counts, freq_penalty):
    """One step of the generation pipeline."""
    # 1. Apply frequency penalty
    penalized = logits.copy()
    for tid, count in token_counts.items():
        penalized[tid] -= count * freq_penalty
    
    # 2. Apply temperature
    scaled = penalized / temperature
    
    # 3. Softmax
    probs = softmax(scaled)
    
    # 4. Top-p filtering
    sorted_idx = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_idx]
    cumulative = np.cumsum(sorted_probs)
    cutoff = np.searchsorted(cumulative, top_p) + 1
    nucleus_idx = sorted_idx[:cutoff]
    
    # 5. Renormalize
    nucleus_probs = probs[nucleus_idx]
    nucleus_probs = nucleus_probs / nucleus_probs.sum()
    
    # 6. Sample
    chosen = np.random.choice(nucleus_idx, p=nucleus_probs)
    return chosen, probs, cutoff

# Simulated vocabulary (small for illustration)
vocab = {
    0: "Gravity",   1: "is",      2: "the",     3: "force",
    4: "that",      5: "pulls",   6: "attracts", 7: "objects",
    8: "toward",    9: "each",   10: "other",   11: ".",
    12: "mass",    13: "between", 14: "with",   15: "all",
    16: "a",       17: "which",  18: "bodies",  19: "together",
}

# Simulated logits at each step (designed to produce a coherent sentence)
# Each row: logits for the 20 tokens above, at that generation step
step_logits = [
    # Step 1: After prompt, model starts response
    [9.0, 1.0, 2.0, 1.5, 0.5, 0.3, 0.2, 0.1, -1.0, -1.5,
     -2.0, -3.0, 0.8, -1.0, -2.0, 0.5, 1.0, 0.3, -1.0, -2.0],
    # Step 2: After "Gravity"
    [0.0, 8.5, 0.5, 0.3, 0.2, 0.1, 0.0, -1.0, -2.0, -1.5,
     -2.0, -3.0, 0.5, -1.0, -2.0, 0.3, 0.8, 0.2, -1.0, -2.0],
    # Step 3: After "Gravity is"
    [0.0, 0.5, 7.5, 1.0, 0.3, 0.2, 0.1, -1.0, -2.0, -1.5,
     -2.0, -3.0, 0.5, -1.0, -2.0, 0.3, 3.5, 0.2, -1.0, -2.0],
    # Step 4: After "Gravity is the"
    [0.0, 0.5, 0.3, 8.0, 0.2, 0.1, 0.0, -1.0, -2.0, -1.5,
     -2.0, -3.0, 0.5, -1.0, -2.0, 0.3, 0.2, 0.2, -1.0, -2.0],
    # Step 5: After "Gravity is the force"
    [0.0, 0.5, 0.3, 0.2, 7.5, 0.1, 0.0, -1.0, -2.0, -1.5,
     -2.0, -3.0, 0.5, -1.0, -2.0, 0.3, 0.2, 3.0, -1.0, -2.0],
    # Step 6: After "Gravity is the force that"
    [0.0, 0.5, 0.3, 0.2, 0.1, 5.5, 6.0, -1.0, -2.0, -1.5,
     -2.0, -3.0, 0.5, -1.0, -2.0, 0.3, 0.2, 0.2, -1.0, -2.0],
    # Step 7: After "Gravity is the force that attracts"
    [0.0, 0.5, 0.3, 0.2, 0.1, 0.0, 0.0, 7.0, -2.0, -1.5,
     -2.0, -3.0, 1.5, -1.0, -2.0, 1.0, 0.2, 0.2, 3.0, -2.0],
    # Step 8: After "Gravity is the force that attracts objects"
    [0.0, 0.5, 0.3, 0.2, 0.1, 0.0, 0.0, 0.0, 5.0, -1.5,
     -2.0, -3.0, 0.5, 4.0, 6.5, 0.3, 0.2, 0.2, -1.0, 2.0],
    # Step 9: After "Gravity is the force that attracts objects with"
    [0.0, 0.5, 0.3, 0.2, 0.1, 0.0, 0.0, 0.0, -2.0, -1.5,
     -2.0, -3.0, 7.5, -1.0, -2.0, 0.3, 0.2, 0.2, -1.0, -2.0],
    # Step 10: After "Gravity is the force that attracts objects with mass"
    [0.0, 0.5, 0.3, 0.2, 0.1, 0.0, 0.0, 0.0, 3.0, -1.5,
     -2.0, -3.0, 0.5, -1.0, -2.0, 0.3, 0.2, 0.2, -1.0, 4.5],
    # Step 11: After "... mass together"
    [0.0, 0.5, 0.3, 0.2, 0.1, 0.0, 0.0, 0.0, -2.0, -1.5,
     -2.0, 8.0, 0.5, -1.0, -2.0, 0.3, 0.2, 0.2, -1.0, -2.0],
]

# Run generation
print("Tracing Token-by-Token Generation")
print(f"Prompt: 'Explain gravity in one sentence.'")
print(f"Settings: temperature=0.7, top_p=0.9, frequency_penalty=0.3")
print(f"\n{'Step':>4} {'Token':>12} {'Prob':>8} {'Nucleus':>8} {'Generated so far'}")
print("-" * 70)

generated = []
token_counts = {}

for step, logits in enumerate(step_logits):
    logits = np.array(logits, dtype=float)
    chosen, probs, nucleus_size = generate_step(
        logits, temperature=0.7, top_p=0.9,
        token_counts=token_counts, freq_penalty=0.3
    )
    
    token_text = vocab[chosen]
    generated.append(token_text)
    token_counts[chosen] = token_counts.get(chosen, 0) + 1
    
    text_so_far = " ".join(generated)
    print(f"{step+1:>4} {token_text:>12} {probs[chosen]:>8.2%} {nucleus_size:>8} {text_so_far}")
    
    if chosen == 11:  # Period = end of sentence
        break

print(f"\nFinal output: {' '.join(generated)}")
print(f"Total tokens generated: {len(generated)}")
print(f"Unique tokens: {len(set(generated))}")

This trace shows several important dynamics:

High-confidence steps (like “Gravity” at the start, or “.” at the end) have small nuclei because the model is very sure about what comes next.
Ambiguous steps (like choosing between “pulls” and “attracts”) have larger nuclei because multiple tokens are plausible.
Frequency penalty gradually reduces the probability of reusing tokens, encouraging the model to use diverse vocabulary.
Temperature 0.7 keeps the distribution relatively focused while allowing some variation. At temperature 1.0, the model might have chosen “pulls” instead of “attracts” at step 6.

Reasoning Models and the Death of Temperature

An important development in 2025 has changed the relationship between users and sampling parameters. When OpenAI released GPT-5 on August 7, 2025, developers discovered that the model no longer accepts custom temperature or top-p values. The API only supports temperature=1 for GPT-5 and its variants (GPT-5 mini, GPT-5 nano). Attempting to set any other value returns an error.

This was not a bug. OpenAI intentionally removed these controls for reasoning models, and the same restriction applies to o3, o3-mini, and o4-mini. The technical reason is that reasoning models use a multi-pass internal generation process: they generate multiple candidate reasoning paths, verify them, and select the best one (as described in Chapter 16). Allowing users to set temperature=0 would collapse all reasoning paths to a single greedy path, defeating the purpose of the multi-pass architecture. Similarly, adjusting top-p could destabilize the carefully calibrated internal selection process.

Instead of temperature and top-p, reasoning models offer alternative controls:

reasoning_effort (OpenAI): Controls how much thinking the model does (minimal, low, medium, high), which affects the depth and breadth of internal reasoning. The default is medium.
Adaptive thinking effort (Anthropic): Claude 3.7 Sonnet (February 2025) introduced extended thinking with a budget_tokens parameter that set the maximum number of thinking tokens (minimum 1,024, maximum 128,000). Claude Opus 4.6 (February 5, 2026) replaced this with adaptive thinking, where the model dynamically decides when and how much to think based on an effort level: low, medium, high (default), or max. The manual budget_tokens configuration still works on Claude 4.6 but is deprecated and will be removed in a future release.

Non-reasoning models like GPT-4o and GPT-4.1 (released April 14, 2025) continue to support temperature and top-p. The split reflects a fundamental architectural difference: traditional autoregressive models expose the raw sampling process to users, while reasoning models encapsulate it behind higher-level controls.

Source: OpenAI community forums, “GPT-5 removed parameters: logprob, Top-p, temperature” and “Temperature in GPT-5 models,” June-August 2025. Shion Honda, “Why You Can’t Set Temperature on GPT-5/o3,” September 20, 2025 (hippocampus-garden.com). GPT-4.1 released April 14, 2025 (openai.com/index/introducing-gpt-4-1). Anthropic, “Building with extended thinking” (docs.claude.com). Claude 3.7 Sonnet introduced budget_tokens (min 1,024, max 128,000) on February 24, 2025 (simonwillison.net). Claude Opus 4.6 released February 5, 2026, introduced adaptive thinking with effort levels (low, medium, high, max), deprecating manual budget_tokens (anthropic.com/research/claude-opus-4-6; laravel-news.com; digitalapplied.com).

The Full Generation Pipeline

Let us put everything together into a complete picture of what happens when you send a prompt to a language model and receive a response.

import numpy as np

def full_generation_pipeline(
    model,
    prompt_text,
    tokenizer,
    temperature=0.7,
    top_p=0.9,
    top_k=None,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    repetition_penalty=1.0,
    max_tokens=4096,
    stop_sequences=None,
):
    """
    Complete token-by-token generation pipeline.
    This is a simplified version of what runs inside every LLM API.
    """
    # Step 1: Tokenize the prompt
    prompt_tokens = tokenizer.encode(prompt_text)
    sequence = list(prompt_tokens)
    generated_tokens = []
    token_counts = {}
    
    for step in range(max_tokens):
        # Step 2: Forward pass through the model
        # (In practice, this uses the KV cache from Chapter 18)
        logits = model.forward(sequence)  # Shape: (vocab_size,)
        
        # Step 3: Apply repetition penalty (Keskar et al.)
        if repetition_penalty != 1.0:
            for tid in set(generated_tokens):
                if logits[tid] > 0:
                    logits[tid] /= repetition_penalty
                else:
                    logits[tid] *= repetition_penalty
        
        # Step 4: Apply frequency and presence penalties
        for tid, count in token_counts.items():
            logits[tid] -= count * frequency_penalty
            if count > 0:
                logits[tid] -= presence_penalty
        
        # Step 5: Apply temperature
        if temperature != 1.0 and temperature > 0:
            logits = logits / temperature
        
        # Step 6: Convert to probabilities
        probs = np.exp(logits - np.max(logits))
        probs = probs / probs.sum()
        
        # Step 7: Top-k filtering
        if top_k is not None and top_k < len(probs):
            top_k_idx = np.argsort(probs)[-top_k:]
            mask = np.zeros_like(probs)
            mask[top_k_idx] = 1
            probs = probs * mask
            probs = probs / probs.sum()
        
        # Step 8: Top-p (nucleus) filtering
        if top_p < 1.0:
            sorted_idx = np.argsort(probs)[::-1]
            cumulative = np.cumsum(probs[sorted_idx])
            cutoff = np.searchsorted(cumulative, top_p) + 1
            keep = sorted_idx[:cutoff]
            mask = np.zeros_like(probs)
            mask[keep] = 1
            probs = probs * mask
            probs = probs / probs.sum()
        
        # Step 9: Sample
        if temperature == 0:
            next_token = np.argmax(probs)
        else:
            next_token = np.random.choice(len(probs), p=probs)
        
        # Step 10: Check stop conditions
        if next_token in model.stop_token_ids:
            break
        
        generated_tokens.append(next_token)
        sequence.append(next_token)
        token_counts[next_token] = token_counts.get(next_token, 0) + 1
        
        # Check custom stop sequences
        generated_text = tokenizer.decode(generated_tokens)
        if stop_sequences:
            if any(generated_text.endswith(s) for s in stop_sequences):
                break
    
    return tokenizer.decode(generated_tokens)

This pipeline runs for every single token in every response you have ever received from a language model. A 1,000-token response means this loop executed 1,000 times, each time running the full transformer forward pass, applying penalties, filtering candidates, and sampling one token.

The computational cost is dominated by the forward pass (Step 2), which involves matrix multiplications through dozens of transformer layers. The sampling steps (Steps 3 through 9) are comparatively cheap. This is why the KV cache (covered in Chapter 18) is so important: it avoids redundantly recomputing attention for all previous tokens at each step.

Key Takeaways

Autoregressive generation produces text one token at a time. Each token depends on all previous tokens, making the process inherently sequential. A 500-token response requires 500 forward passes through the model.
Logits are the raw, unnormalized scores the model produces for every token in the vocabulary (128,256 entries for LLaMA 3, approximately 200,000 for GPT-4o). The softmax function converts these into probabilities.
Greedy decoding always picks the highest-probability token. It is deterministic but produces repetitive, generic text. It is useful for code generation and structured output, but poor for open-ended text.
Temperature controls randomness by rescaling logits before softmax. Lower temperatures (0.1 to 0.3) make the model more deterministic; higher temperatures (0.8 to 1.5) make it more creative. Temperature 0 is equivalent to greedy decoding; temperature approaching infinity produces uniform random output.
Top-k sampling (Fan et al., ACL 2018) restricts choices to the k most likely tokens. It is simple but cannot adapt to varying levels of model confidence.
Top-p (nucleus) sampling (Holtzman et al., arXiv:1904.09751, ICLR 2020) dynamically selects the smallest set of tokens whose cumulative probability exceeds p. It adapts to model confidence: small nuclei when the model is sure, large nuclei when it is uncertain. This is the default sampling method for most APIs.
Min-p sampling (Nguyen et al., arXiv:2407.01082, ICLR 2025 Oral) sets a dynamic floor based on the top token’s probability. It handles high temperatures better than top-p and has been adopted by Hugging Face Transformers, vLLM, and other open-source frameworks. Top-nσ sampling (Tang et al., arXiv:2411.07641, ACL 2025) operates on pre-softmax logits using a statistical threshold (n standard deviations from the maximum), providing temperature-independent candidate selection.
Repetition penalty (Keskar et al., arXiv:1909.05858, 2019) reduces the logit of any previously generated token. Frequency penalty scales the reduction by how many times the token appeared. Presence penalty applies a flat reduction to any token that appeared at all. These prevent the model from getting stuck in repetitive loops.
Stop tokens (EOS tokens) signal the end of generation. LLaMA 3 uses <|end_of_text|> (ID 128001) for absolute end and <|eot_id|> (ID 128009) for end of conversational turn. Custom stop sequences and maximum token limits provide additional stopping controls.
Beam search maintains multiple candidate sequences and selects the highest-probability one. It is useful for translation and structured output but produces bland text for open-ended generation, which is why sampling methods have largely replaced it.
Reasoning models (GPT-5, o3, o3-mini, o4-mini) have disabled user-facing temperature and top-p controls because their multi-pass internal reasoning process requires carefully calibrated sampling. They offer alternative controls: OpenAI’s reasoning_effort (minimal, low, medium, high) and Anthropic’s adaptive thinking effort levels (low, medium, high, max on Claude Opus 4.6, released February 5, 2026, replacing the earlier budget_tokens parameter).
Gemini Diffusion (announced May 21, 2025 at Google I/O) represents a potential alternative to autoregressive generation, using diffusion to generate text blocks in parallel at a reported benchmark speed of 1,479 tokens per second for general tasks and up to 2,000 tokens per second on programming tasks (practical demos showed ~857 tokens per second). As of March 2026, it remains an experimental research demo.
Multi-token prediction (Gloeckle et al., arXiv:2404.19737, ICML 2024) trains models to predict several future tokens simultaneously, enabling 2 to 3x inference speedups. It has been adopted in production models including DeepSeek-V3 and Qwen 3.5. Speculative decoding uses a small draft model to propose candidate tokens that a larger model verifies in parallel, achieving similar speedups without retraining (covered in Chapter 24).

What’s Next

You now understand how language models generate text one token at a time, and how parameters like temperature, top-p, and repetition penalties shape the output. But there is a major efficiency problem hiding in this process: at each step, the model recomputes attention over the entire sequence, including all the tokens it has already processed. In Chapter 18, we will explore the KV cache, the optimization that makes generation fast by caching intermediate computations, and why it is one of the most important practical considerations for deploying language models at scale.

Chapter 18. The KV Cache, Why Generation Is Fast