Chapter 16. Extended Thinking, Reasoning at Inference Time

The models you have learned about so far generate responses in a single pass: they read your prompt, process it through dozens of transformer layers, and produce output tokens one at a time until they are done. This works remarkably well for most tasks, but it has a fundamental limitation. The model spends the same amount of computation on every problem, whether you ask it “What is 2 + 2?” or “Prove that there are infinitely many prime numbers.” In September 2024, OpenAI released a model that changed this paradigm entirely. The o1 model introduced extended thinking: the ability to spend more time reasoning through difficult problems before answering, using additional “thinking tokens” that dramatically improve performance on complex tasks. This chapter explains how extended thinking works, why it matters, and how it has become the defining feature of frontier reasoning models in 2025 and 2026.

The Problem with Single-Pass Generation

To understand why extended thinking matters, consider how a standard language model approaches a math problem. When you ask GPT-4o to solve a problem from the American Invitational Mathematics Examination (AIME), a prestigious competition for top high school math students in the United States, the model generates its response in one continuous stream. It reads the problem, starts producing tokens, and hopes that the reasoning it generates along the way leads to the correct answer.

This approach has a critical flaw: the model cannot go back and reconsider. If it makes a wrong assumption in the third sentence, it is stuck with that assumption for the rest of the response. It cannot pause, reflect, try a different approach, or verify its work. The computation budget is fixed by the number of tokens in the response, not by the difficulty of the problem.

The results speak for themselves. On the AIME 2024 benchmark, GPT-4o solved approximately 12% of problems correctly. The o1 model, released just months later with extended thinking capabilities, achieved 83.3% accuracy on the same benchmark. That is not a small improvement; it is a transformation from “barely functional” to “competitive with top human students.”

Model	AIME 2024 Accuracy	Release Date
GPT-4o	~12%	May 2024
o1-preview	56.7%	September 12, 2024
o1 (full)	83.3%	December 5, 2024
o3	96.7%	April 16, 2025

Source: OpenAI, “Learning to reason with LLMs,” September 12, 2024 (openai.com/index/learning-to-reason-with-llms). OpenAI, “Introducing OpenAI o3 and o4-mini,” April 16, 2025 (openai.com/index/introducing-o3-and-o4-mini). The o1-preview scored 56.7% on AIME 2024; the full o1 model (released December 5, 2024) achieved 83.3% (rankedagi.com, turtlesai.com). GPT-4o’s ~12% from helicone.ai and cometapi.com.

The jump from 12% to 83% represents a qualitative shift in what language models can do. Even the o1-preview, at 56.7%, was a massive leap over GPT-4o’s 12%. The full o1 model pushed this further to 83.3%, and o3 reached 96.7%. Problems that were essentially impossible for GPT-4o became routine for reasoning models. And the key difference was not more parameters or more training data; it was giving the model time to think.

Chain-of-Thought: The Foundation

Extended thinking did not appear out of nowhere. It builds on a technique called chain-of-thought (CoT) prompting, introduced by Wei et al. at Google Research in January 2022 (arXiv:2201.11903, later presented at NeurIPS 2022). The core insight was simple but powerful: if you show a language model examples of step-by-step reasoning, it will produce step-by-step reasoning in its own responses, and this dramatically improves accuracy on complex tasks.

The original paper demonstrated this with PaLM 540B, Google’s largest model at the time. On the GSM8K benchmark of grade-school math word problems, standard prompting (showing the model input-output pairs without intermediate steps) achieved low accuracy. Chain-of-thought prompting (showing the model examples that included the reasoning steps) boosted PaLM 540B to approximately 58% accuracy on GSM8K, achieving state-of-the-art results that surpassed even fine-tuned GPT-3 with a verifier.

The technique was remarkably simple. Instead of prompting the model with:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
   Each can has 3 tennis balls. How many tennis balls does he have now?
A: 11

You prompt it with:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
   Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 
   2 * 3 = 6 tennis balls. 5 + 6 = 11. The answer is 11.

By including the reasoning steps in the examples, the model learns to generate similar reasoning steps in its own responses. This “showing your work” approach mirrors how humans solve complex problems: we break them down into smaller steps, verify each step, and build toward the final answer.

Zero-Shot Chain-of-Thought

Later in 2022, Kojima et al. discovered something even simpler (arXiv:2205.11916, NeurIPS 2022). You do not need to provide examples of chain-of-thought reasoning at all. Simply appending the phrase “Let’s think step by step” to a prompt causes the model to generate step-by-step reasoning on its own. This zero-shot chain-of-thought approach produced dramatic improvements across multiple benchmarks: on MultiArith, accuracy jumped from 17.7% to 78.7% with InstructGPT (text-davinci-002), and on GSM8K from 10.4% to 40.7%, with similar magnitudes of improvement on PaLM 540B. All of this without any task-specific examples.

This finding was profound. It suggested that the reasoning capability was already latent in large language models; it just needed to be elicited. The model “knew” how to reason step-by-step, but it would not do so unless prompted.

Source: Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv:2201.11903, January 2022. NeurIPS 2022. Google Research. PaLM 540B achieved state-of-the-art accuracy on GSM8K with just eight chain-of-thought exemplars. Kojima et al., “Large Language Models are Zero-Shot Reasoners,” arXiv:2205.11916, May 2022. NeurIPS 2022. Zero-shot CoT improved MultiArith accuracy from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with InstructGPT (text-davinci-002), with similar improvements on PaLM 540B.

From Prompting to Training: The o1 Breakthrough

Chain-of-thought prompting was a major advance, but it had limitations. The reasoning was visible in the output, which meant users had to read through potentially lengthy explanations. The model could not easily revise or backtrack on its reasoning. And the quality of the reasoning depended heavily on the prompt and the examples provided.

OpenAI’s o1 model, released on September 12, 2024, took a fundamentally different approach. Instead of relying on prompting to elicit chain-of-thought reasoning, o1 was trained using reinforcement learning to generate an internal chain of thought before producing its final answer. This internal reasoning happens in what OpenAI calls “thinking tokens” or “reasoning tokens,” which are generated, processed, and then hidden from the user.

How o1 Works

When you send a prompt to o1, the model does not immediately start generating the response you see. Instead, it first generates a sequence of internal reasoning tokens. These tokens represent the model “thinking through” the problem: exploring different approaches, checking its work, reconsidering assumptions, and refining its answer. Only after this internal deliberation does the model produce the final response that you see.

The key innovation is that o1 was trained specifically to use this thinking time effectively. Through reinforcement learning, the model learned to:

Break down complex problems into smaller, manageable steps
Try different approaches when the first attempt does not work
Verify its reasoning by checking intermediate results
Recognize and correct mistakes before committing to a final answer
Allocate more thinking time to harder problems

This is fundamentally different from chain-of-thought prompting. With prompting, you are asking a model that was trained for single-pass generation to produce reasoning steps. With o1, the model was trained from the ground up to reason internally before answering.

The Hidden Chain of Thought

One notable aspect of o1 is that the internal reasoning tokens are not shown to users. You see a summary of the thinking process (e.g., “Thought for 47 seconds”), but not the actual tokens. OpenAI has stated that this is partly for competitive reasons (the reasoning process is a key differentiator) and partly because the raw thinking tokens can be messy, repetitive, or confusing to read.

This hidden reasoning has implications for interpretability and trust. Users cannot verify exactly how the model arrived at its answer, which is a tradeoff compared to visible chain-of-thought approaches. However, the dramatic performance improvements have made this tradeoff acceptable for many use cases.

Source: OpenAI, “Learning to reason with LLMs,” September 12, 2024 (openai.com/index/learning-to-reason-with-llms). The o1 model uses reinforcement learning to train a “private chain of thought” that improves reasoning on complex tasks.

Test-Time Compute Scaling: A New Paradigm

The o1 model introduced a new scaling paradigm that has become central to AI research: test-time compute scaling. Traditional scaling laws (covered in Chapter 13) focus on training-time compute: how model performance improves as you increase parameters, training data, and training compute. Test-time compute scaling asks a different question: how does performance improve as you give the model more compute at inference time?

The Snell et al. Paper

A foundational paper on this topic is “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters” by Snell et al. at Google DeepMind (arXiv:2408.03314, August 2024). The paper systematically studied how to allocate additional compute at inference time and found striking results:

Test-time compute can substitute for model size: On problems where a smaller model achieves non-trivial success rates, allocating more test-time compute can allow it to outperform a 14x larger model using the same total FLOPs.
Compute-optimal allocation matters: The effectiveness of test-time compute depends heavily on how it is allocated. A “compute-optimal” strategy that adapts to problem difficulty can improve efficiency by more than 4x compared to naive approaches like best-of-N sampling.
Difficulty-dependent scaling: Easy problems benefit little from additional test-time compute (the model already gets them right). Hard problems benefit enormously. The optimal strategy is to allocate compute adaptively based on problem difficulty.

This research provided theoretical grounding for what o1 demonstrated empirically: spending more compute on harder problems at inference time is a powerful way to improve model capabilities, potentially more efficient than simply training larger models.

import numpy as np

# Illustrating the test-time compute scaling concept
# Based on findings from Snell et al. (arXiv:2408.03314)

def simulate_accuracy_vs_compute(base_accuracy, max_compute_multiplier=16):
    """
    Simulate how accuracy improves with test-time compute.
    Returns (compute_multipliers, accuracies) for plotting.
    """
    compute_multipliers = [1, 2, 4, 8, 16]
    accuracies = []
    
    for mult in compute_multipliers:
        # Diminishing returns: accuracy approaches 1.0 asymptotically
        # This models the empirical finding that more compute helps,
        # but with diminishing returns
        improvement = (1 - base_accuracy) * (1 - np.exp(-0.3 * (mult - 1)))
        accuracies.append(min(base_accuracy + improvement, 0.99))
    
    return compute_multipliers, accuracies

# Example: a model with 40% base accuracy on hard math problems
base_acc = 0.40
multipliers, accs = simulate_accuracy_vs_compute(base_acc)

print("Test-Time Compute Scaling (Illustrative)")
print(f"Base accuracy: {base_acc:.0%}")
print(f"\n{'Compute':>10} {'Accuracy':>10} {'Improvement':>12}")
print("-" * 35)
for mult, acc in zip(multipliers, accs):
    improvement = (acc - base_acc) / base_acc * 100
    print(f"{mult:>10}x {acc:>10.1%} {improvement:>+11.1f}%")

print(f"\nKey insight: 16x more test-time compute can nearly double")
print(f"accuracy on hard problems, without changing the model at all.")

Source: Snell et al., “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters,” arXiv:2408.03314, August 2024. Google DeepMind. The paper found that test-time compute can allow a smaller model to outperform a 14x larger model on problems where the smaller model has non-trivial success rates.

Process Reward Models: Verifying Each Step

A key enabler of effective test-time compute scaling is the process reward model (PRM), which evaluates the quality of each reasoning step rather than just the final answer. This concept was formalized by Lightman et al. at OpenAI in “Let’s Verify Step by Step” (arXiv:2305.20050, May 2023).

Outcome vs. Process Supervision

Traditional reward models (as described in Chapter 15) use outcome supervision: they score the final answer as correct or incorrect. This works well for simple tasks, but it has a problem for complex reasoning: a model might get the right answer through flawed reasoning (lucky guess) or the wrong answer despite mostly correct reasoning (one small error).

Process supervision addresses this by providing feedback on each intermediate step. A process reward model is trained to evaluate whether each step in a chain of reasoning is correct, not just whether the final answer is right.

Lightman et al. found that process supervision significantly outperforms outcome supervision for training models to solve math problems. Their process-supervised model achieved 78.2% accuracy on a challenging subset of the MATH benchmark, compared to lower accuracy with outcome supervision alone.

How PRMs Enable Better Reasoning

Process reward models are particularly valuable for test-time compute scaling because they enable:

Step-by-step verification: The model can check each reasoning step as it generates it, catching errors early.
Search over reasoning paths: Instead of committing to a single chain of thought, the model can explore multiple paths and use the PRM to identify the most promising ones.
Backtracking: If the PRM scores a step poorly, the model can backtrack and try a different approach.
Best-of-N with step-level scoring: When generating multiple candidate solutions, the PRM can identify the best one based on the quality of reasoning, not just whether the final answer matches.

The combination of extended thinking (generating more reasoning tokens) and process reward models (evaluating those tokens step-by-step) is what makes modern reasoning models so effective.

Source: Lightman et al., “Let’s Verify Step by Step,” arXiv:2305.20050, May 2023. OpenAI. Process supervision significantly outperforms outcome supervision, achieving 78.2% on a MATH benchmark subset.

DeepSeek-R1: Open-Source Reasoning Through Pure RL

While OpenAI’s o1 demonstrated the power of extended thinking, its training methodology remained proprietary. In January 2025, DeepSeek published a paper that opened the field wide: “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” (arXiv:2501.12948, released January 20, 2025). The paper described not just one model, but a series of experiments that revealed how reasoning capabilities emerge through reinforcement learning.

DeepSeek-R1-Zero: Reasoning from Pure RL

The most striking result was DeepSeek-R1-Zero, a model trained using GRPO (Group Relative Policy Optimization, covered in Chapter 15) with no supervised fine-tuning at all. Starting from the pre-trained DeepSeek-V3 base model, the team applied RL with rule-based rewards (correct/incorrect for math and code problems) and observed something remarkable: the model spontaneously developed chain-of-thought reasoning, self-verification, and reflection behaviors without ever being shown examples of these behaviors.

During training, the model’s average response length grew from a few hundred tokens to thousands of tokens as it learned that longer, more detailed reasoning led to higher rewards. The model even exhibited what the DeepSeek team described as “aha moments,” where it would re-evaluate its approach mid-reasoning and discover a better solution path.

However, R1-Zero had significant limitations. Its outputs suffered from poor readability, language mixing (switching between English and Chinese mid-response), and repetitive patterns. These issues made it impractical for real-world use, but the experiment proved a crucial point: reasoning capabilities can emerge purely from RL, without any human-written demonstrations of reasoning.

DeepSeek-R1: The Full Pipeline

The production DeepSeek-R1 model addressed R1-Zero’s limitations with a four-stage training pipeline:

Cold-start SFT: A small amount of supervised fine-tuning on thousands of carefully curated chain-of-thought examples to establish good formatting and readability habits.
Reasoning-focused RL: Large-scale GRPO training on math, code, and logic problems with rule-based rewards, similar to R1-Zero but starting from the SFT checkpoint instead of the raw base model.
Rejection sampling and SFT: Generate many responses from the RL-trained model, filter for quality, and use the best ones (along with non-reasoning data for writing, Q&A, etc.) to create a comprehensive SFT dataset. Fine-tune the base model on this combined dataset.
Final RL: A second round of RL training to further refine both reasoning and general helpfulness.

The result was a 671 billion parameter MoE model (37 billion active per token) that matched OpenAI’s o1 on math, coding, and reasoning benchmarks while being fully open-source under the MIT license. On AIME 2024, DeepSeek-R1 achieved 79.8% accuracy (pass@1), comparable to o1’s 83.3%.

Distillation: Reasoning for Smaller Models

Perhaps the most impactful contribution of the DeepSeek-R1 paper was its demonstration that reasoning capabilities can be distilled into much smaller models. The team generated 800,000 high-quality reasoning traces from the full R1 model and used them to fine-tune a range of smaller open-source models:

Distilled Model	Base Model	AIME 2024	MATH-500
DeepSeek-R1-Distill-Qwen-1.5B	Qwen 2.5 1.5B	28.9%	83.9%
DeepSeek-R1-Distill-Qwen-7B	Qwen 2.5 7B	55.5%	92.8%
DeepSeek-R1-Distill-Qwen-14B	Qwen 2.5 14B	69.7%	93.9%
DeepSeek-R1-Distill-Qwen-32B	Qwen 2.5 32B	72.6%	94.3%
DeepSeek-R1-Distill-Llama-70B	Llama 3 70B	70.0%	94.5%
OpenAI o1-mini (for comparison)	Proprietary	63.6%	90.0%

Source: DeepSeek-R1 paper, arXiv:2501.12948, Table 5. Benchmark results for distilled models.

The 32B distilled model outperformed OpenAI’s o1-mini (63.6% vs. 72.6% on AIME 2024) despite being a fraction of the size and fully open-source. Even the 7B model achieved 55.5% on AIME 2024, a score that would have been state-of-the-art for any model just a year earlier. This demonstrated that the reasoning patterns learned through RL can be effectively transferred to smaller models through supervised fine-tuning on reasoning traces.

Source: DeepSeek, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv:2501.12948, January 20, 2025. Released under MIT license. 671B MoE model (37B active). Achieved 79.8% on AIME 2024 (pass@1). Distilled models released in sizes from 1.5B to 70B.

The Extended Thinking Landscape: March 2026

Following o1 and DeepSeek-R1, every major AI lab has released models with extended thinking capabilities. Here is the landscape as of March 2026:

OpenAI: o1, o3, and o3-pro

OpenAI’s reasoning model line has evolved rapidly:

o1 (September 12, 2024) was the first commercial reasoning model. It introduced hidden thinking tokens and achieved 83.3% on AIME 2024 and 78% on GPQA Diamond (a benchmark of PhD-level science questions where human experts achieve approximately 65%).

o3 (April 16, 2025) was a major upgrade. It achieved 96.7% on AIME 2024 (solving 14.5 out of 15 problems) and 87.7% on GPQA Diamond. o3 also introduced multimodal reasoning, allowing it to reason about images, use tools, and search the web as part of its thinking process. In June 2025, OpenAI reduced o3’s pricing by 80%, from $10 to $2 per million input tokens and from $40 to $8 per million output tokens, making reasoning models accessible for everyday use.

o3-mini (January 31, 2025) offered a smaller, faster alternative with three configurable reasoning effort levels: low, medium, and high. This allowed developers to balance speed and cost against reasoning quality.

Anthropic: Claude 3.7 Sonnet and Claude Sonnet 4

Claude 3.7 Sonnet (February 24, 2025) was Anthropic’s first model with extended thinking. It introduced a “hybrid” approach: the same model can operate in standard mode (fast, single-pass generation) or extended thinking mode (with a configurable thinking budget). In extended thinking mode, Claude generates visible thinking tokens in a <thinking> block before producing its final answer.

A key differentiator of Claude’s approach is the configurable thinking budget. Through the API, developers can set a budget_tokens parameter that controls how many tokens Claude can spend thinking, from a minimum of 1,024 to a maximum of 128,000 tokens. This gives developers fine-grained control over the speed/quality tradeoff.

Claude Sonnet 4 (May 22, 2025) and Claude Opus 4 (May 22, 2025) continued this approach with improved reasoning capabilities and support for multi-hour extended thinking sessions.

Google: Gemini 2.5 Pro and Deep Think

Gemini 2.5 Pro (March 25, 2025) was Google’s entry into the reasoning model space. Announced as a “thinking model,” it natively integrates reasoning into the Gemini architecture, with all models in the 2.5 series capable of step-by-step reasoning.

Gemini 2.5 Deep Think (August 1, 2025) took a different architectural approach. Instead of a single model thinking sequentially, Deep Think uses a multi-agent architecture that spawns multiple reasoning agents to explore different solution paths in parallel, then synthesizes the best ideas into a final answer. It achieved 87.6% on LiveCodeBench (competition-level coding) and Bronze-level performance on the 2025 International Mathematical Olympiad benchmark in its publicly released form. Notably, an advanced research version of Deep Think scored 35 out of 42 points on the IMO, earning gold-medal status, the first time an AI model achieved this feat. The publicly available version trades some of that peak performance for faster response times. Deep Think is available through Google’s AI Ultra subscription at $250 per month.

xAI: Grok 3

Grok 3 (February 2025) introduced reasoning capabilities through two modes: “Think,” which displays step-by-step reasoning, and “Big Brain,” which allocates more compute for harder problems. Trained on xAI’s Colossus supercluster with approximately 200,000 GPUs, Grok 3 uses visible chain-of-thought reasoning similar to DeepSeek-R1. It was one of the first models to offer a “Think” button that lets users toggle reasoning on and off for individual queries.

Alibaba: QwQ-32B and Qwen3

QwQ-32B (March 6, 2025) was a significant milestone for open-source reasoning. With only 32 billion parameters, it achieved performance comparable to DeepSeek-R1 (671 billion parameters, 37 billion active) on math, coding, and reasoning benchmarks. Released under the Apache 2.0 license, QwQ-32B demonstrated that reinforcement learning applied to a strong foundation model can produce competitive reasoning capabilities at a fraction of the size. It uses visible chain-of-thought reasoning with <think> tags.

Qwen3 (April 29, 2025) introduced a hybrid thinking architecture across its entire model family, from 0.6B to 235B parameters. Every Qwen3 model can seamlessly switch between “Thinking Mode” (step-by-step reasoning for complex tasks) and “Non-Thinking Mode” (fast, direct responses for simple queries). Developers toggle this behavior using an enable_thinking parameter or /think and /no_think tags within prompts. This made Qwen3 one of the first model families to offer reasoning as a built-in, toggleable capability rather than a separate model.

Source: Qwen Team, “QwQ-32B: Embracing the Power of Reinforcement Learning,” March 6, 2025 (qwenlm.github.io/blog/qwq-32b). Qwen3 Technical Report, arXiv:2505.09388. xAI, Grok 3 announcement, February 2025 (neowin.net, macrumors.com).

OpenAI: GPT-5

GPT-5 (August 7, 2025) represented a convergence of the reasoning and standard model lines. Rather than maintaining separate “o-series” reasoning models and “GPT-series” standard models, GPT-5 unified both into a single system with a built-in router. The router automatically detects prompt complexity and decides when to engage extended reasoning (“GPT-5 Thinking” mode) versus providing a fast, direct response. On AIME 2025, GPT-5 achieved 94.6% accuracy without tools (pass@1). GPT-5 also introduced configurable reasoning depth through a reasoning_effort parameter, continuing the pattern established by o3.

Source: OpenAI, “Introducing GPT-5,” August 7, 2025 (openai.com/index/introducing-gpt-5). GPT-5 Wikipedia article confirms August 7, 2025 release date and unified architecture.

Summary Table

Model	Lab	Release	Thinking Style	AIME 2024
o1-preview	OpenAI	Sep 2024	Hidden tokens	56.7%
o1 (full)	OpenAI	Dec 2024	Hidden tokens	83.3%
DeepSeek-R1	DeepSeek	Jan 2025	Visible `<think>` tags	79.8%
o3-mini	OpenAI	Jan 2025	Hidden, 3 effort levels	Varies by effort
Grok 3	xAI	Feb 2025	Visible, Think/Big Brain modes	N/A
Claude 3.7 Sonnet	Anthropic	Feb 2025	Visible, configurable budget	N/A
QwQ-32B	Alibaba/Qwen	Mar 2025	Visible `<think>` tags	Comparable to R1
Gemini 2.5 Pro	Google	Mar 2025	Native thinking	N/A
o3	OpenAI	Apr 2025	Hidden, multimodal	96.7%
Qwen3 (family)	Alibaba/Qwen	Apr 2025	Hybrid think/non-think toggle	N/A
Claude Sonnet 4	Anthropic	May 2025	Visible, multi-hour	N/A
GPT-5	OpenAI	Aug 2025	Unified router, auto-reasoning	94.6% (AIME 2025)
Gemini 2.5 Deep Think	Google	Aug 2025	Multi-agent parallel	N/A

Source: OpenAI announcements (openai.com). Anthropic, “Claude 3.7 Sonnet and Claude Code,” February 24, 2025 (anthropic.com/news/claude-3-7-sonnet). Anthropic, “Introducing Claude 4,” May 22, 2025 (anthropic.com/news/claude-4). Google, “Our newest Gemini model with thinking,” March 25, 2025 (blog.google). 9to5Google, “Gemini 2.5 Deep Think rolling out now for Google AI Ultra,” August 1, 2025. Qwen Team, QwQ-32B blog post, March 6, 2025. xAI, Grok 3 announcement, February 2025. OpenAI, “Introducing GPT-5,” August 7, 2025.

How Extended Thinking Works in Practice

Let us walk through what actually happens when you use a reasoning model, step by step.

The Thinking Phase

When you send a prompt to a reasoning model, the model enters a thinking phase before generating its visible response. During this phase, the model generates tokens that represent its internal reasoning. These tokens are processed through the same transformer architecture described in earlier chapters (attention, feed-forward networks, layer normalization), but they serve a different purpose: they are “scratch paper” for the model to work through the problem.

Here is a simplified illustration of what happens:

# Simplified illustration of extended thinking vs. standard generation
# This is conceptual, not actual model internals

def standard_generation(prompt):
    """Standard model: directly generates the answer."""
    # The model processes the prompt and immediately starts
    # generating the visible response
    response_tokens = generate_tokens(prompt)
    return response_tokens

def extended_thinking(prompt, thinking_budget=4096):
    """Reasoning model: thinks first, then answers."""
    # Phase 1: Generate thinking tokens (internal reasoning)
    thinking_tokens = []
    for _ in range(thinking_budget):
        # Each thinking token attends to the prompt AND
        # all previous thinking tokens
        next_token = generate_next_token(
            context=prompt + thinking_tokens
        )
        thinking_tokens.append(next_token)
        
        # The model may decide to stop thinking early
        if next_token == "<end_thinking>":
            break
    
    # Phase 2: Generate the visible response
    # The response tokens attend to the prompt AND
    # all thinking tokens, giving the model access to
    # its own reasoning
    response_tokens = generate_tokens(
        context=prompt + thinking_tokens
    )
    
    return thinking_tokens, response_tokens

# The key insight: thinking tokens expand the model's
# effective context, allowing it to "remember" intermediate
# reasoning steps when generating the final answer.

The thinking tokens are regular tokens generated by the same model. They are not a separate system or a different architecture. The model simply generates additional tokens before the response, and these tokens become part of the context that the model attends to when generating the final answer. This is why extended thinking works: it gives the model more “working memory” to hold intermediate results, explore alternatives, and verify its reasoning.

What Thinking Tokens Look Like

For models that expose their thinking (like DeepSeek-R1 and Claude), you can see the actual reasoning process. Here is a real example of what thinking tokens look like when a reasoning model solves a math problem:

<think>
Let me work through this step by step.

The problem asks: Find all integers n such that n^2 + 3n + 5 is 
divisible by 121.

First, 121 = 11^2. So I need n^2 + 3n + 5 ≡ 0 (mod 121).

Let me start by solving mod 11 first.
n^2 + 3n + 5 ≡ 0 (mod 11)

Using the quadratic formula approach:
n = (-3 ± sqrt(9 - 20)) / 2 (mod 11)
n = (-3 ± sqrt(-11)) / 2 (mod 11)
n = (-3 ± sqrt(0)) / 2 (mod 11)

Wait, -11 ≡ 0 (mod 11), so sqrt(-11) ≡ 0 (mod 11).
So n ≡ -3/2 (mod 11).

I need the inverse of 2 mod 11. Since 2 * 6 = 12 ≡ 1 (mod 11),
the inverse of 2 is 6.

So n ≡ -3 * 6 ≡ -18 ≡ -18 + 22 ≡ 4 (mod 11).

Let me verify: 4^2 + 3(4) + 5 = 16 + 12 + 5 = 33 = 3 * 11. ✓
33 is divisible by 11 but not by 121.

Now I need to lift this to mod 121 using Hensel's lemma...
</think>

Notice several important features of this reasoning:

Step-by-step decomposition: The model breaks the problem into smaller parts (first solve mod 11, then lift to mod 121).
Self-correction: The model catches a potential error (“Wait, -11 ≡ 0 (mod 11)”) and adjusts.
Verification: The model checks its intermediate result (“Let me verify: 4^2 + 3(4) + 5 = 33”).
Strategy selection: The model identifies the appropriate technique (Hensel’s lemma) for the next step.

These behaviors were not explicitly programmed. They emerged through reinforcement learning, where the model learned that these strategies lead to correct answers and higher rewards.

The Cost of Thinking

Extended thinking is not free. Every thinking token consumes the same compute as a regular output token: it passes through all the transformer layers, uses the KV cache (covered in Chapter 18), and takes time to generate. For reasoning models, the thinking tokens often vastly outnumber the response tokens.

import numpy as np

# Real-world token usage for reasoning models
# Based on typical usage patterns reported by developers

examples = [
    ("Simple question", "What is the capital of France?",
     50, 20, "Paris. The capital of France is Paris."),
    ("Medium math", "Solve 3x + 7 = 22",
     500, 30, "x = 5"),
    ("Hard math (AIME-level)", "Find all integers n where n^2+3n+5 is divisible by 121",
     8000, 200, "n ≡ 4 (mod 121) or n ≡ 114 (mod 121)"),
    ("Competition coding", "Implement an O(n log n) solution for...",
     15000, 500, "[complete solution with explanation]"),
    ("Research-level math", "Prove that for all primes p > 3...",
     50000, 1000, "[detailed proof]"),
]

print("Token Usage: Thinking vs. Response")
print(f"{'Task':<25} {'Thinking':>10} {'Response':>10} {'Ratio':>8} {'% Thinking':>12}")
print("-" * 70)
for task, _, think, resp, _ in examples:
    total = think + resp
    ratio = think / resp
    pct = think / total * 100
    print(f"{task:<25} {think:>10,} {resp:>10,} {ratio:>7.0f}x {pct:>11.1f}%")

print(f"\nFor hard problems, 90-98% of generated tokens are thinking tokens.")
print(f"These are billed as output tokens, making reasoning models")
print(f"significantly more expensive per query on complex tasks.")

# Cost comparison (using OpenAI o3 pricing as of March 2026)
print(f"\n\nCost Example: o3 on a hard math problem")
print(f"Thinking tokens: 15,000")
print(f"Response tokens:    500")
print(f"Total output tokens: 15,500")
print(f"Input tokens (prompt): 200")
input_cost = 200 / 1_000_000 * 2    # $2 per million input tokens
output_cost = 15_500 / 1_000_000 * 8  # $8 per million output tokens
total_cost = input_cost + output_cost
print(f"Input cost:  ${input_cost:.6f}")
print(f"Output cost: ${output_cost:.6f}")
print(f"Total cost:  ${total_cost:.6f}")
print(f"\nFor comparison, the same prompt with GPT-4o (no thinking):")
gpt4o_input = 200 / 1_000_000 * 2.5   # $2.50 per million input
gpt4o_output = 100 / 1_000_000 * 10    # $10 per million output
gpt4o_total = gpt4o_input + gpt4o_output
print(f"Total cost:  ${gpt4o_total:.6f}")
print(f"Reasoning model is {total_cost/gpt4o_total:.1f}x more expensive,")
print(f"but gets the answer right.")

Source: OpenAI API pricing as of March 2026. o3: $2/M input, $8/M output tokens (after June 2025 price reduction). GPT-4o: $2.50/M input, $10/M output tokens.

Controlling Reasoning Effort

One of the most important practical considerations with reasoning models is controlling how much thinking they do. More thinking generally leads to better answers, but it also means higher latency and cost. Different tasks require different amounts of reasoning, and the optimal amount varies widely.

OpenAI’s Reasoning Effort Parameter

OpenAI’s o1 and o3 models expose a reasoning_effort parameter in the API that controls how much the model thinks. The parameter accepts three values: low, medium, and high.

# Example: Using reasoning_effort with OpenAI's o3 model
# (Illustrative API call structure)

from openai import OpenAI
client = OpenAI()

# Low effort: fast, fewer thinking tokens
# Good for: simple questions, quick lookups, basic math
response_low = client.chat.completions.create(
    model="o3",
    messages=[{"role": "user", "content": "What is 15% of 80?"}],
    reasoning_effort="low"
)

# Medium effort: balanced (default)
# Good for: moderate complexity, most everyday tasks
response_medium = client.chat.completions.create(
    model="o3",
    messages=[{"role": "user", "content": "Explain the tradeoffs between SQL and NoSQL databases."}],
    reasoning_effort="medium"
)

# High effort: thorough, many thinking tokens
# Good for: complex math, competition coding, research problems
response_high = client.chat.completions.create(
    model="o3",
    messages=[{"role": "user", "content": "Prove that there are infinitely many primes of the form 4k+3."}],
    reasoning_effort="high"
)

# The reasoning_effort parameter affects:
# 1. Number of thinking tokens generated (more = higher effort)
# 2. Latency (more tokens = longer wait)
# 3. Cost (thinking tokens are billed as output tokens)
# 4. Accuracy on hard problems (higher effort = better accuracy)

The impact of reasoning effort is substantial. On o3-mini, the difference between low and high effort can mean 10x more thinking tokens and significantly higher accuracy on difficult problems. OpenAI recommends starting with low effort, evaluating results, and increasing effort only if needed.

Anthropic’s Thinking Budget

Anthropic takes a more granular approach with Claude’s extended thinking. Instead of discrete effort levels, developers specify a budget_tokens parameter that sets the maximum number of tokens Claude can spend thinking.

# Example: Using budget_tokens with Claude's extended thinking
# (Illustrative API call structure)

import anthropic
client = anthropic.Anthropic()

# Minimal thinking: 1,024 tokens (minimum allowed)
response_minimal = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 1024  # Minimum thinking budget
    },
    messages=[{"role": "user", "content": "What is the derivative of x^3?"}]
)

# Moderate thinking: 8,000 tokens
response_moderate = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 8000
    },
    messages=[{"role": "user", "content": "Analyze the time complexity of quicksort."}]
)

# Maximum thinking: 128,000 tokens
response_maximum = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 128000  # Maximum thinking budget
    },
    messages=[{"role": "user", "content": "Solve this IMO problem: ..."}]
)

# The budget_tokens parameter:
# - Minimum: 1,024 tokens
# - Maximum: 128,000 tokens
# - Claude may use fewer tokens if it finishes thinking early
# - Larger budgets enable deeper reasoning on harder problems

The continuous budget approach gives developers precise control over the cost/quality tradeoff. For a production application, you might use 2,000 tokens for routine queries and 50,000 tokens for complex analysis, optimizing costs while ensuring quality where it matters.

Source: OpenAI API documentation, “Reasoning effort” parameter for o1 and o3 models. Anthropic API documentation, “Extended thinking” with budget_tokens parameter. Simon Willison, “Claude 3.7 Sonnet, extended thinking and long output,” February 25, 2025 (simonwillison.net).

The Overthinking Problem

Extended thinking is not without drawbacks. One significant issue that has emerged is overthinking: reasoning models sometimes generate far more thinking tokens than necessary, especially on simple problems. This wastes compute, increases latency, and drives up costs without improving accuracy.

Symptoms of Overthinking

Research in 2025 has documented several patterns of overthinking:

Redundant reasoning: The model repeats the same reasoning steps multiple times, perhaps with slight variations, without making progress.
Unnecessary verification: The model checks and re-checks results that are obviously correct, spending tokens on verification that adds no value.
Over-decomposition: The model breaks simple problems into unnecessarily small steps. For “What is 2 + 3?”, it might reason through the definition of addition, the properties of integers, and multiple verification steps.
Exploration without commitment: The model explores many solution paths but struggles to commit to one, generating tokens that explore alternatives even after finding a correct solution.

A May 2025 paper, “Do NOT Think That Much for 2+3=? On the Overthinking of Long Reasoning Models” (OpenReview), found that reasoning models often generate redundant, homogeneous solutions that have little impact on accuracy but greatly increase inference cost. The paper documented cases where models spent thousands of tokens on problems that could be solved in dozens.

Mitigating Overthinking

Several approaches have been proposed to address overthinking:

Adaptive token budgets: Instead of fixed thinking budgets, use a classifier to estimate problem difficulty and allocate tokens accordingly. Easy problems get minimal thinking; hard problems get extensive thinking.

Early stopping: Train models to recognize when they have reached a confident answer and stop thinking, rather than continuing until the budget is exhausted.

Thinking token compression: Research on “Chain of Draft” (arXiv:2502.18600, February 2025) proposes generating concise, dense reasoning steps instead of verbose explanations, reducing token count while preserving reasoning quality.

Calibration training: Train models to better calibrate their confidence, so they know when additional thinking is unlikely to help.

The overthinking problem illustrates a broader challenge with test-time compute scaling: more is not always better. The optimal amount of thinking depends on the problem, and current models are not always good at estimating how much thinking they need.

Source: “Do NOT Think That Much for 2+3=? On the Overthinking of Long Reasoning Models,” OpenReview, May 2025. “ThoughtTerminator: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models,” OpenReview, 2025. “Chain of Draft: Thinking Faster by Writing Less,” arXiv:2502.18600, February 2025.

When to Use Extended Thinking

Extended thinking is powerful, but it is not the right choice for every task. Here is a practical guide to when reasoning models shine and when standard models are better:

Use Extended Thinking For:

Complex mathematics: Problems requiring multiple steps, proofs, or non-obvious insights. AIME-level competition math, calculus proofs, number theory.

Algorithmic coding: Problems where the algorithm design matters more than the implementation. Competition programming, optimization problems, complex data structure design.

Multi-step reasoning: Tasks that require holding multiple pieces of information in mind and combining them. Legal analysis, scientific hypothesis evaluation, strategic planning.

Verification-critical tasks: When getting the wrong answer is costly and you need the model to check its work. Financial calculations, medical reasoning, safety-critical systems.

Research and exploration: Open-ended problems where you want the model to explore multiple approaches. Brainstorming solutions, analyzing tradeoffs, investigating edge cases.

Use Standard Models For:

Simple queries: Questions with straightforward answers. “What is the capital of France?” does not benefit from extended thinking.

Creative writing: Extended thinking can actually hurt creative tasks by making the model over-analyze instead of flowing naturally.

Real-time applications: When latency matters more than accuracy. Chatbots, autocomplete, interactive applications.

High-volume, low-stakes tasks: When you are processing many queries and the cost of extended thinking would be prohibitive. Bulk classification, simple extraction, routine summarization.

Tasks where the model already excels: If a standard model achieves 95%+ accuracy on a task, extended thinking may not provide meaningful improvement.

# Decision framework for choosing between standard and reasoning models

def should_use_reasoning_model(task):
    """
    Simple heuristic for model selection.
    Returns True if a reasoning model is likely beneficial.
    """
    # Factors that favor reasoning models
    reasoning_indicators = [
        task.requires_multi_step_logic,
        task.has_verifiable_answer,
        task.involves_math_or_code,
        task.benefits_from_exploration,
        task.is_high_stakes,
        task.standard_model_accuracy < 0.8,
    ]
    
    # Factors that favor standard models
    standard_indicators = [
        task.requires_low_latency,
        task.is_creative_or_open_ended,
        task.is_simple_lookup,
        task.is_high_volume,
        task.standard_model_accuracy > 0.95,
    ]
    
    reasoning_score = sum(reasoning_indicators)
    standard_score = sum(standard_indicators)
    
    return reasoning_score > standard_score

# Example task assessments
tasks = [
    ("AIME math problem", True),      # Multi-step, verifiable, math
    ("Write a poem", False),          # Creative, open-ended
    ("Debug this code", True),        # Logic, verifiable, code
    ("Translate to French", False),   # Standard models excel
    ("Prove this theorem", True),     # Multi-step, verifiable, math
    ("Summarize this article", False),# Standard models sufficient
    ("Design an algorithm", True),    # Multi-step, code, exploration
    ("Answer FAQ", False),            # Simple, high-volume
]

print("Task Assessment: Reasoning Model Recommended?")
print("-" * 50)
for task, recommended in tasks:
    rec_str = "Yes" if recommended else "No"
    print(f"{task:<30} {rec_str}")

The Future of Extended Thinking

Extended thinking represents a fundamental shift in how we think about AI capabilities. Instead of building ever-larger models and hoping they become smarter, we can give existing models more time to think and achieve better results. This has profound implications for the future of AI development.

Test-Time Compute as a Scaling Axis

The traditional scaling paradigm focused on three axes: model size (parameters), training data (tokens), and training compute (FLOPs). Extended thinking adds a fourth axis: inference compute. This creates new tradeoffs and opportunities:

A smaller model with extensive thinking time might outperform a larger model with minimal thinking
Inference costs become more variable and task-dependent
The optimal model size depends on the expected distribution of task difficulties
Hardware requirements shift toward supporting longer inference runs

Toward Adaptive Reasoning

Current reasoning models use relatively simple strategies for allocating thinking time. Future models will likely become more sophisticated:

Difficulty estimation: Models that can accurately estimate how hard a problem is before starting to think, allocating compute accordingly.

Dynamic reallocation: Models that can recognize when they are stuck and either try a different approach or give up and ask for clarification.

Hierarchical reasoning: Models that can decompose problems into sub-problems, solve each with appropriate compute, and combine the results.

Collaborative reasoning: Multiple models or agents working together, each contributing different perspectives or expertise.

The Convergence of Training and Inference

Extended thinking blurs the line between training and inference. When a model generates thousands of thinking tokens to solve a problem, it is effectively doing a form of in-context learning, adapting its approach based on the specific problem. Future architectures may further blur this line, with models that can update their weights during inference or maintain persistent memory across sessions.

Key Takeaways

Extended thinking (also called test-time compute scaling) allows language models to spend more computation on harder problems by generating internal “thinking tokens” before producing their final answer. This dramatically improves performance on complex reasoning tasks.
Chain-of-thought prompting (Wei et al., arXiv:2201.11903, NeurIPS 2022) was the foundation for extended thinking. It showed that prompting models to “show their work” improves accuracy on math and reasoning tasks. On GSM8K, chain-of-thought prompting enabled PaLM 540B to achieve approximately 58% accuracy, surpassing even fine-tuned GPT-3 with a verifier.
OpenAI’s o1 (September 12, 2024) was the first commercial reasoning model. It uses reinforcement learning to train an internal chain of thought, achieving 83.3% on AIME 2024 compared to GPT-4o’s approximately 12%. The thinking tokens are hidden from users.
Test-time compute scaling can be more efficient than model scaling. Snell et al. (arXiv:2408.03314, August 2024) showed that on problems where a smaller model has non-trivial success rates, allocating more test-time compute can allow it to outperform a 14x larger model.
Process reward models (Lightman et al., arXiv:2305.20050, May 2023) evaluate each reasoning step rather than just the final answer. This enables better search over reasoning paths and step-by-step verification, achieving 78.2% on a MATH benchmark subset.
DeepSeek-R1 (arXiv:2501.12948, January 20, 2025) demonstrated that reasoning capabilities can emerge purely from reinforcement learning without supervised fine-tuning (R1-Zero), and that these capabilities can be distilled into smaller models. The 32B distilled model achieved 72.6% on AIME 2024, outperforming OpenAI’s o1-mini (63.6%).
OpenAI’s o3 (April 16, 2025) achieved 96.7% on AIME 2024 and introduced multimodal reasoning. After an 80% price reduction in June 2025, o3 costs $2 per million input tokens and $8 per million output tokens.
Claude’s extended thinking (starting with Claude 3.7 Sonnet, February 24, 2025) uses visible thinking tokens with a configurable budget from 1,024 to 128,000 tokens, giving developers fine-grained control over the speed/quality tradeoff.
Gemini 2.5 Deep Think (August 1, 2025) uses a multi-agent architecture that explores multiple solution paths in parallel, achieving Bronze-level performance on the 2025 IMO benchmark.
Reasoning effort can be controlled through API parameters: OpenAI uses discrete levels (low/medium/high), while Anthropic uses a continuous token budget. Higher effort means more thinking tokens, higher latency, higher cost, and better accuracy on hard problems.
Overthinking is a documented problem where reasoning models generate far more thinking tokens than necessary, especially on simple problems. Research is ongoing into adaptive budgets, early stopping, and calibration training to mitigate this.
When to use extended thinking: Complex math, algorithmic coding, multi-step reasoning, verification-critical tasks. When not to use it: simple queries, creative writing, real-time applications, high-volume low-stakes tasks.

What’s Next

You now understand how extended thinking allows models to reason more effectively by spending more compute at inference time. In Chapter 17, we will explore the mechanics of token-by-token generation: how models actually produce output one token at a time, how temperature and sampling parameters control randomness, and how stop tokens signal completion. This will complete your understanding of how a prompt becomes a response.

Chapter 15. Fine-tuning & RLHF, Making Models Useful