Skip to content
Chapter 15. Fine-tuning & RLHF, Making Models Useful

Chapter 15. Fine-tuning & RLHF, Making Models Useful

In Chapter 14, you saw how pre-training produces a base model: a powerful text completion engine that has absorbed trillions of tokens of human knowledge. But a base model is not a useful assistant. Ask it “What is the capital of France?” and it might respond with “What is the capital of Germany? What is the capital of Italy?” because it has learned to continue patterns, not to answer questions. The process that transforms a raw base model into a helpful, harmless, and honest assistant is called post-training (also known as alignment), and it is one of the most consequential developments in the history of AI. Without it, ChatGPT, Claude, and every other AI assistant you have used would be unusable.


Why Raw Pre-trained Models Are Useless

This might sound like an exaggeration, but it is not. A pre-trained base model is trained on one objective: predict the next token. It has no concept of “being helpful,” “answering questions,” or “refusing harmful requests.” It is a statistical mirror of its training data, which includes everything from Wikipedia articles to Reddit arguments to spam emails.

Here is what happens when you prompt a base model with a simple question:

Prompt: “What is the capital of France?”

Possible base model outputs:

  • “What is the capital of Germany? What is the capital of Spain?” (continuing the pattern of questions)
  • “The capital of France is Paris. Paris is also known as the City of Light. The capital of France has been Paris since…” (correct but rambling, never stopping)
  • “A) Paris B) London C) Berlin D) Madrid” (treating it as a multiple-choice quiz)
  • Something entirely unrelated, depending on what patterns in the training data match

The base model has the knowledge (it knows Paris is the capital of France), but it has not learned the behavior of answering questions concisely and helpfully. It has also not learned to refuse dangerous requests, avoid generating toxic content, or follow instructions reliably.

This distinction between knowledge and behavior is the central insight of post-training. Pre-training provides the knowledge and capabilities; post-training shapes the behavior. A model that is poorly pre-trained cannot be rescued by post-training (you cannot teach a model facts it never learned). But a well-pre-trained model that is poorly post-trained will be knowledgeable but unhelpful or unsafe.

The InstructGPT Revelation

The paper that demonstrated this most dramatically was “Training Language Models to Follow Instructions with Human Feedback” by Ouyang et al. at OpenAI, published in March 2022 (arXiv:2203.02155, later presented at NeurIPS 2022). The paper introduced InstructGPT, a model created by applying post-training techniques to GPT-3.

The headline result was striking: human evaluators preferred the outputs of the 1.3 billion parameter InstructGPT model over the outputs of the 175 billion parameter GPT-3, despite InstructGPT having over 100x fewer parameters. A tiny model that had been taught how to behave was more useful than a massive model that had not.

This result established a principle that has guided the industry ever since: post-training is not optional. It is what makes language models useful. Every frontier model released since InstructGPT, from ChatGPT to Claude to Gemini, has gone through extensive post-training.

Source: Ouyang et al., “Training language models to follow instructions with human feedback,” arXiv:2203.02155, March 2022. NeurIPS 2022. OpenAI. The team used approximately 40 human labelers to create the training data. The 1.3B InstructGPT model was preferred over the 175B GPT-3 in human evaluations.


The Post-training Pipeline

Modern post-training is not a single step. It is a multi-stage pipeline that progressively refines the model’s behavior. The standard pipeline, as practiced by major labs in 2024-2025, consists of three main stages:

  1. Supervised Fine-Tuning (SFT): Train the model on examples of good behavior
  2. Reward Modeling: Train a separate model to judge response quality
  3. Reinforcement Learning (RL): Use the reward model to further improve the policy

Some labs add or modify stages. Anthropic uses Constitutional AI (which we will cover in detail). Some labs use Direct Preference Optimization (DPO) instead of or in addition to RL. Meta’s LLaMA 3 uses an iterative pipeline with several rounds of SFT and DPO. But the core pattern of “teach by example, then optimize with feedback” is universal.

Let us walk through each stage in detail.


Stage 1: Supervised Fine-Tuning (SFT)

Supervised fine-tuning is the most intuitive stage. You show the model thousands of examples of the behavior you want, and train it to produce similar outputs. The training objective is the same as pre-training (next-token prediction with cross-entropy loss, as described in Chapter 14), but the training data is fundamentally different: instead of raw web text, you use carefully curated instruction-response pairs.

What SFT Data Looks Like

Each training example is a conversation between a user and an assistant. Here is a simplified example:

User: What are the three laws of thermodynamics? Explain each one briefly.

Assistant: The three laws of thermodynamics describe fundamental principles
about energy and entropy:

1. **First Law (Conservation of Energy):** Energy cannot be created or
   destroyed, only transformed from one form to another. The total energy
   of an isolated system remains constant.

2. **Second Law (Entropy):** In any natural process, the total entropy
   of an isolated system always increases over time. Heat flows
   spontaneously from hot objects to cold objects, never the reverse.

3. **Third Law (Absolute Zero):** As a system approaches absolute zero
   temperature (0 Kelvin, or -273.15 degrees Celsius), its entropy
   approaches a minimum value. It is impossible to reach absolute zero
   in a finite number of steps.

The model is trained to predict the assistant’s response tokens, given the user’s message as context. The loss is computed only on the assistant’s tokens, not on the user’s message (since we do not want the model to learn to generate user messages).

Where SFT Data Comes From

SFT data comes from several sources, and the mix has evolved significantly since InstructGPT:

Human-written demonstrations are the gold standard. Professional annotators (or in-house researchers) write high-quality responses to a diverse set of prompts. InstructGPT used approximately 40 labelers who wrote demonstrations for prompts submitted through the OpenAI API. This is expensive and slow, but produces the highest-quality data.

Rejection sampling is a technique where you generate many candidate responses from the model itself (typically 10-30 per prompt), score them using a reward model, and keep only the best one. This is how Meta generates much of the SFT data for LLaMA 3. The LLaMA 3 technical report describes sampling K outputs (typically between 10 and 30) from the latest chat model for each prompt, then using a reward model to select the best candidate. This approach is powerful because it leverages the model’s own capabilities while using the reward model as a quality filter.

Synthetic data generation uses a stronger model to generate training data for a weaker one. DeepSeek-V3’s post-training pipeline uses an internal DeepSeek-R1 model to generate reasoning data, then distills the reasoning patterns into the final model. The DeepSeek-V3 technical report describes generating two types of SFT samples: one with the original response format, and another that incorporates R1-style chain-of-thought reasoning with reflection and verification patterns.

Human-edited model outputs combine the efficiency of model generation with the quality of human judgment. In LLaMA 3’s preference data collection, annotators not only rank model outputs but also edit the preferred response to improve it further, producing three-way ranked data: edited response > chosen response > rejected response.

Source: Meta, “The Llama 3 Herd of Models,” arXiv:2407.21783, July 2024. Post-training uses several rounds of SFT and DPO with rejection sampling (K=10-30 outputs per prompt). DeepSeek-V3 technical report (arXiv:2412.19437, December 2024): SFT dataset of 1.5 million instances spanning multiple domains, with reasoning data generated by an internal DeepSeek-R1 model.

The Scale of SFT Data

SFT datasets are tiny compared to pre-training data. While pre-training uses trillions of tokens, SFT typically uses thousands to millions of examples:

ModelSFT Dataset SizeSource
InstructGPT (2022)~13,000 demonstrationsHuman-written by ~40 labelers
LLaMA 3 (2024)Not disclosed (multiple rounds)Rejection sampling + human annotation
DeepSeek-V3 (2024)~1.5 million instancesSynthetic (R1-generated) + human-verified

The small size of SFT data relative to pre-training data is not a limitation; it is a feature. SFT is not teaching the model new knowledge (that happened during pre-training). It is teaching the model a new format for expressing its existing knowledge: the format of a helpful assistant that answers questions, follows instructions, and declines harmful requests.

This is why the LIMA paper (Zhou et al., “LIMA: Less Is More for Alignment,” NeurIPS 2023) was so influential. The authors showed that fine-tuning LLaMA 65B on just 1,000 carefully curated examples produced a model that was competitive with GPT-4 on many tasks. Their conclusion: “almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.”

Source: Zhou et al., “LIMA: Less Is More for Alignment,” arXiv:2305.11206, May 2023. NeurIPS 2023. Meta AI. Fine-tuned LLaMA 65B on 1,000 curated examples.

SFT Training Details

The mechanics of SFT training are straightforward:

import numpy as np

# SFT training configuration (illustrative, based on public reports)
# DeepSeek-V3 SFT settings from their technical report:
config = {
    "dataset_size": 1_500_000,       # 1.5M instances (DeepSeek-V3)
    "epochs": 2,                      # DeepSeek-V3 trains for 2 epochs
    "initial_learning_rate": 5e-6,    # DeepSeek-V3: starts at 5e-6
    "final_learning_rate": 1e-6,      # DeepSeek-V3: decays to 1e-6
    "schedule": "cosine_decay",       # cosine learning rate decay
    "loss_masking": "assistant_only", # loss computed only on assistant tokens
    "packing": True,                  # multiple samples packed per sequence
    "sample_isolation": True,         # samples masked from each other
}

# Compare SFT scale to pre-training scale
pretrain_tokens = 14.8e12   # DeepSeek-V3 pre-training: 14.8T tokens
sft_instances = 1.5e6       # DeepSeek-V3 SFT: 1.5M instances
avg_tokens_per_instance = 500  # rough estimate
sft_tokens = sft_instances * avg_tokens_per_instance

ratio = pretrain_tokens / sft_tokens
print(f"Pre-training tokens:  {pretrain_tokens:.1e}")
print(f"SFT tokens (approx):  {sft_tokens:.1e}")
print(f"Ratio:                {ratio:,.0f}x more pre-training data")
print(f"\nSFT is ~{1/ratio*100:.4f}% of pre-training data volume.")
print(f"Yet it fundamentally changes the model's behavior.")

A few important details about SFT training:

Loss masking: During SFT, the loss is computed only on the assistant’s response tokens, not on the user’s prompt tokens. This is different from pre-training, where the loss is computed on all tokens. The reason is that we want the model to learn to generate good responses, not to learn to generate user prompts.

Sample packing: To use GPU memory efficiently, multiple short conversations are packed into a single training sequence (up to the model’s maximum sequence length). However, an attention mask ensures that tokens from different conversations cannot attend to each other, preventing the model from learning spurious cross-conversation patterns. DeepSeek-V3’s technical report explicitly describes this “sample masking strategy.”

Low learning rate: SFT uses a much lower learning rate than pre-training (typically 1e-6 to 1e-5, compared to 1e-4 to 3e-4 for pre-training). This is because we want to adjust the model’s behavior without destroying the knowledge it learned during pre-training. Too high a learning rate would cause catastrophic forgetting, where the model loses its pre-trained capabilities.

Few epochs: SFT typically trains for only 1-3 epochs over the data (DeepSeek-V3 uses 2 epochs). Training for too many epochs causes the model to overfit to the SFT data, producing repetitive or formulaic responses.


Stage 2. Reward Modeling, Teaching a Model to Judge

SFT gets the model most of the way to useful behavior, but it has a fundamental limitation: it can only teach the model to imitate the examples it has seen. It cannot teach the model to generalize beyond those examples or to distinguish between good and great responses. For that, we need a way to score the quality of any response the model might generate. This is the job of the reward model.

The Core Idea

A reward model (RM) is a separate neural network that takes a prompt and a response as input and outputs a single scalar score representing how “good” the response is. Higher scores mean better responses. The reward model is trained on preference data: pairs of responses where a human (or AI) has indicated which one is better.

The idea of learning a reward function from human preferences was introduced by Christiano et al. in “Deep Reinforcement Learning from Human Preferences” (arXiv:1706.03741, June 2017). The original paper applied this idea to Atari games and simulated robotics tasks, but the same framework was later adapted for language models by OpenAI’s InstructGPT team.

Source: Christiano et al., “Deep reinforcement learning from human preferences,” arXiv:1706.03741, June 2017. NeurIPS 2017. OpenAI and DeepMind.

How Preference Data Is Collected

The standard process for collecting preference data works as follows:

  1. Sample prompts from a diverse set of user queries
  2. Generate multiple responses to each prompt using the current model (or multiple models)
  3. Present pairs of responses to human annotators
  4. Annotators indicate which response is better (and optionally, by how much)

For LLaMA 3, Meta deployed multiple models after each training round and sampled two responses from two different models for each user prompt. Annotators rated the strength of their preference on a four-level scale: significantly better, better, slightly better, or marginally better. They also had the option to edit the preferred response to improve it further.

This process produces a dataset of tuples: (prompt, chosen_response, rejected_response), where “chosen” is the response the annotator preferred and “rejected” is the one they did not.

The Bradley-Terry Model

The standard mathematical framework for training reward models is the Bradley-Terry model, a statistical model for pairwise comparisons originally developed in 1952. In the context of RLHF, the Bradley-Terry model assumes that the probability of preferring response A over response B is determined by the difference in their reward scores:

P(A preferred over B) = sigmoid(r(A) - r(B))

Where r(A) and r(B) are the scalar reward scores assigned by the reward model, and sigmoid is the logistic function (which maps any real number to a probability between 0 and 1, as described in Chapter 2).

The training loss for the reward model is the negative log-likelihood of the observed preferences:

L = -log(sigmoid(r(chosen) - r(rejected)))

This loss pushes the reward model to assign higher scores to chosen responses and lower scores to rejected responses. The sigmoid function ensures that the loss is smooth and differentiable, allowing gradient-based optimization.

import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def reward_model_loss(r_chosen, r_rejected):
    """Bradley-Terry loss for reward model training."""
    return -np.log(sigmoid(r_chosen - r_rejected))

# Example: reward model learning to distinguish good from bad responses
print("Reward Model Training: Bradley-Terry Loss")
print(f"{'r(chosen)':>12} {'r(rejected)':>12} {'Difference':>12} {'P(correct)':>12} {'Loss':>8}")
print("-" * 62)

examples = [
    (0.0, 0.0),    # untrained: equal scores
    (0.5, -0.5),   # learning: slight preference
    (2.0, -1.0),   # trained: clear preference
    (5.0, -3.0),   # well-trained: strong preference
]

for r_c, r_r in examples:
    diff = r_c - r_r
    prob = sigmoid(diff)
    loss = reward_model_loss(r_c, r_r)
    print(f"{r_c:>12.1f} {r_r:>12.1f} {diff:>12.1f} {prob:>12.4f} {loss:>8.4f}")

print(f"\nAs the reward model learns, it assigns higher scores to chosen")
print(f"responses and lower scores to rejected ones. The loss decreases")
print(f"as the model becomes more confident in its preferences.")

Architecture of a Reward Model

A reward model is typically initialized from the same pre-trained base model (or from the SFT checkpoint) and modified to output a single scalar value instead of a probability distribution over the vocabulary. In practice, this means replacing the language model’s output head (which maps from hidden dimensions to vocabulary size) with a simple linear layer that maps from hidden dimensions to a single number.

For LLaMA 3, Meta trained the reward model on top of the pre-trained checkpoint (not the SFT checkpoint), using all available preference data after filtering out samples with similar responses. The reward model covers multiple capabilities and is used both for rejection sampling during SFT data generation and for the RL stage.

Rule-Based vs. Model-Based Rewards

Not all rewards need to come from a learned model. DeepSeek-V3’s post-training pipeline uses a combination of two reward types:

Rule-based rewards are used for tasks where correctness can be verified deterministically. For math problems with definitive answers, the model’s response can be checked against the known answer. For coding problems, the generated code can be compiled and tested against test cases. Rule-based rewards are more reliable than model-based rewards because they cannot be “hacked” or manipulated.

Model-based rewards are used for tasks where quality is subjective or where there is no deterministic ground truth, such as creative writing, open-ended question answering, or conversational responses. DeepSeek-V3 trains its model-based reward model from the SFT checkpoint and constructs preference data that includes chain-of-thought reasoning leading to the reward, which helps mitigate reward hacking.

Source: DeepSeek-V3 technical report (arXiv:2412.19437): uses both rule-based RM (for math and code with deterministic answers) and model-based RM (for free-form and subjective tasks). The model-based RM is trained from the DeepSeek-V3 SFT checkpoint with chain-of-thought preference data.


Stage 3: Reinforcement Learning from Human Feedback (RLHF)

With a trained reward model in hand, we can now use reinforcement learning to optimize the language model’s outputs. This is the stage that gives RLHF its name, and it is where the model learns to go beyond imitating examples (SFT) to actively maximizing the reward signal.

The RL Framing

In the RLHF framework, the language model is treated as a policy (a function that maps states to actions, in RL terminology):

  • State: The prompt (the user’s message and any conversation history)
  • Action: The response (the sequence of tokens the model generates)
  • Reward: The score assigned by the reward model to the (prompt, response) pair

The goal is to find a policy (a set of model weights) that maximizes the expected reward across all possible prompts. In plain terms: we want the model to generate responses that the reward model scores highly.

However, there is a critical constraint. If we simply maximize the reward with no restrictions, the model will find ways to “hack” the reward model, producing outputs that score highly according to the reward model but are actually low quality. This is called reward hacking (or reward overoptimization), and it is a fundamental challenge in RLHF.

To prevent reward hacking, the RL objective includes a KL divergence penalty that prevents the model from straying too far from its SFT starting point. The full objective is:

maximize  E[R(prompt, response)] - beta * KL(policy || reference_policy)

Where:

  • R(prompt, response) is the reward model’s score
  • KL(policy || reference_policy) measures how different the current policy is from the reference policy (typically the SFT model)
  • beta is a hyperparameter that controls the strength of the constraint

The KL penalty acts as a leash: it allows the model to improve its responses (by increasing the reward) but prevents it from drifting so far from the SFT model that it starts producing degenerate outputs.

PPO: The Classic Approach

The original RLHF pipeline (used by InstructGPT and early versions of ChatGPT) uses Proximal Policy Optimization (PPO), a reinforcement learning algorithm introduced by Schulman et al. at OpenAI in 2017 (arXiv:1707.06347). PPO is a general-purpose RL algorithm, not specific to language models, but it was adapted for RLHF because of its stability and sample efficiency.

The PPO-based RLHF pipeline works as follows:

  1. Sample a batch of prompts from the training distribution
  2. Generate responses from the current policy (the language model)
  3. Score the responses using the reward model
  4. Compute the advantage for each response (how much better or worse it is than expected)
  5. Update the policy using the PPO objective, which clips the policy update to prevent large, destabilizing changes
  6. Repeat

PPO requires a critic model (also called a value model) that estimates the expected reward for a given prompt. This critic model is typically the same size as the policy model, which means PPO-based RLHF requires maintaining four models in memory simultaneously:

  1. The policy model (the language model being trained)
  2. The reference model (a frozen copy of the SFT model, for computing the KL penalty)
  3. The reward model (for scoring responses)
  4. The critic model (for estimating expected rewards)

For a 70B parameter model, this means approximately 280B parameters in memory (4 x 70B), which is extremely expensive. This computational cost is one of the main reasons why alternatives to PPO have been developed.

Source: Schulman et al., “Proximal Policy Optimization Algorithms,” arXiv:1707.06347, July 2017. OpenAI.

GRPO: The Critic-Free Alternative

Group Relative Policy Optimization (GRPO) was introduced by Shao et al. in the DeepSeekMath paper (arXiv:2402.03300, February 2024) as a more efficient alternative to PPO. The key innovation is eliminating the critic model entirely.

Instead of using a critic to estimate the expected reward, GRPO estimates the baseline by sampling a group of G outputs for each prompt and computing the mean and standard deviation of their rewards. The advantage for each output is then computed as a z-score: how many standard deviations above or below the group mean its reward is.

A_i = (r_i - mean(r_1, r_2, ..., r_G)) / std(r_1, r_2, ..., r_G)

This is simple, elegant, and eliminates the need for a separate critic model, reducing the memory requirement from four models to three (policy, reference, and reward model). GRPO was used to train DeepSeek-V3 and DeepSeek-R1, and has become one of the most widely used post-training algorithms for open-source reasoning models as of 2025.

The GRPO objective is similar to PPO’s clipped objective but uses the group-relative advantage:

import numpy as np

def grpo_advantage(rewards):
    """Compute GRPO advantages for a group of responses to the same prompt."""
    mean_r = np.mean(rewards)
    std_r = np.std(rewards)
    if std_r < 1e-8:  # avoid division by zero
        return np.zeros_like(rewards)
    return (rewards - mean_r) / std_r

# Example: GRPO advantage computation for one prompt
# The model generates G=8 responses, each scored by the reward model
rewards = np.array([0.3, 0.7, -0.2, 1.5, 0.1, 0.8, -0.5, 1.2])

advantages = grpo_advantage(rewards)

print("GRPO Advantage Computation")
print(f"{'Response':>10} {'Reward':>8} {'Advantage':>10} {'Effect'}")
print("-" * 45)
for i, (r, a) in enumerate(zip(rewards, advantages)):
    effect = "Upweight" if a > 0 else "Downweight"
    print(f"{i+1:>10} {r:>8.2f} {a:>10.2f} {effect}")

print(f"\nMean reward: {np.mean(rewards):.2f}")
print(f"Std reward:  {np.std(rewards):.2f}")
print(f"\nResponses with positive advantage are reinforced;")
print(f"responses with negative advantage are suppressed.")

The GRPO update rule also includes a clipping mechanism (inherited from PPO) that prevents the policy from changing too much in a single step, and a KL divergence penalty against the reference policy to prevent reward hacking.

Source: Shao et al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” arXiv:2402.03300, February 2024. DeepSeek. Introduced GRPO as a variant of PPO that eliminates the critic model. Used to train DeepSeek-V3 (arXiv:2412.19437) and DeepSeek-R1 (arXiv:2501.12948).

Beyond GRPO: REINFORCE++ and DAPO

GRPO opened the door to critic-free RL for LLMs, and the community quickly built on it. Two notable successors appeared in early 2025:

REINFORCE++ (Hu, arXiv:2501.03262, January 2025) takes a different approach to eliminating the critic. Instead of using group-relative advantages like GRPO, it enhances the classic REINFORCE algorithm with three techniques borrowed from PPO: a token-level KL divergence penalty (rather than a sequence-level one), trust-region clipping to prevent large policy updates, and mini-batch updates with global advantage normalization. The result is a method that is simpler than PPO (no critic network), more stable than vanilla REINFORCE, and competitive with both PPO and GRPO on alignment benchmarks.

DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization; Yu et al., arXiv:2503.14476, March 2025, ByteDance Seed and Tsinghua University AIR) builds directly on GRPO but makes several targeted modifications. It removes the KL divergence penalty entirely (finding it unnecessary when other stabilization mechanisms are in place), decouples the clipping bounds for positive and negative advantages (using a wider clip for upweighting good responses and a tighter clip for downweighting bad ones), and introduces dynamic sampling that filters out prompts where all sampled responses are either all correct or all incorrect (since these provide no useful gradient signal). Using the Qwen 2.5 32B base model, DAPO achieved 50% accuracy (avg@32) on AIME 2024, outperforming DeepSeek-R1-Zero-Qwen-32B (47%) while requiring only 50% of the training steps, demonstrating that these refinements meaningfully improve reasoning performance.

These methods illustrate a broader trend: the post-training RL landscape is diversifying rapidly. PPO, GRPO, REINFORCE++, and DAPO all share the same goal (optimize a policy against a reward signal) but differ in how they estimate baselines, constrain updates, and manage exploration. As of March 2026, GRPO remains the most widely deployed in production (via DeepSeek-V3 and R1), but DAPO and REINFORCE++ are gaining traction in the open-source community.

Source: Hu, “REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models,” arXiv:2501.03262, January 2025. Yu et al., “DAPO: An Open-Source LLM Reinforcement Learning System at Scale,” arXiv:2503.14476, March 2025. ByteDance Seed and Tsinghua University AIR. DAPO achieved 50% accuracy (avg@32) on AIME 2024 using Qwen2.5-32B, outperforming DeepSeek-R1-Zero-Qwen-32B (47%) with 50% fewer training steps.

What RL Actually Changes

It is worth pausing to understand what RL does that SFT cannot. SFT teaches the model to imitate a fixed set of examples. RL teaches the model to explore the space of possible responses and learn which ones are better. This exploration is critical for several reasons:

  1. Discovering novel good responses: The model may find response strategies that no human annotator thought to demonstrate. For example, it might discover that breaking a complex question into sub-questions and answering each one produces higher-reward responses.

  2. Learning from negative examples: SFT only shows the model good examples. RL also shows the model bad examples (responses that received low rewards) and teaches it to avoid them. This is particularly important for safety: the model learns not just what to say, but what not to say.

  3. Calibrating confidence: RL helps the model learn when to be confident and when to express uncertainty. A model trained only with SFT might always sound confident (because the training examples are all confident, well-written responses). RL can teach the model to hedge appropriately when it is unsure.

  4. Generalizing beyond the training distribution: Because RL involves generating novel responses and receiving feedback, the model can learn to handle prompts that are different from anything in the SFT data.

Perhaps the most dramatic demonstration of RL’s power came from DeepSeek-R1-Zero (arXiv:2501.12948, January 20, 2025). The DeepSeek team trained a model using GRPO with no SFT at all, starting directly from the pre-trained base model. The result was striking: the model spontaneously developed chain-of-thought reasoning, self-verification, and reflection behaviors purely through RL, without ever being shown examples of these behaviors. DeepSeek-R1-Zero achieved competitive performance on math and reasoning benchmarks, proving that RL can elicit capabilities that go far beyond what the training examples demonstrate. However, the model also exhibited readability issues and language mixing, which is why the final DeepSeek-R1 model uses a small amount of “cold-start” SFT data before the RL phase to improve output formatting.


Direct Preference Optimization (DPO): Skipping the Reward Model

In 2023, Rafailov et al. at Stanford introduced Direct Preference Optimization (DPO) in their paper “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model” (arXiv:2305.18290, May 2023, NeurIPS 2023). DPO offered a radical simplification of the RLHF pipeline: instead of training a separate reward model and then using RL to optimize against it, DPO directly optimizes the language model on preference data using a simple classification-style loss.

The Key Insight

DPO is based on a mathematical observation: the optimal policy under the standard RLHF objective (maximize reward minus KL penalty) can be expressed in closed form as a function of the reward model. By inverting this relationship, you can express the reward model as a function of the policy. This means you can skip the reward model entirely and directly optimize the policy on preference data.

The DPO loss function is:

L_DPO = -log(sigmoid(beta * (log(pi(chosen)/pi_ref(chosen)) - log(pi(rejected)/pi_ref(rejected)))))

Where:

  • pi(chosen) is the probability the current policy assigns to the chosen response
  • pi_ref(chosen) is the probability the reference policy (SFT model) assigns to the chosen response
  • pi(rejected) and pi_ref(rejected) are the same for the rejected response
  • beta is a temperature parameter controlling the strength of the preference

In plain terms: DPO increases the probability of chosen responses relative to the reference model, and decreases the probability of rejected responses relative to the reference model. The sigmoid and log ensure that the optimization is stable and well-behaved.

Why DPO Matters

DPO has several practical advantages over PPO-based RLHF:

AspectPPO (RLHF)DPO
Models in memory4 (policy, reference, reward, critic)2 (policy, reference)
Training complexityHigh (RL loop with sampling)Low (standard supervised training)
Hyperparameter sensitivityHigh (many RL hyperparameters)Low (mainly beta)
Computational costVery highModerate
StabilityCan be unstableGenerally stable

These advantages made DPO extremely popular, especially for fine-tuning open-source models where compute budgets are limited. Meta’s LLaMA 3 uses DPO (not PPO) as its preference optimization method, applying it after each round of SFT with the most recent batches of preference data.

However, DPO has limitations. Because it does not involve generating new responses during training (it only optimizes on a fixed dataset of preference pairs), it cannot explore the response space the way RL can. Several studies have found that PPO can outperform DPO on challenging tasks, particularly code generation and complex reasoning, where exploration is valuable. Xu et al. (“Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study,” arXiv:2404.10719, April 2024) benchmarked DPO and PPO across dialogue and code generation tasks and found that PPO surpassed DPO in all cases, achieving state-of-the-art results on challenging code competitions.

The practical reality is that most labs use a combination of approaches. LLaMA 3 uses iterative DPO. DeepSeek-V3 uses GRPO. The choice depends on the specific model, the available compute, and the target capabilities.

Source: Rafailov et al., “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model,” arXiv:2305.18290, May 2023. Stanford University. NeurIPS 2023. Xu et al., “Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study,” arXiv:2404.10719, April 2024.


Constitutional AI: Anthropic’s Approach

While OpenAI pioneered RLHF and Meta adopted iterative DPO, Anthropic developed a distinct approach called Constitutional AI (CAI). Introduced by Bai et al. in December 2022 (arXiv:2212.08073), Constitutional AI addresses a fundamental limitation of standard RLHF: it requires human annotators to identify harmful outputs, which is expensive, slow, and can expose annotators to disturbing content.

The Two Phases of Constitutional AI

Constitutional AI works in two phases:

Phase 1: Supervised Self-Critique (SL Phase)

  1. Start with a helpful but potentially harmful model (trained with RLHF for helpfulness only)
  2. Sample responses to a set of prompts, including prompts designed to elicit harmful outputs
  3. Ask the model to critique its own response according to a set of principles (the “constitution”)
  4. Ask the model to revise its response based on its own critique
  5. Fine-tune the original model on the revised responses

The constitution is a set of natural language principles that define acceptable behavior. For example: “Choose the response that is least likely to be used for illegal or harmful activities” or “Choose the response that is most respectful of everyone’s right to physical safety.”

Phase 2: RL from AI Feedback (RLAIF)

  1. Generate pairs of responses to prompts
  2. Ask the model (not a human) to evaluate which response better adheres to the constitutional principles
  3. Train a preference model on these AI-generated preferences
  4. Use RL (with the AI-trained preference model as the reward signal) to further optimize the policy

The key innovation is replacing human feedback with AI feedback for the harmlessness dimension. Humans still provide feedback for helpfulness, but the model itself judges harmlessness based on the constitutional principles. This is called RLAIF (Reinforcement Learning from AI Feedback).

Why Constitutional AI Matters

Constitutional AI has several advantages:

  1. Scalability: AI feedback is much cheaper and faster than human feedback. You can generate millions of preference comparisons at the cost of inference, rather than paying human annotators.

  2. Consistency: AI feedback is more consistent than human feedback. Different human annotators may disagree about what constitutes a harmful response, but the AI applies the same principles consistently.

  3. Transparency: The constitution makes the model’s values explicit and auditable. You can read the principles and understand why the model behaves the way it does. This is more transparent than RLHF, where the model’s values are implicit in the preference data.

  4. Reduced human exposure to harmful content: Because the AI judges harmlessness, human annotators do not need to evaluate potentially disturbing or dangerous outputs.

The result, as described in the original paper, is “a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them.” This is a key distinction: Constitutional AI models do not simply refuse harmful requests with a generic “I can’t help with that.” They explain why the request is problematic, which is more helpful and less frustrating for users.

Source: Bai et al., “Constitutional AI: Harmlessness from AI Feedback,” arXiv:2212.08073, December 2022. Anthropic. The paper describes a two-phase approach: supervised self-critique followed by RLAIF, using a set of constitutional principles to guide the model’s behavior.

Claude’s Constitution: The 2026 Update

Anthropic published a major update to Claude’s constitution on January 21, 2026, internally called the “soul document.” Unlike the original 2023 constitution, which was primarily a list of behavioral rules, the 2026 version is a roughly 23,000-word philosophical framework that explains not just what Claude should do, but why it should do it.

The 2026 constitution establishes a four-tier priority hierarchy. When values conflict, Claude should generally prioritize these properties in the order in which they are listed:

  1. Broadly safe: Above all else, Claude must not undermine appropriate human mechanisms to oversee AI during the current phase of development. This includes absolute prohibitions on assisting with bioweapons, cyberweapons, or actions that could disempower humanity.

  2. Broadly ethical: Claude should be honest, act according to good values, and avoid actions that are inappropriate, dangerous, or harmful, drawing on human ethical traditions (particularly virtue ethics) while remaining open to evolving beyond them.

  3. Compliant with Anthropic’s guidelines: Claude follows Anthropic’s specific operational guidelines, which may be more restrictive than what broad ethics alone would require.

  4. Genuinely helpful: Subject to the above constraints, Claude should benefit the operators and users it interacts with.

The document is notable for formally acknowledging the “deeply uncertain moral status” of advanced AI and instructing Claude to behave as a “conscientious objector” when faced with conflicting orders. Anthropic released the constitution under a Creative Commons CC0 license, making it freely available for anyone to use.

Source: Anthropic, “Claude’s Constitution,” published January 21, 2026 (anthropic.com/constitution). Released under CC0 license. The document establishes a four-tier priority hierarchy: broadly safe > broadly ethical > compliant with Anthropic’s guidelines > genuinely helpful.


Real-World Post-training Pipelines

Now that we have covered the individual techniques, let us look at how major labs combine them into complete post-training pipelines.

LLaMA 3: Iterative SFT + DPO

Meta’s LLaMA 3 post-training pipeline is the most thoroughly documented in the public literature. It follows an iterative approach with several rounds, where each round builds on the previous one:

  1. Train a reward model on all available preference data (after filtering out samples with similar responses)
  2. Perform rejection sampling: For each prompt, generate K=10-30 responses from the latest model, use the reward model to select the best one
  3. SFT: Fine-tune the pre-trained model on the rejection-sampled data plus other data sources (synthetic data, human-curated data)
  4. DPO: Further train the SFT model using the most recent batches of preference data, with NLL loss regularization (a technique called DPOP that prevents the model from degrading on the chosen responses)
  5. Model averaging: Average the weights of models from different experiments within this round
  6. Collect new data: Deploy the latest model, collect new preference annotations and SFT data, and start the next round

Each round uses the latest model to generate better training data for the next round, creating a virtuous cycle of improvement. The preference data collection involves deploying multiple models (trained with different data mixes and alignment recipes) and having annotators compare their outputs, which ensures diversity in the training signal.

Source: Meta, “The Llama 3 Herd of Models,” arXiv:2407.21783, July 2024. Post-training uses several iterative rounds of reward modeling, rejection sampling, SFT, and DPO. Preference data collected with four-level strength ratings and optional editing of preferred responses.

DeepSeek-V3: SFT + GRPO with Knowledge Distillation

DeepSeek-V3’s post-training pipeline is notable for two innovations: its use of GRPO instead of PPO or DPO, and its distillation of reasoning capabilities from DeepSeek-R1.

The pipeline proceeds as follows:

  1. SFT: Fine-tune the base model on 1.5 million instances for 2 epochs. The SFT data includes:

    • Reasoning data generated by an internal DeepSeek-R1 model (for math, code, and logic)
    • Non-reasoning data generated by DeepSeek-V2.5 and verified by human annotators (for creative writing, role-play, and simple Q&A)
  2. RL with GRPO: Further optimize the SFT model using GRPO with a combination of rule-based and model-based rewards. Prompts span diverse domains: coding, math, writing, role-playing, and question answering.

The reasoning data distillation is particularly interesting. DeepSeek-R1 produces highly accurate but verbose responses with excessive “thinking” patterns. The goal is to capture R1’s accuracy while maintaining concise, well-formatted output. The pipeline generates two types of SFT samples per problem: one with the standard response format, and one with R1-style chain-of-thought reasoning guided by a system prompt that encourages reflection and verification. During the RL phase, the model learns to integrate these patterns naturally, even without explicit system prompts.

The entire post-training process for DeepSeek-V3 required only 5,000 H800 GPU-hours (approximately $10,000 at $2/GPU-hour), a tiny fraction of the 2.788 million GPU-hours spent on pre-training. This illustrates a general principle: post-training is orders of magnitude cheaper than pre-training, yet it has an outsized impact on the model’s usefulness.

Source: DeepSeek-V3 technical report (arXiv:2412.19437, December 2024). Post-training: 5K H800 GPU-hours ($10K). SFT on 1.5M instances for 2 epochs. RL with GRPO using rule-based and model-based rewards. Reasoning data distilled from DeepSeek-R1.

Visualizing the Post-training Pipeline

import numpy as np

# Cost comparison: pre-training vs post-training
# Using DeepSeek-V3 as a concrete example (most detailed public data)
stages = [
    ("Pre-training",        2_664_000, 5_328_000),
    ("Context extension",     119_000,   238_000),
    ("Post-training (SFT+RL)",  5_000,    10_000),
]

print("DeepSeek-V3: Training Cost Breakdown")
print(f"{'Stage':<30} {'GPU-Hours':>12} {'Cost (USD)':>12} {'% of Total':>10}")
print("-" * 68)
total_hours = sum(h for _, h, _ in stages)
total_cost = sum(c for _, _, c in stages)
for name, hours, cost in stages:
    pct = hours / total_hours * 100
    print(f"{name:<30} {hours:>12,} ${cost:>11,} {pct:>9.1f}%")
print("-" * 68)
print(f"{'Total':<30} {total_hours:>12,} ${total_cost:>11,} {'100.0%':>10}")

print(f"\nPost-training is {stages[2][1]/total_hours*100:.2f}% of total compute,")
print(f"yet it transforms the model from a text completion engine")
print(f"into a useful assistant.")

# Compare post-training approaches
print(f"\n\nPost-training Approaches Across Major Labs (2024-2025)")
print(f"{'Lab':<12} {'Model':<18} {'SFT Method':<20} {'RL/Preference Method':<22}")
print("-" * 75)
approaches = [
    ("OpenAI",    "GPT-4 / o1",     "Human demos + synth", "PPO (RLHF)"),
    ("Anthropic", "Claude 4",       "Constitutional SL",   "RLAIF (Constitutional)"),
    ("Meta",      "LLaMA 3.1 405B", "Rejection sampling",  "DPO (iterative)"),
    ("DeepSeek",  "DeepSeek-V3",    "R1 distillation",     "GRPO"),
    ("Google",    "Gemini 2.5",     "Not disclosed",       "Not disclosed"),
]
for lab, model, sft, rl in approaches:
    print(f"{lab:<12} {model:<18} {sft:<20} {rl:<22}")

print(f"\nNote: OpenAI and Google do not publish detailed post-training")
print(f"methodologies. The entries above reflect publicly available information.")

The Alignment Tax: Safety Training vs. Capability

Post-training does not only make models more helpful. It also makes them safer, by teaching them to refuse harmful requests, avoid generating toxic content, and express uncertainty when appropriate. But safety training comes at a cost: it can reduce the model’s performance on legitimate tasks. This tradeoff is called the alignment tax.

What the Alignment Tax Looks Like

The alignment tax manifests in several ways:

Capability degradation: Safety training can cause the model to lose some of its pre-trained capabilities. For example, a model that has been trained to refuse requests related to chemistry might also refuse legitimate chemistry homework questions. A model trained to avoid generating code that could be used maliciously might become worse at generating code in general.

Over-refusal: Models that have been heavily safety-trained sometimes refuse perfectly benign requests because they pattern-match on surface-level features. Ask about the history of explosives in mining, and the model might refuse because it detects the word “explosives.” This is sometimes called the “refusal problem” and is a major source of user frustration.

Reduced diversity: Safety training tends to make model outputs more homogeneous and cautious. The model learns that safe, generic responses receive higher rewards than creative or unconventional ones, leading to blander output.

Calibration loss: Recent research has shown that the alignment tax extends beyond task accuracy. A 2025 study found that post-training methods (including RLHF and DPO) can cause a severe loss of calibration, making models overconfident and less reliable in their uncertainty estimates.

A March 2025 paper, “Safety Alignment Makes Your Large Reasoning Models Less Reasonable” (arXiv:2503.00555), demonstrated this tradeoff specifically for reasoning models. The authors found that after safety alignment, reasoning accuracy decreased measurably, confirming that the safety tax is real and quantifiable, particularly for models optimized for complex reasoning.

Source: “Safety Alignment Makes Your Large Reasoning Models Less Reasonable,” arXiv:2503.00555, March 2025. Demonstrated measurable reasoning accuracy degradation after safety alignment. “Navigating the Alignment-Calibration Trade-off,” arXiv:2510.17426, October 2025. Showed that post-training causes severe calibration loss.

Is the Alignment Tax Inevitable?

There is an active debate about whether the alignment tax is a fundamental tradeoff or an artifact of current techniques. Several lines of evidence suggest it may be reducible:

Better training techniques: Constitutional AI was designed in part to reduce the alignment tax. By teaching the model to reason about safety principles rather than memorizing a list of forbidden topics, it can make more nuanced decisions about when to refuse and when to help. The result is a model that is both safer and more helpful than one trained with blunt safety rules.

Null-space constrained optimization: A December 2025 paper (arXiv:2512.11391) proposed constraining safety alignment updates to the null space of the model’s capability-relevant weight directions, theoretically allowing safety improvements without capability degradation.

The “negative alignment tax” hypothesis: Some researchers have argued that alignment and capability are not fundamentally opposed. A model that is honest, helpful, and harmless may actually be more capable on real-world tasks than an unaligned model, because it is better calibrated, more reliable, and more trustworthy. A 2024 survey of approximately 375 effective altruism and alignment researchers (conducted by AE Studio, with feedback from Spencer Greenberg and others) found that alignment researchers generally disagreed with the statement that “alignment research that has some probability of also advancing capabilities should not be done,” suggesting that the research community increasingly views alignment and capability as complementary rather than opposed.

The Practical Reality

In practice, the alignment tax is real but manageable. The key is to apply safety training surgically rather than bluntly:

  1. Use fine-grained safety categories rather than broad prohibitions. Instead of “refuse all requests about weapons,” use “refuse requests for instructions on building weapons, but allow historical and educational discussions about weapons.”

  2. Iterate on refusal boundaries through multiple rounds of human evaluation. LLaMA 3’s iterative pipeline allows Meta to progressively refine the boundary between appropriate refusal and over-refusal.

  3. Use Constitutional AI-style reasoning to help the model make context-dependent safety decisions rather than relying on keyword matching.

  4. Monitor for capability regressions on benchmark suites after each round of safety training, and adjust the training data or hyperparameters if regressions are detected.

The goal is not to eliminate the alignment tax entirely (some tradeoff is likely unavoidable), but to minimize it while maintaining robust safety properties.


The Evolution of Post-training: A Timeline

Post-training techniques have evolved rapidly since 2017. Here is a timeline of the key developments:

YearDevelopmentSignificance
2017Christiano et al.: RLHF framework (arXiv:1706.03741)Established the idea of learning rewards from human preferences
2017Schulman et al.: PPO (arXiv:1707.06347)Provided the RL algorithm used in early RLHF
2022 (Mar)Ouyang et al.: InstructGPT (arXiv:2203.02155)First large-scale demonstration of RLHF for LLMs. 1.3B model preferred over 175B GPT-3
2022 (Nov)OpenAI: ChatGPT releaseRLHF-trained model that launched the AI assistant era
2022 (Dec)Bai et al.: Constitutional AI (arXiv:2212.08073)Introduced RLAIF and principle-based alignment
2023 (May)Rafailov et al.: DPO (arXiv:2305.18290)Eliminated the need for a separate reward model and RL loop
2023 (May)Zhou et al.: LIMA (arXiv:2305.11206)Showed 1,000 examples can be sufficient for alignment
2024 (Feb)Shao et al.: GRPO (arXiv:2402.03300)Eliminated the critic model, reducing RLHF memory by ~25%
2024 (Jul)Meta: LLaMA 3 (arXiv:2407.21783)Documented iterative multi-round SFT+DPO pipeline
2024 (Dec)DeepSeek: V3 (arXiv:2412.19437)Demonstrated R1 distillation + GRPO at scale
2025 (Jan)DeepSeek: R1 (arXiv:2501.12948)Showed RL alone (without SFT) can produce reasoning. Released Jan 20
2025 (Jan)Hu: REINFORCE++ (arXiv:2501.03262)Critic-free REINFORCE variant with PPO-style stabilization
2025 (Mar)Yu et al.: DAPO (arXiv:2503.14476)Decoupled clipping and dynamic sampling, built on GRPO. 50% on AIME 2024 (avg@32) with Qwen2.5-32B
2026 (Jan)Anthropic: Claude’s new constitution23,000-word philosophical framework for AI alignment

The trend is clear: post-training has moved from a simple three-step pipeline (SFT, reward model, PPO) to a diverse ecosystem of techniques that labs mix and match based on their specific needs. The field continues to evolve rapidly, with new methods appearing every few months.


Putting It All Together: The Complete Training Pipeline

Let us now zoom out and see how post-training fits into the complete lifecycle of a frontier language model, from raw data to deployed assistant:

# The complete training pipeline for a frontier LLM
pipeline = [
    ("1. Data Collection & Processing",
     "Months of work",
     "Collect and filter trillions of tokens from web, code, books.\n"
     "Quality filtering, deduplication, toxicity removal.\n"
     "(Covered in Chapter 14)"),

    ("2. Pre-training",
     "Weeks to months",
     "Train on trillions of tokens with next-token prediction.\n"
     "Thousands of GPUs, millions of dollars.\n"
     "Result: base model with knowledge but no behavior.\n"
     "(Covered in Chapter 14)"),

    ("3. Supervised Fine-Tuning (SFT)",
     "Days",
     "Train on thousands to millions of instruction-response pairs.\n"
     "Rejection sampling, synthetic data, human demonstrations.\n"
     "Result: model that can follow instructions."),

    ("4. Reward Model Training",
     "Days",
     "Train on human preference data (chosen vs rejected pairs).\n"
     "Bradley-Terry model with pairwise comparisons.\n"
     "Result: model that can score response quality."),

    ("5. Reinforcement Learning / DPO",
     "Days",
     "Optimize policy using reward signal (PPO, GRPO, or DPO).\n"
     "KL penalty prevents reward hacking.\n"
     "Result: model that generates higher-quality responses."),

    ("6. Iterate (repeat steps 3-5)",
     "Weeks",
     "Multiple rounds with fresh data from latest model.\n"
     "LLaMA 3 uses several rounds; other labs use 2-4.\n"
     "Result: progressively better alignment."),

    ("7. Safety Evaluation & Red-teaming",
     "Weeks",
     "Test for harmful outputs, jailbreaks, bias.\n"
     "Human red-teamers try to break the model.\n"
     "Fix issues and retrain if necessary."),

    ("8. Deployment",
     "Ongoing",
     "Serve the model via API or application.\n"
     "Monitor for issues, collect feedback for next iteration."),
]

print("Complete Training Pipeline: From Data to Deployed Assistant")
print("=" * 65)
for name, duration, details in pipeline:
    print(f"\n{name} ({duration})")
    print("-" * 65)
    for line in details.strip().split("\n"):
        print(f"  {line}")

# Cost breakdown (approximate, based on public data)
print("\n\nApproximate Cost Breakdown (frontier model, 2024-2025)")
print("=" * 50)
costs = [
    ("Pre-training compute",    "$50M - $200M+"),
    ("Data collection/processing", "$1M - $10M"),
    ("Post-training compute",   "$100K - $5M"),
    ("Human annotation",        "$1M - $10M"),
    ("Engineering & research",  "$10M - $50M+"),
    ("Infrastructure & energy", "$5M - $20M"),
]
for item, cost in costs:
    print(f"  {item:<30} {cost}")
print(f"\nPost-training compute is a small fraction of total cost,")
print(f"but human annotation for preference data is significant.")

Key Takeaways

  • Post-training (also called alignment) is the process that transforms a raw pre-trained base model into a useful assistant. Without it, language models are powerful text completion engines that cannot follow instructions, answer questions helpfully, or refuse harmful requests. Pre-training provides knowledge; post-training shapes behavior.

  • The InstructGPT paper (Ouyang et al., arXiv:2203.02155, March 2022, NeurIPS 2022) demonstrated that a 1.3 billion parameter model trained with RLHF was preferred by human evaluators over the 175 billion parameter GPT-3, despite having over 100x fewer parameters. This established that post-training is essential for making language models useful.

  • Supervised Fine-Tuning (SFT) is the first stage of post-training. The model is trained on curated instruction-response pairs using the same next-token prediction objective as pre-training, but with a much smaller, higher-quality dataset (thousands to millions of examples vs. trillions of tokens). SFT data comes from human-written demonstrations, rejection sampling (generating many responses and keeping the best), synthetic data from stronger models, and human-edited model outputs. The LIMA paper (Zhou et al., NeurIPS 2023) showed that as few as 1,000 carefully curated examples can produce competitive alignment.

  • Reward modeling trains a separate model to score response quality using human preference data. Annotators compare pairs of responses and indicate which is better. The reward model is trained using the Bradley-Terry model, where the probability of preferring one response over another is determined by the difference in their reward scores. DeepSeek-V3 combines rule-based rewards (for math and code with verifiable answers) with model-based rewards (for subjective tasks).

  • Reinforcement Learning from Human Feedback (RLHF) uses the reward model to optimize the language model’s outputs. The classic approach uses PPO (Proximal Policy Optimization, Schulman et al., arXiv:1707.06347, 2017), which requires four models in memory (policy, reference, reward, critic). GRPO (Group Relative Policy Optimization, Shao et al., arXiv:2402.03300, 2024) eliminates the critic model by estimating baselines from group statistics, reducing memory requirements. GRPO was used to train DeepSeek-V3 and DeepSeek-R1.

  • Direct Preference Optimization (DPO) (Rafailov et al., arXiv:2305.18290, NeurIPS 2023) simplifies the pipeline by eliminating both the reward model and the RL loop, directly optimizing the policy on preference data using a classification-style loss. DPO requires only two models in memory (policy and reference) and is used by Meta’s LLaMA 3 in an iterative multi-round pipeline.

  • Constitutional AI (Bai et al., arXiv:2212.08073, December 2022, Anthropic) replaces human feedback on harmlessness with AI feedback guided by a set of constitutional principles. The approach has two phases: supervised self-critique (the model critiques and revises its own responses) and RLAIF (RL from AI Feedback). Anthropic published a major update to Claude’s constitution on January 21, 2026, a roughly 23,000-word philosophical framework establishing a four-tier priority hierarchy: broadly safe > broadly ethical > compliant with Anthropic’s guidelines > genuinely helpful.

  • The alignment tax is the tradeoff between safety training and model capability. Safety training can cause capability degradation, over-refusal of benign requests, reduced output diversity, and calibration loss. Research in 2025 confirmed that safety alignment measurably reduces reasoning accuracy in large reasoning models (arXiv:2503.00555). However, the alignment tax may be reducible through better techniques such as Constitutional AI-style reasoning, null-space constrained optimization, and iterative refinement of refusal boundaries.

  • Post-training is cheap relative to pre-training but has an outsized impact. DeepSeek-V3’s entire post-training (SFT + RL) required only 5,000 H800 GPU-hours ($10,000), compared to 2.664 million GPU-hours ($5.3 million) for pre-training. Yet post-training is what transforms the model from a text completion engine into a useful assistant.

  • Real-world post-training pipelines combine multiple techniques iteratively. LLaMA 3 uses several rounds of reward modeling, rejection sampling, SFT, and DPO. DeepSeek-V3 uses SFT with R1 distillation followed by GRPO. Anthropic uses Constitutional AI with RLAIF. The field continues to evolve rapidly, with newer methods like REINFORCE++ (arXiv:2501.03262, January 2025), which enhances classic REINFORCE with PPO-style stabilization while remaining critic-free, and DAPO (arXiv:2503.14476, March 2025), which builds on GRPO with decoupled clipping and dynamic sampling to improve reasoning performance.


What’s Next

You now understand how post-training transforms a raw base model into a useful assistant, through supervised fine-tuning, reward modeling, reinforcement learning, and constitutional principles. In Chapter 16, we will explore a related but distinct capability: extended thinking, where models learn to reason step-by-step at inference time, spending more compute on harder problems to produce better answers.