Reinforcement learning is a way to fine-tune a pretrained transformer so it learns to prefer better responses. Instead of only imitating data, the model practices generating answers, gets feedback on how good those answers are, and then updates itself to make higher-scoring answers more likely in the future.

At a high level, RL involves asking a model a question, allowing the model to generate multiple responses, deciding which responses are good, and then updating the model based on that feedback. RL often uses the term rollout to describe a single prompt-response pair; when you ask ChatGPT a question and get an answer, that is equivalent to a single rollout.

The process of reinforcement learning looks like this:

  1. Collect feedback
    • The pretrained transformer generates multiple rollouts (responses) for a given prompt.
    • Those rollouts are judged by humans (RLHF), tools, (math solvers, whether the code runs), or other techniques.
  2. Training a reward model (sometimes)
    • If humans provided rankings, those judgments can be used to train a separate reward model: a transformer that takes in a response and outputs a score.
    • The reward model automates feedback so humans don’t have to evaluate every rollout.
    • Some methods skip this reward model step
  3. Fine-tuning the transformer
    • The pretrained transformer generates new rollouts.
    • Each rollout is scored (by the reward model, automated checks, or comparison methods).
    • An optimization algorithm updates the model’s parameters so higher-scoring rollouts become more likely next time.

Each update cycle of rollout, scoring, and gradient update in Step 3 is often called an RL step. Over many RL steps, the model gradually learns to produce responses that are more helpful, more correct, and better aligned with what people want.

Optimization algorithms

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a “first-wave” RL algorithm that balances improvement with stability.

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) directly optimizes the model to match preference probabilities. There is no reward model; instead, feedback is given as paired preference data (which response is better?). It isn’t as generalizable as PPO because there is no reward model, but it is much simpler when you have lots of high-quality preference data. It also reduces the risk of unintended behavior, since it directly follows the feedback it is given without a reward model in the middle that might get creative.

Llama-3 was fine-tuned using DPO, likely because Meta had lots of direct feedback from all of its social platforms.

GRPO

Gradient-based Reinforcement Preference Optimization (GRPO) compares the model’s responses relative to each other within a group. This allows the model to learn without needing an absolute reward model.

DeepSeek-R1 was fine-tuned using GRPO.

Seminal papers

I stole this reading list from No Hype DeepSeek-R1 Reading List.