reinforcement learning

Reinforcement learning is a way to fine-tune a pretrained model so it learns to prefer better responses. Instead of only imitating data, the model practices generating answers, gets feedback on how good those answers are, and then updates itself to make higher-scoring answers more likely in the future. OpenAI has a great introduction to RL that I recommend.

At a high level, RL involves asking a model a question, allowing the model to generate multiple responses, deciding which responses are good, and then updating the model based on that feedback. RL often uses the term rollout to describe a single prompt-response pair; when you ask ChatGPT a question and get an answer, that is equivalent to a single rollout.

The process of reinforcement learning looks like this:

Collect feedback (“experiences”)
- The pretrained transformer generates multiple rollouts (responses) for a given prompt.
- Those rollouts are judged by humans (RLHF), tools, (math solvers, whether the code runs), or other techniques.
Training a reward model (sometimes)
- If humans provided rankings, those judgments can be used to train a separate reward model: a transformer that takes in a response and outputs a score.
- The reward model automates feedback so humans don’t have to evaluate every rollout.
- Some methods skip this reward model step
Fine-tuning the transformer
- The pretrained transformer generates new rollouts.
- Each rollout is scored (by the reward model, automated checks, or comparison methods).
- An optimization algorithm updates the model’s parameters so higher-scoring rollouts become more likely next time.

Each update cycle of rollout, scoring, and gradient update in Step 3 is often called an RL step. Over many RL steps, the model gradually learns to produce responses that are more helpful, more correct, and better aligned with what people want.

Optimization algorithms

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a “first-wave” RL algorithm that balances improvement with stability.

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) directly optimizes the model to match preference probabilities. There is no reward model; instead, feedback is given as paired preference data (which response is better?). It isn’t as generalizable as PPO because there is no reward model, but it is much simpler when you have lots of high-quality preference data. It also reduces the risk of unintended behavior, since it directly follows the feedback it is given without a reward model in the middle that might get creative.

Llama-3 was fine-tuned using DPO, likely because Meta had lots of direct feedback from all of its social platforms.

GRPO

Gradient-based Reinforcement Preference Optimization (GRPO) compares the model’s responses relative to each other within a group. This allows the model to learn without needing an absolute reward model.

DeepSeek-R1 was fine-tuned using GRPO.

GSPO

Group Sequence Policy Optimization (GSPO) that defines importance ratios based on sequences of tokens rather than individual tokens.

Mechanics

I’ve been using Ray’s RLlib documentation to understand how RL can be implemented in practice.

The feedback is encoded as “experiences” or “transitions” which are tuples $(s, a, r, s_{n e x t}$ ) that contain:

$s$ is the state, the input prompt to the model being fine tuned
$a$ is the action, the output generated by the model
$r$ is the reward, a score that reflects the quality of $a$
$s_{n e x t}$ is the next state, which may be the next prompt in a multi-turn session

A sequence of these experiences is called a “trajectory” or “episode.”

Seminal papers

I stole this reading list from No Hype DeepSeek-R1 Reading List.

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback by Google introduced reinforcement learning from AI-generated feedback, eliminating the need for human feedback.
Self-Rewarding Language Models by Meta laid the groundwork for using one model to both generate content as well as evaluate how well it did, establishing a way for models to think about what they’re doing.
Thinking LLMs: General Instruction Following with Thought Generation is another Meta paper that used self-rewarding models to create chains of thought, which are the basis for reasoning models.
DPO - Direct Preference Optimization from Stanford proposed a simplified approach to reinforcement learning that avoids the need to train a reward model.

Glenn's Digital Garden

Explorer