
The Evolution of GRPO: DAPO
Author(s): tangbasky
Originally published on Towards AI.
Dynamic sAmpling Policy Optimization (DAPO) is actually a type of reinforcement learning optimization algorithm. To thoroughly understand DAPO, we need to progressively sort out and explain it from PPO -> GRPO -> DAPO.
Proximal Policy Optimization (PPO)
The core of PPO is to limit the difference between the new and old policies during policy updates, avoiding the training from collapsing due to an overly large single update. Its main formula is:

where
- π_{θ} is the output probability of the new policy model.
- π_{θ_{old}} is the outcome probability of the old policy model.
- π_{θ}/π_{θ_{old}} is the importance ratio sampling, which mainly ensures that the distributions of the new and old models do not differ significantly.
- ε is the clipping parameter for the importance ratio, which also limits the distribution change of the model to prevent the distribution from changing too much or too little.
- \tilde{A}_t is the improvement function, which mainly comes from the values of the reward model and the value model.
- R_l is the score of the reward model.
- V is the score of the value model.

The figure 1 above compares the model performance (evaluated using the AIME dataset) and the generated entropy with and without the clipping parameter. We can clearly see that after adding the clipping parameter, the model performance and the entropy improves significantly.
Group Relative Policy Optimization (GRPO)
GRPO mainly removes the value function and estimates the advantage in a group-relative manner, which greatly speeds up the training of the model. The formula is as follows:

The main changes it makes are:
- Sampling each prompt multiple times to form groups, and using the normalized values of the group rewards as an advantage.
- Introducing KL divergence as regularization.
- Since GRPO training is mainly for mathematical or logical reasoning problems, its reward model is also rule-based. For example:

where
- y is the standard answer.
- \hat{y} is the predicted answer.
DAPO

Clip-Higher
From the above formula, we can clearly see that for the clipping parameter, we add a low clip and a high clip. This is because:
- The high clip controls the exploration of the model, allowing the model to explore more tokens.
- The low clip ensures that the probability of a token does not decrease rapidly, maintaining the stability of high-probability tokens.
Here is an example: When ε= 0.2, suppose the probability of an action is π_{θ_{old}}(o_1|q) = 0.01, and the probability of another action is π_{θ_{old}}(o_2|q) = 0.9. Since the model is updated only when π_{θ} ≤π_{θ_{old}}(1+ε) the maximum updated probability for π_{θ_{old}}(o_1|q) is (1+0.2) * 0.01 = 0.012, and for π_{θ_{old}}(o_2|q) it is (1+0.2) * 0.9 = 1.08. This means that the probability of updating low-probability tokens is much lower than that of high-probability tokens. Moreover, in the experiments of DAPO, researchers found that the maximum output probability of the clipped tokens is π_{θ_{old}}(o_i|q) < 0.2. This also proves that high clip definitely limits the growth of the probability of low-probability tokens, thereby restricting the diversity of the model. As shown in the figure 2 below:

To address this issue, DAPO proposes high-low clipping. Specifically:
- ε_{low}: Used to limit the decrease in the probability of high-probability tokens, preventing the probability of a token from dropping too much. This value can be set smaller.
- ε_{high}: Used to limit the increase in the probability of low-probability tokens, allowing a larger exploration space. This value can be set larger.
In DAPO, ε_{low} <ε_{high}. Therefore:
- When (A > 0), the clipping boundary is at (1 + ε_{high}). A larger ε_{high} prevents small-probability tokens from being easily truncated, thus enabling updates.
- When (A < 0), the clipping boundary is at (1 — ε_{high}). If ε_{high} is not too large, high-probability tokens will not be updated too quickly, ensuring that the decrease in probability is not too drastic.
Dynamic Sampling
In current RL algorithms, the same prompt needs to be sampled. If all the sampling results have a correctness rate (i.e., rewards are all correct) of 1, or all have a correctness rate (i.e., rewards) of 0, then the advantage \hat{A} of this group is only 0. When \hat{A} is 0, no gradient updates are generated, reducing the efficiency of the samples. As shown in the figure 3 below, as the model trains, the number of effective samples in the batch gradually decreases.

To address this issue, DAPO proposes Dynamic Sampling, which filters groups with an accuracy of 1 or 0, ensuring that all samples in each batch have effective gradients while maintaining the consistency of the sample size. Before training, sampling continues until the batch is filled with samples whose accuracy is neither 0 nor 1.
Moreover, as the number of training steps increases, the model becomes more accurate, so the number of filtered samples also increases. Therefore, even with dynamic sampling, the training speed does not increase significantly. Instead, the efficiency of the samples accelerates the convergence of the model.
Token-Level Policy Gradient Loss
In the original GRPO, the loss is calculated at the sample level. However, this calculation method may lead to poor learning of tokens in very long samples. For example:
The token loss for a long output sample is:

The token loss for a short output sample is:

When calculating the total loss L_{long} + L_{short}, we find that they are averaged. However, since N_1 > N_2, it is obvious that L_{long} is not learned as well. Moreover, experiments in DAPO have found that once the content is too long, it is easy to generate many nonsensical tokens, so they should be given more attention. Therefore, to make the loss directly accurate for each token, the total loss is changed to the following form:

The loss form of the above example becomes:

Overlong Reward Shaping
During the training of large language models (LLMs), a max_token
is usually set to limit the maximum generation length of the model. Samples exceeding this length will be truncated. However, if the reward design for truncated samples is improper, it may introduce reward noise, which can seriously interfere with the training process.
In previous methods, RL would penalize such truncated samples. However, this method may introduce noise during training because a reasonable answer may be penalized simply because it is too long, which can seriously interfere with the effective generation of the model. DAPO introduces a penalty transition interval with the following formula:

where
- L_{cache} is the penalty transition interval.
- L_{max} is the maximum allowed length.
- |y| is the actual length of the text.
When |y| + L_{cache} ≤ L_{max}, the text length is less than the maximum max_token
and no penalty is applied. When |y| + L_{cache} > L_{max} and |y| < L_{max}, the penalty is applied linearly. When |y| ≥ L_{max}, the maximum penalty is applied. The figure 4 below shows the effect after implementing overlong filtering.

Reflection and Backtracing
During the training process of DAPO, researchers also found the ability of reflection and backtracking, which was not present in the dataset. This is consistent with the report of DeepSeek R1. However, researchers have not yet discovered the root cause, but it points the way for future optimizations.

Inference
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.