The Evolution of GRPO: DAPO

Author(s): tangbasky

Originally published on Towards AI.

Dynamic sAmpling Policy Optimization (DAPO) is actually a type of reinforcement learning optimization algorithm. To thoroughly understand DAPO, we need to progressively sort out and explain it from PPO -> GRPO -> DAPO.

Proximal Policy Optimization (PPO)

The core of PPO is to limit the difference between the new and old policies during policy updates, avoiding the training from collapsing due to an overly large single update. Its main formula is:

where

π_{θ} is the output probability of the new policy model.
π_{θ_{old}} is the outcome probability of the old policy model.
π_{θ}/π_{θ_{old}} is the importance ratio sampling, which mainly ensures that the distributions of the new and old models do not differ significantly.
ε is the clipping parameter for the importance ratio, which also limits the distribution change of the model to prevent the distribution from changing too much or too little.
\tilde{A}_t is the improvement function, which mainly comes from the values of the reward model and the value model.
R_l is the score of the reward model.
V is the score of the value model.

Figure 1: The accuracy on the AIME test set and the entropy of the actor model’s generated probabilities during the RL training process, both before and after applying Clip-Higher strategy. Image from [1].

The figure 1 above compares the model performance (evaluated using the AIME dataset) and the generated entropy with and without the clipping parameter. We can clearly see that after adding the clipping parameter, the model performance and the entropy improves significantly.

Group Relative Policy Optimization (GRPO)

GRPO mainly removes the value function and estimates the advantage in a group-relative manner, which greatly speeds up the training of the model. The formula is as follows:

The main changes it makes are:

Sampling each prompt multiple times to form groups, and using the normalized values of the group rewards as an advantage.
Introducing KL divergence as regularization.
Since GRPO training is mainly for mathematical or logical reasoning problems, its reward model is also rule-based. For example:

where

y is the standard answer.
\hat{y} is the predicted answer.

DAPO

Clip-Higher

From the above formula, we can clearly see that for the clipping parameter, we add a low clip and a high clip. This is because:

The high clip controls the exploration of the model, allowing the model to explore more tokens.
The low clip ensures that the probability of a token does not decrease rapidly, maintaining the stability of high-probability tokens.

Here is an example: When ε= 0.2, suppose the probability of an action is π_{θ_{old}}(o_1|q) = 0.01, and the probability of another action is π_{θ_{old}}(o_2|q) = 0.9. Since the model is updated only when π_{θ} ≤π_{θ_{old}}(1+ε) the maximum updated probability for π_{θ_{old}}(o_1|q) is (1+0.2) * 0.01 = 0.012, and for π_{θ_{old}}(o_2|q) it is (1+0.2) * 0.9 = 1.08. This means that the probability of updating low-probability tokens is much lower than that of high-probability tokens. Moreover, in the experiments of DAPO, researchers found that the maximum output probability of the clipped tokens is π_{θ_{old}}(o_i|q) < 0.2. This also proves that high clip definitely limits the growth of the probability of low-probability tokens, thereby restricting the diversity of the model. As shown in the figure 2 below:

Figure 2: Maximum clipped probabilities. Image from [1].

To address this issue, DAPO proposes high-low clipping. Specifically:

ε_{low}: Used to limit the decrease in the probability of high-probability tokens, preventing the probability of a token from dropping too much. This value can be set smaller.
ε_{high}: Used to limit the increase in the probability of low-probability tokens, allowing a larger exploration space. This value can be set larger.

In DAPO, ε_{low} <ε_{high}. Therefore:

When (A > 0), the clipping boundary is at (1 + ε_{high}). A larger ε_{high} prevents small-probability tokens from being easily truncated, thus enabling updates.
When (A < 0), the clipping boundary is at (1 — ε_{high}). If ε_{high} is not too large, high-probability tokens will not be updated too quickly, ensuring that the decrease in probability is not too drastic.

Dynamic Sampling

In current RL algorithms, the same prompt needs to be sampled. If all the sampling results have a correctness rate (i.e., rewards are all correct) of 1, or all have a correctness rate (i.e., rewards) of 0, then the advantage \hat{A} of this group is only 0. When \hat{A} is 0, no gradient updates are generated, reducing the efficiency of the samples. As shown in the figure 3 below, as the model trains, the number of effective samples in the batch gradually decreases.

Figure 3: The proportion of samples with an accuracy of 1.

To address this issue, DAPO proposes Dynamic Sampling, which filters groups with an accuracy of 1 or 0, ensuring that all samples in each batch have effective gradients while maintaining the consistency of the sample size. Before training, sampling continues until the batch is filled with samples whose accuracy is neither 0 nor 1.

Moreover, as the number of training steps increases, the model becomes more accurate, so the number of filtered samples also increases. Therefore, even with dynamic sampling, the training speed does not increase significantly. Instead, the efficiency of the samples accelerates the convergence of the model.

Token-Level Policy Gradient Loss

In the original GRPO, the loss is calculated at the sample level. However, this calculation method may lead to poor learning of tokens in very long samples. For example:

The token loss for a long output sample is:

The token loss for a short output sample is:

When calculating the total loss L_{long} + L_{short}, we find that they are averaged. However, since N_1 > N_2, it is obvious that L_{long} is not learned as well. Moreover, experiments in DAPO have found that once the content is too long, it is easy to generate many nonsensical tokens, so they should be given more attention. Therefore, to make the loss directly accurate for each token, the total loss is changed to the following form:

The loss form of the above example becomes:

Overlong Reward Shaping

During the training of large language models (LLMs), a max_token is usually set to limit the maximum generation length of the model. Samples exceeding this length will be truncated. However, if the reward design for truncated samples is improper, it may introduce reward noise, which can seriously interfere with the training process.

In previous methods, RL would penalize such truncated samples. However, this method may introduce noise during training because a reasonable answer may be penalized simply because it is too long, which can seriously interfere with the effective generation of the model. DAPO introduces a penalty transition interval with the following formula:

where

L_{cache} is the penalty transition interval.
L_{max} is the maximum allowed length.
|y| is the actual length of the text.

When |y| + L_{cache} ≤ L_{max}, the text length is less than the maximum max_token and no penalty is applied. When |y| + L_{cache} > L_{max} and |y| < L_{max}, the penalty is applied linearly. When |y| ≥ L_{max}, the maximum penalty is applied. The figure 4 below shows the effect after implementing overlong filtering.

Figure 4: The training progress before and after applying dynamic sampling on a baseline setting. Image from [1].

Reflection and Backtracing

During the training process of DAPO, researchers also found the ability of reflection and backtracking, which was not present in the dataset. This is consistent with the report of DeepSeek R1. However, researchers have not yet discovered the root cause, but it points the way for future optimizations.

Figure 5: Emergence of Reflective Behavior in Reinforcement Learning. Image from [1].

Inference

[1] DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

The Evolution of GRPO: DAPO

Author(s): tangbasky

Proximal Policy Optimization (PPO)

Group Relative Policy Optimization (GRPO)

DAPO

Clip-Higher

Dynamic Sampling

Token-Level Policy Gradient Loss

Overlong Reward Shaping

Reflection and Backtracing

Inference

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The Evolution of GRPO: DAPO

Author(s): tangbasky

Proximal Policy Optimization (PPO)

Group Relative Policy Optimization (GRPO)

DAPO

Clip-Higher

Dynamic Sampling

Token-Level Policy Gradient Loss

Overlong Reward Shaping

Reflection and Backtracing

Inference

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement