LLM - R1 強化學習 DRPO 策略優化 DAPO 與 Dr. GRPO 算法教程

歡迎關注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://spike.blog.csdn.net/article/details/146533892

DAPO

在強化學習算法中，DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization)，通過解耦裁剪和動態采樣策略提升模型的推理能力。與 GRPO (Group Relative Policy Optimization) 相比，DRPO 移除 KL 散度懲罰項，允許模型在長推理任務中自由探索，同時，通過調整上下裁剪范圍，增加低概率 Token 探索能力，有效緩解熵崩潰問題。DRPO 引入動態采樣策略，過濾掉準確率為 0 或 1 的無效樣本，確保每個批次中的樣本，具有有效的梯度信號，從而提升訓練效率和模型的收斂速度。Dr. GRPO (GRPO Done Right) 解決 GRPO 優化中的偏差問題，提出的改進方案，通過刪除長度歸一化項和標準差標準化項，解決 GRPO 可能導致錯誤響應逐漸變長的問題。同時，Dr. GRPO 在掩碼均值函數中，將 mask.sum(axis=dim) 替換為固定值 MAX_TOKENS，以確保優化目標的無偏性。能夠有效緩解優化偏差，顯著減少錯誤回答的長度，同時保持模型的推理性能。

DAPO，即 Decoupled Clip and Dynamic sAmpling Policy Optimization，解耦剪裁與動態采樣策略優化

DAPO: an Open-Source LLM Reinforcement Learning System at Scale

Dr. GRPO，即 GRPO Done Right

Dr. GRPO: Understanding R1-Zero-Like Training: A Critical Perspective

GitHub：

DAPO：https://dapo-sia.github.io/
Dr. GRPO：https://github.com/sail-sg/understand-r1-zero

標準的 GRPO，如下：
$\frac{1}{G}\sum_{i=1}^{G} \frac{1}{|o_{i}|}\sum_{t=1}^{|o_{i}|} \big\{ min \big[ \frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,<t})}\hat{A}_{i,t}, clip(\frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,<t})}, 1-\epsilon, 1+\epsilon)\hat{A}_{i,t} \big] -\beta \mathbb{D}_{KL}(\pi_{\theta}||\pi_{ref}) \big\} \\ where \ \hat{A}_{i,t}=\frac{R(q,o_{i}) - mean(\{R(q,o_{i}),...,R(q,o_{G})\})}{std(\{R(q,o_{i}),...,R(q,o_{G})\})}$
DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization)，即解耦剪裁與動態采樣策略優化，即 (1) 增加 Clip 上界范圍，(2) 去除 全正或全錯的采樣，(3) 修改 Token-Level 全平均 代替 Sample-Level 分組平均。
$\frac{1}{\sum_{i=1}^{G}|o_{i}|} \frac{1}{G}\sum_{i=1}^{G} \sum_{t=1}^{|o_{i}|} \big\{ min \big[ \frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,<t})}\hat{A}_{i,t}, clip(\frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,<t})}, 1-\epsilon_{low}, 1+\epsilon_{up})\hat{A}_{i,t} \big] \big\} \\ where \ \hat{A}_{i,t}=\frac{R(q,o_{i}) - mean(\{R(q,o_{i}),...,R(q,o_{G})\})}{std(\{R(q,o_{i}),...,R(q,o_{G})\})} \\ s.t. \ 0 < \big| \{{o_{i}|is\_equivalent(a,o_{i})} \} \big| < G$
Dr. GRPO (GRPO Done Right)，即運行正確的 GRPO，即 (1)去掉 序列長度 $\frac{1}{|o_{i}|}$ ，(2)去掉 優勢方差 $s t d$ ，如下：
$\frac{1}{G}\sum_{i=1}^{G} \sum_{t=1}^{|o_{i}|} \big\{ min \big[ \frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,<t})}\hat{A}_{i,t}, clip(\frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,<t})}, 1-\epsilon, 1+\epsilon)\hat{A}_{i,t} \big]) \big\} \\ where \ \hat{A}_{i,t}=R(q,o_{i}) - mean(\{R(q,o_{i}),...,R(q,o_{G})\})$

DAPO 移除 KL 散度 (Removing KL Divergence)：KL 懲罰用于調節在線策略與凍結參考策略之間的散度。

在強化學習人類反饋(RLHF) 場景中，強化學習的目標是，避免與初始模型偏離太遠，對齊模型的行為。
訓練長思維鏈(long-CoT) 的模型，模型分布可能顯著偏離初始模型，因此，KL 懲罰沒有必要。

Dr. GRPO，同樣，移除 KL 散度，即：

KL 正則化項，通常用于從人類反饋中強化學習，其中 $r$ 是從 $\pi_{ref}$ 收集的數據中，學習得到的獎勵模型。正則化有助于防止 $\pi_{\theta}$ 過度偏離獎勵模型準確的分布。
推理模型的強化學習微調，通常使用基于規則的驗證器作為 $r$ ，消除分布偏移的擔憂，因此，可以移除 KL 項
節省在訓練過程中由 $\pi_{ref}$ 所需的顯存和計算資源，而且，可能為 RL 訓練帶來更好的性能。
因此，假設 $\beta \mathbb{D}_{KL}$ 項的參數 $\beta=0$ 。

在 TRL - GRPO 的 KL 散度參數中，即

$\beta = 0.04$ ，default，GRPOConfig
在 math 訓練中， $\beta=0.001$ ，也是降低 KL 散度權重。

1. DAPO 策略

DAPO 去除 KL 散度項、使用基于規則的獎勵模型(Rule-based Reward Modeling)。

DAPO 核心包括：

Clip-Higher (裁剪上界)：促進系統多樣性，避免熵坍塌(Entropy Collapse)，在策略梯度損失(Policy Gradient Loss) 中，通過增加重要性采樣比率的剪裁上界(Upper Clip) 緩解。
- 提升 Clip 上界，正例(A>0)，低概率 Token 絕對提升能力增強，有效釋放低概率路徑的探索潛能，緩解策略熵的快速降低。
- 保持 Clip 下界，避免策略急劇收縮。
- 如 $\epsilon_{low}=0.2，\epsilon_{high}=0.28$
Dynamic Sampling (動態采樣)：提高訓練效率和穩定性，提出動態采樣策略，篩選出準確率為 1 和 0 的提示組(Prompt Groups)，確保每個批次中，有效梯度的提示數量保持一致。
Token-level Policy Gradient Loss (Token-Level 策略梯度損失)：避免長思維鏈的強化學習(long-CoT RL) 中，無意義的 Pattern 懲罰較低，效果顯著。
- GRPO：先在部分(Generation Level) 取平均，再在整體(Group Level) 取平均。
- DAPO：在 Group 中，全部 Generation 全部 Tokens，一起取平均。
Overlong Reward Shaping (過長獎勵規范)：使用超長過濾(Overlong Filtering) 策略，隱藏(Mask) 截斷樣本(Truncated Samples) 的損失，顯著穩定訓練過程，提高性能，降低獎勵噪聲(Reward Noise)。

軟過長懲罰(Soft Overlong Punishment)，即：
$R_{length}(y) = \begin{cases} 0, &|y|\le L_{max} - L_{cache} \\ \frac{L_{max} - L_{cache} - |y|}{L_{cache}}, &L_{max} - L_{cache} < |y| \le L_{max} \\ -1, &L_{max} < |y| \end{cases}$

DAPO 流程：

DAPO

2. Dr. GRPO 策略

Dr. GRPO 解決 GRPO 的 2 個偏差(Biases)：

響應長度偏差(Response-level length bias)：即除以 $o_{i}|$ ：
- 正向優勢 ( $\hat{A}_{i,t}>0$ ) 回答正確，短回答的梯度更新大，長回答的梯度更新小。
- 負向優勢 ( $\hat{A}_{i,t}<0$ ) 回答錯誤，長回答的懲罰較小，短回答的懲罰較大。
- 去掉除以 $o_{i}|$ 之后，避免回答長短的影響，只考慮獎勵函數值。
問題難度偏差(Question-level difficulty bias)：即優勢 $\hat{A}_{i,t}$ 除以 $std(\{R(q,o_{i}),...,R(q,o_{G})\})$
- 標準差較低的問題，在策略更新過程中，賦予更高的權重。
- 批次歸一化是合理的，但是，問題級別歸一化，導致目標函數中不同問題的權重有所不同。
- 模型訓練更偏向于，回答一致性較高的問題，降低探索能力。

3. TRL GRPO 實現

TRL 代碼中，計算 GRPO 的邏輯：

# 1. advantages 優勢的計算過程
# Gather the reward per function: this part is crucial, because the rewards are normalized per group and the
# completions may be distributed across processes
rewards_per_func = gather(rewards_per_func)# Apply weights to each reward function's output and sum
rewards = (rewards_per_func * self.reward_weights.to(device).unsqueeze(0)).nansum(dim=1)# Compute grouped-wise rewards
mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)# Normalize the rewards to compute the advantages
mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
advantages = rewards - mean_grouped_rewards
if self.args.scale_rewards:advantages = advantages / (std_grouped_rewards + 1e-4)# KL 散度
per_token_kl = torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1# 2. Loss 的計算過程
# Compute the loss
advantages = inputs["advantages"]
# When using num_iterations == 1, old_per_token_logps == per_token_logps, so we can skip it's computation (see
# _generate_and_score_completions) and use per_token_logps.detach() instead.
old_per_token_logps = inputs["old_per_token_logps"] if self.num_iterations > 1 else per_token_logps.detach()
coef_1 = torch.exp(per_token_logps - old_per_token_logps)
coef_2 = torch.clamp(coef_1, 1 - self.epsilon_low, 1 + self.epsilon_high)
per_token_loss1 = coef_1 * advantages.unsqueeze(1)
per_token_loss2 = coef_2 * advantages.unsqueeze(1)
per_token_loss = -torch.min(per_token_loss1, per_token_loss2)
if self.beta != 0.0:per_token_loss = per_token_loss + self.beta * per_token_kl
loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()

4. loss 是 0 不代表 gradient 是 0

loss = 0 也可以反向傳播，更新梯度：

loss 是 0，gradient 可能不是 0

$\begin{align} loss(w) &= (w-1)^{2} - 1 \\ \frac{\partial{loss}}{\partial{w}} &= 2w - 2 \end{align}$

當 loss 是 0 時，w=0，梯度 gradient 是 -2，學習 = 學習率 * 梯度，假設，學習率 $\eta=0.1$ 。

$w_{new} = w - \eta \cdot g = 0 - (0.1\times(-2)) = 0.2$

gradient 是 0，則不可優化

測試：

import torch
x = torch.tensor([3.0], requires_grad=True)
y1 = x-x
y1.backward()
print(f"Grad for x-x: {x.grad.item()}")  # 0.0x.grad.zero_()
y2 = x - x.detach()
y2.backward()
print(f"Grad for x - x.detach(): {x.grad.item()}")  # 1.0