🎯 REINFORCE 策略梯度算法推導(完整)
1. 目標函數定義
我們希望最大化策略的期望回報:
J ( θ ) = E τ ~ π θ [ R ( τ ) ] J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] J(θ)=Eτ~πθ??[R(τ)]
其中:
- τ = ( s 0 , a 0 , s 1 , a 1 , . . . , s T , a T ) \tau = (s_0, a_0, s_1, a_1, ..., s_T, a_T) τ=(s0?,a0?,s1?,a1?,...,sT?,aT?):軌跡
- R ( τ ) = ∑ t = 0 T r t R(\tau) = \sum_{t=0}^T r_t R(τ)=∑t=0T?rt?:軌跡總回報
- π θ ( a t ∣ s t ) \pi_\theta(a_t | s_t) πθ?(at?∣st?):策略函數,如果是連續動作空間則是(概率密度函數值),離散動作空間則是是一個概率值(如 softmax 輸出)。
2. 軌跡的概率
軌跡的概率分布為:
P ( τ ) = ρ ( s 0 ) ? ∏ t = 0 T π θ ( a t ∣ s t ) ? P ( s t + 1 ∣ s t , a t ) P(\tau) = \rho(s_0) \cdot \prod_{t=0}^T \pi_\theta(a_t | s_t) \cdot P(s_{t+1} | s_t, a_t) P(τ)=ρ(s0?)?t=0∏T?πθ?(at?∣st?)?P(st+1?∣st?,at?)
其中:
- ρ ( s 0 ) \rho(s_0) ρ(s0?):初始狀態分布
- P ( s t + 1 ∣ s t , a t ) P(s_{t+1} | s_t, a_t) P(st+1?∣st?,at?):狀態轉移概率(與 θ \theta θ 無關), 就是選什么動作需要概率來描述,選了這個動作跳到什么狀態,也是不確定的,也需要概率來描述。
3. 對目標函數求導
我們希望通過梯度上升更新策略參數 θ \theta θ:
? θ J ( θ ) = ? θ E τ ~ π θ [ R ( τ ) ] \nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] ?θ?J(θ)=?θ?Eτ~πθ??[R(τ)]
問題:如何求這個梯度?由于 π θ \pi_\theta πθ? 依賴于 θ \theta θ,期望不能直接求導。
似然比技巧(likelihood ratio trick),推導如下:
? θ E x ~ p θ ( x ) [ f ( x ) ] = ? θ ∫ f ( x ) p θ ( x ) d x = ∫ f ( x ) ? θ p θ ( x ) d x \nabla_\theta \mathbb{E}_{x \sim p_\theta(x)}[f(x)] = \nabla_\theta \int f(x) p_\theta(x) dx = \int f(x) \nabla_\theta p_\theta(x) dx ?θ?Ex~pθ?(x)?[f(x)]=?θ?∫f(x)pθ?(x)dx=∫f(x)?θ?pθ?(x)dx
這里之所以不對 f ( x ) f(x) f(x)求導,是因為在強化學習中這里的 f ( x ) f(x) f(x)是reward,是一個標量,與環境交互得到的。
利用鏈式法則:
? θ p θ ( x ) = p θ ( x ) ? θ log ? p θ ( x ) \nabla_\theta p_\theta(x) = p_\theta(x) \nabla_\theta \log p_\theta(x) ?θ?pθ?(x)=pθ?(x)?θ?logpθ?(x)
代入得:
= ∫ f ( x ) p θ ( x ) ? θ log ? p θ ( x ) d x = E x ~ p θ ( x ) [ f ( x ) ? θ log ? p θ ( x ) ] = \int f(x) p_\theta(x) \nabla_\theta \log p_\theta(x) dx = \mathbb{E}_{x \sim p_\theta(x)}[f(x) \nabla_\theta \log p_\theta(x)] =∫f(x)pθ?(x)?θ?logpθ?(x)dx=Ex~pθ?(x)?[f(x)?θ?logpθ?(x)]
4. 推導 log 概率項
注意:
log ? P ( τ ) = log ? ρ ( s 0 ) + ∑ t = 0 T [ log ? π θ ( a t ∣ s t ) + log ? P ( s t + 1 ∣ s t , a t ) ] \log P(\tau) = \log \rho(s_0) + \sum_{t=0}^{T} \left[ \log \pi_\theta(a_t | s_t) + \log P(s_{t+1} | s_t, a_t) \right] logP(τ)=logρ(s0?)+t=0∑T?[logπθ?(at?∣st?)+logP(st+1?∣st?,at?)]
由于 ρ ( s 0 ) \rho(s_0) ρ(s0?)和 P ( s t + 1 ∣ s t , a t ) P(s_{t+1} | s_t, a_t) P(st+1?∣st?,at?)與 θ \theta θ 無關:
? θ log ? P ( τ ) = ∑ t = 0 T ? θ log ? π θ ( a t ∣ s t ) \nabla_\theta \log P(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) ?θ?logP(τ)=t=0∑T??θ?logπθ?(at?∣st?)
5. 得到策略梯度表達式
代入得到最終梯度表達式:
? θ J ( θ ) = E τ ~ π θ [ ∑ t = 0 T ? θ log ? π θ ( a t ∣ s t ) ? R ( τ ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R(\tau) \right] ?θ?J(θ)=Eτ~πθ??[t=0∑T??θ?logπθ?(at?∣st?)?R(τ)]
6. 替換為每步折扣回報 ( G_t )
為了更準確地歸因每步動作的影響,引入:
G t = ∑ k = t T γ k ? t r k G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k Gt?=k=t∑T?γk?trk?
改寫為:
? θ J ( θ ) = E τ [ ∑ t = 0 T ? θ log ? π θ ( a t ∣ s t ) ? G t ] \nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t \right] ?θ?J(θ)=Eτ?[t=0∑T??θ?logπθ?(at?∣st?)?Gt?]
7. 引入 baseline 減少方差
減去一個與動作無關的 baseline b ( s t ) b(s_t) b(st?):
? θ J ( θ ) = E τ [ ∑ t = 0 T ? θ log ? π θ ( a t ∣ s t ) ? ( G t ? b ( s t ) ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (G_t - b(s_t)) \right] ?θ?J(θ)=Eτ?[t=0∑T??θ?logπθ?(at?∣st?)?(Gt??b(st?))]
常用 baseline:
b ( s t ) = V π ( s t ) ? A t = G t ? V ( s t ) b(s_t) = V^\pi(s_t) \quad \Rightarrow \quad A_t = G_t - V(s_t) b(st?)=Vπ(st?)?At?=Gt??V(st?)
最終得到優勢形式:
? θ J ( θ ) = E [ ∑ t = 0 T ? θ log ? π θ ( a t ∣ s t ) ? A t ] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] ?θ?J(θ)=E[t=0∑T??θ?logπθ?(at?∣st?)?At?]
? 常見策略梯度形式總結
名稱 | 表達式 |
---|---|
REINFORCE | ? θ J ( θ ) = E [ ∑ t ? θ log ? π θ ( a t ∣ s t ) ? G t ] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t \right] ?θ?J(θ)=E[∑t??θ?logπθ?(at?∣st?)?Gt?] |
baseline形式 | ? θ J ( θ ) = E [ ∑ t ? θ log ? π θ ( a t ∣ s t ) ? ( G t ? b ( s t ) ) ] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (G_t - b(s_t)) \right] ?θ?J(θ)=E[∑t??θ?logπθ?(at?∣st?)?(Gt??b(st?))] |
Advantage形式 | ? θ J ( θ ) = E [ ∑ t ? θ log ? π θ ( a t ∣ s t ) ? A t ] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] ?θ?J(θ)=E[∑t??θ?logπθ?(at?∣st?)?At?] |
📌 附:連續動作高斯策略的梯度
假設策略為:
π θ ( a ∣ s ) = N ( μ θ ( s ) , σ 2 ) \pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma^2) πθ?(a∣s)=N(μθ?(s),σ2)
則:
log ? π θ ( a ∣ s ) = ? ( a ? μ θ ( s ) ) 2 2 σ 2 + const \log \pi_\theta(a|s) = -\frac{(a - \mu_\theta(s))^2}{2\sigma^2} + \text{const} logπθ?(a∣s)=?2σ2(a?μθ?(s))2?+const
對策略參數的梯度為:
? θ log ? π θ ( a ∣ s ) = ( a ? μ θ ( s ) ) σ 2 ? ? θ μ θ ( s ) \nabla_\theta \log \pi_\theta(a|s) = \frac{(a - \mu_\theta(s))}{\sigma^2} \cdot \nabla_\theta \mu_\theta(s) ?θ?logπθ?(a∣s)=σ2(a?μθ?(s))???θ?μθ?(s)