DPO
-
核心思想:直接使用偏好數據進行策略優化,省去
reward
模型策略優化。 -
技術背景知識:
首先給定prompt x,生成兩個答案 ( y 1 , y 2 ) Π S F T ( y ∣ x ) (y_1,y_2)~\Pi^{SFT}(y|x) (y1?,y2?)?ΠSFT(y∣x) ,并通過人工標注對比 y 1 , y 2 y_1,y_2 y1?,y2? ,獲得偏好結果(preference) y w ? y l ∣ x y_w\succ y_l|x yw??yl?∣x,其中 w w w和 l l l表示
win
和lose
。引入獎勵模型 r r r , y 1 > y 2 y_1 > y_2 y1?>y2? 的概率可以表示為
p ( y 1 > y 2 ) = r ? ( x , y 1 ) r ? ( x , y 1 ) + r ? ( x , y 2 ) p(y_1 > y_2) = \frac{r^*(x,y_1)}{r^*(x,y_1)+ r^*(x,y_2)} p(y1?>y2?)=r?(x,y1?)+r?(x,y2?)r?(x,y1?)?
為使得獎勵函數均為正數,引入Bradley-Terry
模型。-
Bradley-Terry:
p ? ( y w ? y l ∣ x ) = e x p ( r ? ( x , y 1 ) ) e x p ( r ? ( x , y 1 ) ) + e x p ( r ? ( x , y 2 ) ) p^{*}(y_w\succ y_l|x) = \frac{exp(r^*(x,y_1))}{exp(r^*(x,y_1))+ exp(r^*(x,y_2))} p?(yw??yl?∣x)=exp(r?(x,y1?))+exp(r?(x,y2?))exp(r?(x,y1?))?
交叉熵:令 a x = e x p ( r ? ( x , y 1 ) ) a_x = exp(r^*(x,y_1)) ax?=exp(r?(x,y1?)), a y = e x p ( r ? ( x , y 2 ) ) a_y = exp(r^*(x,y_2)) ay?=exp(r?(x,y2?))
L o s s = ? E ( a x , a y ) ~ D [ l n a x a x + a y ] = ? E ( x , y w , y l ) ~ D [ l n e x p ( r ? ( x , y w ) ) e x p ( r ? ( x , y w ) ) + e x p ( r ? ( x , y l ) ) ] = ? E ( x , y w , y l ) ~ D [ l n 1 1 + e x p ( r ? ( x , y l ) ? r ? ( x , y w ) ) ] = ? E ( x , y w , y l ) ~ D [ l n σ ( r ? ( x , y w ) ? r ? ( x , y l ) ) ] Loss = -E_{(a_x,a_y)\sim D}[ln\frac{a_x}{a_x+a_y}] \\ = - E_{(x,y_w,y_l)\sim D}[ln\frac{exp(r^*(x,y_w))}{exp(r^*(x,y_w))+exp(r^*(x,y_l))}] \\ = - E_{(x,y_w,y_l)\sim D}[ln\frac{1}{1+exp(r^*(x,y_l)-r^*(x,y_w))}] \\ = - E_{(x,y_w,y_l)\sim D}[ln \sigma(r^*(x,y_w) -r^*(x,y_l))] \\ Loss=?E(ax?,ay?)~D?[lnax?+ay?ax??]=?E(x,yw?,yl?)~D?[lnexp(r?(x,yw?))+exp(r?(x,yl?))exp(r?(x,yw?))?]=?E(x,yw?,yl?)~D?[ln1+exp(r?(x,yl?)?r?(x,yw?))1?]=?E(x,yw?,yl?)~D?[lnσ(r?(x,yw?)?r?(x,yl?))] -
KL 散度:
K L ( P ∣ ∣ Q ) = ∑ x ∈ X P ( X ) l o g ( P ( X ) Q ( X ) ) KL(P||Q) = \sum_{x\in X}P(X)log(\frac{P(X)}{Q(X)}) KL(P∣∣Q)=x∈X∑?P(X)log(Q(X)P(X)?)
P ( x ) , Q ( x ) P(x),Q(x) P(x),Q(x) 分別是數據真實分布和模型預測分布。
-
-
DPO
目標函數:獲取更多的獎勵,并盡可能保證與基準模型一致。
m a x π E x ∈ X , y ∈ π [ r ( x , y ) ] ? β ? D K L [ π ( y ∣ x ) ∣ ∣ π r e f ( y ∣ x ) ] = m a x π E x ∈ X , y ∈ π [ r ( x , y ) ] ? E x ∈ X , y ∈ π [ β ? l o g π ( y ∣ x ) π r e f ( y ∣ x ) ] = m a x π E x ∈ X , y ∈ π [ r ( x , y ) ? β ? l o g π ( y ∣ x ) π r e f ( y ∣ x ) ] = m a x π E x ∈ X , y ∈ π [ l o g π ( y ∣ x ) π r e f ( y ∣ x ) ? 1 β r ( x , y ) ) ] = m i n π E x ∈ X , y ∈ π [ l o g π ( y ∣ x ) π r e f ( y ∣ x ) ? l o g e x p ( 1 β r ( x , y ) ) ] = m i n π E x ∈ X , y ∈ π [ l o g π ( y ∣ x ) π r e f ( y ∣ x ) ? e x p ( 1 β r ( x , y ) ) ] = m i n π E x ∈ X , y ∈ π [ l o g π ( y ∣ x ) 1 Z ( x ) π r e f ( y ∣ x ) ? e x p ( 1 β r ( x , y ) ) ? l o g Z ( x ) ] \underset{\pi}{max} E_{x\in X, y \in \pi}[r(x,y)] - \beta·\mathbb{D}_{KL}[\pi(y|x) || \pi_{ref}(y|x)] \\ = \underset{\pi}{max} E_{x\in X, y \in \pi}[r(x,y)] - E_{x\in X, y \in \pi}[\beta·log \frac{\pi(y|x)}{\pi_{ref}(y|x)}] \\ = \underset{\pi}{max} E_{x\in X, y \in \pi}[r(x,y) - \beta·log \frac{\pi(y|x)}{\pi_{ref}(y|x)}] \\ = \underset{\pi}{max} E_{x\in X, y \in \pi}[log \frac{\pi(y|x)}{\pi_{ref}(y|x)}- \frac{1}{\beta}r(x,y))] \\ = \underset{\pi}{min} E_{x\in X, y \in \pi}[log \frac{\pi(y|x)}{\pi_{ref}(y|x)}- log \ \ exp(\frac{1}{\beta}r(x,y))] \\ = \underset{\pi}{min} E_{x\in X, y \in \pi}[log \frac{\pi(y|x)}{\pi_{ref}(y|x)·exp(\frac{1}{\beta}r(x,y))} ] \\ = \underset{\pi}{min} E_{x\in X, y \in \pi}[log \frac{\pi(y|x)}{\frac{1}{Z(x)}\pi_{ref}(y|x)·exp(\frac{1}{\beta}r(x,y))} - log \ \ Z(x) ] \\ πmax?Ex∈X,y∈π?[r(x,y)]?β?DKL?[π(y∣x)∣∣πref?(y∣x)]=πmax?Ex∈X,y∈π?[r(x,y)]?Ex∈X,y∈π?[β?logπref?(y∣x)π(y∣x)?]=πmax?Ex∈X,y∈π?[r(x,y)?β?logπref?(y∣x)π(y∣x)?]=πmax?Ex∈X,y∈π?[logπref?(y∣x)π(y∣x)??β1?r(x,y))]=πmin?Ex∈X,y∈π?[logπref?(y∣x)π(y∣x)??log??exp(β1?r(x,y))]=πmin?Ex∈X,y∈π?[logπref?(y∣x)?exp(β1?r(x,y))π(y∣x)?]=πmin?Ex∈X,y∈π?[logZ(x)1?πref?(y∣x)?exp(β1?r(x,y))π(y∣x)??log??Z(x)]
令 Z ( x ) Z(x) Z(x) 表示如下:
Z ( x ) = ∑ y π r e f ( y ∣ x ) e x p ( 1 β r ( x , y ) ) Z(x) = \underset{y}{\sum} \pi_{ref}(y|x) exp(\frac{1}{\beta}r(x,y) ) Z(x)=y∑?πref?(y∣x)exp(β1?r(x,y))
令:
1 Z ( x ) π r e f ( y ∣ x ) ? e x p ( 1 β r ( x , y ) ) = π r e f ( y ∣ x ) ? e x p ( 1 β r ( x , y ) ) ∑ y π r e f ( y ∣ x ) e x p ( 1 β r ( x , y ) ) = π ? ( y ∣ x ) \frac{1}{Z(x)}\pi_{ref}(y|x)·exp(\frac{1}{\beta}r(x,y)) = \frac{\pi_{ref}(y|x)·exp(\frac{1}{\beta}r(x,y))}{\underset{y}{\sum} \pi_{ref}(y|x) exp(\frac{1}{\beta}r(x,y) )} \\ = \pi^*(y|x) Z(x)1?πref?(y∣x)?exp(β1?r(x,y))=y∑?πref?(y∣x)exp(β1?r(x,y))πref?(y∣x)?exp(β1?r(x,y))?=π?(y∣x)
接下來繼續對``dpo` 目標函數進行化簡:
m i n π E x ∈ X , y ∈ π [ l o g π ( y ∣ x ) 1 Z ( x ) π r e f ( y ∣ x ) ? e x p ( 1 β r ( x , y ) ) ? l o g Z ( x ) ] = m i n π E x ∈ X , y ∈ π [ l o g π ( y ∣ x ) π ? ( y ∣ x ) ? l o g Z ( x ) ] \underset{\pi}{min} E_{x\in X, y \in \pi}[log \frac{\pi(y|x)}{\frac{1}{Z(x)}\pi_{ref}(y|x)·exp(\frac{1}{\beta}r(x,y))} - log \ \ Z(x) ] \\ = \underset{\pi}{min} E_{x\in X, y \in \pi}[log \frac{\pi(y|x)}{\pi^*(y|x)} - log \ \ Z(x) ] \\ πmin?Ex∈X,y∈π?[logZ(x)1?πref?(y∣x)?exp(β1?r(x,y))π(y∣x)??log??Z(x)]=πmin?Ex∈X,y∈π?[logπ?(y∣x)π(y∣x)??log??Z(x)]
由于 Z ( x ) Z(x) Z(x) 表達式與 π \pi π 不相關,優化可以直接省去。
m i n π E x ∈ X , y ∈ π [ l o g π ( y ∣ x ) π ? ( y ∣ x ) ? l o g Z ( x ) ] = m i n π E x ∈ X , y ∈ π [ l o g π ( y ∣ x ) π ? ( y ∣ x ) ] = m i n π E x ~ D [ D K L ( π ( y ∣ x ) ∣ ∣ π ? ( y ∣ x ) ) ] \underset{\pi}{min} E_{x\in X, y \in \pi}[log \frac{\pi(y|x)}{\pi^*(y|x)} - log \ \ Z(x) ] \\ = \underset{\pi}{min} E_{x\in X, y \in \pi}[log \frac{\pi(y|x)}{\pi^*(y|x)} ] \\ = \underset{\pi}{min} E_{x \sim D}[\mathbb{D}_{KL}(\pi(y|x) || \pi^*(y|x))] \\ πmin?Ex∈X,y∈π?[logπ?(y∣x)π(y∣x)??log??Z(x)]=πmin?Ex∈X,y∈π?[logπ?(y∣x)π(y∣x)?]=πmin?Ex~D?[DKL?(π(y∣x)∣∣π?(y∣x))]
當 目標函數最小化,也就是 D K L \mathbb{D}_{KL} DKL? 最小化,所滿足的條件為:
π ( y ∣ x ) = π ? ( y ∣ x ) = 1 Z ( x ) π r e f ( y ∣ x ) ? e x p ( 1 β r ( x , y ) ) \pi(y|x) = \pi^*(y|x) = \frac{1}{Z(x)}\pi_{ref}(y|x)·exp(\frac{1}{\beta}r(x,y)) π(y∣x)=π?(y∣x)=Z(x)1?πref?(y∣x)?exp(β1?r(x,y))
反解獎勵函數 r ( x , y ) r(x,y) r(x,y)
r ( x , y ) = β π ( y ∣ x ) π r e f ( y ∣ x ) + β ? l n Z ( x ) r(x,y) = \beta \frac{\pi(y|x)}{\pi_{ref}(y|x)} + \beta · ln \Z(x) r(x,y)=βπref?(y∣x)π(y∣x)?+β?lnZ(x)
求解獎勵函數隱式表達后,帶入Bradley-Terry
交叉熵函數:
L o s s = ? E ( x , y w , y l ) ~ D [ l n σ ( r ? ( x , y w ) ? r ? ( x , y l ) ) ] = ? E ( x , y w , y l ) ~ D [ l n σ ( β l o g π ( y w ∣ x ) π r e f ( y w ∣ x ) ? β l o g π ( y l ∣ x ) π r e f ( y l ∣ x ) ) ] Loss = - E_{(x,y_w,y_l)\sim D}[ln \sigma(r^*(x,y_w) -r^*(x,y_l))] \\ =- E_{(x,y_w,y_l)\sim D}[ln \sigma(\beta log\frac{\pi(y_w|x)}{\pi_{ref}(y_w|x)} - \beta log \frac{\pi(y_l|x)}{\pi_{ref}(y_l|x)})] Loss=?E(x,yw?,yl?)~D?[lnσ(r?(x,yw?)?r?(x,yl?))]=?E(x,yw?,yl?)~D?[lnσ(βlogπref?(yw?∣x)π(yw?∣x)??βlogπref?(yl?∣x)π(yl?∣x)?)]
到此,整個數學部分已推導完畢,不得不說句牛逼plus。
-
梯度表征:
將上述損失進行梯度求導
? θ L o s s ( π θ ; π r e f ) = ? E ( x , y w , y l ) ~ D [ β σ ( β l o g π ( y w ∣ x ) π r e f ( y w ∣ x ) ? β l o g π ( y l ∣ x ) π r e f ( y l ∣ x ) ) [ ? θ l o g π ( y w ∣ x ) ? ? θ l o g π ( y l ∣ x ) ] ] \nabla_\theta Loss(\pi_{\theta};\pi_{ref}) = - E_{(x,y_w,y_l)\sim D}[\beta \sigma(\beta log\frac{\pi(y_w|x)}{\pi_{ref}(y_w|x)} - \beta log \frac{\pi(y_l|x)}{\pi_{ref}(y_l|x)}) [\nabla_{\theta}log \pi(y_w|x) - \nabla_{\theta}log \pi(y_l|x) ]] ?θ?Loss(πθ?;πref?)=?E(x,yw?,yl?)~D?[βσ(βlogπref?(yw?∣x)π(yw?∣x)??βlogπref?(yl?∣x)π(yl?∣x)?)[?θ?logπ(yw?∣x)??θ?logπ(yl?∣x)]]
再令:
r ^ ( x , y ) = β π θ ( y ∣ x ) π r e f ( y ∣ x ) \hat{r}(x,y) = \beta \frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x)} r^(x,y)=βπref?(y∣x)πθ?(y∣x)?
最終形式:
? θ L o s s ( π θ ; π r e f ) = ? β E ( x , y w , y l ) ~ D [ σ ( r ^ ? ( x , y w ) ? r ^ ? ( x , y l ) ) ? h i g h e r w e i g h t w h e n r e w a r d e s t i m a t e i s w r o n g [ ? θ l o g π ( y w ∣ x ) ? i n c r e a s e l i k e l i h o o d o f y w ? ? θ l o g π ( y l ∣ x ) ? d e c r e a s e l i k e l i h o o d o f y l ] ] \nabla_\theta Loss(\pi_{\theta};\pi_{ref}) = -\beta E_{(x,y_w,y_l)\sim D}[\underbrace{\sigma(\hat{r}^*(x,y_w) -\hat{r}^*(x,y_l))}_{higher\ weight\ when\ reward\ estimate\ is\ wrong} [\underbrace{\nabla_{\theta}log \pi(y_w|x)}_{\ \ \ \ \ \ \ \ \ increase \ likelihood\ of\ y_w} - \underbrace{\nabla_{\theta}log \pi(y_l|x)}_{decrease \ likelihood \ of \ y_l} ]] ?θ?Loss(πθ?;πref?)=?βE(x,yw?,yl?)~D?[higher?weight?when?reward?estimate?is?wrong σ(r^?(x,yw?)?r^?(x,yl?))??[?????????increase?likelihood?of?yw? ?θ?logπ(yw?∣x)???decrease?likelihood?of?yl? ?θ?logπ(yl?∣x)??]] -
改進方法ODPO
dpo
缺陷主要是:采用Bradley–Terry model
只給出了一個response
比另一個response
好的概率,而沒有告訴我們好的程度。
? odpo
核心思想: 把這個好的程度的差距信息引入到偏好的建模里,應該能帶來收益,及在dpo
損失里添加margin
, 這相當于要求偏好回應
的評估分數要比非偏好回應
的評估分數大,且要大offset
值這么多。目的是:加大對靠得比較近的數據對的懲罰力度。
L o s s o d p o = ? E ( x , y w , y l ) ~ D [ l n σ ( r ? ( x , y w ) ? r ? ( x , y l ) ) ? δ r ] δ r = α l o g ( r ( y w ) ? r ( y l ) ) Loss^{odpo}= - E_{(x,y_w,y_l)\sim D}[ln \sigma(r^*(x,y_w) -r^*(x,y_l)) - \delta_r] \\ \delta_r = \alpha \ log(r(y_w)- r(y_l)) Lossodpo=?E(x,yw?,yl?)~D?[lnσ(r?(x,yw?)?r?(x,yl?))?δr?]δr?=α?log(r(yw?)?r(yl?))
-
相似改進方法:
IPO
KTO
都是不需要獎勵模型的;