1 KL 散度
KL 散度的作為是描述兩個分布的差異的,首先是度量一個分布,用熵來度量。
1.1 熵
在介紹熵之間,首先要度量單個事件的信息量
I(x)=?logP(x)I(x)=-logP(x)I(x)=?logP(x)
整體的信息量
H(P)=Ex?P[?logP(x)]=?∑P(x)logP(x)
\begin{aligned}
H(P) &=E_{x~P}[-logP(x)] \\
& = -\sum P(x)logP(x)
\end{aligned}
H(P)?=Ex?P?[?logP(x)]=?∑P(x)logP(x)?
1.2 KL 散度
原本數據真實的分布應該是p(x),但是現在搞錯了,搞成q(x)
本來一個信息應該用-logP(x)描述,現在變成了-logq(x),
DKL(P∣∣Q)=Ex?p[logP(x)Q(x)]=∑xP(x)logP(x)Q(x)
\begin{aligned}
D_{KL}(P||Q)=E_{x~p}[log\frac{P(x)}{Q(x)}]=\sum_xP(x)log\frac{P(x)}{Q(x)}
\end{aligned}
DKL?(P∣∣Q)=Ex?p?[logQ(x)P(x)?]=x∑?P(x)logQ(x)P(x)??
1.3 應用
- softmax分類問題的KL散度
對于每個樣本來說,正確的類別
P(xk)=1,Q(xk)=exkex1+ex2+...+exnDKL(P∣∣Q)=?logQ(xk)=?logexkex1+ex2+...+exn \begin{aligned} P(x_k)=1,Q(x_k)=\frac{e^{x_k}}{e^{x_1}+e^{x_2}+...+e^{x_n}} \\ D_{KL}(P||Q)=-logQ(x_k) =-log\frac{e^{x_k}}{e^{x_1}+e^{x_2}+...+e^{x_n}} \end{aligned} P(xk?)=1,Q(xk?)=ex1?+ex2?+...+exn?exk??DKL?(P∣∣Q)=?logQ(xk?)=?logex1?+ex2?+...+exn?exk??? - 高斯分布問題的KL 散度
P(x)=12πe?x22logP(x)=?12log(2π)?x22Q(x)=12πσe?(x?μ)22σ2logQ(x)=?12log(2π)?(x?μ)22σ2?log(σ)DKL(P∣∣Q)=Ep[logP(x)?logQ(x)]=Ep[logσ+(x?μ)22σ2?x22]DKL(P∣∣Q)=log(σ)+12σ2Ep[(x?μ)2]?12Ep(x2)DKL(P∣∣Q)=log(σ)+1+μ22σ2?12 \begin{aligned} P(x)=\frac{1}{\sqrt{2\pi}} e^{\frac{-x^2}{2}} \\ logP(x)=-\frac{1}{2}log(2\pi)-\frac{x^2}{2} \\ Q(x)=\frac{1}{\sqrt{2\pi}\sigma} e^{\frac{-(x-\mu)^2}{2\sigma^2}} \\ logQ(x)=-\frac{1}{2}log(2\pi)-\frac{(x-\mu)^2}{2\sigma^2}-log(\sigma) \\ D_{KL}(P||Q)=E_p[logP(x)-logQ(x)]=E_p[log_{\sigma}+\frac{(x-\mu)^2}{2\sigma^2}-\frac{x^2}{2} ]\\ D_{KL}(P||Q)=log(\sigma)+\frac{1}{2\sigma^2}E_p[(x-\mu)^2]-\frac{1}{2}E_p(x^2) \\ D_{KL}(P||Q)=log(\sigma)+\frac{1+\mu^2}{2\sigma^2}-\frac{1}{2} \end{aligned} P(x)=2π?1?e2?x2?logP(x)=?21?log(2π)?2x2?Q(x)=2π?σ1?e2σ2?(x?μ)2?logQ(x)=?21?log(2π)?2σ2(x?μ)2??log(σ)DKL?(P∣∣Q)=Ep?[logP(x)?logQ(x)]=Ep?[logσ?+2σ2(x?μ)2??2x2?]DKL?(P∣∣Q)=log(σ)+2σ21?Ep?[(x?μ)2]?21?Ep?(x2)DKL?(P∣∣Q)=log(σ)+2σ21+μ2??21??
其中,直覺的理解是總平方距離=抖動平方+偏移的平方
Ep[(x?μ)2]=Ep[(x?E(x)+E(x)?μ)2]=Ep[(x?E(x)2)]+2Ep[x?E(x)][E(x)?μ]+Ep[(E(x)?μ)2]=var(x)+μ2
\begin{aligned}
E_p[(x-\mu)^2] &=E_p[(x-E(x)+E(x)-\mu)^2] \\
& = E_p[(x-E(x)^2)]+2E_p[x-E(x)][E(x)-\mu]+E_p[(E(x)-\mu)^2] \\
& = var(x)+\mu^2
\end{aligned}
Ep?[(x?μ)2]?=Ep?[(x?E(x)+E(x)?μ)2]=Ep?[(x?E(x)2)]+2Ep?[x?E(x)][E(x)?μ]+Ep?[(E(x)?μ)2]=var(x)+μ2?