論文閱讀筆記:Denoising Diffusion Probabilistic Models (3)

論文閱讀筆記:Denoising Diffusion Probabilistic Models (1)
論文閱讀筆記:Denoising Diffusion Probabilistic Models (2)
論文閱讀筆記:Denoising Diffusion Probabilistic Models (3)

4、損失函數逐項分析

可以看出 L L L總共分為了3項,首先考慮第一項 L 1 L_1 L1?
L 1 = E x 1 : T ~ q ( x 1 : T ∣ x 0 ) ( l o g [ q ( x T ∣ x 0 ) p ( x T ) ] ) = ∫ d x 1 : T ? q ( x 1 : T ∣ x 0 ) ? l o g [ q ( x T ∣ x 0 ) p ( x T ) ] = ∫ d x 1 : T ? q ( x 1 : T ∣ x 0 ) q ( x T ∣ x 0 ) ? q ( x T ∣ x 0 ) ? l o g [ q ( x T ∣ x 0 ) p ( x T ) ] = ∫ d x 1 : T ? q ( x 1 : T ? 1 ∣ x 0 , x T ) ? q ( x 1 : T ∣ x 0 ) = q ( x T ∣ x 0 ) ? q ( x 1 ; T ? 1 ∣ x 0 , x T ) ? q ( x T ∣ x 0 ) ? l o g [ q ( x T ∣ x 0 ) p ( x T ) ] = ∫ ( ∫ q ( x 1 : T ? 1 ∣ x 0 , x T ) ? ∏ k = 1 T ? 1 d x k ? 二重積分化為兩個定積分相乘,并且 = 1 ) ? q ( x T ∣ x 0 ) ? l o g [ q ( x T ∣ x 0 ) p ( x T ) ] ? d x T = ∫ q ( x T ∣ x 0 ) ? l o g [ q ( x T ∣ x 0 ) p ( x T ) ] ? d x T = E x T ~ q ( x T ∣ x 0 ) l o g [ q ( x T ∣ x 0 ) p ( x T ) ] = K L ( q ( x T ∣ x 0 ) ∣ ∣ p ( x T ) ) \begin{equation} \begin{split} L_1&=E_{x_{1:T} \sim q(x_{1:T} | x_0)} \Bigg(log \Big[ \frac{q(x_{T}|x_0)}{ p(x_T)}\Big]\Bigg) \\ &=\int dx_{1:T} \cdot q(x_{1:T}| x_0) \cdot log \Big[ \frac{q(x_{T}|x_0)}{ p(x_T)}\Big] \\ &=\int dx_{1:T} \cdot \frac{q(x_{1:T}| x_0)}{q(x_T|x_0)} \cdot q(x_T|x_0) \cdot log \Big[ \frac{q(x_{T}|x_0)}{ p(x_T)}\Big] \\ &=\int dx_{1:T} \cdot \underbrace{ q(x_{1:T-1}| x_0, x_T) }_{q(x_{1:T}| x_0)=q(x_{T}|x_0) \cdot q(x_{1;T-1}| x_0, x_T)} \cdot q(x_T|x_0) \cdot log \Big[ \frac{q(x_{T}|x_0)}{ p(x_T)}\Big] \\ &=\int \Bigg( \underbrace{ \int q(x_{1:T-1}| x_0, x_T) \cdot \prod_{k=1}^{T-1} dx_k }_{二重積分化為兩個定積分相乘,并且=1} \Bigg) \cdot q(x_T|x_0) \cdot log \Big[ \frac{q(x_{T}|x_0)}{ p(x_T)} \Big] \cdot dx_{T} \\ &=\int q(x_T|x_0) \cdot log \Big[ \frac{q(x_{T}|x_0)}{ p(x_T)} \Big] \cdot dx_{T} \\ &=E_{x^T\sim q(x_T|x_0)} log \Big[ \frac{q(x_{T}|x_0)}{ p(x_T)} \Big]\\ &= KL\Big(q(x_T|x_0)||p(x_T)\Big) \end{split} \end{equation} L1??=Ex1:T?q(x1:T?x0?)?(log[p(xT?)q(xT?x0?)?])=dx1:T??q(x1:T?x0?)?log[p(xT?)q(xT?x0?)?]=dx1:T??q(xT?x0?)q(x1:T?x0?)??q(xT?x0?)?log[p(xT?)q(xT?x0?)?]=dx1:T??q(x1:T?x0?)=q(xT?x0?)?q(x1;T?1?x0?,xT?) q(x1:T?1?x0?,xT?)???q(xT?x0?)?log[p(xT?)q(xT?x0?)?]=(二重積分化為兩個定積分相乘,并且=1 q(x1:T?1?x0?,xT?)?k=1T?1?dxk???)?q(xT?x0?)?log[p(xT?)q(xT?x0?)?]?dxT?=q(xT?x0?)?log[p(xT?)q(xT?x0?)?]?dxT?=ExTq(xT?x0?)?log[p(xT?)q(xT?x0?)?]=KL(q(xT?x0?)∣∣p(xT?))???

可以看出, L 1 L_1 L1? q ( x T ∣ x 0 ) q(x_T|x_0) q(xT?x0?) p ( x T ) p(x_T) p(xT?)的散度。 q ( x T ∣ x 0 ) q(x_T|x_0) q(xT?x0?)是前向加噪過程的終點,是無限趨向于標準正態分布。而 p ( x T ) p(x_T) p(xT?)是高斯分布,這在論文《Denoising Diffusion Probabilistic Models》中的2 Background的第四行中有說明。由 兩個高斯分布KL散度推導可以計算出 L 1 L_1 L1?,也就是說 L 1 L_1 L1?是一個定值。因此,在損失函數中 L 1 L_1 L1?可以被忽略掉。

接著考慮第二項 L 2 L_2 L2?

L 2 = E x 1 : T ~ q ( x 1 : T ∣ x 0 ) ( ∑ t = 2 T l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) ] ) = ∑ t = 2 T E x 1 : T ~ q ( x 1 : T ∣ x 0 ) ( l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) ] ) = ∑ t = 2 T ( ∫ d x 1 : T ? q ( x 1 : T ∣ x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) ] ) = ∑ t = 2 T ( ∫ d x 1 : T ? q ( x 1 : T ∣ x 0 ) q ( x t ? 1 ∣ x t , x 0 ) ? q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p ( x t ? 1 ∣ x t ) ] ) = ∑ t = 2 T ( ∫ d x 1 : T ? q ( x 0 : T ) q ( x 0 ) ? q ( x 0 : T ) = q ( x 0 ) ? q ( x 1 : T ∣ x 0 ) ? q ( x t , x 0 ) q ( x t , x t ? 1 , x 0 ) ? q ( x t , x t ? 1 , x 0 ) = q ( x t , x 0 ) ? q ( x t ? 1 ∣ x t , x 0 ) ? q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) ] ) = ∑ t = 2 T ( ∫ d x 1 : T ? q ( x 0 : T ) q ( x 0 ) ? q ( x t , x 0 ) q ( x t ? 1 , x 0 ) ? q ( x t ∣ x t ? 1 , x 0 ) ? q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) ] ) = ∑ t = 2 T ( ∫ [ ∫ q ( x 0 : T ) q ( x 0 ) ? q ( x t , x 0 ) q ( x t ? 1 , x 0 ) ? q ( x t ∣ x t ? 1 , x 0 ) ∏ k ≥ 1 , k ≠ t ? 1 d x k ] ? q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) d x t ? 1 ] ) = ∑ t = 2 T ( ∫ [ ∫ q ( x 0 : T ) q ( x t ? 1 , x 0 ) ? q ( x t , x 0 ) q ( x 0 ) ? q ( x t ∣ x t ? 1 , x 0 ) ∏ k ≥ 1 , k ≠ t ? 1 d x k ] ? q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) d x t ? 1 ] ) = ∑ t = 2 T ( ∫ [ ∫ q ( x k : k ≥ 1 , k ≠ t ? 1 ∣ x t ? 1 , x 0 ) ? q ( x 0 ; T ) = q ( x t ? 1 , x 0 ) ? q ( x k : k ≥ 1 , k ≠ t ? 1 ∣ x t ? 1 , x 0 ) ? q ( x t ∣ x 0 ) q ( x t ∣ x t ? 1 , x 0 ) ? q ( x t , x 0 ) = q ( x 0 ) ? q ( x t ∣ x 0 ) ∏ k ≥ 1 , k ≠ t ? 1 d x k ] ? q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) d x t ? 1 ] ) = ∑ t = 2 T ( ∫ [ ∫ q ( x k : k ≥ 1 , k ≠ t ? 1 ∣ x t ? 1 , x 0 ) ? q ( x t ∣ x 0 ) q ( x t ∣ x t ? 1 , x 0 ) ? = 1 ∏ k ≥ 1 , k ≠ t ? 1 d x k ] ? q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) d x t ? 1 ] ) = ∑ t = 2 T ( ∫ [ ∫ q ( x k : k ≥ 1 , k ≠ t ? 1 ∣ x t ? 1 , x 0 ) ? ∏ k ≥ 1 , k ≠ t ? 1 d x k ] ? q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) d x t ? 1 ] ) = ∑ t = 2 T ( ∫ [ ∫ q ( x k : k ≥ 1 , k ≠ t ? 1 ∣ x t ? 1 , x 0 ) ? ∏ k ≥ 1 , k ≠ t ? 1 d x k ? = 1 ] ? q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) d x t ? 1 ] ) = ∑ t = 2 T ( ∫ q ( x t ? 1 ∣ x t , x 0 ) ? l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) d x t ? 1 ] ) = ∑ t = 2 T ( E x t ? 1 ~ q ( x t ? 1 ∣ x t , x 0 ) l o g [ q ( x t ? 1 ∣ x t , x 0 ) p θ ( x t ? 1 ∣ x t ) ] ) = ∑ t = 2 T K L ( q ( x t ? 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t ? 1 ∣ x t ) ) \begin{equation} \begin{split} L_2&=E_{x_{1:T} \sim q(x_{1:T} | x_0)} \Bigg(\sum_{t=2}^{T} log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} \Big]\Bigg)\\ &=\sum_{t=2}^{T} E_{x_{1:T} \sim q(x_{1:T} | x_0)} \Bigg(log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} \Big]\Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int dx_{1:T} \cdot q(x_{1:T}| x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int dx_{1:T} \cdot \frac{ q(x_{1:T}| x_0)}{q(x_{t-1}|x_t,x_0)} \cdot q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p(x_{t-1}|x_t)} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int dx_{1:T} \cdot \underbrace{ \frac{q(x_{0:T})}{q(x_0)}}_{q(x_{0:T})=q(x_0)\cdot q(x_{1:T}| x_0)} \cdot \underbrace{ \frac{q(x_t,x_0)}{q(x_t,x_{t-1},x_0)}}_{q(x_t,x_{t-1},x_0)=q(x_t,x_0)\cdot q(x_{t-1}|x_t,x_0)} \cdot q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int dx_{1:T} \cdot \frac{q(x_{0:T})}{q(x_0)}\cdot \frac{q(x_t,x_0)}{q(x_{t-1},x_0)\cdot q(x_t|x_{t-1},x_0)} \cdot q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int \bigg[ \int \frac{q(x_{0:T})}{q(x_0)}\cdot \frac{q(x_t,x_0)}{q(x_{t-1},x_0)\cdot q(x_t|x_{t-1},x_0)} \prod_{k\geq1 ,k\neq t-1} dx_k \bigg] \cdot q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} dx_{t-1} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int \bigg[ \int \frac{q(x_{0:T})}{q(x_{t-1},x_0)}\cdot \frac{q(x_t,x_0)}{q(x_0)\cdot q(x_t|x_{t-1},x_0)} \prod_{k\geq1 ,k\neq t-1} dx_k \bigg] \cdot q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} dx_{t-1} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int \bigg[ \underbrace{ \int q(x_{k:k\geq1,k\neq t-1}|x_{t-1},x_0)}_{q(x_{0;T})=q(x_{t-1},x_0)\cdot q(x_{k:k\geq1,k\neq t-1}|x_{t-1},x_0)} \cdot \underbrace {\frac{q(x_t|x_0)}{ q(x_t|x_{t-1},x_0)}}_{q(x_t,x_0)=q(x_0)\cdot q(x_t|x_0)} \prod_{k\geq1 ,k\neq t-1} dx_k \bigg] \cdot q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} dx_{t-1} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int \bigg[\int q(x_{k:k\geq1,k\neq t-1}|x_{t-1},x_0)\cdot \underbrace {\frac{q(x_t|x_0)}{ q(x_t|x_{t-1},x_0)}}_{=1} \prod_{k\geq1 ,k\neq t-1} dx_k \bigg] \cdot q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} dx_{t-1} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int \bigg[\int q(x_{k:k\geq1,k\neq t-1}|x_{t-1},x_0)\cdot \prod_{k\geq1 ,k\neq t-1} dx^k \bigg] \cdot q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} dx_{t-1} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int \bigg[\underbrace{ \int q(x_{k:k\geq1,k\neq t-1}|x_{t-1},x^0)\cdot \prod_{k\geq1 ,k\neq t-1} dx_k }_{=1}\bigg] \cdot q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} dx_{t-1} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( \int q(x_{t-1}|x_t,x_0) \cdot log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} dx_{t-1} \Big] \Bigg)\\ &=\sum_{t=2}^{T} \Bigg( E_{x_{t-1}\sim q(x_{t-1}|x_t,x_0)} log \Big[\frac{q(x_{t-1}|x_t,x_0)}{ p_{\theta}(x_{t-1}|x_t)} \Big] \Bigg)\\ &=\sum_{t=2}^{T}KL\bigg(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t) \bigg) \end{split} \end{equation} L2??=Ex1:T?q(x1:T?x0?)?(t=2T?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?])=t=2T?Ex1:T?q(x1:T?x0?)?(log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?])=t=2T?(dx1:T??q(x1:T?x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?])=t=2T?(dx1:T??q(xt?1?xt?,x0?)q(x1:T?x0?)??q(xt?1?xt?,x0?)?log[p(xt?1?xt?)q(xt?1?xt?,x0?)?])=t=2T?(dx1:T??q(x0:T?)=q(x0?)?q(x1:T?x0?) q(x0?)q(x0:T?)????q(xt?,xt?1?,x0?)=q(xt?,x0?)?q(xt?1?xt?,x0?) q(xt?,xt?1?,x0?)q(xt?,x0?)????q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?])=t=2T?(dx1:T??q(x0?)q(x0:T?)??q(xt?1?,x0?)?q(xt?xt?1?,x0?)q(xt?,x0?)??q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?])=t=2T?([q(x0?)q(x0:T?)??q(xt?1?,x0?)?q(xt?xt?1?,x0?)q(xt?,x0?)?k1,k=t?1?dxk?]?q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?dxt?1?])=t=2T?([q(xt?1?,x0?)q(x0:T?)??q(x0?)?q(xt?xt?1?,x0?)q(xt?,x0?)?k1,k=t?1?dxk?]?q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?dxt?1?])=t=2T?([q(x0;T?)=q(xt?1?,x0?)?q(xk:k1,k=t?1?xt?1?,x0?) q(xk:k1,k=t?1?xt?1?,x0?)???q(xt?,x0?)=q(x0?)?q(xt?x0?) q(xt?xt?1?,x0?)q(xt?x0?)???k1,k=t?1?dxk?]?q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?dxt?1?])=t=2T?([q(xk:k1,k=t?1?xt?1?,x0?)?=1 q(xt?xt?1?,x0?)q(xt?x0?)???k1,k=t?1?dxk?]?q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?dxt?1?])=t=2T?([q(xk:k1,k=t?1?xt?1?,x0?)?k1,k=t?1?dxk]?q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?dxt?1?])=t=2T?([=1 q(xk:k1,k=t?1?xt?1?,x0)?k1,k=t?1?dxk???]?q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?dxt?1?])=t=2T?(q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?dxt?1?])=t=2T?(Ext?1?q(xt?1?xt?,x0?)?log[pθ?(xt?1?xt?)q(xt?1?xt?,x0?)?])=t=2T?KL(q(xt?1?xt?,x0?)∣∣pθ?(xt?1?xt?))???
最后考慮 L 3 L_3 L3?,事實上,在論文《Deep Unsupervised Learning using Nonequilibrium Thermodynamics》中提到為了防止邊界效應,強制另 p ( x 0 ∣ x 1 ) = q ( x 1 ∣ x 0 ) p(x^0|x^1)=q(x^1|x^0) p(x0x1)=q(x1x0),因此這一項也是個常數。

由以上分析可知道,損失函數可以寫為公式(3)。
L : = L 1 + L 2 + L 3 = K L ( q ( x T ∣ x 0 ) ∣ ∣ p ( x T ) ) + ∑ t = 2 T K L ( q ( x t ? 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t ? 1 ∣ x t ) ) ? l o g [ p θ ( x 0 ∣ x 1 ) ] \begin{equation} \begin{split} L&:=L_1+L_2+L_3 \\ &=KL\Big(q(x_T|x_0)||p(x_T)\Big) + \sum_{t=2}^{T}KL\bigg(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t) \bigg)-log \Big[p_{\theta}(x_{0}|x_1)\Big] \end{split} \end{equation} L?:=L1?+L2?+L3?=KL(q(xT?x0?)∣∣p(xT?))+t=2T?KL(q(xt?1?xt?,x0?)∣∣pθ?(xt?1?xt?))?log[pθ?(x0?x1?)]???

忽略掉 L 1 L_1 L1? L 3 L_3 L3?,損失函數可以寫為公式(4)。
L : = ∑ t = 2 T K L ( q ( x t ? 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t ? 1 ∣ x t ) ) \begin{equation} \begin{split} L:=\sum_{t=2}^{T}KL\bigg(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t) \bigg) \end{split} \end{equation} L:=t=2T?KL(q(xt?1?xt?,x0?)∣∣pθ?(xt?1?xt?))???

可以看出 損失函數 L L L是兩個高斯分布 q ( x t ? 1 ∣ x t , x 0 ) q(x_{t-1}|x_t,x_0) q(xt?1?xt?,x0?) p θ ( x t ? 1 ∣ x t ) p_{\theta}(x_{t-1}|x_t) pθ?(xt?1?xt?)的KL散度。 q ( x t ? 1 ∣ x t , x 0 ) q(x_{t-1}|x_t,x_0) q(xt?1?xt?,x0?)的均值和方差由論文閱讀筆記:Denoising Diffusion Probabilistic Models (1)可知,分別為

σ 1 = β t ? ( 1 ? α t ? 1 ˉ ) ( 1 ? α t ˉ ) μ 1 = 1 α t ? ( x t ? β t 1 ? α t ˉ ? z t ) 或者 μ 1 = α t ? ( 1 ? α t ? 1 ˉ ) 1 ? α t ˉ ? x t + β t ? α t ? 1 ˉ 1 ? α t ˉ ? x 0 \begin{equation} \begin{split} \sigma_1&=\sqrt{\frac{\beta_t\cdot (1-\bar{\alpha_{t-1}})}{(1-\bar{\alpha_{t}})}}\\ \mu_1&=\frac{1}{\sqrt{\alpha_t}}\cdot (x_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha_t}}}\cdot z_t) \\ 或者 \mu_1&=\frac{\sqrt{\alpha_t}\cdot(1-\bar{\alpha_{t-1}})}{1-\bar{\alpha_t}}\cdot x_t+\frac{\beta_t\cdot \sqrt{\bar{\alpha_{t-1}}}}{1-\bar{\alpha_t}} \cdot x_0 \end{split} \end{equation} σ1?μ1?或者μ1??=(1?αt?ˉ?)βt??(1?αt?1?ˉ?)? ?=αt? ?1??(xt??1?αt?ˉ? ?βt???zt?)=1?αt?ˉ?αt? ??(1?αt?1?ˉ?)??xt?+1?αt?ˉ?βt??αt?1?ˉ? ???x0????

p θ ( x t ? 1 ∣ x t ) p_{\theta}(x_{t-1}|x_t) pθ?(xt?1?xt?)則由模型(深度學習模型或者其他模型)估算出其均值和方差,分別記作 μ 2 , σ 2 \mu_2,\sigma_2 μ2?,σ2?
因此損失函數 L L L可以進一步寫為公式12。
L : = l o g [ σ 2 σ 1 ] + σ 1 2 + ( μ 1 ? μ 2 ) 2 2 σ 2 2 ? 1 2 \begin{equation} \begin{split} L:=log \Big[\frac{\sigma_2}{\sigma_1}\Big]+\frac{\sigma_1^2 +(\mu_1-\mu_2)^2}{2\sigma_2^2}-\frac{1}{2} \end{split} \end{equation} L:=log[σ1?σ2??]+2σ22?σ12?+(μ1??μ2?)2??21????

5、代碼解析

最后結合原文中的代碼diffusion-https://github.com/hojonathanho/diffusion來理解一下訓練過程和推理過程。
首先是訓練過程

class GaussianDiffusion2:"""Contains utilities for the diffusion model.Arguments:- what the network predicts (x_{t-1}, x_0, or epsilon)- which loss function (kl or unweighted MSE)- what is the variance of p(x_{t-1}|x_t) (learned, fixed to beta, or fixed to weighted beta)- what type of decoder, and how to weight its loss? is its variance learned too?"""# 模型中的一些定義def __init__(self, *, betas, model_mean_type, model_var_type, loss_type):self.model_mean_type = model_mean_type  # xprev, xstart, epsself.model_var_type = model_var_type  # learned, fixedsmall, fixedlargeself.loss_type = loss_type  # kl, mseassert isinstance(betas, np.ndarray)self.betas = betas = betas.astype(np.float64)  # computations here in float64 for accuracyassert (betas > 0).all() and (betas <= 1).all()timesteps, = betas.shapeself.num_timesteps = int(timesteps)alphas = 1. - betasself.alphas_cumprod = np.cumprod(alphas, axis=0)self.alphas_cumprod_prev = np.append(1., self.alphas_cumprod[:-1])assert self.alphas_cumprod_prev.shape == (timesteps,)# calculations for diffusion q(x_t | x_{t-1}) and othersself.sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod)self.sqrt_one_minus_alphas_cumprod = np.sqrt(1. - self.alphas_cumprod)self.log_one_minus_alphas_cumprod = np.log(1. - self.alphas_cumprod)self.sqrt_recip_alphas_cumprod = np.sqrt(1. / self.alphas_cumprod)self.sqrt_recipm1_alphas_cumprod = np.sqrt(1. / self.alphas_cumprod - 1)# calculations for posterior q(x_{t-1} | x_t, x_0)self.posterior_variance = betas * (1. - self.alphas_cumprod_prev) / (1. - self.alphas_cumprod)# below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chainself.posterior_log_variance_clipped = np.log(np.append(self.posterior_variance[1], self.posterior_variance[1:]))self.posterior_mean_coef1 = betas * np.sqrt(self.alphas_cumprod_prev) / (1. - self.alphas_cumprod)self.posterior_mean_coef2 = (1. - self.alphas_cumprod_prev) * np.sqrt(alphas) / (1. - self.alphas_cumprod)# 在模型Model類當中的方法def train_fn(self, x, y):B, H, W, C = x.shapeif self.randflip:x = tf.image.random_flip_left_right(x)assert x.shape == [B, H, W, C]# 隨機生成第t步t = tf.random_uniform([B], 0, self.diffusion.num_timesteps, dtype=tf.int32)# 計算第t步時對應的損失函數losses = self.diffusion.training_losses(denoise_fn=functools.partial(self._denoise, y=y, dropout=self.dropout), x_start=x, t=t)assert losses.shape == t.shape == [B]return {'loss': tf.reduce_mean(losses)}# 根據x_start采樣到第t步的帶噪圖像def q_sample(self, x_start, t, noise=None):"""Diffuse the data (t == 0 means diffused for 1 step)"""if noise is None:noise = tf.random_normal(shape=x_start.shape)assert noise.shape == x_start.shapereturn (self._extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +self._extract(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)# 計算q(x^{t-1}|x^t,x^0)分布的均值和方差def q_posterior_mean_variance(self, x_start, x_t, t):"""Compute the mean and variance of the diffusion posterior q(x_{t-1} | x_t, x_0)"""assert x_start.shape == x_t.shapeposterior_mean = (self._extract(self.posterior_mean_coef1, t, x_t.shape) * x_start +self._extract(self.posterior_mean_coef2, t, x_t.shape) * x_t)posterior_variance = self._extract(self.posterior_variance, t, x_t.shape)posterior_log_variance_clipped = self._extract(self.posterior_log_variance_clipped, t, x_t.shape)assert (posterior_mean.shape[0] == posterior_variance.shape[0] == posterior_log_variance_clipped.shape[0] ==x_start.shape[0])return posterior_mean, posterior_variance, posterior_log_variance_clipped# 由深度學習模型UNet估算出p(x^{t-1}|x^t)分布的方差和均值def p_mean_variance(self, denoise_fn, *, x, t, clip_denoised: bool, return_pred_xstart: bool):B, H, W, C = x.shapeassert t.shape == [B]model_output = denoise_fn(x, t)# Learned or fixed variance?if self.model_var_type == 'learned':assert model_output.shape == [B, H, W, C * 2]model_output, model_log_variance = tf.split(model_output, 2, axis=-1)model_variance = tf.exp(model_log_variance)elif self.model_var_type in ['fixedsmall', 'fixedlarge']:# below: only log_variance is used in the KL computationsmodel_variance, model_log_variance = {# for fixedlarge, we set the initial (log-)variance like so to get a better decoder log likelihood'fixedlarge': (self.betas, np.log(np.append(self.posterior_variance[1], self.betas[1:]))),'fixedsmall': (self.posterior_variance, self.posterior_log_variance_clipped),}[self.model_var_type]model_variance = self._extract(model_variance, t, x.shape) * tf.ones(x.shape.as_list())model_log_variance = self._extract(model_log_variance, t, x.shape) * tf.ones(x.shape.as_list())else:raise NotImplementedError(self.model_var_type)# Mean parameterization_maybe_clip = lambda x_: (tf.clip_by_value(x_, -1., 1.) if clip_denoised else x_)if self.model_mean_type == 'xprev':  # the model predicts x_{t-1}pred_xstart = _maybe_clip(self._predict_xstart_from_xprev(x_t=x, t=t, xprev=model_output))model_mean = model_outputelif self.model_mean_type == 'xstart':  # the model predicts x_0pred_xstart = _maybe_clip(model_output)model_mean, _, _ = self.q_posterior_mean_variance(x_start=pred_xstart, x_t=x, t=t)elif self.model_mean_type == 'eps':  # the model predicts epsilonpred_xstart = _maybe_clip(self._predict_xstart_from_eps(x_t=x, t=t, eps=model_output))model_mean, _, _ = self.q_posterior_mean_variance(x_start=pred_xstart, x_t=x, t=t)else:raise NotImplementedError(self.model_mean_type)assert model_mean.shape == model_log_variance.shape == pred_xstart.shape == x.shapeif return_pred_xstart:return model_mean, model_variance, model_log_variance, pred_xstartelse:return model_mean, model_variance, model_log_variance# 損失函數的計算過程def training_losses(self, denoise_fn, x_start, t, noise=None):assert t.shape == [x_start.shape[0]]# 隨機生成一個噪音if noise is None:noise = tf.random_normal(shape=x_start.shape, dtype=x_start.dtype)assert noise.shape == x_start.shape and noise.dtype == x_start.dtype# 將隨機生成的噪音加到x_start上得到第t步的帶噪圖像x_t = self.q_sample(x_start=x_start, t=t, noise=noise)# 有兩種損失函數的方法,'kl'和'mse',并且這兩種方法差別并不明顯。if self.loss_type == 'kl':  # the variational boundlosses = self._vb_terms_bpd(denoise_fn=denoise_fn, x_start=x_start, x_t=x_t, t=t, clip_denoised=False, return_pred_xstart=False)elif self.loss_type == 'mse':  # unweighted MSEassert self.model_var_type != 'learned'target = {'xprev': self.q_posterior_mean_variance(x_start=x_start, x_t=x_t, t=t)[0],'xstart': x_start,'eps': noise}[self.model_mean_type]model_output = denoise_fn(x_t, t)assert model_output.shape == target.shape == x_start.shapelosses = nn.meanflat(tf.squared_difference(target, model_output))else:raise NotImplementedError(self.loss_type)assert losses.shape == t.shapereturn losses# 計算兩個高斯分布的KL散度,代碼中的logvar1,logvar2為方差的對數def normal_kl(mean1, logvar1, mean2, logvar2):"""KL divergence between normal distributions parameterized by mean and log-variance."""return 0.5 * (-1.0 + logvar2 - logvar1 + tf.exp(logvar1 - logvar2)+ tf.squared_difference(mean1, mean2) * tf.exp(-logvar2))# 使用'kl'方法計算損失函數def _vb_terms_bpd(self, denoise_fn, x_start, x_t, t, *, clip_denoised: bool, return_pred_xstart: bool):true_mean, _, true_log_variance_clipped = self.q_posterior_mean_variance(x_start=x_start, x_t=x_t, t=t)model_mean, _, model_log_variance, pred_xstart = self.p_mean_variance(denoise_fn, x=x_t, t=t, clip_denoised=clip_denoised, return_pred_xstart=True)kl = normal_kl(true_mean, true_log_variance_clipped, model_mean, model_log_variance)kl = nn.meanflat(kl) / np.log(2.)decoder_nll = -utils.discretized_gaussian_log_likelihood(x_start, means=model_mean, log_scales=0.5 * model_log_variance)assert decoder_nll.shape == x_start.shapedecoder_nll = nn.meanflat(decoder_nll) / np.log(2.)# At the first timestep return the decoder NLL, otherwise return KL(q(x_{t-1}|x_t,x_0) || p(x_{t-1}|x_t))assert kl.shape == decoder_nll.shape == t.shape == [x_start.shape[0]]output = tf.where(tf.equal(t, 0), decoder_nll, kl)return (output, pred_xstart) if return_pred_xstart else output

接下來是推理過程。

def p_sample(self, denoise_fn, *, x, t, noise_fn, clip_denoised=True, return_pred_xstart: bool):"""Sample from the model"""# 使用深度學習模型,根據x^t和t估算出x^{t-1}的均值和分布model_mean, _, model_log_variance, pred_xstart = self.p_mean_variance(denoise_fn, x=x, t=t, clip_denoised=clip_denoised, return_pred_xstart=True)noise = noise_fn(shape=x.shape, dtype=x.dtype)assert noise.shape == x.shape# no noise when t == 0nonzero_mask = tf.reshape(1 - tf.cast(tf.equal(t, 0), tf.float32), [x.shape[0]] + [1] * (len(x.shape) - 1))# 當t>0時,模型估算出的結果還要加上一個高斯噪音,因為要繼續循環。當t=0時,循環停止,因此不需要再添加噪音了,輸出最后的結果。sample = model_mean + nonzero_mask * tf.exp(0.5 * model_log_variance) * noiseassert sample.shape == pred_xstart.shapereturn (sample, pred_xstart) if return_pred_xstart else sampledef p_sample_loop(self, denoise_fn, *, shape, noise_fn=tf.random_normal):"""Generate samples"""assert isinstance(shape, (tuple, list))# 生成總的布數Ti_0 = tf.constant(self.num_timesteps - 1, dtype=tf.int32)# 隨機生成一個噪音作為p(x^T)img_0 = noise_fn(shape=shape, dtype=tf.float32)# 循環T次,得到最終的圖像_, img_final = tf.while_loop(cond=lambda i_, _: tf.greater_equal(i_, 0),body=lambda i_, img_: [i_ - 1,self.p_sample(denoise_fn=denoise_fn, x=img_, t=tf.fill([shape[0]], i_), noise_fn=noise_fn, return_pred_xstart=False)],loop_vars=[i_0, img_0],shape_invariants=[i_0.shape, img_0.shape],back_prop=False)assert img_final.shape == shapereturn img_final

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/diannao/74302.shtml
繁體地址,請注明出處:http://hk.pswp.cn/diannao/74302.shtml
英文地址,請注明出處:http://en.pswp.cn/diannao/74302.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

PyTorch 面試題及參考答案(精選100道)

目錄 PyTorch 的動態計算圖與 TensorFlow 的靜態計算圖有何區別?動態圖的優勢是什么? 解釋張量(Tensor)與 NumPy 數組的異同,為何 PyTorch 選擇張量作為核心數據結構? 什么是 torch.autograd 模塊?它在反向傳播中的作用是什么? 如何理解 PyTorch 中的 nn.Module 類?…

#C8# UVM中的factory機制 #S8.1.4# 約束的重載

今天,復習一下《UVM實戰》一書中的 關于約束的重載 章節學習。 一 問題引導 文件:src/ch8/section8.1/8.1.2/rand_mode/my_transaction.sv4 class my_transaction extends uvm_sequence_item; …17 constraint crc_err_cons{18 crc_err == 1b0;19 }20 const…

空調遙控器低功耗單片機方案

RAMSUN空調遙控器采用先進的32位低功耗單片機作為核心控制器&#xff0c;通過優化軟件算法和硬件設計&#xff0c;實現了空調遙控器的低功耗運行。單片機集成了多種功能模塊&#xff0c;包括紅外發射、按鍵掃描、電源管理等&#xff0c;有效降低了整體功耗。同時&#xff0c;該…

結構型——代理模式

結構型——代理模式 代理模式指的是通過創建一個代理來控制對原始對象的訪問。代理在客戶端與實際對象之間充當“中介” 特點 訪問控制&#xff1a;代理對象可以控制對實際對象的訪問&#xff0c;從而實現對訪問權限的控制。延遲加載&#xff1a;代理對象可以在實際對象被調…

【算法】常見排序算法(插入排序、選擇排序、交換排序和歸并排序)

文章目錄 前言一、排序概念及常見排序算法框圖1.排序概念2.常見排序算法框圖 二、實現比較排序算法1.插入排序1.1 直接插入排序1.2 希爾排序 2.選擇排序2.1 直接選擇排序2.2 堆排序 3.交換排序3.1 冒泡排序3.2 快速排序3.2.1 hoare版本3.2.2 挖坑法3.2.3 lomuto前后指針 3.3 快…

Go語言分布式鎖實戰:dlock助力構建高并發穩定系統

在構建分布式系統時&#xff0c;一個常見且棘手的問題便是資源競爭和數據一致性問題。分布式鎖作為一種常用的解決方案&#xff0c;在多個進程或節點之間協調訪問共享資源時顯得尤為重要。今天&#xff0c;我們將介紹一款分布式鎖庫——dlock&#xff0c;并通過詳細的使用示例帶…

算法方法快速回顧

&#xff08;待修改&#xff09; 目錄 1. 雙指針2. 滑動窗口理論基礎 3. 二分查找3. 二分查找理論基礎 4. KMP5. 回溯算法6. 貪心算法7. 動態規劃7.1. 01背包7.2. 完全背包7.3. 多重背包 8. 單調棧9. 并查集10. 圖論10.1. 廣度優先搜索&#xff08;BFS&#xff09;10.2. 深度優…

深度學習:讓機器學會“思考”的魔法

文章目錄 引言&#xff1a;從“鸚鵡學舌”到“舉一反三”一、深度學習是什么&#xff1f;1. 定義&#xff1a;機器的“大腦”2. 核心思想&#xff1a;從數據中“悟”出規律 二、深度學習的“大腦”結構&#xff1a;神經網絡1. 神經元&#xff1a;深度學習的基本單元2. 神經網絡…

電動自行車/電動工具鋰電池PCM方案--SH367003、SH367004、SH79F329

在消費電子系統中&#xff0c;如手機電池包&#xff0c;筆記本電腦電池包等&#xff0c;帶有控制IC、功率MOSFETFE管以及其他電子元件的電路系統稱為電池充放電保護板Protection Circuit Module &#xff08;PCM&#xff09;&#xff0c;而對于動力電池的電池管理系統&#xff…

補碼詳細分析

補碼引入 舉一個生活化的例子 假設由一個掛鐘&#xff0c;它只能順時鐘調時間&#xff0c;那么它調時間就分成了一下兩種情況 正好順時針調就能調好 如&#xff1a;時針從5調到9需要逆時針調才能調好 如&#xff1a;時針從10調到7 在上面的情況中1是不用處理的&#xff0c;2…

計算機網絡入門:物理層與數據鏈路層詳解

&#x1f310; &#xff08;專業解析 中學生也能懂&#xff01;&#xff09; &#x1f4d6; 前言 計算機網絡就像數字世界的“高速公路系統”&#xff0c;而物理層和數據鏈路層是這條公路的基石。本文用 專業視角 和 生活化比喻 &#xff0c;帶你輕松理解這兩層的核心原理&a…

哪些視頻格式在webview2中播放可以設置成透明的?

在WebView2中&#xff0c;能夠播放并設置成透明背景的視頻格式主要取決于其支持的編解碼器以及視頻是否包含alpha通道&#xff08;透明度信息&#xff09;。以下是支持透明背景的視頻格式&#xff1a; 支持透明背景的視頻格式 1. WebM&#xff08;使用VP9編解碼器&#xff09; …

【基于ROS的A*算法實現路徑規劃】A* | ROS | 路徑規劃 | Python

### 記錄一下使用Python實現ROS平臺A*算法路徑規劃 ### 代碼可自取 &#xff1a;Xz/little_projecthttps://gitee.com/Xz_zh/little_project.git 目錄 一、思路分析 二、算法實現 三、路徑規劃實現 一、思路分析 要求使用A*算法實現路徑規劃&#xff0c;可以將該任務分為三…

2025-03-23 吳恩達機器學習3——多維特征

文章目錄 1 多元引入2 矢量化2.1 示例2.2 非矢量化實現2.3 矢量化實現2.4 應用 3 特征縮放3.1 舉例3.2 必要性3.3 方法3.3.1 最大最小值縮放&#xff08;Min-Max Scaling&#xff09;3.3.2 均值歸一化&#xff08;Mean Normalization&#xff09;3.3.3 Z 分數歸一化&#xff08…

正點原子內存管理學習和修改

由于項目需要用到內存管理進行動態申請和釋放&#xff0c;今天又重新學習了一下正點原子的內存管理實驗&#xff0c;溫習了一下內存管理的實質。首先先上正點原子內存管理的源代碼&#xff1a; malloc.c文件&#xff1a; #include "./MALLOC/malloc.h"#if !(__ARMC…

時空觀測者:俯身拾貝

目錄 中華文明時空貝殼集&#xff08;按時間排序&#xff09;1. 良渚玉琮&#xff08;約公元前3300-2300年&#xff09;2. 三星堆青銅神樹&#xff08;公元前1200年&#xff09;3. 殷墟甲骨文&#xff08;約公元前14世紀&#xff09;4. 京杭大運河&#xff08;公元前486年始建&…

護網期間監測工作全解析:內容與應對策略

護網期間監測工作全解析&#xff1a;內容與應對策略 一、引言 在數字化浪潮中&#xff0c;網絡安全的重要性愈發凸顯&#xff0c;護網行動作為保障關鍵信息基礎設施安全的關鍵舉措&#xff0c;備受矚目。護網期間&#xff0c;監測工作是發現潛在威脅、防范攻擊的重要防線。全…

【Centos7搭建Zabbix4.x監控HCL模擬網絡設備:zabbix-server搭建及監控基礎05

蘭生幽谷&#xff0c;不為莫服而不芳&#xff1b; 君子行義&#xff0c;不為莫知而止休。 5.zabbix監控HCL模擬網絡設備 在保證zabbix-server與HCL網絡相通的情況下進行如下操作。 5.1創建主機群 配置-主機群-創建主機群 圖 19 取名&#xff0c;添加。 圖 20 5.2 創建監控…

趣味極簡品牌海報藝術貼紙設計圓潤邊緣無襯線粗體裝飾字體 Chunko Bold - Sans Serif Font

Chunko Bold 是一種功能強大的顯示字體&#xff0c;體現了大膽極簡主義的原則 – 當代設計的主流趨勢。這種自信的字體將粗獷的幾何形狀與現代的趣味性相結合&#xff0c;具有圓潤的邊緣和強烈的存在感&#xff0c;與當今的極簡主義設計方法完美契合。無論是用于鮮明的構圖還是…

Spring Boot(十七):集成和使用Redis

Redis(Remote Dictionary Server,遠程字典服務器)是一個開源的、基于內存的數據結構存儲系統,它可以用作數據庫、緩存和消息中間件。Spring Boot 中集成和使用Redis主要涉及以下幾個步驟: 添加依賴 在項目的pom.xml文件中添加Redis的依賴。Spring Boot提供了對Redis的集…