1 Title?
????????Hierarchical Text-Conditional Image Generation with CLIP Latents(Aditya Ramesh、Prafulla Dhariwal、Alex Nichol、Casey Chu、Mark Chen)
2 Conclusion
????????Contrastive models like CLIP have been shown to learn robust representations of
images that capture both semantics and style. To leverage these representations for
image generation, this study proposes a two-stage model: a prior that generates a CLIP
image embedding given a text caption, and a decoder that generates an image
conditioned on the image embedding.?
3 Good Sentences
? ? ? ? 1、?We use only spatial convolutions in the model (i.e., no attention layers) and at inference time directly apply the model at the target resolution, observing that it readily generalizes to the higher resolution. We found no benefit from conditioning the upsamplers on the caption, and use unconditional ADMNets [11] with no guidance.(The work which are waiting to be improved (can add attention layers in it))
? ? ? ? 2、Although we train a prior to generate CLIP image embeddings from captions, the prior is not strictly necessary for caption-to-image generation. For instance, our decoder can condition on both CLIP image embeddings and captions, but the CLIP image embedding is dropped 5% of the time during training in order to enable classifier-free guidance(The prior is not necessary for T2I project)
? ? ? ? 3、Compared to GLIDE, we qualitatively observe that unCLIP is able to generate more diverse images while leveraging the guidance technique to improve sample quality. To understand why, consider Figure 9 where we increase guidance scale for both GLIDE and unCLIP. For GLIDE, the semantics (camera angle, color, size) converge as we increase guidance scale, whereas for unCLIP the semantic information of the scene is frozen in the CLIP image embedding and therefore does not collapse when guiding the decoder.(The advantage of CLIP when compared with GLIDE)
? ? ? ? 本文將將zero-shot和擴散模型兩種方法結合起來,用于文本條件下的圖像生成問題。該項工作提出了一個兩階段的模型:一個給定文本字幕生成CLIP圖像嵌入的先驗器,以及一個以圖像嵌入為條件生成圖像的解碼器。
? ? ? ? 首先要提的就是CLIP具有打破預定義好的標簽的能力,也就是zero-shot,它的標簽很靈活,兩個標簽就是二分類任務,十個就是十分類,不需要預定義任務是分幾個類。在使用引導的時候,與glide相比,unclip不會導致坍縮問題(也就是隨著引導條件的增多,繪制出的圖多樣性越來越少,基本都一樣了最后)。但是clip也有它的問題,就是在多目標屬性綁定上容易造成混淆,unclip在這方面做的更差,屬性綁定問題更嚴重。