The Illustrated Stable Diffusion

1. The components of Stable Diffusion
- 1.1. Image information creator
- 1.2. Image Decoder
2. What is Diffusion anyway?
- 2.1. How does Diffusion work?
- 2.2. Painting images by removing noise
3. Speed Boost: Diffusion on compressed (latent) data instead of the pixel image
4. The Text Encoder: A Transformer language model
- 4.1. How is CLIP trained?
5. Feeding text information into the image generation process
- 5.1. Layers of the Unet Noise predictor without text
- 5.2. Layers of the Unet Noise predictor with text
6. Conclusion
7. Resources
Acknowledgements
Citation
References

https://jalammar.github.io/illustrated-stable-diffusion/

This is a gentle introduction to how Stable Diffusion works.

text-to-image，text2img，T2I

在這里插入圖片描述

paradise /?p?r?da?s/ n. 天堂，樂園 (指美好的環境)，(某些宗教所指的) 天國，樂土，伊甸園，(某類活動或某類人的) 完美去處
cosmic /?kɑ?zm?k/ adj. 宇宙的，巨大且重要的
beach /bi?t?/ n. 海灘，沙灘，海濱，湖濱 v. (使) 上岸，把...拖上岸

Stable Diffusion is versatile in that it can be used in a number of different ways. Let’s focus at first on image generation from text only (text2img). The image above shows an example text input and the resulting generated image (The actual complete prompt is here). Aside from text to image, another main way of using it is by making it alter images (so inputs are text + image).
除了將文本轉換為圖像之外，另一種主要的使用方式是讓它改變圖像 (因此輸入是文本 + 圖像)。

versatile /?v??rs?tl/ adj. 多功能的，多才多藝的，多用途的，多面手的，有多種技能的
alter /???lt?r/ v. 改變，(使) 更改，修改 (衣服使更合身)，改動
pirate /?pa?r?t/ n. 海盜，盜版者，盜印者，道德敗壞者，違法者，侵犯專利權者，非法仿制者，非法播音者 v. 當海盜，從事劫掠，盜印，竊用，以海盜方式劫掠，搶掠 adj. 盜版的，盜用的，剽竊的

在這里插入圖片描述

1. The components of Stable Diffusion

Stable Diffusion is a system made up of several components and models. It is not one monolithic model.

monolithic /?mɑ?n??l?θ?k/ adj. 龐大而單一的，整體式的，單一的，獨塊巨石的，整料的，由塊料組成的，單片的，單塊的，龐大而無特點的，巨大而單調的 n. 單片電路，單塊電路

As we look under the hood, the first observation we can make is that there’s a text-understanding component that translates the text information into a numeric representation that captures the ideas in the text.

hood /h?d/ n. (設備或機器的) 防護罩，罩，風帽，兜帽 (外衣的一部分，可拉起蒙住頭頸)，(布質) 面罩，街區，學位連領帽 (表示學位種類)，(汽車等的) 折疊式車篷 vt. 罩上，覆蓋

在這里插入圖片描述

We’re starting with a high-level view and we’ll get into more machine learning details later in this article. However, we can say that this text encoder is a special Transformer language model (technically: the text encoder of a CLIP model). It takes the input text and outputs a list of numbers representing each word/token in the text (a vector per token).

That information is then presented to the Image Generator, which is composed of a couple of components itself.

在這里插入圖片描述

The image generator goes through two stages:

1.1. Image information creator

This component is the secret sauce of Stable Diffusion. It’s where a lot of the performance gain over previous models is achieved.

sauce /s??s/ n. 醬，調味汁，無禮的話 (或舉動)，討厭的話 (或舉動) vt. 對...無禮，給...增加趣味或風味，調味或加沙司于...

This component runs for multiple steps to generate image information. This is the steps parameter in Stable Diffusion interfaces and libraries which often defaults to 50 or 100.

The image information creator works completely in the image information space (or latent space). We’ll talk more about what that means later in the post. This property makes it faster than previous diffusion models that worked in pixel space. In technical terms, this component is made up of a UNet neural network and a scheduling algorithm.

The word “diffusion” describes what happens in this component. It is the step by step processing of information that leads to a high-quality image being generated in the end (by the next component, the image decoder).

在這里插入圖片描述

1.2. Image Decoder

The image decoder paints a picture from the information it got from the information creator. It runs only once at the end of the process to produce the final pixel image.

在這里插入圖片描述

With this we come to see the three main components (each with its own neural network) that make up Stable Diffusion:

ClipText for text encoding

Input: text.
Output: 77 token embeddings vectors, each in 768 dimensions.

UNet + Scheduler to gradually process/diffuse information in the information (latent) space

Input: text embeddings and a starting multi-dimensional array (structured lists of numbers, also called a tensor) made up of noise.
Output: A processed information array

Autoencoder Decoder that paints the final image using the processed information array

Input: The processed information array (dimensions: (4, 64, 64))
Output: The resulting image (dimensions: (3, 512, 512) which are (red/green/blue, width, height))

diffuse /d??fju?s , d??fju?z/ adj. 擴散的，漫射的，彌漫的，不清楚的，冗長的，難解的，啰唆的 v. (使氣體或液體) 擴散，彌漫，滲透，(使光) 模糊，漫射，漫散，傳播，使分散，散布，普及

在這里插入圖片描述

2. What is Diffusion anyway?

Diffusion is the process that takes place inside the pink “image information creator” component. Having the token embeddings that represent the input text, and a random starting image information array (these are also called latents), the process produces an information array that the image decoder uses to paint the final image.

在這里插入圖片描述

This process happens in a step-by-step fashion. Each step adds more relevant information. To get an intuition of the process, we can inspect the random latents array, and see that it translates to visual noise. Visual inspection in this case is passing it through the image decoder.

intuition /??ntu???n/ n. (一種) 直覺，直覺力

在這里插入圖片描述

Diffusion happens in multiple steps, each step operates on an input latents array, and produces another latents array that better resembles the input text and all the visual information the model picked up from all images the model was trained on.

resemble /r??zembl/ vt. 像，類似于，看起來像，顯得像

在這里插入圖片描述

We can visualize a set of these latents to see what information gets added at each step.

在這里插入圖片描述

The process is quite breathtaking to look at.

在這里插入圖片描述

Something especially fascinating happens between steps 2 and 4 in this case. It’s as if the outline emerges from the noise.

emerge /i?m??rd?/ v. (從隱蔽處或暗處) 出現，浮現，顯現，暴露，露出，顯露，被知曉，幸存下來，擺脫出來，露頭，露出真相

在這里插入圖片描述

2.1. How does Diffusion work?

The central idea of generating images with diffusion models relies on the fact that we have powerful computer vision models. Given a large enough dataset, these models can learn complex operations. Diffusion models approach image generation by framing the problem as following:

Say we have an image, we generate some noise, and add it to the image.

在這里插入圖片描述

This can now be considered a training example. We can use this same formula to create lots of training examples to train the central component of our image generation model.

在這里插入圖片描述

While this example shows a few noise amount values from image (amount 0, no noise) to total noise (amount 4, total noise), we can easily control how much noise to add to the image, and so we can spread it over tens of steps, creating tens of training examples per image for all the images in a training dataset.

在這里插入圖片描述

With this dataset, we can train the noise predictor and end up with a great noise predictor that actually creates images when run in a certain configuration. A training step should look familiar if you’ve had ML exposure:

在這里插入圖片描述

2.2. Painting images by removing noise

Let’s now see how this can generate images.

The trained noise predictor can take a noisy image, and the number of the denoising step, and is able to predict a slice of noise.

在這里插入圖片描述

The sampled noise is predicted so that if we subtract it from the image, we get an image that’s closer to the images the model was trained on (not the exact images themselves, but the distribution - the world of pixel arrangements where the sky is usually blue and above the ground, people have two eyes, cats look a certain way - pointy ears and clearly unimpressed).

unimpressed /??n?m?prest/ adj. 印象平平的，無深刻印象的

在這里插入圖片描述

If the training dataset was of aesthetically pleasing images (e.g., LAION Aesthetics https://laion.ai/blog/laion-aesthetics/, which Stable Diffusion was trained on), then the resulting image would tend to be aesthetically pleasing. If the we train it on images of logos, we end up with a logo-generating model.

aesthetical [i:s'θet?k?l] adj. 美的，美學的，審美的
please [pliz] int. 請務必，請問，太感謝了，收斂點兒 v. 喜歡，使滿意，使愉快

在這里插入圖片描述

This concludes the description of image generation by diffusion models mostly as described in Denoising Diffusion Probabilistic Models (https://arxiv.org/abs/2006.11239). Now that you have this intuition of diffusion, you know the main components of not only Stable Diffusion, but also Dall-E 2 and Google’s Imagen.

Note that the diffusion process we described so far generates images without using any text data. So if we deploy this model, it would generate great looking images, but we’d have no way of controlling if it’s an image of a pyramid or a cat or anything else. In the next sections we’ll describe how text is incorporated in the process in order to control what type of image the model generates.
請注意，我們到目前為止描述的擴散過程無需使用任何文本數據即可生成圖像。因此，如果我們部署此模型，它將生成外觀精美的圖像，但我們無法控制它是金字塔、貓還是其他圖像。在下一節中，我們將描述如何在該過程中合并文本以控制模型生成的圖像類型。

3. Speed Boost: Diffusion on compressed (latent) data instead of the pixel image

To speed up the image generation process, the Stable Diffusion paper runs the diffusion process not on the pixel images themselves, but on a compressed version of the image. The paper calls this “Departure to Latent Space” (High-Resolution Image Synthesis with Latent Diffusion Models).

This compression (and later decompression/painting) is done via an autoencoder. The autoencoder compresses the image into the latent space using its encoder, then reconstructs it using only the compressed information using the decoder.

departure [d??pɑrt??r] n. 離開，出發，背離，起程

在這里插入圖片描述

Now the forward diffusion process is done on the compressed latents. The slices of noise are of noise applied to those latents, not to the pixel image. And so the noise predictor is actually trained to predict noise in the compressed representation (the latent space).

在這里插入圖片描述

The forward process (using the autoencoder’s encoder) is how we generate the data to train the noise predictor. Once it’s trained, we can generate images by running the reverse process (using the autoencoder’s decoder).

在這里插入圖片描述

These two flows are what’s shown in Figure 3 of the LDM/Stable Diffusion paper:

在這里插入圖片描述

This figure additionally shows the “conditioning” components, which in this case is the text prompts describing what image the model should generate. So let’s dig into the text components.

4. The Text Encoder: A Transformer language model

A Transformer language model is used as the language understanding component that takes the text prompt and produces token embeddings. The released Stable Diffusion model uses ClipText (A GPT-based model), while the paper used BERT.

The Illustrated GPT-2 (Visualizing Transformer Language Models)
https://jalammar.github.io/illustrated-gpt2/

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
https://jalammar.github.io/illustrated-bert/

The choice of language model is shown by the Imagen paper to be an important one. Swapping in larger language models had more of an effect on generated image quality than larger image generation components.

Larger / better language models have a significant effect on the quality of image generation models. https://arxiv.org/abs/2205.11487

在這里插入圖片描述

The early Stable Diffusion models just plugged in the pre-trained ClipText model released by OpenAI. It’s possible that future models may switch to the newly released and much larger OpenCLIP variants of CLIP (True enough, Stable Diffusion V2 uses OpenClip). This new batch includes text models of sizes up to 354M parameters, as opposed to the 63M parameters in ClipText.

plug [pl?ɡ]：n. 插頭，(電源) 插座，轉換插頭，塞子 v. 堵塞，封堵，補足，補充

LARGE SCALE OPENCLIP: L/14, H/14 AND G/14 TRAINED ON LAION-2B
https://laion.ai/blog/large-openclip/

Stable Diffusion 2.0 Release
https://stability.ai/news/stable-diffusion-v2-release

4.1. How is CLIP trained?

CLIP is trained on a dataset of images and their captions. Think of a dataset looking like this, only with 400 million images and their captions:

在這里插入圖片描述

In actuality, CLIP was trained on images crawled from the web along with their “alt” tags.

CLIP is a combination of an image encoder and a text encoder. Its training process can be simplified to thinking of taking an image and its caption. We encode them both with the image and text encoders respectively.

actuality [??kt?u??l?ti] n. 實際，真實，真實情況，現實情況
crawl [kr?l] v. 爬，爬行，匍匐行進，(昆蟲) 爬行 n. 自由泳，爬泳，緩慢的速度

在這里插入圖片描述

We then compare the resulting embeddings using cosine similarity. When we begin the training process, the similarity will be low, even if the text describes the image correctly.

在這里插入圖片描述

We update the two models so that the next time we embed them, the resulting embeddings are similar.

在這里插入圖片描述

By repeating this across the dataset and with large batch sizes, we end up with the encoders being able to produce embeddings where an image of a dog and the sentence “a picture of a dog” are similar. Just like in word2vec, the training process also needs to include negative examples of images and captions that don’t match, and the model needs to assign them low similarity scores.

The Illustrated Word2vec
https://jalammar.github.io/illustrated-word2vec/

5. Feeding text information into the image generation process

To make text a part of the image generation process, we have to adjust our noise predictor to use the text as an input.

在這里插入圖片描述

Our dataset now includes the encoded text. Since we’re operating in the latent space, both the input images and predicted noise are in the latent space.

在這里插入圖片描述

To get a better sense of how the text tokens are used in the Unet, let’s look deeper inside the Unet.

5.1. Layers of the Unet Noise predictor without text

Let’s first look at a diffusion Unet that does not use text. Its inputs and outputs would look like this:

在這里插入圖片描述

Inside, we see that:

The Unet is a series of layers that work on transforming the latents array
Each layer operates on the output of the previous layer
Some of the outputs are fed (via residual connections) into the processing later in the network
The timestep is transformed into a time step embedding vector, and that’s what gets used in the layers

在這里插入圖片描述

5.2. Layers of the Unet Noise predictor with text

Let’s now look how to alter this system to include attention to the text.

在這里插入圖片描述

The main change to the system we need to add support for text inputs (technical term: text conditioning) is to add an attention layer between the ResNet blocks.

在這里插入圖片描述

Note that the ResNet block doesn’t directly look at the text. But the attention layers merge those text representations in the latents. And now the next ResNet can utilize that incorporated text information in its processing.

6. Conclusion

I hope this gives you a good first intuition about how Stable Diffusion works. Lots of other concepts are involved, but I believe they’re easier to understand once you’re familiar with the building blocks above.

7. Resources

DreamStudio, https://beta.dreamstudio.ai/generate
The Annotated Diffusion Model, https://huggingface.co/blog/annotated-diffusion
What are Diffusion Models? https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Stable Diffusion with 🧨 Diffusers, https://huggingface.co/blog/stable_diffusion