The Illustrated Stable Diffusion

The Illustrated Stable Diffusion

  • 1. The components of Stable Diffusion
    • 1.1. Image information creator
    • 1.2. Image Decoder
  • 2. What is Diffusion anyway?
    • 2.1. How does Diffusion work?
    • 2.2. Painting images by removing noise
  • 3. Speed Boost: Diffusion on compressed (latent) data instead of the pixel image
  • 4. The Text Encoder: A Transformer language model
    • 4.1. How is CLIP trained?
  • 5. Feeding text information into the image generation process
    • 5.1. Layers of the Unet Noise predictor without text
    • 5.2. Layers of the Unet Noise predictor with text
  • 6. Conclusion
  • 7. Resources
  • Acknowledgements
  • Citation
  • References

https://jalammar.github.io/illustrated-stable-diffusion/

This is a gentle introduction to how Stable Diffusion works.

text-to-image,text2img,T2I

在這里插入圖片描述

paradise /?p?r?da?s/ n. 天堂,樂園 (指美好的環境),(某些宗教所指的) 天國,樂土,伊甸園,(某類活動或某類人的) 完美去處
cosmic /?kɑ?zm?k/ adj. 宇宙的,巨大且重要的
beach /bi?t?/ n. 海灘,沙灘,海濱,湖濱 v. (使) 上岸,把...拖上岸

Stable Diffusion is versatile in that it can be used in a number of different ways. Let’s focus at first on image generation from text only (text2img). The image above shows an example text input and the resulting generated image (The actual complete prompt is here). Aside from text to image, another main way of using it is by making it alter images (so inputs are text + image).
除了將文本轉換為圖像之外,另一種主要的使用方式是讓它改變圖像 (因此輸入是文本 + 圖像)。

versatile /?v??rs?tl/ adj. 多功能的,多才多藝的,多用途的,多面手的,有多種技能的
alter /???lt?r/ v. 改變,(使) 更改,修改 (衣服使更合身),改動
pirate /?pa?r?t/ n. 海盜,盜版者,盜印者,道德敗壞者,違法者,侵犯專利權者,非法仿制者,非法播音者 v. 當海盜,從事劫掠,盜印,竊用,以海盜方式劫掠,搶掠 adj. 盜版的,盜用的,剽竊的

在這里插入圖片描述

1. The components of Stable Diffusion

Stable Diffusion is a system made up of several components and models. It is not one monolithic model.

monolithic /?mɑ?n??l?θ?k/ adj. 龐大而單一的,整體式的,單一的,獨塊巨石的,整料的,由塊料組成的,單片的,單塊的,龐大而無特點的,巨大而單調的 n. 單片電路,單塊電路

As we look under the hood, the first observation we can make is that there’s a text-understanding component that translates the text information into a numeric representation that captures the ideas in the text.

hood /h?d/ n. (設備或機器的) 防護罩,罩,風帽,兜帽 (外衣的一部分,可拉起蒙住頭頸),(布質) 面罩,街區,學位連領帽 (表示學位種類),(汽車等的) 折疊式車篷 vt. 罩上,覆蓋

在這里插入圖片描述

We’re starting with a high-level view and we’ll get into more machine learning details later in this article. However, we can say that this text encoder is a special Transformer language model (technically: the text encoder of a CLIP model). It takes the input text and outputs a list of numbers representing each word/token in the text (a vector per token).

That information is then presented to the Image Generator, which is composed of a couple of components itself.

在這里插入圖片描述

The image generator goes through two stages:

1.1. Image information creator

This component is the secret sauce of Stable Diffusion. It’s where a lot of the performance gain over previous models is achieved.

sauce /s??s/ n. 醬,調味汁,無禮的話 (或舉動),討厭的話 (或舉動) vt. 對...無禮,給...增加趣味或風味,調味或加沙司于...

This component runs for multiple steps to generate image information. This is the steps parameter in Stable Diffusion interfaces and libraries which often defaults to 50 or 100.

The image information creator works completely in the image information space (or latent space). We’ll talk more about what that means later in the post. This property makes it faster than previous diffusion models that worked in pixel space. In technical terms, this component is made up of a UNet neural network and a scheduling algorithm.

The word “diffusion” describes what happens in this component. It is the step by step processing of information that leads to a high-quality image being generated in the end (by the next component, the image decoder).

在這里插入圖片描述

1.2. Image Decoder

The image decoder paints a picture from the information it got from the information creator. It runs only once at the end of the process to produce the final pixel image.

在這里插入圖片描述

With this we come to see the three main components (each with its own neural network) that make up Stable Diffusion:

  • ClipText for text encoding

Input: text.
Output: 77 token embeddings vectors, each in 768 dimensions.

  • UNet + Scheduler to gradually process/diffuse information in the information (latent) space

Input: text embeddings and a starting multi-dimensional array (structured lists of numbers, also called a tensor) made up of noise.
Output: A processed information array

  • Autoencoder Decoder that paints the final image using the processed information array

Input: The processed information array (dimensions: (4, 64, 64))
Output: The resulting image (dimensions: (3, 512, 512) which are (red/green/blue, width, height))

diffuse /d??fju?s , d??fju?z/ adj. 擴散的,漫射的,彌漫的,不清楚的,冗長的,難解的,啰唆的 v. (使氣體或液體) 擴散,彌漫,滲透,(使光) 模糊,漫射,漫散,傳播,使分散,散布,普及

在這里插入圖片描述

2. What is Diffusion anyway?

Diffusion is the process that takes place inside the pink “image information creator” component. Having the token embeddings that represent the input text, and a random starting image information array (these are also called latents), the process produces an information array that the image decoder uses to paint the final image.

在這里插入圖片描述

This process happens in a step-by-step fashion. Each step adds more relevant information. To get an intuition of the process, we can inspect the random latents array, and see that it translates to visual noise. Visual inspection in this case is passing it through the image decoder.

intuition /??ntu???n/ n. (一種) 直覺,直覺力

在這里插入圖片描述

Diffusion happens in multiple steps, each step operates on an input latents array, and produces another latents array that better resembles the input text and all the visual information the model picked up from all images the model was trained on.

resemble /r??zembl/ vt. 像,類似于,看起來像,顯得像

在這里插入圖片描述

We can visualize a set of these latents to see what information gets added at each step.

在這里插入圖片描述

The process is quite breathtaking to look at.

在這里插入圖片描述在這里插入圖片描述

在這里插入圖片描述在這里插入圖片描述

在這里插入圖片描述在這里插入圖片描述

在這里插入圖片描述在這里插入圖片描述

在這里插入圖片描述在這里插入圖片描述

Something especially fascinating happens between steps 2 and 4 in this case. It’s as if the outline emerges from the noise.

emerge /i?m??rd?/ v. (從隱蔽處或暗處) 出現,浮現,顯現,暴露,露出,顯露,被知曉,幸存下來,擺脫出來,露頭,露出真相

在這里插入圖片描述在這里插入圖片描述

2.1. How does Diffusion work?

The central idea of generating images with diffusion models relies on the fact that we have powerful computer vision models. Given a large enough dataset, these models can learn complex operations. Diffusion models approach image generation by framing the problem as following:

Say we have an image, we generate some noise, and add it to the image.

在這里插入圖片描述

This can now be considered a training example. We can use this same formula to create lots of training examples to train the central component of our image generation model.

在這里插入圖片描述

While this example shows a few noise amount values from image (amount 0, no noise) to total noise (amount 4, total noise), we can easily control how much noise to add to the image, and so we can spread it over tens of steps, creating tens of training examples per image for all the images in a training dataset.

在這里插入圖片描述

With this dataset, we can train the noise predictor and end up with a great noise predictor that actually creates images when run in a certain configuration. A training step should look familiar if you’ve had ML exposure:

在這里插入圖片描述

2.2. Painting images by removing noise

Let’s now see how this can generate images.

The trained noise predictor can take a noisy image, and the number of the denoising step, and is able to predict a slice of noise.

在這里插入圖片描述

The sampled noise is predicted so that if we subtract it from the image, we get an image that’s closer to the images the model was trained on (not the exact images themselves, but the distribution - the world of pixel arrangements where the sky is usually blue and above the ground, people have two eyes, cats look a certain way - pointy ears and clearly unimpressed).

unimpressed /??n?m?prest/ adj. 印象平平的,無深刻印象的

在這里插入圖片描述

If the training dataset was of aesthetically pleasing images (e.g., LAION Aesthetics https://laion.ai/blog/laion-aesthetics/, which Stable Diffusion was trained on), then the resulting image would tend to be aesthetically pleasing. If the we train it on images of logos, we end up with a logo-generating model.

aesthetical [i:s'θet?k?l] adj. 美的,美學的,審美的
please [pliz] int. 請務必,請問,太感謝了,收斂點兒 v. 喜歡,使滿意,使愉快

在這里插入圖片描述

This concludes the description of image generation by diffusion models mostly as described in Denoising Diffusion Probabilistic Models (https://arxiv.org/abs/2006.11239). Now that you have this intuition of diffusion, you know the main components of not only Stable Diffusion, but also Dall-E 2 and Google’s Imagen.

Note that the diffusion process we described so far generates images without using any text data. So if we deploy this model, it would generate great looking images, but we’d have no way of controlling if it’s an image of a pyramid or a cat or anything else. In the next sections we’ll describe how text is incorporated in the process in order to control what type of image the model generates.
請注意,我們到目前為止描述的擴散過程無需使用任何文本數據即可生成圖像。因此,如果我們部署此模型,它將生成外觀精美的圖像,但我們無法控制它是金字塔、貓還是其他圖像。在下一節中,我們將描述如何在該過程中合并文本以控制模型生成的圖像類型。

3. Speed Boost: Diffusion on compressed (latent) data instead of the pixel image

To speed up the image generation process, the Stable Diffusion paper runs the diffusion process not on the pixel images themselves, but on a compressed version of the image. The paper calls this “Departure to Latent Space” (High-Resolution Image Synthesis with Latent Diffusion Models).

This compression (and later decompression/painting) is done via an autoencoder. The autoencoder compresses the image into the latent space using its encoder, then reconstructs it using only the compressed information using the decoder.

departure [d??pɑrt??r] n. 離開,出發,背離,起程

在這里插入圖片描述

Now the forward diffusion process is done on the compressed latents. The slices of noise are of noise applied to those latents, not to the pixel image. And so the noise predictor is actually trained to predict noise in the compressed representation (the latent space).

在這里插入圖片描述

The forward process (using the autoencoder’s encoder) is how we generate the data to train the noise predictor. Once it’s trained, we can generate images by running the reverse process (using the autoencoder’s decoder).

在這里插入圖片描述

These two flows are what’s shown in Figure 3 of the LDM/Stable Diffusion paper:

在這里插入圖片描述

This figure additionally shows the “conditioning” components, which in this case is the text prompts describing what image the model should generate. So let’s dig into the text components.

4. The Text Encoder: A Transformer language model

A Transformer language model is used as the language understanding component that takes the text prompt and produces token embeddings. The released Stable Diffusion model uses ClipText (A GPT-based model), while the paper used BERT.

The Illustrated GPT-2 (Visualizing Transformer Language Models)
https://jalammar.github.io/illustrated-gpt2/

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
https://jalammar.github.io/illustrated-bert/

The choice of language model is shown by the Imagen paper to be an important one. Swapping in larger language models had more of an effect on generated image quality than larger image generation components.

Larger / better language models have a significant effect on the quality of image generation models. https://arxiv.org/abs/2205.11487

在這里插入圖片描述

The early Stable Diffusion models just plugged in the pre-trained ClipText model released by OpenAI. It’s possible that future models may switch to the newly released and much larger OpenCLIP variants of CLIP (True enough, Stable Diffusion V2 uses OpenClip). This new batch includes text models of sizes up to 354M parameters, as opposed to the 63M parameters in ClipText.

plug [pl?ɡ]:n. 插頭,(電源) 插座,轉換插頭,塞子 v. 堵塞,封堵,補足,補充

LARGE SCALE OPENCLIP: L/14, H/14 AND G/14 TRAINED ON LAION-2B
https://laion.ai/blog/large-openclip/

Stable Diffusion 2.0 Release
https://stability.ai/news/stable-diffusion-v2-release

4.1. How is CLIP trained?

CLIP is trained on a dataset of images and their captions. Think of a dataset looking like this, only with 400 million images and their captions:

在這里插入圖片描述

In actuality, CLIP was trained on images crawled from the web along with their “alt” tags.

CLIP is a combination of an image encoder and a text encoder. Its training process can be simplified to thinking of taking an image and its caption. We encode them both with the image and text encoders respectively.

actuality [??kt?u??l?ti] n. 實際,真實,真實情況,現實情況
crawl [kr?l] v. 爬,爬行,匍匐行進,(昆蟲) 爬行 n. 自由泳,爬泳,緩慢的速度

在這里插入圖片描述

We then compare the resulting embeddings using cosine similarity. When we begin the training process, the similarity will be low, even if the text describes the image correctly.

在這里插入圖片描述

We update the two models so that the next time we embed them, the resulting embeddings are similar.

在這里插入圖片描述

By repeating this across the dataset and with large batch sizes, we end up with the encoders being able to produce embeddings where an image of a dog and the sentence “a picture of a dog” are similar. Just like in word2vec, the training process also needs to include negative examples of images and captions that don’t match, and the model needs to assign them low similarity scores.

The Illustrated Word2vec
https://jalammar.github.io/illustrated-word2vec/

5. Feeding text information into the image generation process

To make text a part of the image generation process, we have to adjust our noise predictor to use the text as an input.

在這里插入圖片描述

Our dataset now includes the encoded text. Since we’re operating in the latent space, both the input images and predicted noise are in the latent space.

在這里插入圖片描述

To get a better sense of how the text tokens are used in the Unet, let’s look deeper inside the Unet.

5.1. Layers of the Unet Noise predictor without text

Let’s first look at a diffusion Unet that does not use text. Its inputs and outputs would look like this:

在這里插入圖片描述

Inside, we see that:

  • The Unet is a series of layers that work on transforming the latents array
  • Each layer operates on the output of the previous layer
  • Some of the outputs are fed (via residual connections) into the processing later in the network
  • The timestep is transformed into a time step embedding vector, and that’s what gets used in the layers

在這里插入圖片描述

5.2. Layers of the Unet Noise predictor with text

Let’s now look how to alter this system to include attention to the text.

在這里插入圖片描述

The main change to the system we need to add support for text inputs (technical term: text conditioning) is to add an attention layer between the ResNet blocks.

在這里插入圖片描述

Note that the ResNet block doesn’t directly look at the text. But the attention layers merge those text representations in the latents. And now the next ResNet can utilize that incorporated text information in its processing.

6. Conclusion

I hope this gives you a good first intuition about how Stable Diffusion works. Lots of other concepts are involved, but I believe they’re easier to understand once you’re familiar with the building blocks above.

7. Resources

DreamStudio, https://beta.dreamstudio.ai/generate
The Annotated Diffusion Model, https://huggingface.co/blog/annotated-diffusion
What are Diffusion Models? https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Stable Diffusion with 🧨 Diffusers, https://huggingface.co/blog/stable_diffusion

Acknowledgements

Citation

https://jalammar.github.io/illustrated-stable-diffusion/

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/pingmian/73009.shtml
繁體地址,請注明出處:http://hk.pswp.cn/pingmian/73009.shtml
英文地址,請注明出處:http://en.pswp.cn/pingmian/73009.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

yarn 裝包時 package里包含sqlite3@5.0.2報錯

yarn 裝包時 package里包含sqlite35.0.2報錯 解決方案: 第一步: 刪除package.json里的sqlite35.0.2 第二步: 裝包,或者增加其他的npm包 第三步: 在package.json里增加sqlite35.0.2,并運行yarn裝包 此…

一個免費 好用的pdf在線處理工具

pdf24 doc2x 相比上面能更好的支持數學公式。但是收費

buu-bjdctf_2020_babystack2-好久不見51

整數溢出漏洞 將nbytes設置為-1就會回繞,變成超大整數 從而實現棧溢出漏洞 環境有問題 from pwn import *# 連接到遠程服務器 p remote("node5.buuoj.cn", 28526)# 定義后門地址 backdoor 0x400726# 發送初始輸入 p.sendlineafter(b"your name…

DHCP 配置

? 最近發現,自己使用虛擬機建立的集群,在斷電關機或者關機一段時間后,集群之間的鏈接散了,并且節點自身的 IP 也發生了變化,發現是 DHCP 的問題,這里記錄一下。 DHCP ? DHCP(Dynamic Host C…

股指期貨合約的命名規則是怎樣的?

股指期貨合約的命名規則其實很簡單,主要由兩部分組成:合約代碼和到期月份。 股指期貨合約4個字母數字背后的秘密 股指期貨合約一般來說都是由字母和數字來組合的,包含了品種代碼和到期的時間,下面我們具體來看看。 咱們以“IF23…

OSPF 協議詳解:從概念原理到配置實踐的全網互通實現

什么是OSPF OSPF(開放最短路徑優先)是由IETF開發的基于鏈路狀態的自治系統內部路由協議,用來代替存在一些問題的RIP協議。與距離矢量協議不同,鏈路狀態路由協議關心網絡中鏈路活接口的狀態(包括UP、DOWN、IP地址、掩碼…

深入探究 JVM 堆的垃圾回收機制(二)— 回收

GC Roots 枚舉需要遍歷整個應用程序的上下文,而在進行可達性分析或者垃圾回收時,如果我們還是進行全堆掃描及收集,那么會非常耗時。JVM 將堆分為新生代及老生代,它們的回收頻率及算法不一樣。 1 回收算法 在進行可達性分析時&am…

藍橋杯 之 數論

文章目錄 習題質數找素數 數論,就是一些數學問題,藍橋杯十分喜歡考察,常見的數論的問題有:取模,同余,大整數分解,素數,質因數,最大公約數,最小公倍數等等 素…

Unity Shader編程】之渲染流程之深度及pass詳解

關于透明物體的渲染,首先需要了解以下部分 深度緩沖區深度寫入深度測試pass渲染和深度測試的過程深度測試和顏色混合過程 ** 一,深度緩沖區 ** 深度即物體距離相機的距離,深度寫入即是把物體的距離相機信息記錄下來,寫入一個名…

csv文件格式和excel數據格式有什么區別

CSV(Comma-Separated Values)和Excel(XLS/XLSX)數據格式的主要區別如下: 1. 文件格式 CSV:純文本格式,每一行表示一條記錄,字段之間用逗號(,)或其他分隔符&…

Beans模塊之工廠模塊注解模塊@Qualifier

博主介紹:?全網粉絲5W,全棧開發工程師,從事多年軟件開發,在大廠呆過。持有軟件中級、六級等證書。可提供微服務項目搭建與畢業項目實戰,博主也曾寫過優秀論文,查重率極低,在這方面有豐富的經驗…

C# HTTP 文件上傳、下載服務器

程序需要管理員權限,vs需要管理員打開 首次運行需要執行以下命令注冊URL(管理員命令行) netsh advfirewall firewall add rule name"FileShare" dirin actionallow protocolTCP localport8000 ipconfig | findstr "IPv4&quo…

基于 TRIZ 理論的筏式養殖吊籠清洗裝備設計研究

基于 TRIZ 理論的筏式養殖吊籠清洗裝備設計研究 一、引言 筏式養殖在水產養殖業中占據重要地位,吊籠作為養殖貝類、藻類等生物的關鍵器具,其清潔程度直接影響養殖生物的健康與產量。傳統的吊籠清洗方式多依賴人工,效率低下、勞動強度大且清洗…

QA:備份產品的存儲架構采用集中式和分布式的優劣?

分布式和集中式各有優劣,且這兩者下面的存儲類型也都不盡相同,從備份與恢復的數據層面來看,這兩者存儲相結合才是優解。 眾所周知,備份數據只存一份還只放在一個存儲里是不現實的。假設把備份數據訪問頻率、生命周期等參數分為三個…

FPGA中串行執行方式之計數器控制

FPGA中串行執行方式之計數器控制 使用計數器控制的方式實現狀態機是一種簡單且直觀的方法。它通過計數器的值來控制狀態的變化,從而實現順序邏輯。計數器的方式特別適合狀態較少且狀態轉移是固定的場景。 基本原理 計數器控制的狀態機 ?例程1:簡單的順序狀態機 以下是一個…

純vue手寫流程組件

前言 網上有很多的vue的流程組件,但是本人不喜歡很多冗余的代碼,喜歡動手敲代碼;剛開始寫的時候,確實沒法下筆,最后一層一層剝離,總算實現了;大家可以參考我寫的代碼,可以拿過去定制…

數字化轉型驅動衛生用品安全革新

當315晚會上晃動的暗訪鏡頭揭露衛生巾生產車間里漂浮的異物、紙尿褲原料倉中霉變的碎屑時,這一觸目驚心的場景無情地撕開了“貼身安全”的遮羞布,暴露的不僅是部分企業的道德缺失,更凸顯了當前檢測與監管體系的漏洞,為整個行業敲響…

【C++】:異常

目錄 C語言處理錯誤的方式 C異常的概念 C異常的使用 異常的拋出與捕獲匹配原則 函數調用鏈中的棧展開 異常重新拋出 異常安全 異常規范 標準庫異常體系 自定義異常體系 異常的優缺點 C語言處理錯誤的方式 返回值檢查:函數返回特定錯誤碼或值標識失敗&am…

SZU軟件工程大學生涯 2022~2026

用于個人面試前自我介紹,防止忘記或談吐不流利。 面試官您好,我是來自深圳大學計算機與軟件學院的軟件工程專業的王雅賢。在校期間,我修讀了程序設計基礎、面向對象程序設計、數據結構、算法分析與設計、操作系統等核心課程,系統…

【JavaWeb學習Day27】

Tlias前端 員工管理 條件分頁查詢&#xff1a; 頁面布局 搜索欄&#xff1a; <!-- 搜索欄 --><div class"container"><el-form :inline"true" :model"searchEmp" class"demo-form-inline"><el-form-item label…