深度學習·經典模型·VisionTransformer

VIT

embedding處理與標準的Transformer不同，其他基本一致

Eｍbedding

Graph: $(H, W, C)$
Patch: $N,P^2C)$ ,其中 $N=\frac{H*W}{P^2}$ , $P$ 是patch的大小
注意的是,論文了保留與Bert的一致性,加上了<class>這一個token,

例子

例如: $(224, 224)$ 的圖片，patch的大小為 $(16, 16)$ ，被embedding為 $(196=\frac{224*224}{16*16},768=(16*16*3))$ 的圖片,加上<class>變為 $(197, 768)$ ,然后經過模型輸出結果。

其他結構與Transformer保持一致

不再贅述

特性分析

缺乏歸納偏置

歸納偏置：論文中指的是平移不變性，翻譯相等性。這些是CNN特有的歸納性質，可以理解為一種先驗知識，可以幫助模型更好的學習。

Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.
🔤我們注意到 Vision Transformer 的圖像特異性電感偏差比 CNN 小得多。在 CNN 中，局部性、二維鄰域結構和平移等方差被烘焙到整個模型的每一層中🔤

位置編碼的特殊性

圖像是2D的，所以自然而然地考慮位置編碼是否需要是2D的？
論文用實驗證明：1D的位置編碼足夠VIT可以學習到圖像這種特殊的位置關系。

數據集的偏好

VIT相比CNN在小數據集上表現不佳：論文解釋將原因歸咎于缺乏歸納偏置。
大數據上VIT比CNN更好.

在這里插入圖片描述

參考文獻

@misc{dosovitskiyImageWorth16x162021,title = {An {{Image}} Is {{Worth}} 16x16 {{Words}}: {{Transformers}} for {{Image Recognition}} at {{Scale}}},shorttitle = {An {{Image}} Is {{Worth}} 16x16 {{Words}}},author = {Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},year = {2021},number = {arXiv:2010.11929},eprint = {2010.11929},primaryclass = {cs},publisher = {arXiv},doi = {10.48550/arXiv.2010.11929},archiveprefix = {arXiv}
}

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/81302.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/81302.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/81302.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！