VIT
- embedding處理與標準的Transformer不同,其他基本一致
Embedding
- Graph: ( H , W , C ) (H,W,C) (H,W,C)
-
Patch: ( N , P 2 C ) (N,P^2C) (N,P2C),其中 N = H ? W P 2 N=\frac{H*W}{P^2} N=P2H?W?, P P P是patch的大小
- 注意的是,論文了保留與Bert的一致性,加上了
<class>
這一個token,
例子
例如: ( 224 , 224 ) (224,224) (224,224)的圖片,patch的大小為 ( 16 , 16 ) (16,16) (16,16),被embedding為 ( 196 = 224 ? 224 16 ? 16 , 768 = ( 16 ? 16 ? 3 ) ) (196=\frac{224*224}{16*16},768=(16*16*3)) (196=16?16224?224?,768=(16?16?3))的圖片,加上<class>
變為 ( 197 , 768 ) (197,768) (197,768),然后經過模型輸出結果。
其他結構與Transformer保持一致
- 不再贅述
特性分析
缺乏歸納偏置
- 歸納偏置:論文中指的是平移不變性,翻譯相等性。這些是CNN特有的歸納性質,可以理解為一種先驗知識,可以幫助模型更好的學習。
Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.
🔤我們注意到 Vision Transformer 的圖像特異性電感偏差比 CNN 小得多。在 CNN 中,局部性、二維鄰域結構和平移等方差被烘焙到整個模型的每一層中🔤
位置編碼的特殊性
- 圖像是2D的,所以自然而然地考慮位置編碼是否需要是2D的?
- 論文用實驗證明:1D的位置編碼足夠VIT可以學習到圖像這種特殊的位置關系。
數據集的偏好
- VIT相比CNN在小數據集上表現不佳:論文解釋將原因歸咎于缺乏歸納偏置。
- 大數據上VIT比CNN更好.
參考文獻
@misc{dosovitskiyImageWorth16x162021,title = {An {{Image}} Is {{Worth}} 16x16 {{Words}}: {{Transformers}} for {{Image Recognition}} at {{Scale}}},shorttitle = {An {{Image}} Is {{Worth}} 16x16 {{Words}}},author = {Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},year = {2021},number = {arXiv:2010.11929},eprint = {2010.11929},primaryclass = {cs},publisher = {arXiv},doi = {10.48550/arXiv.2010.11929},archiveprefix = {arXiv}
}