在圖像領域,CNN卷積神經網絡結構已經成為了標配,所有的模型都是基于CNN來構造的。
而在NLP領域,自從Transformer橫空出世之后,基本上也統治了NLP的各個領域。
基于Transformer的強大,一些論文的工作都是將Transformer也應用到CV領域,在這篇論文:AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE中,基本上就把Transformer比較成功的引入到了CV領域,而且獲得了相對比較好的效果。
摘要
While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image