Vision Transformer (ViT) :Transformer在computer vision領域的應用(一)

bicheng/2025/9/14 19:57:49/文章來源:https://blog.csdn.net/pcgamer/article/details/150955913

在圖像領域，CNN卷積神經網絡結構已經成為了標配，所有的模型都是基于CNN來構造的。
而在NLP領域，自從Transformer橫空出世之后，基本上也統治了NLP的各個領域。

基于Transformer的強大，一些論文的工作都是將Transformer也應用到CV領域，在這篇論文：AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE中，基本上就把Transformer比較成功的引入到了CV領域，而且獲得了相對比較好的效果。

摘要

While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/96853.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/96853.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/96853.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！