Experiment
上來的一段話就概括了整章的內容。
We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the hybrid.
章節的一開頭就說明了,對比的模型就是
- ResNet,CNN領域中的代碼模型。
- ViT。
- 上一篇中提到的Hybrid模型,也就是CNN來做特征提取,Transformer做全局整合。
To understand the data requirements of each model, we pre-train on datasets of varying size and evaluate many benchmark tasks.
第二句說的就是數據集的選用:
- ILSVRC-2012 ImageNet,論文挑選的小規模數據集,21k classes and 14M images。
- ImageNet-21k,論文認為的中等規模的數據集,21k classes and 14M images。
- JFT,Google內部的大型圖形數據庫,18k classes and
303M high-resolution images。
When considering the computational cost of pre-training the model, ViT performs very favourably,