題意:Transformer中感受野的替代概念及其影響因素
問題背景:
I have two transformer networks. One with 3 heads per attention and 15 layers in total and second one with 5 heads per layer and 30 layers in total. Given an arbitrary set of documents (2048 tokens per each), how to find out, which network is going to be better to use and is less prone to overfitting?
我有兩個Transformer網絡。一個網絡每層的注意力有3個頭,總共有15層;另一個網絡每層的注意力有5個頭,總共有30層。給定一組任意的文檔(每篇文檔包含2048個標記),要如何確定哪個網絡更適合使用,并且更不容易過擬合?
In computer vision we have concept called: "receptive field", that allows us to understand how big or small network we need to use. For instance, if we have CNN with 120 layers and CNN with 70 layers, we can calculate their receptive fields and understand which one is going to perform better on a particular dataset of images.
在計算機視覺中,我們有一個概念叫做“感受野(Receptive Field)”,它幫助我們理解需要使用多大或多小的網絡。例如,如果我們有一個120層的卷積神經網絡(CNN)和一個70層的CNN,我們可以計算它們的感受野,并了解哪一個在特定的圖像數據集上表現會更好。
Do you guys have something similar in NLP? How do you understand whether one architecture is more optimal to use versus another,having a set of text documents with unique properties?
在自然語言處理(NLP)中,我們是否有類似“感受野”的概念呢?當我們有一組具有獨特屬性的文本文檔時,如何判斷哪種架構比另一種更優?
問題解決:
How do you understand whether one architecture is more optimal to use versus another, having a set of text documents with unique properties?
在擁有一組具有獨特屬性的文本文檔時,您如何判斷哪種架構比另一種更優?
For modern?Transformer-based?Language Models (LMs), there are some empirical "scaling laws," such as the?Chinchilla scaling laws?(Wikipedia), that essentially say that larger (deeper) models with more layers, i.e., with more parameters tend to perform better. So far, most LMs seem to roughly follow Chinchilla scaling. There is another kind of scaling, which is closer to a "receptive field", that I talk about below.
對于現代的基于Transformer的語言模型(LMs),存在一些經驗性的“擴展定律”,如Chinchilla擴展定律(Wikipedia上可查),這些定律本質上表明,具有更多層(即更深)和更多參數的大型模型往往表現更好。到目前為止,大多數語言模型似乎都大致遵循Chinchilla擴展定律。不過,還有一種擴展類型,它更接近于我下面要討論的“感受野”概念。
Do you guys have something similar in NLP?
在自然語言處理(NLP)中,你們有沒有類似的概念或機制
Kind of. Transformer-based LMs can be thought to have a "receptive field" similar to CNN layers, as the attention mechanism in the Transformer operates on a pre-defined "context window" or "context length", which is the maximum number of tokens the layer can look at ("attend to") at any given time, similar to a CNN kernel. However, with the introduction of new positional encoding (PE) approaches, such as?Rotary Positional Encoding (RoPE), and modified attention architectures, like?Sliding Window Attention (SWA), this is not strictly accurate.
在某種程度上,可以認為基于Transformer的語言模型(LMs)具有類似于卷積神經網絡(CNN)層的“感受野”。因為Transformer中的注意力機制是在一個預定義的“上下文窗口”或“上下文長度”上操作的,這個長度是該層在任何給定時間可以查看(或“注意”)的最大標記(token)數,這類似于CNN中的卷積核。然而,隨著新的位置編碼(PE)方法(如旋轉位置編碼Rotary Positional Encoding,RoPE)和修改后的注意力架構(如滑動窗口注意力Sliding Window Attention,SWA)的引入,這一說法并不完全準確。
Scaling in terms of "context length" is of much interest, but usually, it is very difficult to scale Transformers this way, because of attention being a ($\mathcal{O}(N^2)$) (O(N^2)) operation. So, usually, researchers go towards deeper architectures with more parameters ("over-parameterization") that can allow the model to "memorize" as much of the large training corpus as it can ("overfitting"), so that it can perform reasonably well, when fine-tuned for most down-stream tasks (that have at least some representative examples in the training corpus).
在“上下文長度”方面的擴展是非常有吸引力的,但通常,由于注意力機制是(O(N2))操作,因此很難以這種方式擴展Transformer。因此,研究人員通常會選擇更深的架構,增加更多的參數(“過參數化”),這樣模型就可以“記憶”盡可能多的大型訓練語料庫中的內容(“過擬合”),以便在大多數下游任務(這些任務在訓練語料庫中至少有一些代表性示例)上進行微調時,能夠表現出合理的性能。