【NLP】27. 語言模型訓練以及模型選擇：從預訓練到下游任務

語言模型訓練：從預訓練到下游任務

本文詳細講解大型語言模型（LLMs）是如何訓練的，包括不同的模型類型（Encoder、Decoder、Encoder-Decoder），以及各類預訓練任務的原理、對比、適用場景，幫助你構建完整的語言建模理解體系。

一、三種主流語言模型結構

語言模型（LLMs）主要分為三種結構，每種結構的訓練方式、能力邊界、應用場景均有所不同：

類型	代表模型	輸入處理	輸出形式	典型用途
編碼器（Encoder）	BERT	輸入整句（遮掩詞）	詞級表示向量	NER、分類、匹配等
解碼器（Decoder）	GPT	輸入前文	自回歸生成后文	生成、續寫、對話等
編碼-解碼結構	T5、BART	編碼整句 → 解碼目標	文本對到文本對	翻譯、問答、摘要等

**注：**目前主流大模型如 GPT-4、Claude 多為 decoder-only 結構。

二、訓練語言模型的基本思想

無論是哪類模型，訓練過程都遵循一條核心路徑：

給定原始文本，稍作修改（如遮蓋、替換、刪除），訓練模型去“恢復或識別修改處”。

這種方式不僅可以學習詞與詞之間的語義關系，還能促使模型理解上下文和結構。

三、常見的訓練任務詳解與對比

1?? Next Token Prediction（下一個詞預測）

用于： Decoder 模型（如 GPT）
機制： 給定文本開頭，預測下一個詞
建模目標：

$\hat{y}_{t} = \arg\max P(y_t \mid y_{<t})$

例子：
輸入：The weather is → 輸出：sunny
優點： 能處理文本生成任務
缺點： 單向上下文，只能看到前文

2?? Masked Token Prediction（遮蓋詞預測）

用于： Encoder 模型（如 BERT）
機制： 輸入句子中隨機遮蓋若干 token，模型預測遮蓋位置的原始詞

例子：

輸入：The capital of France is [MASK]
輸出：Paris

注意： 非遮蓋詞位置不參與 loss 計算
對比 GPT： 能看到前后文（雙向上下文），但不能用于生成任務

3?? Span Prediction（SpanBERT）

區別： 遮蓋連續多個詞（而不是單詞）
目的： 強化模型處理片段（span）的能力，更適合問答等任務

例子：

輸入：Chocolate is [MASK] [MASK] [MASK] good
輸出：incredibly irresistibly tasty

難度更高 → 更強能力

4?? Random Token Correction（BERT）

機制： 隨機替換句子中的詞，模型需判斷哪個詞錯了

例子：

I like MLP (原來是NLP)
輸出：發現 MLP 是錯的

挑戰： 模型要全局理解文本含義，避免僅依靠表層詞頻

5?? Token Edit Detection（ELECTRA）

流程：
- 由小型 generator 替換部分 token（偽造）
- 判別器判斷每個 token 是否被替換
優點： 所有 token 都參與訓練（比 BERT 更高效）
訓練目標：

$\text{Output}_i = \begin{cases} S, & \text{token 是原始} \\ E, & \text{token 是生成} \end{cases}$

6?? Combination（BERT 擴展任務）

整合多種任務：Masked + Replacement + 原文保留
效果： 模型能更全面學習語義結構、對抗擾動

7?? Next Sentence Prediction（BERT）

機制： 判斷兩句話是否為上下文連續

例子：

A: I like NLP
B: I like MLP	
輸出：是否 Next Sentence?

后續研究發現： 該任務效果有限，RoBERTa 移除該任務后表現反而更好

四、Encoder-Decoder 專屬訓練任務（如 T5、BART）

? 1. Masked Sequence Prediction（遮蓋詞預測）

輸入：
```
I attended [MASK] at [MASK]
```
輸出目標：
```
I attended a workshop at Google
```

? 2. Deleted Sequence Prediction（刪除預測）

輸入：
```
I watched yesterday
```
輸出目標：
```
I watched a movie on Netflix yesterday
```
錯誤答案示例（模型需要避免）：
```
I watched a presentation at work yesterday
```

? 3. Deleted Span Prediction（刪除片段預測）

輸入：
```
She submitted the assignment
```
輸出目標（分段補全）：
```
<X>: yesterday evening, <Y>: on Canvas
```

? 4. Permuted Sequence Prediction（打亂順序重構）

輸入：
```
Netflix the on movie watched I
```
輸出目標：
```
I watched the movie on Netflix
```

? 5. Rotated Sequence Prediction（旋轉預測）

輸入：
```
the conference in I presented paper a
```
輸出目標：
```
I presented a paper in the conference
```

? 6. Infilling Prediction（間隙填空）

輸入：
```
She [MASK] the [MASK] before [MASK]
```

輸出目標：

She completed the report before midnight

Encoder-Decoder 架構能更靈活地處理“輸入 → 輸出”任務，適合做結構性轉換。其訓練任務也更具多樣性：

任務類型	模型示例	輸入	輸出（目標）
Masked Sequence	BART	I submitted [MASK] to [MASK]	I submitted the report to my supervisor
Deleted Sequence	BART	I submitted to my supervisor	I submitted the report to my supervisor
Span Mask	T5	I submitted to	: the report, : my supervisor
Permuted Sequence	BART	My supervisor to the report submitted I	I submitted the report to my supervisor
Rotated Sequence	BART	To my supervisor I submitted the report	I submitted the report to my supervisor
Infilling Prediction	BART	I [MASK] the [MASK] to [MASK]	I submitted the report to my supervisor

這些任務強化了模型處理不定結構輸入的能力，提升其在翻譯、摘要等任務中的泛化表現。

五、預訓練 vs 微調：對比分析

對比維度	預訓練（Pre-training）	微調（Fine-tuning）
執行頻率	只做一次	每個任務可單獨訓練一次
訓練時間	較長（幾周/月）	可長可短
計算資源	通常需大規模 GPU（集群）	可在小規模 GPU 運行
數據來源	原始文本（如 Wikipedia, BooksCorpus）	任務特定數據（分類、問答等）
學習目標	通用語言理解（語義、上下文、關系）	針對具體任務性能最優

六、如何使用語言模型做任務？

模式一：使用語言模型作為特征提取器

提取詞向量或句向量 → 輸入到后續任務模型
類似于早期使用 Word2Vec 或 GloVe 向量
BERT 特別適合這種方式

模式二：將任務直接表述為語言建模

直接將任務轉為“生成”問題
GPT 類型模型常用此方式

示例：

問答：Q: Who discovered gravity? A: → Isaac Newton
翻譯：Translate: Hello → Bonjour

任務結構對比分析

📘 情感分析任務（模擬課程評價）

模型結構	輸入內容	輸出內容
Encoder	The student submitted the assignment on time. Rating: [MASK]	5
Decoder	The student submitted the assignment on time. Rating:	5
Enc-Dec	The student submitted the assignment on time. Rating:	: 5

📘 命名實體識別（NER）

模型結構	輸入內容	輸出內容
Encoder	The student submitted the assignment on time	[O, O, O, O, B-TASK, O, B-TIME]
Decoder	The student submitted the assignment on time	assignment: Task, on time: Time
Enc-Dec	The student submitted the assignment on time	assignment → Task on time → Time

📘 共指消解（改寫句子以引入代詞）

例句：The student told the lecturer that he was late.

模型結構	輸入內容	輸出內容
Encoder	The student told the lecturer that he was late	he → student / lecturer（需結構化解碼）
Decoder	The student told the lecturer that he was late	“he” refers to the student
Enc-Dec	The student told the lecturer that he was late	he → student

📘 文本摘要（擴展長句 → 摘要）

長句：The student, after days of hard work and late nights, finally submitted the assignment well before the deadline.

模型結構	輸入內容	輸出內容
Decoder	The student, after days of hard work…	The student submitted early
Enc-Dec	Same as above	Submitted the assignment

📘 翻譯任務（英 → 法）

模型結構	輸入內容	輸出內容
Decoder	Translate: The student submitted the assignment on time.	L’étudiant a rendu le devoir à temps.
Enc-Dec	Translate English to French: The student submitted…	L’étudiant a rendu le devoir à temps.

模型結構與任務適配總結

任務類型	Encoder (如 BERT)	Decoder (如 GPT)	Encoder-Decoder (如 T5, BART)
分類/回歸	? 非常適合	? 也可建模	? 靈活，適合文本→標簽結構
實體識別	? 標準做法（token 分類）	?? 需序列生成	? 可做 span 生成
共指消解	?? 通常需外部處理	?? 表達模糊	? 可結構化生成
摘要	? 不能生成	? 具備能力	? 最適合（輸入輸出解耦）
翻譯	? 無法實現	? 可做（Prompt式）	? 最佳（標準 Encoder-Decoder 應用）