你的第一個Transformer模型：從零實現并訓練一個迷你ChatBot

點擊 “AladdinEdu，同學們用得起的【H卡】算力平臺”，注冊即送-H卡級別算力，80G大顯存，按量計費，靈活彈性，頂級配置，學生更享專屬優惠。

引言：破除神秘感，擁抱核心思想

在人工智能的浪潮中，Transformer模型無疑是一顆最璀璨的明珠。從GPT系列到BERT，從翻譯到對話，它的身影無處不在。然而，對于許多初學者而言，Transformer常常被冠以“復雜”、“晦澀難懂”的標簽，厚厚的論文和錯綜復雜的結構圖讓人望而卻步。

今天，我們的目標就是親手撕掉這層神秘的面紗。我們將不使用任何高級的深度學習框架（如Hugging Face的Transformers庫），而是僅借助PyTorch提供的基礎張量操作和神經網絡模塊，從零開始，一行代碼一行代碼地構建一個完整的Transformer模型。最終，我們會在一個小型數據集上訓練它，讓它成為一個能進行簡單對話的迷你ChatBot。

相信我，當你跟著本文完成整個流程，并看到你的模型開始生成回復時，你會對Self-Attention、位置編碼等核心概念有一種“頓悟”般的感覺。這不僅是一次編程練習，更是一次深入理解現代AI核心架構的絕佳旅程。

讓我們開始吧！

第一部分：Transformer架構總覽

在深入代碼之前，我們快速回顧一下Transformer的核心設計。其最初在論文《Attention Is All You Need》中提出，完全基于Attention機制，摒棄了傳統的循環和卷積結構。

一個典型的Transformer包含一個編碼器（Encoder）和一個解碼器（Decoder）。對于我們的聊天機器人任務，編碼器負責理解輸入的問句，解碼器則負責生成輸出的答句。

編碼器：由N個（原文是6個）相同的層堆疊而成。每層包含兩個子層：
1. 多頭自注意力機制（Multi-Head Self-Attention）
2. 前饋神經網絡（Position-wise Feed-Forward Network）
  每個子層周圍都有一個殘差連接（Residual Connection）和層歸一化（Layer Normalization）。
解碼器：同樣由N個相同的層堆疊而成。每層包含三個子層：
1. 掩碼多頭自注意力機制（Masked Multi-Head Self-Attention）：確保解碼時只能看到當前位置之前的信息，防止“偷看”未來答案。
2. 多頭編碼器-解碼器注意力機制（Multi-Head Encoder-Decoder Attention）：幫助解碼器關注輸入序列中的相關信息。
3. 前饋神經網絡。
  同樣，每個子層也都有殘差連接和層歸一化。

此外，模型最開始有輸入嵌入層和位置編碼，最后有輸出線性層和Softmax。

我們的代碼實現將嚴格遵循這個結構。我們將自底向上地構建它。

第二部分：核心模塊代碼實現

我們首先實現最核心、最關鍵的幾個模塊。

1. Self-Attention 與 Scaled Dot-Product Attention

Self-Attention是Transformer的靈魂。它的目的是讓序列中的任何一個字都能夠與序列中的所有其他字進行交互，從而更好地捕捉上下文信息。

Scaled Dot-Product Attention的計算公式如下：
$softmax(\frac{QK^T}{\sqrt{d_k}})V$

其中：

Q (Query)：查詢矩陣，代表當前要關注的詞。
K (Key)：鍵矩陣，代表序列中所有待被查詢的詞。
V (Value)：值矩陣，代表序列中所有詞的實際信息。
d_k：Key向量的維度，縮放因子 $dk\sqrt{d_k}$ 用于防止點積過大導致softmax梯度消失。

import torch
import torch.nn as nn
import torch.nn.functional as F
import mathclass ScaledDotProductAttention(nn.Module):"""Scaled Dot-Product Attention"""def __init__(self, dropout_rate=0.1):super(ScaledDotProductAttention, self).__init__()self.dropout = nn.Dropout(dropout_rate)def forward(self, Q, K, V, attn_mask=None):# Q, K, V 的形狀: [batch_size, n_heads, seq_len, d_k or d_v]# d_k = d_model / n_headsd_k = K.size()[-1]# 計算注意力分數 QK^T / sqrt(d_k)scores = torch.matmul(Q, K.transpose(-1, -2)) / math.sqrt(d_k)# 如果提供了注意力掩碼，應用它（將mask為1的位置置為一個極小的值，如-1e9）if attn_mask is not None:scores = scores.masked_fill(attn_mask == 1, -1e9)# 對最后一維（seq_len維）進行softmax，得到注意力權重attn_weights = F.softmax(scores, dim=-1)# 可選：應用dropoutattn_weights = self.dropout(attn_weights)# 將注意力權重乘以V，得到最終的輸出output = torch.matmul(attn_weights, V) # [batch_size, n_heads, seq_len, d_v]return output, attn_weights

2. Multi-Head Attention

多頭注意力機制將模型分為多個“頭”，讓每個頭去關注序列中不同的方面（例如，有的頭關注語法關系，有的頭關注語義關系），最后將各頭的輸出合并起來。

class MultiHeadAttention(nn.Module):"""Multi-Head Attention mechanism"""def __init__(self, d_model, n_heads, dropout_rate=0.1):super(MultiHeadAttention, self).__init__()assert d_model % n_heads == 0, "d_model must be divisible by n_heads"self.d_model = d_modelself.n_heads = n_headsself.d_k = d_model // n_heads # 每個頭的維度self.d_v = d_model // n_heads# 線性投影層，用于生成Q, K, Vself.W_Q = nn.Linear(d_model, d_model)self.W_K = nn.Linear(d_model, d_model)self.W_V = nn.Linear(d_model, d_model)self.W_O = nn.Linear(d_model, d_model) # 輸出投影層self.attention = ScaledDotProductAttention(dropout_rate)self.dropout = nn.Dropout(dropout_rate)self.layer_norm = nn.LayerNorm(d_model)def forward(self, Q, K, V, attn_mask=None):# 殘差連接residual = Qbatch_size = Q.size(0)# 線性投影并分頭# (batch_size, seq_len, d_model) -> (batch_size, seq_len, n_heads, d_k) -> (batch_size, n_heads, seq_len, d_k)q_s = self.W_Q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)k_s = self.W_K(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)v_s = self.W_V(V).view(batch_size, -1, self.n_heads, self.d_v).transpose(1, 2)# 如果需要，擴展attn_mask以匹配多頭形狀if attn_mask is not None:attn_mask = attn_mask.unsqueeze(1) # [batch_size, 1, seq_len, seq_len] 廣播到所有頭# 應用ScaledDotProductAttentioncontext, attn_weights = self.attention(q_s, k_s, v_s, attn_mask=attn_mask)# 將各頭的輸出拼接起來# (batch_size, n_heads, seq_len, d_v) -> (batch_size, seq_len, n_heads * d_v) = (batch_size, seq_len, d_model)context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)# 輸出投影output = self.W_O(context)output = self.dropout(output)# 殘差連接和層歸一化output = self.layer_norm(output + residual)return output, attn_weights

3. Position-wise Feed-Forward Network

這是一個簡單的前饋神經網絡，對每個位置（詞）的特征進行獨立變換。它通常包含兩個線性層和一個ReLU激活函數。

class PositionWiseFFN(nn.Module):"""Position-wise Feed-Forward Network"""def __init__(self, d_model, d_ff, dropout_rate=0.1):super(PositionWiseFFN, self).__init__()self.w_1 = nn.Linear(d_model, d_ff)self.w_2 = nn.Linear(d_ff, d_model)self.dropout = nn.Dropout(dropout_rate)self.layer_norm = nn.LayerNorm(d_model)def forward(self, x):residual = xx = self.w_1(x)x = F.relu(x)x = self.dropout(x)x = self.w_2(x)x = self.dropout(x)# 殘差連接和層歸一化x = self.layer_norm(x + residual)return x

4. Positional Encoding（位置編碼）

由于Transformer沒有循環和卷積結構，它無法感知序列的順序。因此，我們需要手動注入位置信息。這里我們使用論文中的正弦和余弦函數編碼。

class PositionalEncoding(nn.Module):"""Implement the PE function."""def __init__(self, d_model, max_seq_len=5000):super(PositionalEncoding, self).__init__()# 創建一個足夠長的位置編碼矩陣pe = torch.zeros(max_seq_len, d_model)position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1) # [max_seq_len, 1]div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))# 對矩陣的偶數和奇數索引分別用正弦和余弦函數pe[:, 0::2] = torch.sin(position * div_term)pe[:, 1::2] = torch.cos(position * div_term)pe = pe.unsqueeze(0) # [1, max_seq_len, d_model]# 注冊為一個緩沖區（buffer），它將是模型的一部分，但不被視為可訓練參數self.register_buffer('pe', pe)def forward(self, x):# x 的形狀: [batch_size, seq_len, d_model]x = x + self.pe[:, :x.size(1), :]return x

第三部分：組裝編碼器與解碼器層

有了上面的積木，我們現在可以搭建編碼器層和解碼器層了。

1. 編碼器層（Encoder Layer）

一個編碼器層包含一個多頭自注意力子層和一個前饋網絡子層。

class EncoderLayer(nn.Module):"""A single layer of the encoder."""def __init__(self, d_model, n_heads, d_ff, dropout_rate=0.1):super(EncoderLayer, self).__init__()self.self_attn = MultiHeadAttention(d_model, n_heads, dropout_rate)self.ffn = PositionWiseFFN(d_model, d_ff, dropout_rate)def forward(self, enc_input, enc_self_attn_mask=None):# 自注意力子層enc_output, attn_weights = self.self_attn(enc_input, enc_input, enc_input, attn_mask=enc_self_attn_mask)# 前饋網絡子層enc_output = self.ffn(enc_output)return enc_output, attn_weights

2. 解碼器層（Decoder Layer）

一個解碼器層包含三個子層：掩碼自注意力、編碼器-解碼器注意力和前饋網絡。

class DecoderLayer(nn.Module):"""A single layer of the decoder."""def __init__(self, d_model, n_heads, d_ff, dropout_rate=0.1):super(DecoderLayer, self).__init__()self.self_attn = MultiHeadAttention(d_model, n_heads, dropout_rate)self.enc_dec_attn = MultiHeadAttention(d_model, n_heads, dropout_rate)self.ffn = PositionWiseFFN(d_model, d_ff, dropout_rate)def forward(self, dec_input, enc_output, dec_self_attn_mask=None, dec_enc_attn_mask=None):# 掩碼自注意力子層dec_output, self_attn_weights = self.self_attn(dec_input, dec_input, dec_input, attn_mask=dec_self_attn_mask)# 編碼器-解碼器注意力子層# Q 來自解碼器，K, V 來自編碼器輸出dec_output, enc_dec_attn_weights = self.enc_dec_attn(dec_output, enc_output, enc_output, attn_mask=dec_enc_attn_mask)# 前饋網絡子層dec_output = self.ffn(dec_output)return dec_output, self_attn_weights, enc_dec_attn_weights

第四部分：構建完整的Transformer模型

現在，我們將嵌入層、位置編碼、編碼器棧、解碼器棧以及最終的輸出層組合在一起。

class Transformer(nn.Module):"""The complete Transformer model."""def __init__(self, src_vocab_size, tgt_vocab_size, d_model, n_heads, n_layers, d_ff, max_seq_len, dropout_rate=0.1):super(Transformer, self).__init__()self.d_model = d_model# 輸入和輸出嵌入層，共享權重通常效果更好，但這里我們先分開self.enc_embedding = nn.Embedding(src_vocab_size, d_model)self.dec_embedding = nn.Embedding(tgt_vocab_size, d_model)self.pos_encoding = PositionalEncoding(d_model, max_seq_len)# 編碼器和解碼器堆疊self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, n_heads, d_ff, dropout_rate) for _ in range(n_layers)])self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, n_heads, d_ff, dropout_rate) for _ in range(n_layers)])# 最終的線性層和softmaxself.linear = nn.Linear(d_model, tgt_vocab_size)self.dropout = nn.Dropout(dropout_rate)def forward(self, src_input, tgt_input, src_mask=None, tgt_mask=None):# 編碼器部分enc_output = self.enc_embedding(src_input) * math.sqrt(self.d_model)enc_output = self.pos_encoding(enc_output)enc_output = self.dropout(enc_output)for layer in self.encoder_layers:enc_output, _ = layer(enc_output, enc_self_attn_mask=src_mask)# 解碼器部分dec_output = self.dec_embedding(tgt_input) * math.sqrt(self.d_model)dec_output = self.pos_encoding(dec_output)dec_output = self.dropout(dec_output)for layer in self.decoder_layers:dec_output, _, _ = layer(dec_output, enc_output, dec_self_attn_mask=tgt_mask)# 輸出投影output = self.linear(dec_output)# Softmax在損失函數中計算，這里直接返回logitsreturn output

第五部分：數據準備與訓練

1. 選擇一個迷你數據集

為了快速實驗，我們使用一個非常小的對話數據集。例如，我們可以手動創建一個：

Q: Hi
A: Hello!
Q: What's your name?
A: I'm ChatBot.
Q: How are you?
A: I'm fine, thank you.
... (再多幾十組)

或者使用Cornell Movie Dialogs Corpus的一小部分。我們需要構建一個詞匯表，并將句子轉換為ID序列。

2. 構建詞匯表和DataLoader

# 偽代碼：構建詞匯表
# sentences = [所有Q和A的句子]
# vocab = {'<pad>':0, '<sos>':1, '<eos>':2, ...} 構建詞匯字典
# src_ids = [[vocab[word] for word in sentence.split()] for sentence in src_sentences]# 使用PyTorch的DataLoader和Dataset
from torch.utils.data import Dataset, DataLoaderclass ChatDataset(Dataset):def __init__(self, src_sentences, tgt_sentences, vocab, max_len):self.src_sentences = src_sentencesself.tgt_sentences = tgt_sentencesself.vocab = vocabself.max_len = max_lendef __len__(self):return len(self.src_sentences)def __getitem__(self, idx):src_seq = self.sentence_to_ids(self.src_sentences[idx])tgt_seq = self.sentence_to_ids(self.tgt_sentences[idx])# 添加起始符<sos>和結束符<eos>tgt_input = [self.vocab['<sos>']] + tgt_seqtgt_output = tgt_seq + [self.vocab['<eos>']]# 填充到最大長度src_seq = self.pad_seq(src_seq, self.max_len)tgt_input = self.pad_seq(tgt_input, self.max_len)tgt_output = self.pad_seq(tgt_output, self.max_len)return torch.LongTensor(src_seq), torch.LongTensor(tgt_input), torch.LongTensor(tgt_output)# ... (實現sentence_to_ids和pad_seq方法)

3. 創建注意力掩碼和訓練循環

我們需要創建兩種掩碼：

填充掩碼（Padding Mask）：遮蓋掉<pad>符號，防止注意力機制關注這些無意義的位置。
序列掩碼（Sequence Mask）：用于解碼器的自注意力，防止解碼時看到未來的信息（一個下三角矩陣）。

def create_padding_mask(seq, pad_idx):# seq: [batch_size, seq_len]return (seq == pad_idx).unsqueeze(1).unsqueeze(2) # [batch_size, 1, 1, seq_len] 便于廣播def create_look_ahead_mask(seq_len):# 創建一個下三角矩陣，對角線及其以上為0，以下為1mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()return mask.unsqueeze(0).unsqueeze(0) # [1, 1, seq_len, seq_len]

訓練循環 的標準流程：準備數據、計算模型輸出、計算損失（帶忽略<pad>的CrossEntropyLoss）、反向傳播、優化器步進。

# 初始化模型、優化器、損失函數
model = Transformer(src_vocab_size, tgt_vocab_size, d_model=512, n_heads=8, n_layers=6, d_ff=2048, max_seq_len=100)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, betas=(0.9, 0.98), eps=1e-9)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx) # 忽略padding位置的損失for epoch in range(num_epochs):model.train()for batch in dataloader:src, tgt_in, tgt_out = batchsrc_mask = create_padding_mask(src, pad_idx)# 解碼器掩碼：填充掩碼 + 序列掩碼tgt_padding_mask = create_padding_mask(tgt_in, pad_idx)tgt_look_ahead_mask = create_look_ahead_mask(tgt_in.size(1))tgt_mask = torch.logical_or(tgt_padding_mask, tgt_look_ahead_mask) # 組合掩碼optimizer.zero_grad()output = model(src, tgt_in, src_mask, tgt_mask)# output: [batch_size, tgt_len, tgt_vocab_size]# tgt_out: [batch_size, tgt_len]loss = criterion(output.view(-1, output.size(-1)), tgt_out.view(-1))loss.backward()optimizer.step()# ... 每個epoch結束后可以打印損失或進行驗證

第六部分：推理與對話生成

訓練完成后，我們使用貪心搜索（Greedy Search）來生成回復。

def predict(model, src_sentence, vocab, inv_vocab, max_len, device):model.eval()with torch.no_grad():# 將源句子轉換為IDsrc_ids = sentence_to_ids(src_sentence, vocab)src_tensor = torch.LongTensor(src_ids).unsqueeze(0).to(device) # [1, src_len]# 初始化目標輸入，起始為<sos>tgt_ids = [vocab['<sos>']]for i in range(max_len):tgt_tensor = torch.LongTensor(tgt_ids).unsqueeze(0).to(device) # [1, current_tgt_len]# 創建掩碼src_mask = create_padding_mask(src_tensor, pad_idx)tgt_mask = create_look_ahead_mask(len(tgt_ids))# 預測下一個詞output = model(src_tensor, tgt_tensor, src_mask, tgt_mask)next_word_logits = output[0, -1, :] # 最后一個位置的輸出next_word_id = torch.argmax(next_word_logits, dim=-1).item()tgt_ids.append(next_word_id)if next_word_id == vocab['<eos>']:break# 將ID序列轉換回句子，忽略<sos>和<eos>predicted_sentence = ids_to_sentence(tgt_ids[1:-1], inv_vocab)return predicted_sentence# 示例用法
# vocab: 詞匯表，inv_vocab: 反向詞匯表（id到word）
# response = predict(model, "Hello there", vocab, inv_vocab, max_len=20, device='cpu')
# print(response)