從RNN到Transformer
目錄
- 基礎篇:序列模型概述
- RNN循環神經網絡
- LSTM長短期記憶網絡
- Transformer架構
- 時間序列預測應用
- 計算機視覺應用
- 大語言模型應用
- 實戰與優化
- 前沿發展
基礎篇:序列模型概述 {#基礎篇}
什么是序列數據?
序列數據是按照特定順序排列的數據點集合,其中順序信息至關重要:
- 時間序列:股票價格、溫度變化、銷售數據
- 文本序列:句子、段落、文檔
- 視頻序列:連續的圖像幀
- 音頻序列:聲波信號
為什么需要專門的序列模型?
傳統的前饋神經網絡存在以下限制:
- 固定輸入大小:無法處理變長序列
- 位置不變性:忽略了序列中的順序信息
- 無記憶能力:不能利用之前的信息
- 參數爆炸:對長序列需要大量參數
序列建模的核心挑戰
- 長期依賴:捕獲序列中相距較遠的元素之間的關系
- 梯度消失/爆炸:深層網絡訓練困難
- 計算效率:處理長序列的計算復雜度
- 泛化能力:適應不同長度和類型的序列
RNN循環神經網絡 {#rnn}
基礎概念
RNN通過引入循環連接,使網絡能夠保持內部狀態(記憶),從而處理序列數據。
核心思想
# RNN的基本公式
h_t = tanh(W_hh @ h_{t-1} + W_xh @ x_t + b_h)
y_t = W_hy @ h_t + b_y# 其中:
# h_t: 時刻t的隱藏狀態
# x_t: 時刻t的輸入
# y_t: 時刻t的輸出
# W_*: 權重矩陣
# b_*: 偏置向量
RNN的結構
1. 基本RNN單元
- 輸入層:接收當前時刻的輸入x_t
- 隱藏層:結合當前輸入和上一時刻的隱藏狀態
- 輸出層:產生當前時刻的輸出
2. 展開的RNN
RNN可以展開成一個深度網絡,每個時間步共享相同的參數:
x_0 → [RNN] → h_0 → y_0↓
x_1 → [RNN] → h_1 → y_1↓
x_2 → [RNN] → h_2 → y_2
RNN的變體
1. 多對一(Many-to-One)
- 應用:情感分析、文本分類
- 特點:整個序列輸入,單個輸出
2. 一對多(One-to-Many)
- 應用:圖像描述生成、音樂生成
- 特點:單個輸入,序列輸出
3. 多對多(Many-to-Many)
- 同步:機器翻譯、視頻分類
- 異步:序列到序列任務
RNN的訓練
反向傳播穿越時間(BPTT)
# BPTT算法偽代碼
def bptt(sequences, targets, rnn_params):total_loss = 0for seq, target in zip(sequences, targets):# 前向傳播hidden_states = []h = initial_hidden_statefor x in seq:h = rnn_cell(x, h, rnn_params)hidden_states.append(h)# 計算損失loss = compute_loss(hidden_states[-1], target)# 反向傳播gradients = backward_pass(loss, hidden_states)# 更新參數update_parameters(rnn_params, gradients)
RNN的問題
- 梯度消失:長序列訓練時梯度呈指數級衰減
- 梯度爆炸:梯度值變得非常大,導致訓練不穩定
- 長期依賴困難:難以捕獲序列中的長距離依賴關系
LSTM長短期記憶網絡 {#lstm}
LSTM的動機
LSTM專門設計用于解決RNN的長期依賴問題,通過引入門控機制和記憶單元。
LSTM的核心組件
1. 細胞狀態(Cell State)
- 信息的長期記憶通道
- 可以選擇性地保留或遺忘信息
2. 三個門(Gates)
遺忘門(Forget Gate)
f_t = sigmoid(W_f @ [h_{t-1}, x_t] + b_f)
# 決定從細胞狀態中丟棄什么信息
輸入門(Input Gate)
i_t = sigmoid(W_i @ [h_{t-1}, x_t] + b_i)
C?_t = tanh(W_C @ [h_{t-1}, x_t] + b_C)
# 決定什么新信息存儲在細胞狀態中
輸出門(Output Gate)
o_t = sigmoid(W_o @ [h_{t-1}, x_t] + b_o)
h_t = o_t * tanh(C_t)
# 決定輸出什么信息
LSTM的完整計算流程
def lstm_cell(x_t, h_prev, C_prev, W, b):# 1. 遺忘門:決定丟棄什么f_t = sigmoid(W_f @ concat([h_prev, x_t]) + b_f)# 2. 輸入門:決定存儲什么i_t = sigmoid(W_i @ concat([h_prev, x_t]) + b_i)C_tilde = tanh(W_C @ concat([h_prev, x_t]) + b_C)# 3. 更新細胞狀態C_t = f_t * C_prev + i_t * C_tilde# 4. 輸出門:決定輸出什么o_t = sigmoid(W_o @ concat([h_prev, x_t]) + b_o)h_t = o_t * tanh(C_t)return h_t, C_t
LSTM的變體
1. GRU(門控循環單元)
- 簡化版LSTM,只有兩個門
- 更新門和重置門
- 參數更少,訓練更快
def gru_cell(x_t, h_prev, W, b):# 重置門r_t = sigmoid(W_r @ concat([h_prev, x_t]) + b_r)# 更新門z_t = sigmoid(W_z @ concat([h_prev, x_t]) + b_z)# 候選隱藏狀態h_tilde = tanh(W_h @ concat([r_t * h_prev, x_t]) + b_h)# 最終隱藏狀態h_t = (1 - z_t) * h_prev + z_t * h_tildereturn h_t
2. 雙向LSTM(BiLSTM)
- 同時處理正向和反向序列
- 捕獲雙向上下文信息
- 在NLP任務中表現優異
3. 多層LSTM
- 垂直堆疊多個LSTM層
- 學習更抽象的特征表示
- 增強模型表達能力
LSTM的優勢與局限
優勢:
- 有效解決梯度消失問題
- 能夠捕獲長期依賴關系
- 在多種序列任務中表現優秀
局限:
- 計算復雜度高
- 難以并行化
- 對于超長序列仍有挑戰
Transformer架構 {#transformer}
Transformer的革命性創新
2017年提出的"Attention is All You Need"徹底改變了序列建模范式。
核心概念:自注意力機制
1. 縮放點積注意力
def scaled_dot_product_attention(Q, K, V, mask=None):# Q: Query矩陣 [batch, seq_len, d_k]# K: Key矩陣 [batch, seq_len, d_k]# V: Value矩陣 [batch, seq_len, d_v]d_k = K.shape[-1]# 計算注意力分數scores = Q @ K.transpose(-2, -1) / sqrt(d_k)# 應用mask(可選)if mask is not None:scores = scores.masked_fill(mask == 0, -1e9)# Softmax歸一化attention_weights = softmax(scores, dim=-1)# 加權求和output = attention_weights @ Vreturn output, attention_weights
2. 多頭注意力
class MultiHeadAttention:def __init__(self, d_model, n_heads):self.d_model = d_modelself.n_heads = n_headsself.d_k = d_model // n_heads# 線性投影層self.W_q = Linear(d_model, d_model)self.W_k = Linear(d_model, d_model)self.W_v = Linear(d_model, d_model)self.W_o = Linear(d_model, d_model)def forward(self, query, key, value, mask=None):batch_size = query.shape[0]# 1. 線性投影并重塑為多頭Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k)K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k)V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k)# 2. 轉置以便于并行計算Q = Q.transpose(1, 2)K = K.transpose(1, 2)V = V.transpose(1, 2)# 3. 計算注意力attn_output, _ = scaled_dot_product_attention(Q, K, V, mask)# 4. 拼接多頭輸出attn_output = attn_output.transpose(1, 2).contiguous()attn_output = attn_output.view(batch_size, -1, self.d_model)# 5. 最終線性投影output = self.W_o(attn_output)return output
Transformer的架構組件
1. 編碼器(Encoder)
class TransformerEncoder:def __init__(self, d_model, n_heads, d_ff, dropout=0.1):# 子層self.self_attention = MultiHeadAttention(d_model, n_heads)self.feed_forward = FeedForward(d_model, d_ff)# 層歸一化self.norm1 = LayerNorm(d_model)self.norm2 = LayerNorm(d_model)# Dropoutself.dropout = Dropout(dropout)def forward(self, x, mask=None):# 1. 自注意力子層attn_output = self.self_attention(x, x, x, mask)x = self.norm1(x + self.dropout(attn_output))# 2. 前饋網絡子層ff_output = self.feed_forward(x)x = self.norm2(x + self.dropout(ff_output))return x
2. 解碼器(Decoder)
class TransformerDecoder:def __init__(self, d_model, n_heads, d_ff, dropout=0.1):# 三個子層self.masked_self_attention = MultiHeadAttention(d_model, n_heads)self.cross_attention = MultiHeadAttention(d_model, n_heads)self.feed_forward = FeedForward(d_model, d_ff)# 層歸一化self.norm1 = LayerNorm(d_model)self.norm2 = LayerNorm(d_model)self.norm3 = LayerNorm(d_model)self.dropout = Dropout(dropout)def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):# 1. 掩碼自注意力attn1 = self.masked_self_attention(x, x, x, tgt_mask)x = self.norm1(x + self.dropout(attn1))# 2. 交叉注意力attn2 = self.cross_attention(x, encoder_output, encoder_output, src_mask)x = self.norm2(x + self.dropout(attn2))# 3. 前饋網絡ff_output = self.feed_forward(x)x = self.norm3(x + self.dropout(ff_output))return x
位置編碼(Positional Encoding)
由于Transformer沒有循環結構,需要顯式注入位置信息:
def positional_encoding(seq_len, d_model):position = np.arange(seq_len)[:, np.newaxis]div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))pos_encoding = np.zeros((seq_len, d_model))pos_encoding[:, 0::2] = np.sin(position * div_term)pos_encoding[:, 1::2] = np.cos(position * div_term)return pos_encoding
Transformer的優勢
- 并行計算:所有位置可以同時計算
- 長距離依賴:直接建模任意兩個位置的關系
- 可解釋性:注意力權重提供了可視化依據
- 遷移學習:預訓練模型可以適應多種下游任務
時間序列預測應用 {#時間序列}
傳統時間序列預測
1. 問題定義
- 單變量預測:預測單一時間序列的未來值
- 多變量預測:同時預測多個相關時間序列
- 多步預測:預測未來多個時間點
2. 數據預處理
class TimeSeriesPreprocessor:def __init__(self, window_size, horizon):self.window_size = window_sizeself.horizon = horizondef create_sequences(self, data):X, y = [], []for i in range(len(data) - self.window_size - self.horizon + 1):X.append(data[i:i + self.window_size])y.append(data[i + self.window_size:i + self.window_size + self.horizon])return np.array(X), np.array(y)def normalize(self, data):self.mean = np.mean(data)self.std = np.std(data)return (data - self.mean) / self.stddef denormalize(self, data):return data * self.std + self.mean
RNN/LSTM時間序列模型
1. 單步預測LSTM
class LSTMForecaster(nn.Module):def __init__(self, input_size, hidden_size, num_layers, output_size):super().__init__()self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)self.linear = nn.Linear(hidden_size, output_size)def forward(self, x):# x: [batch, seq_len, input_size]lstm_out, (h_n, c_n) = self.lstm(x)# 使用最后一個時間步的輸出predictions = self.linear(lstm_out[:, -1, :])return predictions
2. Seq2Seq模型
class Seq2SeqForecaster(nn.Module):def __init__(self, input_size, hidden_size, output_size, horizon):super().__init__()self.encoder = nn.LSTM(input_size, hidden_size, batch_first=True)self.decoder = nn.LSTM(output_size, hidden_size, batch_first=True)self.output_layer = nn.Linear(hidden_size, output_size)self.horizon = horizondef forward(self, x):# 編碼歷史序列_, (h_n, c_n) = self.encoder(x)# 解碼預測序列decoder_input = torch.zeros(x.size(0), 1, self.output_layer.out_features)predictions = []for _ in range(self.horizon):output, (h_n, c_n) = self.decoder(decoder_input, (h_n, c_n))prediction = self.output_layer(output)predictions.append(prediction)decoder_input = predictionreturn torch.cat(predictions, dim=1)
Transformer時間序列模型
1. Temporal Fusion Transformer (TFT)
class TemporalFusionTransformer:"""Google提出的時間序列預測專用Transformer"""def __init__(self, config):# 變量選擇網絡self.vsn = VariableSelectionNetwork(config)# 門控殘差網絡self.grn = GatedResidualNetwork(config)# 多頭注意力self.attention = InterpretableMultiHeadAttention(config)# 位置編碼self.positional_encoding = PositionalEncoding(config)# 量化損失self.quantile_loss = QuantileLoss(config.quantiles)def forward(self, x_static, x_historical, x_future):# 1. 變量選擇selected_historical = self.vsn(x_historical)selected_future = self.vsn(x_future)# 2. 靜態特征編碼static_encoding = self.grn(x_static)# 3. LSTM編碼歷史信息historical_features = self.lstm_encoder(selected_historical)# 4. 自注意力機制temporal_features = self.attention(historical_features,static_context=static_encoding)# 5. 預測未來predictions = self.output_layer(temporal_features)return predictions
2. Autoformer
class Autoformer:"""基于自相關機制的Transformer變體"""def __init__(self, config):# 序列分解self.decomposition = SeriesDecomposition(config.kernel_size)# 自相關機制self.auto_correlation = AutoCorrelation(factor=config.factor,attention_dropout=config.dropout)# 編碼器self.encoder = AutoformerEncoder(config)# 解碼器self.decoder = AutoformerDecoder(config)def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec):# 1. 分解輸入序列enc_seasonal, enc_trend = self.decomposition(x_enc)# 2. 編碼器處理enc_out = self.encoder(enc_seasonal, x_mark_enc)# 3. 解碼器生成預測seasonal_output, trend_output = self.decoder(x_dec, x_mark_dec, enc_out, enc_trend)# 4. 組合預測結果predictions = seasonal_output + trend_outputreturn predictions
時間序列預測的關鍵技術
1. 特征工程
def create_time_features(df, date_column):"""提取時間特征"""df['hour'] = df[date_column].dt.hourdf['dayofweek'] = df[date_column].dt.dayofweekdf['month'] = df[date_column].dt.monthdf['dayofyear'] = df[date_column].dt.dayofyeardf['weekofyear'] = df[date_column].dt.isocalendar().week# 周期性編碼df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)return df
2. 處理多尺度模式
class MultiScaleBlock(nn.Module):def __init__(self, scales=[1, 4, 8]):super().__init__()self.scales = scalesself.convs = nn.ModuleList([nn.Conv1d(in_channels, out_channels, kernel_size=s, stride=s)for s in scales])def forward(self, x):multi_scale_features = []for scale, conv in zip(self.scales, self.convs):features = conv(x)# 上采樣到原始尺度features = F.interpolate(features, size=x.size(-1))multi_scale_features.append(features)return torch.cat(multi_scale_features, dim=1)
計算機視覺應用 {#視覺}
視覺序列建模的挑戰
- 空間-時間建模:同時處理空間和時間維度
- 計算復雜度:視頻數據量巨大
- 長距離依賴:動作可能跨越多幀
RNN/LSTM在視覺中的應用
1. 圖像描述生成(Image Captioning)
class ImageCaptioningModel(nn.Module):def __init__(self, encoder, decoder, vocab_size, embed_dim, hidden_dim):super().__init__()# CNN編碼器self.encoder = encoder # 例如:ResNetself.encoder_fc = nn.Linear(encoder.output_dim, hidden_dim)# LSTM解碼器self.embedding = nn.Embedding(vocab_size, embed_dim)self.lstm = nn.LSTM(embed_dim + hidden_dim, hidden_dim, batch_first=True)self.output_layer = nn.Linear(hidden_dim, vocab_size)# 注意力機制self.attention = AdditiveAttention(hidden_dim)def forward(self, images, captions=None):# 1. 編碼圖像features = self.encoder(images) # [batch, features_dim, H, W]features = features.view(features.size(0), features.size(1), -1)features = features.permute(0, 2, 1) # [batch, H*W, features_dim]# 2. 初始化LSTMh_0 = self.encoder_fc(features.mean(dim=1)) # 全局特征c_0 = torch.zeros_like(h_0)if self.training and captions is not None:# Teacher forcing訓練embedded = self.embedding(captions)outputs = []h_t, c_t = h_0.unsqueeze(0), c_0.unsqueeze(0)for t in range(embedded.size(1)):# 注意力機制context, _ = self.attention(h_t.squeeze(0), features)# LSTM步進input_t = torch.cat([embedded[:, t], context], dim=1)output, (h_t, c_t) = self.lstm(input_t.unsqueeze(1), (h_t, c_t))# 預測下一個詞prediction = self.output_layer(output.squeeze(1))outputs.append(prediction)return torch.stack(outputs, dim=1)else:# 推理時生成return self.generate(features, h_0, c_0)
2. 視頻理解
class VideoUnderstandingModel(nn.Module):def __init__(self, feature_extractor, hidden_dim, num_classes):super().__init__()# 3D CNN或2D CNN特征提取self.feature_extractor = feature_extractor# 雙向LSTMself.bilstm = nn.LSTM(feature_extractor.output_dim,hidden_dim,bidirectional=True,batch_first=True)# 時間注意力self.temporal_attention = nn.Sequential(nn.Linear(hidden_dim * 2, hidden_dim),nn.Tanh(),nn.Linear(hidden_dim, 1))# 分類器self.classifier = nn.Linear(hidden_dim * 2, num_classes)def forward(self, video_frames):# 1. 提取幀特征batch_size, num_frames = video_frames.shape[:2]frame_features = []for t in range(num_frames):features = self.feature_extractor(video_frames[:, t])frame_features.append(features)frame_features = torch.stack(frame_features, dim=1)# 2. 時序建模lstm_out, _ = self.bilstm(frame_features)# 3. 時間注意力聚合attention_weights = self.temporal_attention(lstm_out)attention_weights = F.softmax(attention_weights, dim=1)# 加權平均video_representation = (lstm_out * attention_weights).sum(dim=1)# 4. 分類output = self.classifier(video_representation)return output
Vision Transformer (ViT)
1. 基礎ViT架構
class VisionTransformer(nn.Module):def __init__(self, img_size=224, patch_size=16, in_channels=3,embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0,num_classes=1000):super().__init__()# Patch embeddingself.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)num_patches = (img_size // patch_size) ** 2# 位置編碼self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))# CLS tokenself.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))# Transformer編碼器self.transformer = nn.ModuleList([TransformerBlock(embed_dim, num_heads, mlp_ratio)for _ in range(depth)])# 分類頭self.norm = nn.LayerNorm(embed_dim)self.head = nn.Linear(embed_dim, num_classes)def forward(self, x):# 1. Patch embeddingx = self.patch_embed(x) # [B, num_patches, embed_dim]# 2. 添加CLS tokencls_tokens = self.cls_token.expand(x.shape[0], -1, -1)x = torch.cat((cls_tokens, x), dim=1)# 3. 添加位置編碼x = x + self.pos_embed# 4. Transformer編碼器for block in self.transformer:x = block(x)# 5. 提取CLS token用于分類x = self.norm(x)cls_token_final = x[:, 0]# 6. 分類output = self.head(cls_token_final)return output
2. Patch Embedding實現
class PatchEmbedding(nn.Module):def __init__(self, img_size, patch_size, in_channels, embed_dim):super().__init__()self.img_size = img_sizeself.patch_size = patch_sizeself.num_patches = (img_size // patch_size) ** 2# 使用卷積實現patch embeddingself.proj = nn.Conv2d(in_channels, embed_dim,kernel_size=patch_size, stride=patch_size)def forward(self, x):# x: [B, C, H, W]x = self.proj(x) # [B, embed_dim, H/P, W/P]x = x.flatten(2) # [B, embed_dim, num_patches]x = x.transpose(1, 2) # [B, num_patches, embed_dim]return x
視覺Transformer的改進
1. Swin Transformer
class SwinTransformer(nn.Module):"""具有層次結構和滑動窗口的Vision Transformer"""def __init__(self, img_size, patch_size, embed_dim, depths, num_heads):super().__init__()# Patch分區self.patch_partition = PatchPartition(patch_size)# 多個階段self.stages = nn.ModuleList()for i, (depth, num_head) in enumerate(zip(depths, num_heads)):stage = nn.ModuleList([SwinTransformerBlock(dim=embed_dim * (2 ** i),num_heads=num_head,window_size=7,shift_size=0 if j % 2 == 0 else 3)for j in range(depth)])self.stages.append(stage)# Patch merging(除了最后一個階段)if i < len(depths) - 1:self.stages.append(PatchMerging(embed_dim * (2 ** i)))def forward(self, x):x = self.patch_partition(x)for stage in self.stages:if isinstance(stage, nn.ModuleList):for block in stage:x = block(x)else:x = stage(x) # Patch mergingreturn x
2. DETR(Detection Transformer)
class DETR(nn.Module):"""目標檢測Transformer"""def __init__(self, backbone, transformer, num_classes, num_queries=100):super().__init__()self.backbone = backboneself.conv = nn.Conv2d(backbone.num_channels, hidden_dim, 1)self.transformer = transformer# 目標查詢self.query_embed = nn.Embedding(num_queries, hidden_dim)# 預測頭self.class_embed = nn.Linear(hidden_dim, num_classes + 1) # +1 for no objectself.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)def forward(self, images):# 1. CNN backbone提取特征features = self.backbone(images)# 2. 投影到transformer維度h = self.conv(features)# 3. 添加位置編碼pos_embed = self.positional_encoding(h)# 4. Transformerhs = self.transformer(self.flatten(h),query_embed=self.query_embed.weight,pos_embed=self.flatten(pos_embed))# 5. 預測類別和邊界框outputs_class = self.class_embed(hs)outputs_coord = self.bbox_embed(hs).sigmoid()return {'pred_logits': outputs_class, 'pred_boxes': outputs_coord}
大語言模型應用 {#語言模型}
語言模型的演進
1. 從N-gram到神經網絡語言模型
- N-gram模型:基于統計的方法
- 詞向量:Word2Vec, GloVe
- RNN語言模型:處理變長序列
- Transformer語言模型:并行化和長距離依賴
GPT系列(生成式預訓練)
1. GPT架構
class GPT(nn.Module):def __init__(self, vocab_size, n_layer, n_head, n_embd, block_size):super().__init__()# Token和位置嵌入self.token_embedding = nn.Embedding(vocab_size, n_embd)self.position_embedding = nn.Embedding(block_size, n_embd)# Transformer塊self.blocks = nn.ModuleList([TransformerBlock(n_embd, n_head) for _ in range(n_layer)])# 最終層歸一化和輸出投影self.ln_f = nn.LayerNorm(n_embd)self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)self.block_size = block_sizedef forward(self, idx, targets=None):B, T = idx.shape# Token和位置嵌入tok_emb = self.token_embedding(idx)pos_emb = self.position_embedding(torch.arange(T, device=idx.device))x = tok_emb + pos_emb# Transformer塊for block in self.blocks:x = block(x)# 最終處理x = self.ln_f(x)logits = self.lm_head(x)# 計算損失(如果有目標)loss = Noneif targets is not None:loss = F.cross_entropy(logits.view(-1, logits.size(-1)),targets.view(-1))return logits, lossdef generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):"""自回歸生成"""for _ in range(max_new_tokens):# 裁剪序列到塊大小idx_cond = idx if idx.size(1) <= self.block_size else idx[:, -self.block_size:]# 前向傳播logits, _ = self(idx_cond)logits = logits[:, -1, :] / temperature# 可選的top-k采樣if top_k is not None:v, _ = torch.topk(logits, top_k)logits[logits < v[:, [-1]]] = -float('Inf')# Softmax和采樣probs = F.softmax(logits, dim=-1)idx_next = torch.multinomial(probs, num_samples=1)# 拼接idx = torch.cat((idx, idx_next), dim=1)return idx
2. GPT訓練技巧
class GPTTrainer:def __init__(self, model, train_dataset, config):self.model = modelself.train_dataset = train_datasetself.config = config# 優化器self.optimizer = self.configure_optimizers()# 學習率調度self.scheduler = CosineAnnealingLR(self.optimizer,T_max=config.max_iters)def configure_optimizers(self):"""配置AdamW優化器,with weight decay fix"""decay = set()no_decay = set()for name, param in self.model.named_parameters():if 'bias' in name or 'ln' in name or 'embedding' in name:no_decay.add(name)else:decay.add(name)param_groups = [{'params': [p for n, p in self.model.named_parameters() if n in decay],'weight_decay': self.config.weight_decay},{'params': [p for n, p in self.model.named_parameters() if n in no_decay],'weight_decay': 0.0}]optimizer = torch.optim.AdamW(param_groups,lr=self.config.learning_rate,betas=(0.9, 0.95))return optimizer
BERT系列(雙向編碼器)
1. BERT預訓練
class BERT(nn.Module):def __init__(self, vocab_size, hidden_size, num_layers, num_heads, max_len):super().__init__()# 嵌入層self.token_embedding = nn.Embedding(vocab_size, hidden_size)self.position_embedding = nn.Embedding(max_len, hidden_size)self.segment_embedding = nn.Embedding(2, hidden_size)# Transformer編碼器self.encoder = nn.ModuleList([TransformerEncoderLayer(hidden_size, num_heads)for _ in range(num_layers)])# 預訓練任務頭self.mlm_head = nn.Linear(hidden_size, vocab_size) # MLMself.nsp_head = nn.Linear(hidden_size, 2) # NSPdef forward(self, input_ids, segment_ids, attention_mask, mlm_labels=None, nsp_labels=None):# 嵌入seq_len = input_ids.size(1)pos_ids = torch.arange(seq_len, device=input_ids.device)embeddings = (self.token_embedding(input_ids) +self.position_embedding(pos_ids) +self.segment_embedding(segment_ids))# 編碼hidden_states = embeddingsfor encoder_layer in self.encoder:hidden_states = encoder_layer(hidden_states, attention_mask)# MLM預測mlm_logits = self.mlm_head(hidden_states)# NSP預測(使用CLS token)nsp_logits = self.nsp_head(hidden_states[:, 0])# 計算損失total_loss = 0if mlm_labels is not None:mlm_loss = F.cross_entropy(mlm_logits.view(-1, self.config.vocab_size),mlm_labels.view(-1),ignore_index=-100)total_loss += mlm_lossif nsp_labels is not None:nsp_loss = F.cross_entropy(nsp_logits, nsp_labels)total_loss += nsp_lossreturn {'loss': total_loss, 'mlm_logits': mlm_logits, 'nsp_logits': nsp_logits}
2. BERT微調
class BERTForSequenceClassification(nn.Module):def __init__(self, bert_model, num_classes, dropout=0.1):super().__init__()self.bert = bert_modelself.dropout = nn.Dropout(dropout)self.classifier = nn.Linear(bert_model.config.hidden_size, num_classes)def forward(self, input_ids, attention_mask, labels=None):# BERT編碼outputs = self.bert(input_ids, attention_mask=attention_mask)# 使用CLS token表示pooled_output = outputs.last_hidden_state[:, 0]pooled_output = self.dropout(pooled_output)# 分類logits = self.classifier(pooled_output)# 計算損失loss = Noneif labels is not None:loss = F.cross_entropy(logits, labels)return {'loss': loss, 'logits': logits}
T5(Text-to-Text Transfer Transformer)
class T5Model(nn.Module):"""統一的文本到文本框架"""def __init__(self, config):super().__init__()# 共享嵌入self.shared = nn.Embedding(config.vocab_size, config.d_model)# 編碼器self.encoder = T5Stack(config, embed_tokens=self.shared)# 解碼器self.decoder = T5Stack(config, embed_tokens=self.shared, is_decoder=True)# 語言模型頭self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)def forward(self, input_ids, decoder_input_ids, labels=None):# 編碼encoder_outputs = self.encoder(input_ids)# 解碼decoder_outputs = self.decoder(decoder_input_ids,encoder_hidden_states=encoder_outputs.last_hidden_state)# 預測lm_logits = self.lm_head(decoder_outputs.last_hidden_state)# 損失計算loss = Noneif labels is not None:loss = F.cross_entropy(lm_logits.view(-1, lm_logits.size(-1)),labels.view(-1),ignore_index=-100)return {'loss': loss, 'logits': lm_logits}
現代LLM的關鍵技術
1. 高效注意力機制
class FlashAttention(nn.Module):"""Flash Attention: 內存高效的精確注意力"""def forward(self, q, k, v, causal=False):# 使用分塊計算減少內存使用B, H, N, D = q.shape# 分塊大小BLOCK_SIZE = min(64, N)# 輸出初始化O = torch.zeros_like(q)L = torch.zeros((B, H, N, 1), device=q.device)for i in range(0, N, BLOCK_SIZE):# 加載Q塊q_block = q[:, :, i:i+BLOCK_SIZE]# 初始化塊輸出o_block = torch.zeros_like(q_block)l_block = torch.zeros((B, H, q_block.shape[2], 1), device=q.device)for j in range(0, N, BLOCK_SIZE):# 因果掩碼檢查if causal and j > i + BLOCK_SIZE:break# 加載KV塊k_block = k[:, :, j:j+BLOCK_SIZE]v_block = v[:, :, j:j+BLOCK_SIZE]# 計算注意力分數scores = torch.matmul(q_block, k_block.transpose(-2, -1))# 因果掩碼if causal and i >= j:mask = torch.triu(torch.ones_like(scores), diagonal=j-i+1)scores.masked_fill_(mask.bool(), float('-inf'))# 在線softmax更新m_block = scores.max(dim=-1, keepdim=True).valuesp_block = torch.exp(scores - m_block)l_block_new = p_block.sum(dim=-1, keepdim=True)# 更新輸出o_block = o_block * l_block + torch.matmul(p_block, v_block)l_block = l_block + l_block_newo_block = o_block / l_blockO[:, :, i:i+BLOCK_SIZE] = o_blockL[:, :, i:i+BLOCK_SIZE] = l_blockreturn O
2. 參數高效微調(PEFT)
class LoRALayer(nn.Module):"""Low-Rank Adaptation for efficient fine-tuning"""def __init__(self, in_features, out_features, rank=16, alpha=32):super().__init__()self.rank = rankself.alpha = alphaself.scaling = alpha / rank# 凍結的預訓練權重self.weight = nn.Parameter(torch.randn(out_features, in_features))self.weight.requires_grad = False# LoRA參數self.lora_A = nn.Parameter(torch.randn(rank, in_features))self.lora_B = nn.Parameter(torch.zeros(out_features, rank))# 初始化nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))def forward(self, x):# 原始前向傳播result = F.linear(x, self.weight)# 添加LoRAresult += (x @ self.lora_A.T @ self.lora_B.T) * self.scalingreturn result
3. 長上下文處理
class LongContextTransformer(nn.Module):"""處理超長上下文的技術"""def __init__(self, config):super().__init__()# RoPE位置編碼self.rotary_embedding = RotaryEmbedding(config.hidden_size)# 稀疏注意力模式self.attention_pattern = self.create_sparse_pattern(config.max_length)# 滑動窗口注意力self.window_size = config.window_sizedef create_sparse_pattern(self, seq_len):"""創建稀疏注意力模式"""pattern = torch.zeros(seq_len, seq_len)# 局部窗口for i in range(seq_len):start = max(0, i - self.window_size // 2)end = min(seq_len, i + self.window_size // 2)pattern[i, start:end] = 1# 全局token(每隔一定距離)stride = seq_len // 8pattern[::stride, :] = 1pattern[:, ::stride] = 1return pattern.bool()
實戰與優化 {#實戰}
模型訓練最佳實踐
1. 混合精度訓練
from torch.cuda.amp import autocast, GradScalerclass MixedPrecisionTrainer:def __init__(self, model, optimizer):self.model = modelself.optimizer = optimizerself.scaler = GradScaler()def train_step(self, batch):self.optimizer.zero_grad()# 自動混合精度with autocast():outputs = self.model(**batch)loss = outputs['loss']# 反向傳播with scalingself.scaler.scale(loss).backward()# 梯度裁剪self.scaler.unscale_(self.optimizer)torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)# 優化器步進self.scaler.step(self.optimizer)self.scaler.update()return loss.item()
2. 分布式訓練
class DistributedTrainer:def __init__(self, model, rank, world_size):# 初始化進程組dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)# 模型并行self.model = nn.parallel.DistributedDataParallel(model.cuda(rank),device_ids=[rank],output_device=rank,find_unused_parameters=False)# 數據并行self.train_sampler = DistributedSampler(train_dataset,num_replicas=world_size,rank=rank)def train_epoch(self):self.train_sampler.set_epoch(epoch) # 保證每個epoch的隨機性for batch in DataLoader(train_dataset, sampler=self.train_sampler):loss = self.train_step(batch)# 同步所有進程dist.barrier()
推理優化
1. 模型量化
class QuantizedModel:@staticmethoddef quantize_model(model, calibration_data):"""INT8量化"""model.eval()# 準備量化model.qconfig = torch.quantization.get_default_qconfig('fbgemm')torch.quantization.prepare(model, inplace=True)# 校準with torch.no_grad():for batch in calibration_data:model(batch)# 轉換為量化模型torch.quantization.convert(model, inplace=True)return model
2. KV Cache優化
class OptimizedDecoder:def __init__(self, model, max_cache_size=1024):self.model = modelself.kv_cache = {}self.max_cache_size = max_cache_sizedef generate_with_cache(self, input_ids, max_length):outputs = []for i in range(max_length):# 使用緩存的KVif i > 0:# 只計算新token的KVnew_token_id = input_ids[:, -1:]key, value = self.compute_kv(new_token_id, i)# 更新緩存self.kv_cache[i] = (key, value)else:# 初始計算keys, values = self.compute_all_kv(input_ids)self.kv_cache = {j: (keys[:, :, j], values[:, :, j]) for j in range(input_ids.size(1))}# 使用緩存的KV計算注意力output = self.attention_with_cache(input_ids[:, -1:])outputs.append(output)# 采樣下一個tokennext_token = self.sample(output)input_ids = torch.cat([input_ids, next_token], dim=1)# 緩存管理if len(self.kv_cache) > self.max_cache_size:self.evict_cache()return torch.cat(outputs, dim=1)
評估指標
1. 語言模型評估
def calculate_perplexity(model, eval_dataloader):"""計算困惑度"""model.eval()total_loss = 0total_tokens = 0with torch.no_grad():for batch in eval_dataloader:outputs = model(**batch)loss = outputs['loss']total_loss += loss.item() * batch['labels'].numel()total_tokens += batch['labels'].numel()avg_loss = total_loss / total_tokensperplexity = math.exp(avg_loss)return perplexitydef calculate_bleu(predictions, references):"""計算BLEU分數"""from nltk.translate.bleu_score import corpus_bleu# 分詞pred_tokens = [pred.split() for pred in predictions]ref_tokens = [[ref.split()] for ref in references]# 計算BLEUbleu_1 = corpus_bleu(ref_tokens, pred_tokens, weights=(1, 0, 0, 0))bleu_2 = corpus_bleu(ref_tokens, pred_tokens, weights=(0.5, 0.5, 0, 0))bleu_4 = corpus_bleu(ref_tokens, pred_tokens, weights=(0.25, 0.25, 0.25, 0.25))return {'bleu_1': bleu_1, 'bleu_2': bleu_2, 'bleu_4': bleu_4}
2. 時間序列評估
def time_series_metrics(predictions, targets):"""時間序列預測評估指標"""# MAEmae = torch.mean(torch.abs(predictions - targets))# MSEmse = torch.mean((predictions - targets) ** 2)# RMSErmse = torch.sqrt(mse)# MAPEmape = torch.mean(torch.abs((targets - predictions) / targets)) * 100# SMAPEsmape = 200 * torch.mean(torch.abs(predictions - targets) / (torch.abs(predictions) + torch.abs(targets)))return {'mae': mae.item(),'mse': mse.item(),'rmse': rmse.item(),'mape': mape.item(),'smape': smape.item()}
前沿發展 {#前沿}
最新研究方向
1. Mamba:狀態空間模型
class MambaBlock(nn.Module):"""線性復雜度的序列建模"""def __init__(self, d_model, d_state=16, d_conv=4, expand=2):super().__init__()self.d_model = d_modelself.d_state = d_stateself.d_conv = d_convself.expand = expandd_inner = int(self.expand * self.d_model)# 投影層self.in_proj = nn.Linear(d_model, d_inner * 2)# 卷積層self.conv1d = nn.Conv1d(d_inner, d_inner,kernel_size=d_conv,groups=d_inner,padding=d_conv - 1)# SSM參數self.x_proj = nn.Linear(d_inner, d_state + d_state + 1)self.dt_proj = nn.Linear(d_state, d_inner)# 輸出投影self.out_proj = nn.Linear(d_inner, d_model)def forward(self, x):"""選擇性狀態空間模型"""# 此處簡化了實現# 實際Mamba包含復雜的狀態空間計算return self.ssm(x)
2. RWKV:線性Transformer
class RWKV(nn.Module):"""Receptance Weighted Key Value - 線性復雜度的RNN式Transformer"""def __init__(self, n_embd, n_layer):super().__init__()self.blocks = nn.ModuleList([RWKVBlock(n_embd) for _ in range(n_layer)])def forward(self, x, state=None):for block in self.blocks:x, state = block(x, state)return x, state
3. Mixture of Experts (MoE)
class MoELayer(nn.Module):"""稀疏激活的專家混合層"""def __init__(self, d_model, n_experts, n_experts_per_token=2):super().__init__()self.experts = nn.ModuleList([FeedForward(d_model) for _ in range(n_experts)])self.gate = nn.Linear(d_model, n_experts)self.n_experts_per_token = n_experts_per_tokendef forward(self, x):# 計算路由概率gate_logits = self.gate(x)# Top-k路由weights, selected_experts = torch.topk(gate_logits, self.n_experts_per_token)weights = F.softmax(weights, dim=-1)# 稀疏計算results = torch.zeros_like(x)for i, expert in enumerate(self.experts):# 只對被選中的token運行專家expert_mask = (selected_experts == i).any(dim=-1)if expert_mask.any():expert_input = x[expert_mask]expert_output = expert(expert_input)# 加權組合expert_weight = weights[expert_mask, selected_experts[expert_mask] == i]results[expert_mask] += expert_output * expert_weight.unsqueeze(-1)return results
多模態模型
1. CLIP風格的視覺-語言模型
class VisionLanguageModel(nn.Module):"""對比學習的多模態模型"""def __init__(self, vision_encoder, text_encoder, projection_dim=512):super().__init__()self.vision_encoder = vision_encoderself.text_encoder = text_encoder# 投影頭self.vision_projection = nn.Linear(vision_encoder.output_dim, projection_dim)self.text_projection = nn.Linear(text_encoder.output_dim, projection_dim)# 溫度參數self.temperature = nn.Parameter(torch.ones(1) * 0.07)def forward(self, images, texts):# 編碼image_features = self.vision_encoder(images)text_features = self.text_encoder(texts)# 投影和歸一化image_embeds = F.normalize(self.vision_projection(image_features), dim=-1)text_embeds = F.normalize(self.text_projection(text_features), dim=-1)# 計算相似度logits_per_image = image_embeds @ text_embeds.T / self.temperaturelogits_per_text = text_embeds @ image_embeds.T / self.temperaturereturn logits_per_image, logits_per_text
2. 統一的多模態Transformer
class UnifiedMultiModalTransformer(nn.Module):"""處理多種模態的統一架構"""def __init__(self, config):super().__init__()# 模態特定的編碼器self.text_embedder = TextEmbedder(config)self.image_embedder = ImageEmbedder(config)self.audio_embedder = AudioEmbedder(config)# 共享Transformerself.transformer = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)])# 模態特定的解碼器self.text_head = TextGenerationHead(config)self.image_head = ImageGenerationHead(config)def forward(self, inputs, modality_mask):# 多模態嵌入embeddings = []if 'text' in inputs:embeddings.append(self.text_embedder(inputs['text']))if 'image' in inputs:embeddings.append(self.image_embedder(inputs['image']))if 'audio' in inputs:embeddings.append(self.audio_embedder(inputs['audio']))# 拼接所有模態x = torch.cat(embeddings, dim=1)# Transformer處理for block in self.transformer:x = block(x, modality_mask)return x
實用工具和框架
1. Hugging Face Transformers
from transformers import AutoModel, AutoTokenizer# 加載預訓練模型
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")# 使用
inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)
2. 自定義訓練循環
class CustomTrainer:def __init__(self, model, train_dataloader, eval_dataloader, config):self.model = modelself.train_dataloader = train_dataloaderself.eval_dataloader = eval_dataloaderself.config = config# 優化器和調度器self.optimizer = AdamW(model.parameters(), lr=config.learning_rate)self.scheduler = get_linear_schedule_with_warmup(self.optimizer,num_warmup_steps=config.warmup_steps,num_training_steps=config.total_steps)# 混合精度self.scaler = GradScaler() if config.fp16 else None# 記錄self.writer = SummaryWriter(config.log_dir)def train(self):global_step = 0for epoch in range(self.config.num_epochs):# 訓練self.model.train()for batch in tqdm(self.train_dataloader, desc=f"Epoch {epoch}"):loss = self.training_step(batch)# 記錄if global_step % self.config.log_interval == 0:self.writer.add_scalar('train/loss', loss, global_step)# 評估if global_step % self.config.eval_interval == 0:eval_metrics = self.evaluate()for key, value in eval_metrics.items():self.writer.add_scalar(f'eval/{key}', value, global_step)# 保存檢查點if global_step % self.config.save_interval == 0:self.save_checkpoint(global_step)global_step += 1def training_step(self, batch):self.optimizer.zero_grad()if self.scaler:with autocast():outputs = self.model(**batch)loss = outputs['loss']self.scaler.scale(loss).backward()self.scaler.unscale_(self.optimizer)torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)self.scaler.step(self.optimizer)self.scaler.update()else:outputs = self.model(**batch)loss = outputs['loss']loss.backward()torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)self.optimizer.step()self.scheduler.step()return loss.item()
1. 循序漸進的結構
- 從序列模型的基礎概念開始
- 逐步深入到每種架構的核心原理
- 最后延伸到前沿技術和實戰應用
2. 理論與實踐結合
- 每個概念都配有詳細的數學公式和原理解釋
- 提供了大量可運行的Python/PyTorch代碼示例
- 包含實際項目中的最佳實踐
3. 三大應用領域全覆蓋
- 時間序列預測:從傳統方法到最新的Autoformer
- 計算機視覺:從CNN+RNN到Vision Transformer
- 大語言模型:從GPT到BERT,再到現代LLM技術
4. 實用性強
- 包含模型訓練、優化和部署的實戰技巧
- 提供了評估指標和調試方法
- 介紹了主流框架和工具的使用
5. 前沿內容
- 涵蓋了Mamba、RWKV等最新架構
- 討論了MoE、多模態等熱門方向
- 包含了Flash Attention、LoRA等優化技術
總結
從RNN到Transformer的深度學習序列模型,
涵蓋了:
- 基礎概念:序列建模的核心挑戰和解決方案
- 模型架構:RNN、LSTM、Transformer的詳細實現
- 應用領域:時間序列預測、計算機視覺、自然語言處理
- 實戰技巧:訓練優化、推理加速、評估方法
- 前沿發展:最新的研究方向和技術趨勢
學習建議
初學者路徑:
- 掌握RNN基礎概念和反向傳播
- 理解LSTM的門控機制
- 深入學習Transformer架構
- 實踐簡單的序列任務
進階路徑:
- 研究各種注意力機制變體
- 探索大規模預訓練技術
- 學習分布式訓練和優化
- 跟蹤最新論文和實現
實戰項目建議:
- 實現一個簡單的語言模型
- 構建時間序列預測系統
- 開發圖像描述生成應用
- 微調預訓練模型解決實際問題
- 論文:Attention is All You Need, BERT, GPT系列
- 課程:Stanford CS224N, Fast.ai
- 框架:PyTorch, TensorFlow, JAX
- 社區:Hugging Face, Papers with Code