NeuralForecast TokenEmbedding 一維卷積 (Conv1d) 與矩陣乘法

flyfish

TokenEmbedding中使用了一維卷積 (Conv1d)

TokenEmbedding 源碼分析

在源碼的基礎上增加調用示例
下面會分析這段代碼

import torch
import torch.nn as nn
class TokenEmbedding(nn.Module):def __init__(self, c_in, hidden_size):super(TokenEmbedding, self).__init__()padding = 1 if torch.__version__ >= "1.5.0" else 2self.tokenConv = nn.Conv1d(in_channels=c_in,out_channels=hidden_size,kernel_size=3,padding=padding,padding_mode="circular",bias=False,)for m in self.modules():if isinstance(m, nn.Conv1d):nn.init.kaiming_normal_(m.weight, mode="fan_in", nonlinearity="leaky_relu")def forward(self, x):x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)return ximport torch# 創建 TokenEmbedding 實例
c_in = 10  # 輸入通道數
hidden_size = 20  # 輸出通道數
token_embedding = TokenEmbedding(c_in, hidden_size)# 創建輸入數據
batch_size = 32
sequence_length = 100
input_features = 10
x = torch.randn(batch_size, sequence_length, input_features)  # 輸入數據形狀為 (batch_size, sequence_length, input_features)# 前向傳播
output = token_embedding(x)# 輸出結果
print("Output shape:", output.shape)  # 打印輸出的形狀
#Output shape: torch.Size([32, 100, 20])

TokenEmbedding類繼承自nn.Module類，通過super().init()調用了父類nn.Module的__init__()方法，以執行nn.Module類中的初始化操作，確保TokenEmbedding類的實例在創建時也執行了nn.Module類的初始化

init_ 方法：
在初始化過程中，定義了一個一維卷積層 self.tokenConv。這個卷積層的輸入通道數為 c_in，輸出通道數為 hidden_size，卷積核大小為 3，填充模式為 “circular”，并且設置偏置為 False。在 PyTorch 的版本大于等于 1.5.0 時，設置填充為 1，否則設置填充為 2。然后通過循環遍歷模型的所有模塊，并對其中類型為 nn.Conv1d 的模塊進行參數初始化，使用 Kaiming 初始化方法。

forward 方法：
將輸入 x 進行形狀變換，然后通過 self.tokenConv 進行一維卷積操作，并將結果進行轉置，最后返回卷積操作的結果。

比較下不同的padding_mode

import torch
import torch.nn as nn# 定義輸入序列
input_seq = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32).view(1, 1, -1)# 定義卷積層
conv_zero_padding = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=3, padding=1, padding_mode='zeros', bias=False)
conv_circular_padding = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=3, padding=1, padding_mode='circular', bias=False)# 手動設置卷積核為簡單的平均操作
with torch.no_grad():conv_zero_padding.weight = nn.Parameter(torch.ones_like(conv_zero_padding.weight) / 3)conv_circular_padding.weight = nn.Parameter(torch.ones_like(conv_circular_padding.weight) / 3)# 進行卷積操作
output_zero_padding = conv_zero_padding(input_seq)
output_circular_padding = conv_circular_padding(input_seq)print("Input sequence:", input_seq)
print("Zero padding output:", output_zero_padding)
print("Circular padding output:", output_circular_padding)

Input sequence: tensor([[[1., 2., 3., 4., 5.]]])
Zero padding output: tensor([[[1., 2., 3., 4., 3.]]], grad_fn=<ConvolutionBackward0>)
Circular padding output: tensor([[[2.6667, 2.0000, 3.0000, 4.0000, 3.3333]]],grad_fn=<ConvolutionBackward0>)

嵌入層 nn.Conv1d和 nn.Embedding不同的處理方式

使用 nn.Conv1d 的 TokenEmbedding

import torch
import torch.nn as nnclass TokenEmbedding(nn.Module):def __init__(self, c_in, hidden_size):super(TokenEmbedding, self).__init__()self.tokenConv = nn.Conv1d(in_channels=c_in,out_channels=hidden_size,kernel_size=3,padding=1,padding_mode="circular",bias=False,)def forward(self, x):x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)return x# 示例輸入
batch_size = 2
sequence_length = 10
feature_dim = 3time_series = torch.randn(batch_size, sequence_length, feature_dim)
embedding = TokenEmbedding(c_in=feature_dim, hidden_size=8)
embedded_time_series = embedding(time_series)
print(embedded_time_series.shape)  # 輸出形狀：[2, 10, 8]

使用 nn.Embedding

class SimpleEmbedding(nn.Module):def __init__(self, num_embeddings, embedding_dim):super(SimpleEmbedding, self).__init__()self.embedding = nn.Embedding(num_embeddings, embedding_dim)def forward(self, x):return self.embedding(x)# 示例輸入：假設我們有一些離散的索引序列
batch_size = 2
sequence_length = 10
vocab_size = 20  # 假設有20個不同的類別
embedding_dim = 8indices = torch.randint(0, vocab_size, (batch_size, sequence_length))
embedding = SimpleEmbedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
embedded_indices = embedding(indices)
print(embedded_indices.shape)  # 輸出形狀：[2, 10, 8]

Conv1d

（1維卷積）和矩陣乘法在數學上有密切的關系。1維卷積操作實際上可以看作是某種形式的矩陣乘法
1維卷積操作可以通過將輸入向量轉換成Toeplitz矩陣，然后與卷積核進行矩陣乘法來實現。這種方法可以幫助我們更好地理解卷積操作的本質及其與線性代數的關系。

1. Conv1d 操作

假設我們有一個輸入向量 $\mathbf{x} = [x_1, x_2, \ldots, x_n]$ 和一個卷積核（濾波器） $\mathbf{w} = [w_1, w_2, \ldots, w_k]$ ，1維卷積操作可以定義為：

$y_i = \sum_{j=1}^{k} x_{i+j-1} \cdot w_j$

對于每一個輸出位置 $i$ ，卷積核 $\mathbf{w}$ 會與輸入向量 $\mathbf{x}$ 的某一部分元素進行點積。

2. 矩陣乘法表示

1維卷積操作可以通過將輸入向量轉換成一個特定的矩陣，然后進行矩陣乘法來實現。這種矩陣稱為“Toeplitz矩陣”或“卷積矩陣”。例如，對于輸入向量 $\mathbf{x}$ 和卷積核 $\mathbf{w}$ ，我們構建一個Toeplitz矩陣：
$\mathbf{X} = \begin{bmatrix} x_1 & x_2 & x_3 & \ldots & x_k \\ x_2 & x_3 & x_4 & \ldots & x_{k+1} \\ x_3 & x_4 & x_5 & \ldots & x_{k+2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{n-k+1} & x_{n-k+2} & x_{n-k+3} & \ldots & x_n \end{bmatrix}$
然后將卷積核 $\mathbf{w}$ 看作一個列向量：
$\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ w_3 \\ \vdots \\ w_k \end{bmatrix}$
那么，1維卷積的輸出可以表示為：
$\mathbf{y} = \mathbf{X} \cdot \mathbf{w}$

3. 示例

假設輸入向量 $\mathbf{x} = [1, 2, 3, 4, 5]$ 和卷積核 $\mathbf{w} = [1, 0, -1]$ ，我們可以構建Toeplitz矩陣：
$\mathbf{X} = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 3 & 4 \\ 3 & 4 & 5 \end{bmatrix}$
然后進行矩陣乘法：
$\mathbf{y} = \mathbf{X} \cdot \mathbf{w} = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 3 & 4 \\ 3 & 4 & 5 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 0 \\ -1 \end{bmatrix} = \begin{bmatrix} 1 \cdot 1 + 2 \cdot 0 + 3 \cdot (-1) \\ 2 \cdot 1 + 3 \cdot 0 + 4 \cdot (-1) \\ 3 \cdot 1 + 4 \cdot 0 + 5 \cdot (-1) \end{bmatrix} = \begin{bmatrix} -2 \\ -2 \\ -2 \end{bmatrix}$
這就是1維卷積的輸出。
用代碼演示一維卷積 (Conv1d) 和矩陣乘法會得到相同結果的方式

import torch
import torch.nn as nn# 輸入序列
x = torch.tensor([[1, 2, 3, 4, 5]], dtype=torch.float32)  # shape: [1, 5]
# 卷積核
w = torch.tensor([[1, 0, -1]], dtype=torch.float32).unsqueeze(0)  # shape: [1, 3]# 使用 nn.Conv1d
conv1d = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=3, padding=0, bias=False)
conv1d.weight.data = wx_unsqueezed = x.unsqueeze(0)  # shape: [1, 1, 5]
output_conv1d = conv1d(x_unsqueezed).squeeze(0)  # shape: [1, 3]
print("Conv1d output:", output_conv1d)# 使用矩陣乘法
X = torch.tensor([[1, 2, 3],[2, 3, 4],[3, 4, 5]
], dtype=torch.float32)W = torch.tensor([1, 0, -1], dtype=torch.float32).view(-1, 1)output_matmul = X @ W
print("Matrix multiplication output:", output_matmul.squeeze())# Conv1d output: tensor([[-2., -2., -2.]], grad_fn=<SqueezeBackward1>)
# Matrix multiplication output: tensor([-2., -2., -2.])

在代碼中，以下部分對卷積層的權重進行了初始化：

for m in self.modules():if isinstance(m, nn.Conv1d):nn.init.kaiming_normal_(m.weight, mode="fan_in", nonlinearity="leaky_relu")

這段代碼使用了Kaiming初始化（也稱為He初始化）來初始化卷積層的權重。為了理解為什么要使用 mode=“fan_in” 和 nonlinearity=“leaky_relu”，我們需要了解一些背景知識。

1. Kaiming 初始化 (He Initialization)

Kaiming初始化是一種針對神經網絡權重的初始化方法，旨在解決在訓練深度神經網絡時可能遇到的梯度消失或梯度爆炸問題。Kaiming初始化的方法依據權重矩陣的大小來設置初始值，使得每一層的輸出保持適當的方差。

2. mode=“fan_in” 和 nonlinearity=“leaky_relu”

mode=“fan_in”：這是Kaiming初始化中的一種模式，表示初始化應該考慮輸入的數量（即每個神經元輸入連接的數量）。使用這種模式，可以確保前向傳播過程中信號的方差不會膨脹。
nonlinearity=“leaky_relu”：這是Kaiming初始化時需要指定的非線性激活函數類型。在初始化過程中，不同的激活函數需要不同的方差調整。leaky_relu 是一種變體的ReLU激活函數，可以防止神經元死亡問題。

詳細解釋

在使用Kaiming初始化時，根據不同的激活函數，初始化權重時需要調整標準差。Kaiming初始化的公式通常是：

$fan_in \text{std} = \sqrt{\frac{2}{\text{fan\_in}}}$

其中，fan_in 是指每個神經元輸入的數量。

當使用不同的激活函數時，初始化的標準差需要調整，以適應激活函數的特點。對于ReLU和其變體（如Leaky ReLU），公式中的系數2是經驗上獲得的最優值。

因此，代碼中指定 mode=“fan_in” 和 nonlinearity=“leaky_relu” 是為了確保在使用Leaky ReLU激活函數時，權重初始化的方差被正確地設置，從而使網絡訓練更加穩定和高效。

代碼示例

具體到代碼：

for m in self.modules():if isinstance(m, nn.Conv1d):nn.init.kaiming_normal_(m.weight, mode="fan_in", nonlinearity="leaky_relu")

這段代碼的作用是遍歷所有模塊（即網絡層），并對所有 nn.Conv1d 層的權重使用Kaiming初始化方法進行初始化。mode=“fan_in” 和 nonlinearity=“leaky_relu” 的指定，確保了權重的初始化是根據Leaky ReLU激活函數的特點來進行的。

Leaky ReLU

ReLU 函數將所有負值映射為零，正值不變。
Leaky ReLU 函數在負值區域有一個小的斜率（在此例子中為0.1），以避免神經元死亡。
PReLU 是Leaky ReLU的參數化版本，其負值區域的斜率可以學習。
ELU 在負值區域逐漸趨于一個負的固定值，正值區域類似ReLU。
在這里插入圖片描述

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F# 定義x軸數據
x = np.linspace(-10, 10, 400)
x_tensor = torch.tensor(x, dtype=torch.float32)# 定義不同的激活函數
relu = F.relu(x_tensor).numpy()
leaky_relu = F.leaky_relu(x_tensor, negative_slope=0.1).numpy()
prelu = torch.nn.PReLU(num_parameters=1, init=0.1)
prelu_output = prelu(x_tensor).detach().numpy()
elu = F.elu(x_tensor, alpha=1.0).numpy()# 繪圖
plt.figure(figsize=(12, 8))plt.subplot(2, 2, 1)
plt.plot(x, relu, label='ReLU', color='blue')
plt.title('ReLU')
plt.grid(True)plt.subplot(2, 2, 2)
plt.plot(x, leaky_relu, label='Leaky ReLU (0.1)', color='red')
plt.title('Leaky ReLU')
plt.grid(True)plt.subplot(2, 2, 3)
plt.plot(x, prelu_output, label='PReLU (0.1)', color='green')
plt.title('PReLU')
plt.grid(True)plt.subplot(2, 2, 4)
plt.plot(x, elu, label='ELU', color='purple')
plt.title('ELU')
plt.grid(True)plt.tight_layout()
plt.show()