【NLP入門系列三】NLP文本嵌入(以Embedding和EmbeddingBag為例)

在這里插入圖片描述

🍨 本文為🔗365天深度學習訓練營中的學習記錄博客
🍖 原作者：K同學啊

博主簡介：努力學習的22級本科生一枚 🌟?；探索AI算法，C++，go語言的世界；在迷茫中尋找光芒?🌸
博客主頁：羊小豬~~-CSDN博客
內容簡介：NLP入門三，Embedding和EmbeddingBag嵌入.
🌸箴言🌸：去尋找理想的“天空“”之城
上一篇內容：【NLP入門系列二】NLP分詞和字典構建-CSDN博客

文章目錄

NLP文本嵌入
- 前言
- 1、Embedding嵌入
- 2、EmbeddingBag嵌入
- 3、參考資料

NLP文本嵌入

前言

📄 大模型語言理解文字方式： 將每一個詞當做一個數字，然后不斷地進行做計算題，從而不斷地輸出文字；

? 舉例：

如果用一個數字表示一個詞，這里用1表示男人，2表示女人，這樣的作用是給詞進行了編號，但是表示無法表示詞與詞之間的關系。

但是，如果用兩位數字表示呢？

👀 參考b站大佬視頻：

在數學中，向量是有方向的，可以做運算，這里也一樣，如圖：

在這里插入圖片描述
在數學中，向量是有方向的，可以做運算，這里也一樣，如圖：

這樣就實現了：將每一個詞當做一個數字，然后進行做計算題，從而輸出文字；

💠 詞嵌入： 用向量表示詞。**原理：**將詞嵌入到數學的維度空間，如果詞用二維表示，那么嵌入到一個二維空間里，以此類推；

本質： 將離散的詞匯映射到一個低維連續的向量空間中，這樣詞匯之間的關系就可以在向量空間中得到體現。

📘 大模型語言訓練過程

大模型語言訓練是一個很復雜的過程，但是了解最基本過程還是簡單的，如下圖表示(剛開始不同詞隨機分布在二維空間中不同位置)：
在這里插入圖片描述

經過模型訓練后：
在這里插入圖片描述

將語義相近的分布在一起，但是也有一些中立詞，如蘋果這個詞，吃蘋果和蘋果手機是不同意思的，所以蘋果就是中立的，具體的意思需要根據模型訓練過程中結合上下文進行運算得出結果。

Embedding和EmbeddingBag是pytorch處理文本數據詞嵌入的工具。

1、Embedding嵌入

Embedding是pytorch中最基本的詞嵌入操作。

輸入：一整張向量，每個整數都代表一個詞匯的索引

輸出：是一個浮點型的張量，每個浮現數都代表著對應詞匯的詞嵌入向量。

維度變化：

輸入shape：[batch, seqSize] ，seqSize表示單個文本長度(注意：同一批次中每個樣本的序列長度（seq_len）必須相同)；
輸出shape：[batch, seqSize, embed_dim] ，embed_bim 表示嵌入維度。

👙 注意：嵌入層被定義為網絡的第一個隱藏層，采用隨機權重初始化的方式，既可以作為深度學習模型的一部分，一起訓練，也可以用于用于加載訓練好的詞嵌入模型。

函數原型：

torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None,                     max_norm=None,norm_type=2.0,scale_grad_by_freq=False,                     sparse=False,_weight=None,_freeze=False, device=None,                     dtype=None)

常見參數：

num_embeddings：詞匯表大小，即，最大整數 index + 1。
embedding_dim：詞向量的維度。

📚 以一個二分類案例為例：

1、導入庫和自定義數據格式

import torch
import torch.nn as nn 
import torch.nn.functional as F 
import torch.optim as optim 
from torch.utils.data import Dataset, DataLoader# 自定義數據維度
class MyDataset(Dataset):def __init__(self, texts, labels):super().__init__()self.texts = textsself.labels = labelsdef __len__(self):return len(self.labels)def __getitem__(self, idx):text = self.texts[idx]label = self.labels[idx]return text, label

2、定義填充函數(將所有詞長度變成一致)

def collate_batch(batch):# 解包texts, labels = zip(*batch) # texts、labels存儲在不同[]/()這樣的數據結構# 獲取最大長度max_len = max(len(text) for text in texts)# 填充， 不夠的填充為0padding_texts = [F.pad(text, (0, max_len - len(text)), value=0) for text in texts] # 采用右填充# 改變維度--> (batch_size, max_len)padding_texts = torch.stack(padding_texts)# 標簽格式化(改變維度)--> (batch_size) --> (batch_size, 1), 不改變值labels = torch.tensor(labels, dtype=torch.float).unsqueeze(1)return padding_texts, labels

3、定義數據

# 定義三個樣本
data = [torch.tensor([1, 1, 1], dtype=torch.long), torch.tensor([2, 2, 2], dtype=torch.long),torch.tensor([3, 3], dtype=torch.long)
]# 定義標簽
labels = torch.tensor([1, 2, 3], dtype=torch.float)# 創建數據
data = MyDataset(data, labels)
data_loader = DataLoader(data, batch_size=2, shuffle=True, collate_fn=collate_batch)# 展示
for batch in data_loader:print(batch)print("shape:", batch)

(tensor([[1, 1, 1],[3, 3, 0]]), tensor([[1.],[3.]]))
shape: (tensor([[1, 1, 1],[3, 3, 0]]), tensor([[1.],[3.]]))
(tensor([[2, 2, 2]]), tensor([[2.]]))
shape: (tensor([[2, 2, 2]]), tensor([[2.]]))

4、定義模型

class EmbeddingModel(nn.Module):def __init__(self, vocab_size, embed_dim):super(EmbeddingModel, self).__init__()# 定義模型self.embedding = nn.Embedding(vocab_size, embed_dim) # 詞匯表大小 + 嵌入維度self.fc = nn.Linear(embed_dim, 1)  # 這里假設做二分類任務def forward(self, text):print("Embedding輸入文本是: ", text)print("Embedding輸入文本shape: ", text.shape)embedding = self.embedding(text)embedding_mean = embedding.mean(dim=1)print("embedding輸出文本維度: ", embedding_mean.shape)return self.fc(embedding_mean)

注意：
如果使用embedding_mean = embedding.mean(dim=1)語句對每個樣本的嵌入向量求平均，輸出shape為[batch, embed_dim]。若注釋掉該語句，輸出shape則為[batch, seqSize, embed_dim]。

5、模型訓練

# 定義詞表大小和嵌入維度
vacab_size = 10
embed_dim = 6# 創建模型
model = EmbeddingModel(vacab_size, embed_dim)# 設置超參數
cirterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)# 模型訓練
for epoch in range(1):for batch in data_loader:text, label = batch# 前向傳播outputs = model(text)loss = cirterion(outputs, label)# 方向傳播optimizer.zero_grad()loss.backward()optimizer.step()print(f'Epoch {epoch+1}, Loss: {loss.item()}')

Embedding輸入文本是:  tensor([[2, 2, 2],[1, 1, 1]])
Embedding輸入文本shape:  torch.Size([2, 3])
embedding輸出文本維度:  torch.Size([2, 6])
Epoch 1, Loss: 0.6635428667068481
Embedding輸入文本是:  tensor([[3, 3]])
Embedding輸入文本shape:  torch.Size([1, 2])
embedding輸出文本維度:  torch.Size([1, 6])
Epoch 1, Loss: 0.5667202472686768

2、EmbeddingBag嵌入

EmbeddingBag是在Embedding基礎上進一步優化的工具，其核心思想是將每個輸入序列的嵌入向量進行合并，能夠處理可變長度的輸入序列，并且減少了計算和存儲的開銷，并且可以計算句子中所有詞匯的詞嵌入向量的均值或總和。

減少計算量：因為embedding嵌入中需要要求每一個詞向量長度需要一樣。

在PyTorch中，EmbeddingBag的輸入是一個 整數張量 和一個 偏移量張量，每個整數都代表著一個詞匯的索引，偏移量則表示句子中每個詞匯的位置，輸出是一個浮點型的張量，每個浮點數都代表著對應句子的詞嵌入向量的均值或總和。

輸入shape：[seqsSize]（seqsSize為單個batch文本總長度）
輸出shape：[batch, embed_dim]（embed_dim嵌入維度）

📐 假設原始輸入數據為 [[1, 1, 1, 1], [2, 2, 2], [3, 3]]

展平的詞匯索引張量

將所有樣本的數據合并成一個一維數組。如 [1, 1, 1, 1, 2, 2, 2, 3, 3]。

偏移量

偏移量表示每個樣本在展平張量中的起始位置。如本案例： [0, 4, 7]，

合并操作

根據偏移量進行合并
合并操作可以是求和、平均或取最大值，默認是平均（mean）。以平均為例：
第一個樣本的平均值：(1 + 1 + 1 + 1) / 4 = 1
第二個樣本的平均值：(2 + 2 + 2) / 3 = 2
第三個樣本的平均值：(3 + 3) / 2 = 3
最后結果為 [1, 2, 3]，即batch維度

📑 一個簡單的案例如下：

1、導入庫和自定義數據格式

import torch
import torch.nn as nn 
import torch.nn.functional as F 
import torch.optim as optim 
from torch.utils.data import Dataset, DataLoader# 自定義數據維度
class MyDataset(Dataset):def __init__(self, texts, labels):super().__init__()self.texts = textsself.labels = labelsdef __len__(self):return len(self.labels)def __getitem__(self, idx):text = self.texts[idx]label = self.labels[idx]return text, label

2、定義數據

# 定義三個樣本
data = [torch.tensor([1, 1, 1], dtype=torch.long), torch.tensor([2, 2, 2], dtype=torch.long),torch.tensor([3, 3], dtype=torch.long)
]# 定義標簽
labels = torch.tensor([1, 2, 3], dtype=torch.float)# 創建數據
data = MyDataset(data, labels)
data_loader = DataLoader(data, batch_size=2, shuffle=True, collate_fn=lambda x : x)# 展示
for batch in data_loader:print(batch)

[(tensor([1, 1, 1]), tensor(1.)), (tensor([3, 3]), tensor(3.))]
[(tensor([2, 2, 2]), tensor(2.))]

3、定義模型

class EmbeddingModel(nn.Module):def __init__(self, vocab_size, embed_dim):super(EmbeddingModel, self).__init__()# 定義模型self.embedding_bag = nn.EmbeddingBag(vocab_size, embed_dim, mode='mean')self.fc = nn.Linear(embed_dim, 1)  # 這里假設做二分類任務def forward(self, text, offsets):print("Embedding輸入文本是: ", text)print("Embedding輸入文本shape: ", text.shape)embedding_bag = self.embedding_bag(text, offsets)print("embedding_bag輸出文本維度: ", embedding_bag.shape)return self.fc(embedding_bag)

4、模型訓練

# 定義詞表大小和嵌入維度
vacab_size = 10
embed_dim = 6# 創建模型
model = EmbeddingModel(vacab_size, embed_dim)# 設置超參數
cirterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)# 模型訓練
for epoch in range(1):for batch in data_loader:# 展平和計算偏移量texts, labels = zip(*batch)# 偏移量計算，就是統計文本長度offset = [0] + [len(text) for text in texts[:-1]] # 統計長度offset = torch.tensor(offset).cumsum(dim=0) # 生成偏移量，累計求和texts = torch.cat(texts)  # 合并文本labels = torch.tensor(labels).unsqueeze(1)  # 增加維度-->(batch_size, 1)# 前向傳播outputs = model(texts, offset)loss = cirterion(outputs, labels)# 方向傳播optimizer.zero_grad()loss.backward()optimizer.step()print(f'Epoch {epoch+1}, Loss: {loss.item()}')

Embedding輸入文本是:  tensor([2, 2, 2, 1, 1, 1])
Embedding輸入文本shape:  torch.Size([6])
embedding_bag輸出文本維度:  torch.Size([2, 6])
Epoch 1, Loss: 0.07764509320259094
Embedding輸入文本是:  tensor([3, 3])
Embedding輸入文本shape:  torch.Size([2])
embedding_bag輸出文本維度:  torch.Size([1, 6])
Epoch 1, Loss: 3.315852642059326

3、參考資料

【大模型靠啥理解文字？通俗解釋：詞嵌入embedding】https://www.bilibili.com/video/BV1bfoQYCEHC?vd_source=1fd424333dd77a7d3e2e741f7d6fd4ee
PyTorch 簡單易懂的 Embedding 和 EmbeddingBag - 解析與實踐_nn.embeddingbag-CSDN博客
小團體～第十二波