自然語言處理NLP中的連續詞袋（Continuous bag of words，CBOW）方法、優勢、作用和程序舉例

自然語言處理NLP中的連續詞袋（Continuous bag of words，CBOW）方法、優勢、作用和程序舉例
- 一、連續詞袋( Continuous Bag of Words, CBOW)介紹
- - 1.1 什么是詞嵌入（ Word embeddings）
  - 1.2 什么是連續詞袋（CBOW)
  - 1.3 連續詞袋優勢
  - - （1）高效性
    - （2）靈活性
    - （3）魯棒性
  - 1.4 CBOW模型體系結構（CBOW Architecture）
  - 1.5 詞袋（ Bag-of-Words, BoW）模型和連續詞袋（Continuous Bag-of-Words , CBOW）模型區別
- 二、連續詞袋的作用
- - 2.1. 詞嵌入生成
  - - （1）核心功能
    - （2）工作原理
  - 2.2. 提升語義理解能力
  - - （1）捕捉語義關系
    - （2）減少維度災難
- 三、連續詞袋應用場景
- - 3.1 文本分類
  - 3.2 機器翻譯
  - 3.3 情感分析
  - 3.4 信息檢索
- 四、CBOW代碼實現舉例
- - 4.1 對詞語料庫進行向量化
  - 4.2 構建一個CBOW模型
  - 4.3 使用模型可視化詞嵌入
- 五、總結

為了使得計算機理解一個文本，可以將文本中的詞表示為數字向量。連續詞袋（Continuous Bag of Words, CBOW）是一種用于自然語言處理中的詞嵌入模型，它是Word2Vec算法的一種變體。
Word2vec是一種基于神經網絡的生成單詞嵌入的方法，它是詞的密集向量表示（ dense vector representations of words），能夠描述詞的語義和關系。實現 Word2vec主要有兩種方法:(1) 連續詞袋(Continuous bag-of-words, CBOW) ; (2) 跳字模型（Skip-gram）。

本節重點介紹連續詞袋(Continuous bag-of-words, CBOW) 內容。

一、連續詞袋( Continuous Bag of Words, CBOW)介紹

1.1 什么是詞嵌入（ Word embeddings）

詞嵌入（Word embeddings）是多數NLP任務中重要的描述了詞的語義以及語言中詞之間的句法關系的方法，是一種表示詞為數字向量的方法。

1.2 什么是連續詞袋（CBOW)

CBOW是一種根據目標詞周圍上下文預測目標詞的基于神經網絡的算法，是一種用于生成詞嵌入的流行的自然語言處理技術。它是一種可以從未標記的數據中學習的無監督學習方法，如圖1所示。

在這里插入圖片描述
圖1 CBOW模型舉例

1.3 連續詞袋優勢

（1）高效性

相比其他方法，CBOW能夠在較短的時間內完成大規模數據集上的訓練。

（2）靈活性

可以很容易地集成到更復雜的NLP系統中去。

（3）魯棒性

即使面對拼寫錯誤或罕見詞匯時，也能給出合理的表示。

1.4 CBOW模型體系結構（CBOW Architecture）

CBOW模型使用周圍的上下文詞來預測目標詞。考慮上面的例子’’ He is a great man“。CBOW模型將該短語轉換為上下文詞和目標詞對。在窗口大小為2的情況下，詞對（word pairings）將呈現如下形式（[He, a]，is), ([is, great], a), ([a, man]，great)。

在這里插入圖片描述

圖2 CBOW模型體系結構

該模型考慮上下文單詞，并嘗試預測目標詞。如果使用四個單詞作為上下文單詞來預測一個目標詞，則四個1?W輸入向量將被傳遞給輸入層（input layer）。隱藏層（hidden layer ）將接收輸入向量，然后將它們乘以 W?N 矩陣。最后，來自隱藏層的1?N輸出進入求和層。在該層中，向量進行元素級求和再執行最終激活，然后從輸出層獲得輸出。

1.5 詞袋（ Bag-of-Words, BoW）模型和連續詞袋（Continuous Bag-of-Words , CBOW）模型區別

詞袋模型BoW和連續詞袋模型CBOW都是自然語言處理中使用的以計算機可讀格式表示文本的技術，但它們在描述上下文的方式上有所不同。

(1) BoW模型將給定文檔(given document)或語料庫(corpus)中的文本表示為單詞及其頻率的集合。它不考慮單詞出現的順序或上下文，因此，它可能無法捕獲文本的全部含義。BoW模型簡單易實現，但在理解語言意義方面存在局限性。
（2）相比之下，CBOW模型是一種基于神經網絡的方法，可以捕獲單詞的上下文。它根據上下文窗口前后出現的詞來學習預測目標詞。CBOW模型通過考慮周圍的詞，可以更好地捕捉給定上下文中的詞義。

二、連續詞袋的作用

連續詞袋的作用主要有以下四方面：

2.1. 詞嵌入生成

（1）核心功能

CBOW通過上下文預測目標詞的方式，將詞語映射到一個固定維度的稠密向量空間中。

（2）工作原理

給定一個詞的上下文（即若干個相鄰的詞），模型的目標是預測這個上下文所對應的中心詞。例如，在句子“我喜歡吃西紅柿”中，如果已知上下文是“喜歡”和“吃”，那么CBOW會嘗試預測中心詞“西紅柿”。

2.2. 提升語義理解能力

（1）捕捉語義關系

通過訓練，CBOW能夠捕捉到詞匯之間的語義關系。比如，“國王”與“王后”的關系類似于“男人”與“女人”的關系。

（2）減少維度災難

相比于傳統的稀疏表示方法，詞嵌入可以有效減少數據維度，使得后續任務更加高效。

三、連續詞袋應用場景

3.1 文本分類

利用預訓練好的詞向量作為特征輸入到分類器中。

3.2 機器翻譯

在神經網絡翻譯模型中使用詞向量來提高翻譯質量。

3.3 情感分析

通過對文本進行詞嵌入后，可以更好地識別文本的情感傾向。

3.4 信息檢索

基于詞向量的距離度量來進行文檔相似度計算或者查詢擴展。

四、CBOW代碼實現舉例

在此，通過CBOW模型實現詞嵌入，以展示單詞之間的相似性。在本文中，定義了自己的詞語料庫，你可以使用任何數據集。

4.1 對詞語料庫進行向量化

首先，將導入所有必要的庫；其次，定義語料庫。然后，將對每個單詞進行分詞（ tokenize each word），并將其轉換為整數向量。

# 1.導入需要的模塊
from tensorflow.keras.preprocessing.text import Tokenizer# 2.定義語料庫
corpus = ['The Fish swam in the water','The cat sat on the mat','The horse galloped on the grassland','The dog ran in the park','The bird sang in the tree'
]# 3.將語料庫轉換為整數向量
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
sequences = tokenizer.texts_to_sequences(corpus)
print("將語料庫中的單詞轉換為整數向量后:",sequences)

運行輸出結果：在這里插入圖片描述

4.2 構建一個CBOW模型

接著,建立一個窗口大小=2的CBOW模型

# 1.導入需要的模塊
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Lambda, Dense
import tensorflow as tf
# 2.定義語料庫
corpus = ['The Fish swam in the water','The cat sat on the mat','The horse galloped on the grassland','The dog ran in the park','The bird sang in the tree'
]# 3.將語料庫轉換為整數向量
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
sequences = tokenizer.texts_to_sequences(corpus)
print("將語料庫中的單詞轉換為整數向量后:",sequences)# 4。定義參數
vocab_size = len(tokenizer.word_index) + 1
embedding_size = 10
window_size = 2# 5.產生上下文——目標對
contexts = []
targets = []
for sequence in sequences:for i in range(window_size, len(sequence) - window_size):context = sequence[i - window_size:i] + sequence[i + 1:i + window_size + 1]target = sequence[i]contexts.append(context)targets.append(target)# 6.轉換上下文和目標為numpy向量X = np.array(contexts)
y = to_categorical(targets, num_classes=vocab_size)# 7.定義CBOW模型model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=2 * window_size))
model.add(Lambda(lambda x: tf.reduce_mean(x, axis=1)))
model.add(Dense(units=vocab_size, activation='softmax'))# 8.編譯模型
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])# 9.訓練此模型
model.fit(X, y, epochs=100, verbose=0)
print(model.summary())

運行結果為:
在這里插入圖片描述

4.3 使用模型可視化詞嵌入

最后，使用模型進行可視化。

# 1.導入需要的模塊
from tensorflow.keras.preprocessing.text import Tokenizerimport numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Lambda, Dense
import tensorflow as tffrom sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# 2.定義語料庫
corpus = ['The Fish swam in the water','The cat sat on the mat','The horse galloped on the grassland','The dog ran in the park','The bird sang in the tree'
]# 3.將語料庫轉換為整數向量
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
sequences = tokenizer.texts_to_sequences(corpus)
print("將語料庫中的單詞轉換為整數向量后:",sequences)# 4.定義參數
vocab_size = len(tokenizer.word_index) + 1
embedding_size = 10
window_size = 2# 5. 產生上下文——目標對
contexts = []
targets = []
for sequence in sequences:for i in range(window_size, len(sequence) - window_size):context = sequence[i - window_size:i] + sequence[i + 1:i + window_size + 1]target = sequence[i]contexts.append(context)targets.append(target)# 6. 轉換上下文和目標為numpy向量X = np.array(contexts)
y = to_categorical(targets, num_classes=vocab_size)# 7. 定義CBOW模型model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=2 * window_size))
model.add(Lambda(lambda x: tf.reduce_mean(x, axis=1)))
model.add(Dense(units=vocab_size, activation='softmax'))# 8.編譯模型model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])# 9.訓練此模型
model.fit(X, y, epochs=100, verbose=0)# 10.提取詞嵌入
# Extract the embeddings
embedding_layer = model.layers[0]
embeddings = embedding_layer.get_weights()[0]# 11.執行PCA以降低詞嵌入維度pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)# 12.可視化詞嵌入
plt.figure(figsize=(6, 6))
for word, idx in tokenizer.word_index.items():x, y = reduced_embeddings[idx]plt.scatter(x, y)plt.annotate(word, xy=(x, y), xytext=(5, 2),textcoords='offset points', ha='right', va='bottom')
plt.title("Word Embeddings Visualized")
plt.show()

運行結果：

在這里插入圖片描述

這種可視化使能夠根據單詞的嵌入來觀察單詞的相似性。詞或上下文意思相似的，在圖中應該彼此接近。

五、總結

連續詞袋模型通過學習詞語之間的共現概率分布，成功地將離散的詞匯轉換成了連續的空間表示形式，這對于許多需要深層次語義理解的應用來說是非常重要的工具之一。本文介紹了自然語言處理NLP中的連續詞袋（Continuous bag of words，CBOW）的概念，優勢，作用，應用場景，并通過程序舉例說明其用法。