TensorFlow深度學習實戰——基于自編碼器構建句子向量

- 0. 前言
- 1. 句子向量
- 2. 基于自編碼器構建句子向量
- - 2.1 數據處理
  - 2.2 模型構建與訓練
- 3. 模型測試
- 相關鏈接

0. 前言

在本節中，我們將構建和訓練一個基于長短期記憶 (Long Short Term Memory, LSTM) 的自編碼器，用于生成 Reuters-21578 語料庫中文檔的句子向量。我們已經學習了如何使用詞嵌入表示一個詞，從而創建表示該詞在其上下文中含義的向量。本節中，我們將學習如何為句子構建句子向量，句子是單詞的序列，因此句子向量表示一個句子的含義。

1. 句子向量

構建句子向量 (Sentence Vector) 的最簡單方法是將句子中所有單詞的向量加總起來，然后除以單詞數量。但這種方法將句子視為詞袋，未考慮單詞的順序，在這種情況下，“The dog bit the man” 和 “The man bit the dog” 具有相同的句子向量。長短期記憶 (Long Short Term Memory, LSTM) 設計用于處理序列輸入，并考慮單詞的順序，從而能夠得到更好、更自然的句子表示。

2. 基于自編碼器構建句子向量

2.1 數據處理

(1) 首先，導入所需的庫：

from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import RepeatVectorfrom tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import sequence
from scipy.stats import describe
import collections
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
from time import gmtime, strftime
from tensorflow.keras.callbacks import TensorBoard
import re
# Needed to run only once
nltk.download('punkt')
nltk.download('reuters')
from nltk.corpus import reuters

(2) 下載完成后使用以下命令解壓 Reuters 語料庫：

$ unzip ~/nltk_data/corpora/reuters.zip -d ~/nltk_data/corpora

(3) 接下來，由于需要使用 GloVe 嵌入，因此首先下載 glove.6B.zip 并解壓縮：

$ unzip glove.6B.zip

(4) 把每個文本塊(文檔)轉換為一個句子列表，列表中每個元素表示一個句子。同時，每個句子中的單詞在添加時會被規范化。規范化包括移除所有數字并將其替換為數字 9，然后將單詞轉換為小寫。同時，計算單詞頻率，得到單詞頻率表 word_freqs：

def is_number(n):temp = re.sub("[.,-/]", "",n)return temp.isdigit()# parsing sentences and building vocabulary
word_freqs = collections.Counter()
documents = reuters.fileids()
#ftext = open("text.tsv", "r")
sents = []
sent_lens = []
num_read = 0
for i in range(len(documents)):# periodic heartbeat reportif num_read % 100 == 0:print("building features from {:d} docs".format(num_read))# skip docs without specified topictitle_body = reuters.raw(documents[i]).lower()if len(title_body) == 0:continuenum_read += 1# convert to list of word indexestitle_body = re.sub("\n", "", title_body)for sent in nltk.sent_tokenize(title_body):for word in nltk.word_tokenize(sent):if is_number(word):word = "9"word = word.lower()word_freqs[word] += 1sents.append(sent)sent_lens.append(len(sent))

(5) 獲取關于語料庫的信息，用于確定合適的 LSTM 網絡常量：

print("Total number of sentences are: {:d} ".format(len(sents)))
print ("Sentence distribution min {:d}, max {:d} , mean {:3f}, median {:3f}".format(np.min(sent_lens), np.max(sent_lens), np.mean(sent_lens), np.median(sent_lens)))
print("Vocab size (full) {:d}".format(len(word_freqs)))

輸出的語料庫信息如下：

Total number of sentences are: 50470 
Sentence distribution min 1, max 3688 , mean 167.072657, median 155.000000
Vocab size (full) 33743

(6) 根據以上信息，為 LSTM 模型設置常量。將 VOCAB_SIZE 設為 5000，即詞匯表包含最常見的 5,000 個單詞，這覆蓋了語料庫中 93% 以上的單詞。其余單詞視為超出詞匯范圍 (out of vocabulary, OOV) 并用詞元 UNK 替代。在預測時，模型未見過的單詞也會替換為詞元 UNK。序列長度 SEQUENCE_LEN 設為訓練集中句子的中位數長度的一半。長度小于 SEQUENCE_LEN 的句子用 PAD 字符進行填充，而比 SEQUENCE_LEN 長的句子將被截斷：

VOCAB_SIZE = 5000
EMBED_SIZE = 50
LATENT_SIZE = 512
SEQUENCE_LEN = 50

(7) 由于 LSTM 的輸入需要數值型數據，需要建立一個在單詞和單詞 ID 之間轉換的查找表。由于將詞匯表大小限制為 5,000，并且還需要添加兩個特殊詞元 PAD 和 UNK，因此查找表中包含了最常出現的 4,998 個單詞以及 PAD 和 UNK 詞元：

# word2id = collections.defaultdict(lambda: 1)
word2id = {}
word2id["PAD"] = 0
word2id["UNK"] = 1
for v, (k, _) in enumerate(word_freqs.most_common(VOCAB_SIZE - 2)):word2id[k] = v + 2
id2word = {v: k for k, v in word2id.items()}

(8) 網絡輸入為單詞序列，每個單詞由一個向量表示。可以使用獨熱編碼 (one-hot encoding) 來表示每個單詞，但這會使輸入數據非常龐大。因此，我們使用 50 維的 GloVe 嵌入來編碼每個單詞。嵌入生成一個形狀為 (VOCAB_SIZE, EMBED_SIZE) 的矩陣，其中每一行表示詞匯表中一個單詞的 GloVe 嵌入，PAD 和 UNK (分別是 0 和 1 )分別用零和隨機均值填充：

def lookup_word2id(word):try:return word2id[word]except KeyError:return word2id["UNK"]def load_glove_vectors(glove_file, word2id, embed_size):embedding = np.zeros((len(word2id), embed_size))fglove = open(glove_file, "rb")for line in fglove:cols = line.strip().split()word = cols[0].decode('utf-8')if embed_size == 0:embed_size = len(cols) - 1if word in word2id:vec = np.array([float(v) for v in cols[1:]])embedding[lookup_word2id(word)] = vecembedding[word2id["PAD"]] = np.zeros((embed_size))embedding[word2id["UNK"]] = np.random.uniform(-1, 1, embed_size)return embedding

(9) 接下來，生成嵌入：

sent_wids = [[lookup_word2id(w) for w in s.split()] for s in sents]
sent_wids = sequence.pad_sequences(sent_wids, SEQUENCE_LEN)# load glove vectors into weight matrix
embeddings = load_glove_vectors("glove.6B/glove.6B.{:d}d.txt".format(EMBED_SIZE), word2id, EMBED_SIZE)
print(embeddings.shape)

自編碼器模型接受 GloVe 單詞向量序列，并學習生成一個與輸入序列相似的序列。編碼器 LSTM 將序列壓縮成一個固定大小的上下文向量，解碼器 LSTM 使用該上下文向量來重建原始序列：

模型架構

(10) 由于輸入數據量較大，我們將使用生成器來生成每一批次輸入，生成器產生形狀為 (BATCH_SIZE, SEQUENCE_LEN, EMBED_SIZE) 的張量批次，其中，BATCH_SIZE 為 64，由于使用的是 50 維的 GloVe 向量，因此 EMBED_SIZE 為 50。在每個訓練 epoch 開始時打亂句子，每個批次包含 64 個句子，每個句子表示為一個 GloVe 單詞向量的向量。如果詞匯表中的單詞沒有對應的 GloVe 嵌入，則用零向量表示。構建兩個生成器實例，一個用于訓練數據，另一個用于測試數據，分別包含原始數據集的 70% 和 30%：

BATCH_SIZE = 64
NUM_EPOCHS = 20
def sentence_generator(X, embeddings, batch_size):while True:# loop once per epochnum_recs = X.shape[0]indices = np.random.permutation(np.arange(num_recs))num_batches = num_recs // batch_sizefor bid in range(num_batches):sids = indices[bid * batch_size: (bid + 1) * batch_size]Xbatch = embeddings[X[sids, :]]yield Xbatch, Xbatch# split sentences into training and test
train_size = 0.7
Xtrain, Xtest = train_test_split(sent_wids, train_size=train_size)
print("number of sentences: ", len(sent_wids))
print(Xtrain.shape, Xtest.shape)# define training and test generators
train_gen = sentence_generator(Xtrain, embeddings, BATCH_SIZE)
test_gen = sentence_generator(Xtest, embeddings, BATCH_SIZE)

2.2 模型構建與訓練

定義自編碼器，自編碼器由編碼器 LSTM 和解碼器 LSTM 組成。編碼器 LSTM 讀取形狀為 (BATCH_SIZE, SEQUENCE_LEN, EMBED_SIZE) 的張量，表示一批次句子。每個句子表示為一個固定長度為 SEQUENCE_LEN 的填充序列，每個單詞用一個 50 維的 GloVe 向量表示。編碼器 LSTM 的輸出維度使用超參數 LATENT_SIZE 定義，代表從訓練好的自編碼器的編碼器部分獲得的句子向量的大小，維度為 LATENT_SIZE 的向量空間表示了編碼句子含義的潛空間。LSTM 的輸出是大小為 LATENT_SIZE 的向量，因此對于一個批次，輸出張量的形狀是 (BATCH_SIZE, LATENT_SIZE)。接下來，將該張量輸入到 RepeatVector 層，該層會在整個序列中復制這個向量，即該層輸出張量形狀為 (BATCH_SIZE, SEQUENCE_LEN, LATENT_SIZE)。這個張量輸入到解碼器 LSTM 中，其輸出維度為 EMBED_SIZE，因此輸出張量的形狀為 (BATCH_SIZE, SEQUENCE_LEN, EMBED_SIZE)，也就是說，與輸入張量的形狀相同：

# define autoencoder network
inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum",name="encoder_lstm")(inputs)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True),merge_mode="sum",name="decoder_lstm")(decoded)autoencoder = Model(inputs, decoded)

使用 Adam 優化器和 MSE 損失函數編譯模型。選擇 MSE 的原因是我們希望重建一個具有相似含義的句子，即在 LATENT_SIZE 維度的嵌入空間中接近原始句子的句子。將損失函數定義為均方誤差，并選擇 Adam 優化器：

autoencoder.compile(optimizer="adam", loss="mse")

訓練自編碼器 20 個 epoch：

# train
num_train_steps = len(Xtrain) // BATCH_SIZE
num_test_steps = len(Xtest) // BATCH_SIZE
history = autoencoder.fit(train_gen,steps_per_epoch=num_train_steps,epochs=NUM_EPOCHS,validation_data=test_gen,validation_steps=num_test_steps) plt.plot(history.history["loss"], label = "training loss")
plt.plot(history.history["val_loss"], label = "validation loss")
plt.xlabel("epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

下圖顯示了訓練和驗證數據的損失變化情況，可以看到隨著模型的學習，損失逐漸減少：

訓練過程

由于輸入是嵌入矩陣，因此輸出也是詞嵌入矩陣。由于嵌入空間是連續的，而詞匯表是離散的，并不是每個輸出嵌入都會對應一個單詞。我們能做的就是找到一個最接近輸出嵌入的單詞，以重建原始文本，所以我們將以不同的方式評估自編碼器。
由于自編碼器的目標是產生良好的潛表示，我們將使用原始輸入和自編碼器的輸出生成的潛向量進行比較。首先，提取編碼器組件提取：

# collect autoencoder predictions for test set
test_inputs, test_labels = next(test_gen)
preds = autoencoder.predict(test_inputs)# extract encoder part from autoencoder
encoder = Model(autoencoder.input,autoencoder.get_layer("encoder_lstm").output)

3. 模型測試

在測試集上運行自編碼器，以返回預測的嵌入。接著，將輸入嵌入和預測嵌入都通過編碼器，生成各自的句子向量，并使用余弦相似度比較這兩個向量。余弦相似度接近 1 表示兩個向量高度相似，而余弦相似度接近 0 則表示兩個向量相似度較低。
在包含 500 個測試句子的隨機子集上測試，并計算源嵌入和自編碼器生成的目標嵌入之間的句子向量的余弦相似度值：

def compute_cosine_similarity(x, y):return np.dot(x, y) / (np.linalg.norm(x, 2) * np.linalg.norm(y, 2))# compute difference between vector produced by original and autoencoded
k = 500
cosims = np.zeros((k))
i = 0
for bid in range(num_test_steps):xtest, ytest = next(test_gen)ytest_ = autoencoder.predict(xtest)Xvec = encoder.predict(xtest)Yvec = encoder.predict(ytest_)for rid in range(Xvec.shape[0]):if i >= k:breakcosims[i] = compute_cosine_similarity(Xvec[rid], Yvec[rid])if i <= 10:print(cosims[i])i += 1if i >= k:breakplt.hist(cosims, bins=10, density=True)
plt.xlabel("cosine similarity")
plt.ylabel("frequency")
plt.show()

前 10 個余弦相似度值如下所示。可以看到，這些向量之間相似較高：

0.9826117753982544
0.983581006526947
0.9853078126907349
0.9853724241256714
0.9793808460235596
0.9805294871330261
0.9780978560447693
0.9855653643608093
0.9836362600326538
0.9835963845252991
0.9832736253738403