09.C2W4.Word Embeddings with Neural Networks

往期文章請點這里

Overview
Basic Word Representations
- Integers
- One-hot vectors
Word Embeddings
- Meaning as vectors
- Word embedding vectors
Word embedding process
Word Embedding Methods
- Basic word embedding methods
- Advanced word embedding methods
Continuous Bag-of-Words Model
- Center word prediction: rationale
- Creating a training example
- From corpus to training
Cleaning and Tokenization
- Cleaning and tokenization matters
- Example in Python
- - corpus
  - libraries
  - code
Sliding Window of Words in Python
Transforming Words into Vectors
- Transforming center words into vectors
- Transforming context words into vectors
- Final prepared training set
Architecture of the CBOW Model
Dimensions
- single input
- batch input
Activation Functions
- Rectified Linear Unit (ReLU)
- Softmax
- Softmax: example
Training a CBOW Model: Cost Function
- Loss
- Cross-entropy loss
Training a CBOW Model: Forward Propagation
- Forward propagation
- Cost
Training a CBOW Model: Backpropagation and Gradient Descent
- Backpropagation
- Gradient descent
Extracting Word Embedding Vectors
- option 1
- option 2
- option 3
Evaluating Word Embeddings
- Intrinsic evaluation
- Extrinsic Evaluation

往期文章請點這里

Overview

了解word embeddings一些基礎應用
在這里插入圖片描述
高級應用：

學習目標（需要掌握NN）：
●Identify the key concepts of word representations
●Generate word embeddings
●Prepare text for machine learning
●Implement the continuous bag-of-words model

Basic Word Representations

Integers

直接使用唯一的Integers對單詞進行編碼，優點是簡單：
在這里插入圖片描述
缺點是無法表達單詞的語義信息：

One-hot vectors

使用0-1詞向量來表示單詞，向量長度與詞表長度相同：
在這里插入圖片描述
每一個單詞可以使用其對應列為1，其他列為0的方式來表示：

Integers和獨熱編碼可以相互轉化

獨熱編碼的優點是簡單，沒有暗含單詞的排序信息；
但仍然沒有語義信息：

且當詞表較大時，向量長度很長：
在這里插入圖片描述

Word Embeddings

Meaning as vectors

向量是否能包含語義？當然可以，這里用低維向量來進行演示：
在這里插入圖片描述
上圖是一個情感分析或情感評分的示例，它表示了一些詞匯與它們對應的情感分數。

有8個詞匯：spider, boring, kitten, happy, anger, paper, excited, rage。
這些詞匯被分為4組，每組兩個詞，每組詞旁邊有括號內的情感分數，表示這些詞與特定情感的關聯強度。
第一組：spider (-2.52), boring (-2.08)，這些分數可能是負數，表明它們與負面情緒相關。
第二組：kitten (-1.53), happy (-0.91)，這些分數接近零或稍微負，可能表示它們與輕微的負面情緒或中性情緒相關。
第三組：anger (0.03), paper (1.09)，分數從接近零到正數，表明它們與正面情緒或中性情緒相關。
第四組：excited (2.31), rage，最后一個詞 rage 沒有給出分數，但根據上下文，它可能與強烈的負面情緒相關。
圖片底部有標尺，從 -2 到 2，分為 negative（負向/消極）、0（中性）和 positive（正向/積極）三個情感區域。
當然還可以加上y軸表示單詞的抽象和具體，例如：
在這里插入圖片描述
當然，這樣表示會丟失一些精確性，例如spider和snake都重合了，這個是不合理的。

Word embedding vectors

可以看到詞嵌入向量表示有兩個優點：
Low dimension（相對獨熱編碼）
Embed meaning：
在這里插入圖片描述
注意：
one-hot vectors，word embedding vectors都屬于word vectors（詞向量），但后者在很多場合也叫：“word vectors”，word embeddings

Word embedding process

Corpus對于生成詞嵌入很重要，例如你要針對特定領域的單詞進行詞嵌入，則盡量包含該領域的語料，因為單詞受到上下文影響很大，例如apple在農業領域是水果，在科技領域就是公司。
Embedding method這里主要是使用ML的模型，采用自監督的方式訓練。
整個流程大概如下圖所示：
在這里插入圖片描述

Word Embedding Methods

Basic word embedding methods

●word2vec (Google, 2013)
○Continuous bag of words (CBOW)
○Continuous skip gram / Skip gram with negative sampling (SGNS)
●Global Vectors (GloVe) (Stanford, 2014)
●fastText (Facebook, 2016)
○Supports out of vocabulary (OOV) words
○訓練速度很快

Advanced word embedding methods

Deep learning, contextual embeddings
●BERT (Google, 2018)
●ELMo (Allen Institute for AI, 2018)
●GPT 2 (OpenAI, 2018)
…
這些都是預訓練模型，可以對其進行finetune

Continuous Bag-of-Words Model

在這里插入圖片描述

Center word prediction: rationale

詞向量是CBOW任務的副產物，其主線任務是做預測的，根據上下文預測中間詞：
在這里插入圖片描述
因為單詞與上下文是有關系的，例如上圖中，通過足夠打的語料庫，模型將學會預測缺失的單詞與狗相關。

Creating a training example

在這里插入圖片描述
中心詞（Center word）：在這個示例中，中心詞是 “happy”。
上下文詞（Context words）：圍繞中心詞的詞，用于提供上下文信息。在這個例子中，上下文詞包括 “because”, “learning”, “am”（出現了兩次）。
窗口大小（Window size）：指上下文窗口可以包含的總詞數。在這個例子中，窗口大小是5。
上下文半尺寸（Context half-size）：指窗口一半的大小，通常用于確定窗口在中心詞的左側和右側分別可以擴展多遠。在這個例子中，上下文半尺寸是2，意味著窗口在中心詞的左側和右側各擴展2個詞的位置。
窗口（Window）：實際上指的是上下文詞圍繞中心詞的布局。根據窗口大小和上下文半尺寸，窗口包括中心詞以及它左右兩側的詞。

From corpus to training

根據上面的訓練實例，我們對I am happy because I am learning，假設窗口大小為5

Context words	Center word
I am because I	happy
am happy I am	because
happy because am learning	I

在這里插入圖片描述

Cleaning and Tokenization

Cleaning and tokenization matters

數據清理是預處理階段的重要步驟，目的是提高文本數據的質量，使其更適合后續的分析和模型訓練。
●Letter case
●Punctuation
●Numbers
●Special characters
●Special words
在這里插入圖片描述

Letter case（字母大小寫）：
清理操作可能包括將所有文本轉換為小寫或大寫，以消除大小寫差異帶來的影響。
例如，將 “Hello” 和 “hello” 統一轉換為 “hello”，以便模型不會將它們視為兩個不同的詞。

Punctuation（標點符號）：
標點符號的清理可能涉及刪除或替換文本中的所有標點符號，因為它們可能對某些NLP任務不重要或會干擾模型的分析。
例如，將句子 “Hello! How are you?” 中的感嘆號和問號去除，變為 “Hello How are you”。

Numbers（數字）：
數字清理通常指將文本中的數字替換或刪除，因為數字可能對某些文本分析任務沒有意義或會引入噪聲。
例如，將 “I have 3 apples” 中的 “3” 刪除或替換，變為 “I have apples”。

Special characters（特殊字符）：
特殊字符包括非字母數字的符號，如 @, #, $, % 等。清理這些字符可以簡化文本數據，避免它們對模型造成干擾。
例如，將 “email@example.com” 中的 “@” 和 “.” 刪除，變為 “emailexamplecom”。

Special words（特殊詞匯）：
特殊詞匯的清理可能包括去除常見的但對分析沒有幫助的詞，如停用詞（stop words，如 “and”, “the” 等）或特定的行業術語。
例如，從 “The quick brown fox jumps over the lazy dog” 中去除 “the” 和 “over” 等停用詞。

Example in Python

corpus

在這里插入圖片描述

libraries

# pip install nltk
# pip install emoji
import nltk
from nltk.tokenize import word_tokenize
import emoji
nltk.download(' punkt') # download pre trained Punkt tokenizer for English

code

corpus = 'Who ??"word embeddings" in 2020? I do!!!'
data = re.sub(r'[,!?;-]+', '.', corpus)

結果：
Who ??"word embeddings" in 2020. I do.

data = nltk.word_tokenize(data) # tokenize string to words

結果：
[‘Who’, ‘??’, ‘``’, ‘word’, ‘embeddings’, “‘’”, ‘in’, ‘2020’, ‘.’, ‘I’, ‘do’, ‘.’]

data = [ ch.lower() for ch in dataif ch.isalpha() or ch == '.'or emoji.get_emoji_regexp().search(ch)]

結果：
[‘who’, ‘??’, ‘word’, ‘embeddings’, ‘in’, ‘.’, ‘i’, ‘do’, ‘.’]

Sliding Window of Words in Python

def get_windows (words, C):i = Cwhile i < len(words)-C:center_word = words[i]context_words = words[(i-C):i] + words[(i+ 1 ):(i+C+1)]yield context_words, center_wordi += 1

在這里插入圖片描述
可以看到i初始化是從i = C=2開始的，也是第一個中心詞happy對應的索引，i結束于倒數第三個詞len(words)-C，每次i往前移動一個單詞
最后使用yield 完成多次返回值傳遞

for x, y in get_windows([' i', ' am', ' happy', ' because', ' i', ' am', 'learning'],2
):
print(f'{x}\t{y}')

結果：
在這里插入圖片描述

Transforming Words into Vectors

有了上下文和中心詞，接下來就是將它們轉化為向量。

Transforming center words into vectors

語料庫：I am happy because I am learning
詞庫：am, because, happy, I, learning
使用獨熱編碼表示每個中心詞：
在這里插入圖片描述

Transforming context words into vectors

使用上下文的獨熱編碼平均值來表示，對于中心詞為happy的時候：
在這里插入圖片描述

Final prepared training set

Context words	Context words vector	Center word	Center word vector
I am because I	[0.25; 0.25; 0; 0.5; 0]	happy	[0; 0; 1; 0; 0]

Architecture of the CBOW Model

在這里插入圖片描述
CBOW 是一個典型的前饋神經網絡結構，其中包括輸入層、一個或多個隱藏層，以及一個輸出層。每一層都包含權重和偏置，以及激活函數來處理數據和進行非線性變換。
Input layer（輸入層）：這一層接收輸入數據，在這個例子中是文本序列 “I am happy because I am learning”。輸入數據通常會被轉換為數值向量，如詞嵌入（Word Embeddings）。

Context words and Center word（上下文詞和中心詞）：在某些模型中，如卷積神經網絡（CNN）或循環神經網絡（RNN），上下文詞可以提供周圍詞的語境信息，而中心詞是當前正在處理的詞。

W1, W2, …（權重）：這些表示網絡中的權重參數，每個權重連接輸入層和隱藏層的神經元。

b, b2, …（偏置）：偏置參數，用于調整神經元的激活函數的輸出。

Hidden layer（隱藏層）：輸入層之后是隱藏層，隱藏層中的神經元會對輸入數據進行處理，提取特征。

Output layer（輸出層）：隱藏層之后是輸出層，輸出層的神經元數量通常取決于任務的類別數，用于生成最終的預測結果。

Vector（向量）：表示輸入文本被轉換為固定大小的數值向量，以便神經網絡可以處理。

ReLU（Rectified Linear Unit）：一種常用的激活函數，用于增加非線性，幫助模型學習更復雜的特征。

softmax：一種在輸出層使用的激活函數，用于多分類任務中將輸出轉換為概率分布。

V = 5：表示詞表大小，這里使用獨熱編碼，也是輸入向量的維度大小。

X：可能表示輸入數據的特征矩陣或特征向量。

當然還有別的超參數可以配置，例如：N: Word embedding size等等…

Dimensions

single input

在這里插入圖片描述
如果輸入不是列向量，而是行向量，則需要使用轉置矩陣和矩陣乘法中的倒置項進行計算。

batch input

上面以單個樣本作為輸入為例，演示了CBOW的各個部分的維度，在實際操作過程中，為了加快運行速度，我們通常一次傳入一個batch（批次）的數據，batch_size是一個超參數，下圖給出了batch_size=m的例子：
在這里插入圖片描述
我們將m個樣本的列向量合在一起，變成輸入矩陣
這里的偏置項寫成了大寫的B，之前的b是1×N大小的，這里在和矩陣做加法的時候，Python會自動做broadcasting，將其大小擴展到m×N大小：

這里注意輸入和輸出矩陣中向量于預測結果的對應關系（綠色部分）：
在這里插入圖片描述

Activation Functions

Rectified Linear Unit (ReLU)

這個沒有什么好說的，還有很多變體，例如：leakyReLU
輸入層經過W和b后，再進入ReLU
$z_1 = W_1 x + b_1\\ h= ReLU(z_1)$
在這里插入圖片描述
ReLU公式為：
$ReLU(x)=\max(0,x)$
圖像為：

下面是一組 $z_1$ 對應的h值：

Softmax

Sofmax是吃隱藏層輸出的線性變換：
$W_2 h + b_2\\ \hat y=softmax(z)$
一組實數經過Sofmax后會得到一組0-1之間的數字（可以說是概率），這一組數字和為1
在這里插入圖片描述
對于CBOW模型，得到的是每個單詞對應的出現概率：

$\hat y_i$ 的公式如下，其原理就相當于把每個 $\hat y_i$ 進行標準化，使其概率和為1。
$\hat y_i=\cfrac{e^{z_i}}{\sum_{j=1}^Ve^{z_j}}$

Softmax: example

最后預測結果是happy因為其對應的概率值最大。
在這里插入圖片描述

Training a CBOW Model: Cost Function

Loss

"Loss"通常指的是在機器學習中，模型預測值與實際值之間的差異或誤差。在訓練機器學習模型的過程中，目標是最小化這個損失函數（Loss function），這樣可以使模型的預測更加接近真實值。

具體來說，損失函數是一個衡量模型性能的指標，它計算了模型預測值與真實值之間的差距。不同的機器學習任務會使用不同類型的損失函數。例如：
對于分類問題，常用的損失函數是交叉熵損失（Cross-Entropy Loss）。
對于回歸問題，常用的損失函數是均方誤差（Mean Squared Error, MSE）。
在這里插入圖片描述

Cross-entropy loss

CBOW 使用的損失函數形式為：
$J=-\sum_{k=1}^Vy_k\log \hat y_k$
真實值和預測值形式為：
在這里插入圖片描述
對于語料：
I am happy because I am learning
前五個單詞中心詞是happy，假設其預測值和真實值如下：

按照公式取對數后與真實值進行點乘，然后再求和：

可以看到當預測值與真實值相近的時候，損失值較小。
下面看預測值為am是中心詞的情況：
在這里插入圖片描述
上面的損失函數計算可以進一步簡化為：
$J=-\log \hat y_{actual\space word}$
例如：

J=-log 0.01=4.61，注意這里寫的是log其實是ln
根據簡化后的公式可以畫出其函數圖像：

正確中心詞對應的預測概率越大，Loss值越小，反正Loss越大。

Training a CBOW Model: Forward Propagation

整個訓練過程包含：
●Forward propagation
●Cost
●Backpropagation and gradient descent

Forward propagation

其實在CBOW構架中就cover了前向傳播，嘗試用自己的話描述下圖（注意，這里使用的是batch模式）：
在這里插入圖片描述
你能寫出下面公式么？

Cost

“cost”（成本）和"loss"（損失）這兩個術語經常被用來描述衡量模型預測與實際值之間差異的函數。盡管在日常使用中它們可能可以互換，但它們在嚴格意義上有一些區別。損失函數通常用于單個樣本，而成本函數則用于整個數據集。在實踐中，當我們說“最小化損失”時，我們通常指的是最小化成本函數，因為這是我們在訓練模型時優化的總體目標。
這一節中的Cost是指一個Batch的Loss的平均，假設一個batch有m個樣本，則：
$J_{batch}=-\cfrac{1}{m}\sum_{i=1}^m\sum_{j=1}^Vy_j^{(i)}\log \hat y_j^{(i)}$
同樣的可以簡化為：
$J_{batch}=-\cfrac{1}{m}\sum_{i=1}^mJ^{(i)}$
在這里插入圖片描述

Training a CBOW Model: Backpropagation and Gradient Descent

訓練模型的目的是最小化cost，按batch的cost 函數有四個變量：
$J_{batch}=f(W_1,W_2,b_1,b_2)$
我們可以使用Backpropagation: calculate partial derivatives of cost with respect to weights and biases
使用Gradient descent: update weights and biases

Backpropagation

$\cfrac{\partial J_{batch}}{\partial W_1}=\cfrac{1}{m}ReLU\left(W_2^\intercal (\hat Y-Y)\right)X^\intercal$
$\cfrac{\partial J_{batch}}{\partial W_2}=\cfrac{1}{m}(\hat Y-Y)H^\intercal$
$\cfrac{\partial J_{batch}}{\partial b_1}=\cfrac{1}{m}ReLU\left(W_2^\intercal (\hat Y-Y)\right)1_m^\intercal$
$\cfrac{\partial J_{batch}}{\partial b_2}=\cfrac{1}{m}(\hat Y-Y)1_m^\intercal$
這里 $1_m$ 是一個有m個元素且都為1的列向量，其轉置后與其他矩陣相乘得到矩陣每行求和：
在這里插入圖片描述
實際操作的時候是用numpy的求和函數實現的：

import numpy as np
# code to initialize matrix a omitted
np.sum(a, axis= 1 , keepdims=True )

反向傳播就是要根據鏈式法則求偏導，具體計算推導這里不展開，可以直接使用現有的函數實現計算。

Gradient descent

Hyperparameter: learning rate $\alpha$
$W_1:= W_1-\alpha\cfrac{\partial J_{batch}}{\partial W_1}$
$W_2:= W_2-\alpha\cfrac{\partial J_{batch}}{\partial W_2}$
$b_1:= b_1-\alpha\cfrac{\partial J_{batch}}{\partial b_1}$
$b_2:= b_2-\alpha\cfrac{\partial J_{batch}}{\partial b_2}$

Extracting Word Embedding Vectors

共有3種方式

option 1

將 $W_1$ 的每一個列作為詞表中每一個單詞的嵌入列向量， $W_1$ 有V列剛好和詞表長度對應，其對應方式與輸入X的順序相對應（看藍色部分）：
在這里插入圖片描述

option 2

將 $W_2$ 的每一個行作為詞表中每一個單詞的嵌入行向量， $W_2$ 有V行剛好和詞表長度對應，其對應方式與輸入X的順序相對應（看藍色部分）：
在這里插入圖片描述

option 3

將上面二者相結合得到V×N的矩陣 $W_3$ ，每一個列作為詞表中每一個單詞的嵌入列向量：
$W_3=0.5(W_1+W_2^T)$
在這里插入圖片描述

Evaluating Word Embeddings

主要有兩種：Intrinsic Evaluation（內在評估），Extrinsic Evaluation（外在評估）。內在評估提供了關于模型預測能力的信息，而外在評估則提供了關于模型在實際應用中效果的信息。兩者都是重要的，因為一個模型可能在技術上表現良好（內在評估），但如果它不能有效地支持最終的應用目標（外在評估），那么它可能不是一個成功的模型。在實際應用中，通常需要結合這兩種評估方法來全面理解模型的性能。

Intrinsic evaluation

Analogies
Clustering
Visualization
Analogies主要是Test relationships between words，有三種常見方式：

Analogies	example
Semantic analogies	“France” is to “Paris” as “Italy” is to <?>
Syntactic analogies	“seen” is to “saw” as “been” is to <?>
Ambiguity	“wolf” is to “pack” as “bee” is to <?> → swarm? colony?
Clustering

在這里插入圖片描述
Visualization