【NLP】語言模型和遷移學習

10.13 Update：最近新出了一個state-of-the-art預訓練模型，傳送門：

李入魔：【NLP】Google BERT詳解?zhuanlan.zhihu.com

1. 簡介

長期以來，詞向量一直是NLP任務中的主要表征技術。隨著2017年底以及2018年初的一系列技術突破，研究證實預訓練的語言表征經過精調后可以在眾多NLP任務中達到更好的表現。目前預訓練有兩種方法：

Feature-based：將訓練出的representation作為feature用于任務，從詞向量、句向量、段向量、文本向量都是這樣的。新的ELMo也屬于這類，但遷移后需要重新計算出輸入的表征。
Fine-tuning：這個主要借鑒于CV，就是在預訓練好的模型上加些針對任務的層，再對后幾層進行精調。新的ULMFit和OpenAI GPT屬于這一類。

本文主要對ELMo、ULMFiT以及OpenAI GPT三種預訓練語言模型作簡要介紹。

2. ELMo

2.1 模型原理與架構

原文鏈接：Deep contextualized word representations

ELMo是從雙向語言模型（BiLM）中提取出的Embedding。訓練時使用BiLSTM，給定N個tokens (t1, t2,...,tN), 目標為最大化：

ELMo對于每個token , 通過一個L層的biLM計算出2L+1個表示：

其中是對token進行直接編碼的結果(這里是字符通過CNN編碼)，是每個biLSTM層輸出的結果。

應用中將ELMo中所有層的輸出R壓縮為單個向量, , 最簡單的壓縮方法是取最上層的結果做為token的表示: , 更通用的做法是通過一些參數來聯合所有層的信息：

其中是softmax出來的權重, 是一個任務相關的scale參數，在優化過程中很重要，同時因為每層BiLM的輸出分布不同，可以對層起到normalisation的作用。

論文中使用的預訓練BiLM在Jozefowicz et al.中的CNN-BIG-LSTM基礎上做了修改，最終模型為2層biLSTM（4096 units, 512 dimension projections），并在第一層和第二層之間增加了殘差連接。同時使用CNN和兩層Highway對token進行字符級的上下文無關編碼。使得模型最終對每個token輸出三層向量表示。

2.2 模型訓練注意事項

- 正則化：

1. Dropout

2. 在loss中添加權重的懲罰項 (實驗結果顯示ELMo適合較小的 )

- TF版源碼解析：

1. 模型架構的代碼主要在training模塊的LanguageModel類中，分為兩步：第一步創建word或character的Embedding層（CNN+Highway）；第二步創建BiLSTM層。

2. 加載所需的預訓練模型為model模塊中的BidirectionalLanguageModel類。

2.3 模型的使用

將ELMo向量與傳統的詞向量拼接成后，輸入到對應具體任務的RNN中。
將ELMo向量放到模型輸出部分，與具體任務RNN輸出的拼接成。
Keras代碼示例

import tensorflow as tf
from keras import backend as K
import keras.layers as layers
from keras.models import Model# Initialize session
sess = tf.Session()
K.set_session(sess)# Instantiate the elmo model
elmo_model = hub.Module("https://tfhub.dev/google/elmo/1", trainable=True)
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())# We create a function to integrate the tensorflow model with a Keras model
# This requires explicitly casting the tensor to a string, because of a Keras quirk
def ElmoEmbedding(x):return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]input_text = layers.Input(shape=(1,), dtype=tf.string)
embedding = layers.Lambda(ElmoEmbedding, output_shape=(1024,))(input_text)
dense = layers.Dense(256, activation='relu')(embedding)
pred = layers.Dense(1, activation='sigmoid')(dense)model = Model(inputs=[input_text], outputs=pred)model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()復制代碼

2.4 模型的優缺點

優點：

效果好，在大部分任務上都較傳統模型有提升。實驗正式ELMo相比于詞向量，可以更好地捕捉到語法和語義層面的信息。
傳統的預訓練詞向量只能提供一層表征，而且詞匯量受到限制。ELMo所提供的是character-level的表征，對詞匯量沒有限制。

缺點：

速度較慢，對每個token編碼都要通過language model計算得出。

2.5 適用任務

Question Answering
Textual entailment
Semantic role labeling
Coreference resolution
Named entity extraction
Sentiment analysis

3. ULMFiT

3.1 模型原理與架構

原文鏈接：Universal Language Model Fine-tuning for Text Classification

ULMFiT是一種有效的NLP遷移學習方法，核心思想是通過精調預訓練的語言模型完成其他NLP任務。文中所用的語言模型參考了Merity et al. 2017a的AWD-LSTM模型，即沒有attention或shortcut的三層LSTM模型。

ULMFiT的過程分為三步：

1. General-domain LM pre-train

在Wikitext-103上進行語言模型的預訓練。
預訓練的語料要求：large & capture general properties of language
預訓練對小數據集十分有效，之后僅有少量樣本就可以使模型泛化。

2. Target task LM fine-tuning

文中介紹了兩種fine-tuning方法：

Discriminative fine-tuning

因為網絡中不同層可以捕獲不同類型的信息，因此在精調時也應該使用不同的learning rate。作者為每一層賦予一個學習率，實驗后發現，首先通過精調模型的最后一層L確定學習率，再遞推地選擇上一層學習率進行精調的效果最好，遞推公式為:

Slanted triangular learning rates (STLR)

為了針對特定任務選擇參數，理想情況下需要在訓練開始時讓參數快速收斂到一個合適的區域，之后進行精調。為了達到這種效果，作者提出STLR方法，即讓LR在訓練初期短暫遞增，在之后下降。如圖b的右上角所示。具體的公式為：

- T: number of training iterations
- cut_frac: fraction of iterations we increase the LR
- cut: the iteration when we switch from increasing to decreasing the LR
- p: the fraction of the number of iterations we have increased or will decrease the LR respectively
- ratio: specifies how much smaller the lowest LR is from thr max LR
- : the LR at iteration t

文中作者使用的

3. Target task classifier fine-tuning

為了完成分類任務的精調，作者在最后一層添加了兩個線性block，每個都有batch-norm和dropout，使用ReLU作為中間層激活函數，最后經過softmax輸出分類的概率分布。最后的精調涉及的環節如下：

Concat pooling
第一個線性層的輸入是最后一個隱層狀態的池化。因為文本分類的關鍵信息可能在文本的任何地方，所以只是用最后時間步的輸出是不夠的。作者將最后時間步與盡可能多的時間步池化后拼接起來，以作為輸入。
Gradual unfreezing
由于過度精調會導致模型遺忘之前預訓練得到的信息，作者提出逐漸unfreez網絡層的方法，從最后一層開始unfreez和精調，由后向前地unfreez并精調所有層。
BPTT for Text Classification (BPT3C)
為了在large documents上進行模型精調，作者將文檔分為固定長度為b的batches，并在每個batch訓練時記錄mean和max池化，梯度會被反向傳播到對最終預測有貢獻的batches。
Bidirectional language model
在作者的實驗中，分別獨立地對前向和后向LM做了精調，并將兩者的預測結果平均。兩者結合后結果有0.5-0.7的提升。

3.2 模型訓練注意事項

- PyTorch版源碼解析 (FastAI第10課)

# location: fastai/lm_rnn.pydef get_language_model(n_tok, emb_sz, n_hid, n_layers, pad_token,dropout=0.4, dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, tie_weights=True, qrnn=False, bias=False):"""Returns a SequentialRNN model.A RNN_Encoder layer is instantiated using the parameters provided.This is followed by the creation of a LinearDecoder layer.Also by default (i.e. tie_weights = True), the embedding matrix used in the RNN_Encoderis used to  instantiate the weights for the LinearDecoder layer.The SequentialRNN layer is the native torch's Sequential wrapper that puts the RNN_Encoder andLinearDecoder layers sequentially in the model.Args:n_tok (int): number of unique vocabulary words (or tokens) in the source datasetemb_sz (int): the embedding size to use to encode each tokenn_hid (int): number of hidden activation per LSTM layern_layers (int): number of LSTM layers to use in the architecturepad_token (int): the int value used for padding text.dropouth (float): dropout to apply to the activations going from one LSTM layer to anotherdropouti (float): dropout to apply to the input layer.dropoute (float): dropout to apply to the embedding layer.wdrop (float): dropout used for a LSTM's internal (or hidden) recurrent weights.tie_weights (bool): decide if the weights of the embedding matrix in the RNN encoder should be tied to theweights of the LinearDecoder layer.qrnn (bool): decide if the model is composed of LSTMS (False) or QRNNs (True).bias (bool): decide if the decoder should have a bias layer or not.Returns:A SequentialRNN model"""rnn_enc = RNN_Encoder(n_tok, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token,dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)enc = rnn_enc.encoder if tie_weights else Nonereturn SequentialRNN(rnn_enc, LinearDecoder(n_tok, emb_sz, dropout, tie_encoder=enc, bias=bias))def get_rnn_classifier(bptt, max_seq, n_class, n_tok, emb_sz, n_hid, n_layers, pad_token, layers, drops, bidir=False,dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, qrnn=False):rnn_enc = MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token=pad_token, bidir=bidir,dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)return SequentialRNN(rnn_enc, PoolingLinearClassifier(layers, drops))復制代碼

3.3 模型的優缺點

優點：

對比其他遷移學習方法（ELMo）更適合以下任務：

- 非英語語言，有標簽訓練數據很少

- 沒有state-of-the-art模型的新NLP任務

- 只有部分有標簽數據的任務

缺點：

對于分類和序列標注任務比較容易遷移，對于復雜任務（問答等）需要新的精調方法。

3.4 適用任務

Classification
Sequence labeling

4. OpenAI GPT

4.1 模型原理與架構

原文鏈接：Improving Language Understanding by Generative Pre-Training (未出版)

OpenAI Transformer是一類可遷移到多種NLP任務的，基于Transformer的語言模型。它的基本思想同ULMFiT相同，都是在盡量不改變模型結構的情況下將預訓練的語言模型應用到各種任務。不同的是，OpenAI Transformer主張用Transformer結構，而ULMFiT中使用的是基于RNN的語言模型。文中所用的網絡結構如下：

模型的訓練過程分為兩步：

1. Unsupervised pre-training

第一階段的目標是預訓練語言模型，給定tokens的語料，目標函數為最大化似然函數：

該模型中應用multi-headed self-attention，并在之后增加position-wise的前向傳播層，最后輸出一個分布：

2. Supervised fine-tuning

有了預訓練的語言模型之后，對于有標簽的訓練集，給定輸入序列和標簽，可以通過語言模型得到，經過輸出層后對進行預測：

則目標函數為：

整個任務的目標函數為：

4.2 模型訓練注意事項

- TF版源碼解析

# location: finetune-transformer-lm/train.pydef model(X, M, Y, train=False, reuse=False):with tf.variable_scope('model', reuse=reuse):# n_special=3，作者把數據集分為三份# n_ctx 應該是 n_contextwe = tf.get_variable("we", [n_vocab+n_special+n_ctx, n_embd], initializer=tf.random_normal_initializer(stddev=0.02))we = dropout(we, embd_pdrop, train)X = tf.reshape(X, [-1, n_ctx, 2])M = tf.reshape(M, [-1, n_ctx])# 1. Embeddingh = embed(X, we)# 2. transformer blockfor layer in range(n_layer):h = block(h, 'h%d'%layer, train=train, scale=True)# 3. 計算語言模型losslm_h = tf.reshape(h[:, :-1], [-1, n_embd])lm_logits = tf.matmul(lm_h, we, transpose_b=True)lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=lm_logits, labels=tf.reshape(X[:, 1:, 0], [-1]))lm_losses = tf.reshape(lm_losses, [shape_list(X)[0], shape_list(X)[1]-1])lm_losses = tf.reduce_sum(lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)# 4. 計算classifier lossclf_h = tf.reshape(h, [-1, n_embd])pool_idx = tf.cast(tf.argmax(tf.cast(tf.equal(X[:, :, 0], clf_token), tf.float32), 1), tf.int32)clf_h = tf.gather(clf_h, tf.range(shape_list(X)[0], dtype=tf.int32)*n_ctx+pool_idx)clf_h = tf.reshape(clf_h, [-1, 2, n_embd])if train and clf_pdrop > 0:shape = shape_list(clf_h)shape[1] = 1clf_h = tf.nn.dropout(clf_h, 1-clf_pdrop, shape)clf_h = tf.reshape(clf_h, [-1, n_embd])clf_logits = clf(clf_h, 1, train=train)clf_logits = tf.reshape(clf_logits, [-1, 2])clf_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=clf_logits, labels=Y)return clf_logits, clf_losses, lm_losses復制代碼

4.3 模型的優缺點

優點：

循環神經網絡所捕捉到的信息較少，而Transformer可以捕捉到更長范圍的信息。
計算速度比循環神經網絡更快，易于并行化
實驗結果顯示Transformer的效果比ELMo和LSTM網絡更好

缺點：

對于某些類型的任務需要對輸入數據的結構作調整

4.4 適用任務

Natural Language Inference
Question Answering and commonsense reasoning
Classification
Semantic Similarity

5. 總結

從Wrod Embedding到OpenAI Transformer，NLP中的遷移學習從最初使用word2vec、GLoVe進行字詞的向量表示，到ELMo可以提供前幾層的權重共享，再到ULMFiT和OpenAI Transformer的整個預訓練模型的精調，大大提高了NLP基本任務的效果。同時，多項研究也表明，以語言模型作為預訓練模型，不僅可以捕捉到文字間的語法信息，更可以捕捉到語義信息，為后續的網絡層提供高層次的抽象信息。另外，基于Transformer的模型在一些方面也展現出了優于RNN模型的效果。

最后，關于具體任務還是要進行多種嘗試，可以使用以上方法做出模型baseline，再調整網絡結構提升效果。