TensorFlow深度學習實戰——使用Hugging Face構建Transformer模型

- 0. 前言
- 1. 安裝 Hugging Face
- 2. 文本生成
- 3. 自動模型選擇和自動分詞
- 4. 命名實體識別
- 5. 摘要生成
- 6. 模型微調
- 相關鏈接

0. 前言

除了需要實現特定的自定義結構，或者想要了解 Transformer 工作原理外，從零開始實現 Transformer 并不是最佳選擇，和其它編程實踐一樣，通常并不需要從頭開始造輪子。只有想要理解 Transformer 架構的內部細節，或者修改 Transformer 架構以得到新的變體時才需要從零開始構建。有很多優秀的庫提供高質量的 Transformer 解決方案，Hugging Face 是其中的代表之一，它提供了一些構建 Transformer 的高效工具：

Hugging Face 提供了一個通用的 API 來處理多種 Transformer 架構
Hugging Face 不僅提供了基礎模型，還提供了帶有不同類型“頭”的模型來處理特定任務(例如，對于 BERT 架構，提供了 TFBertModel，用于情感分析的 TFBertForSequenceClassification，用于命名實體識別的 TFBertForTokenClassification，以及用于問答的 TFBertForQuestionAnswering 等)
可以通過使用 Hugging Face 提供的預訓練權重來輕松創建自定義的網絡，例如，使用 TFBertForPreTraining
除了 pipeline() 方法，還可以以常規方式定義模型，使用 fit() 進行訓練，使用 predict() 進行推理，就像普通的 TensorFlow 模型一樣

1. 安裝 Hugging Face

和其它第三方庫一樣，可以使用 pip 命令安裝 Hugging Face 庫：

$ pip install transformers[tf]

然后，通過下載一個用于情感分析的預訓練模型來驗證 Hugging Face 庫是否安裝成功：

$ python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"

如果成功安裝，將顯示如下輸出結果：

[{'label': 'POSITIVE', 'score': 0.9998704791069031}]

接下來，介紹如何使用 Hugging Face 解決具體任務。

2. 文本生成

在本節中，我們將使用 GPT-2 進行自然語言生成，這是一個生成自然語言輸出的過程。

(1) 使用 GPT-2 生成文本：

from transformers import pipeline
generator = pipeline(task="text-generation")

(2) 模型下載完成后，將文本傳遞給生成器，觀察結果：

generator("Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone")generator ("The original theory of relativity is based upon the premise that all coordinate systems in relative uniform translatory motion to each other are equally valid and equivalent ")generator ("It takes a great deal of bravery to stand up to our enemies")

生成結果

3. 自動模型選擇和自動分詞

Hugging Face 能夠盡可能幫助自動化多個步驟。

(1) 可以非常簡單的從數十個可用的預訓練模型中導入可用模型：

from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

可以在下游任務上訓練模型，以便用于預測和推理。

(2) 可以使用 AutoTokenizer 將單詞轉換為模型使用的詞元：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sequence = "The original theory of relativity is based upon the premise that all coordinate systems"
print(tokenizer(sequence))

輸出結果

4. 命名實體識別

命名實體識別 (Named Entity Recognition, NER) 是經典的自然語言處理任務。命名實體識別也稱實體識別 (entity identification)、實體分塊 (entity chunking) 或實體提取 (entity extraction)，是信息提取的一個子任務，旨在定位和分類在非結構化文本中提到的命名實體，將其劃分為預定義的類別，例如人名、組織、地點、時間表達、數量、貨幣值和百分比等。接下來，我們使用 Hugging Face 完成命名實體識別任務。

(1) 創建一個 NER 管道：

from transformers import pipeline
ner_pipe = pipeline("ner")
sequence = """Mr. and Mrs. Dursley, of number four, Privet Drive, were
proud to say that they were perfectly normal, thank you very much."""
for entity in ner_pipe(sequence):print(entity)

(2) 結果如下所示，其中實體已經被識別出來：
識別結果

命名實體識別可以理解九個不同的類別：

O: 不屬于命名實體
B-MIS: 在另一個雜項實體后開始的雜項實體
I-MIS: 雜項實體
B-PER: 在另一個人名后面開始的人名
I-PER: 人名
B-ORG: 在另一個組織后面開始的組織
I-ORG: 組織
B-LOC: 在另一個地點后面開始的地點
I-LOC: 地點

這些實體在 CoNLL-2003 數據集中定義，并由 Hugging Face 自動選擇。

5. 摘要生成

摘要生成，是指用簡短而清晰的形式表達有關某事或某人的最重要事實或觀點。Hugging Face 使用 T5 模型作為完成此任務的默認模型。

(1) 首先，使用默認的 T5 small 模型創建一個摘要生成管道：

from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """Mr. and Mrs.Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.Mr.Dursley was the director of a firm called Grunnings, which made drills.He was a big, beefy man with hardly any neck, although he did have a very large mustache.Mrs.Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors.The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

輸出結果如下：

輸出結果

(2) 如果想要更換使用不同的模型，只需修改參數 model：

summarizer = pipeline("summarization", model='t5-base')

輸出結果如下：

輸出結果

6. 模型微調

一種常見的 Transformer 使用模式是先使用預訓練的大語言模型 (Large Language Model, LLM)，然后對模型進行微調以適應特定的下游任務。微調步驟將基于自定義數據集，而預訓練則是在非常大的數據集上進行的。這種策略的優點在于節省計算成本，此外，微調令我們使用最先進的模型，而不需要從頭開始訓練一個模型。接下來，我們介紹如何使用 TensorFlow 進行模型微調，使用的預訓練模型是 bert-base-cased，在 Yelp Reviews 數據集上進行微調。
本節使用 datasets 庫加載數據集，datasets 庫是由 Hugging Face 提供的一個非常強大的工具，專門用于加載、處理和分享數據集，使用 pip 命令安裝 datasets 庫：

$ pip install datasets

(1) 首先，加載并對 Yelp 數據集進行分詞：

from datasets import load_datasetdataset = load_dataset("yelp_review_full")
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")def tokenize_function(examples):return tokenizer(examples["text"], padding="max_length", truncation=True)tokenized_datasets = dataset.map(tokenize_function, batched=True)small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

(2) 然后，將數據集轉換為 TensorFlow 格式：

from transformers import DefaultDataCollator
data_collator = DefaultDataCollator(return_tensors="tf")# convert the tokenized datasets to TensorFlow datasetstf_train_dataset = small_train_dataset.to_tf_dataset(columns=["attention_mask", "input_ids", "token_type_ids"],label_cols=["labels"],shuffle=True,collate_fn=data_collator,batch_size=8,
)tf_validation_dataset = small_eval_dataset.to_tf_dataset(columns=["attention_mask", "input_ids", "token_type_ids"],label_cols=["labels"],shuffle=False,collate_fn=data_collator,batch_size=8,
)

(3) 使用 TFAutoModelForSequenceClassification，選擇 bert-base-cased：

import tensorflow as tf
from transformers import TFAutoModelForSequenceClassificationmodel = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

(4) 最后，微調模型的方法是使用 TensorFlow 中的標準訓練方式，通過編譯模型并使用 fit() 進行訓練：

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=tf.metrics.SparseCategoricalAccuracy(),
)model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)