Transformer實戰（18）——微調Transformer語言模型進行回歸分析

- 0. 前言
- 1. 回歸模型
- 2. 數據處理
- 3. 模型構建與訓練
- 4. 模型推理
- 小結
- 系列鏈接

0. 前言

在自然語言處理領域中，預訓練 Transformer 模型不僅能勝任離散類別預測，也可用于連續數值回歸任務。本節介紹了如何將 DistilBert 轉變為回歸模型，為模型賦予預測連續相似度分值的能力。我們以 GLUE 基準中的語義文本相似度 (STS-B) 數據集為例，詳細介紹配置 DistilBertConfig、加載數據集、分詞并構建 TrainingArguments，并定義 Pearson/Spearman 相關系數等回歸指標。

1. 回歸模型

回歸模型通常最后一層只有一個神經元，它不會通過 softmax 邏輯回歸處理，而是進行歸一化。為了定義模型并在頂部添加一個單神經元的輸出層，有兩種方法：直接在 BERT.from_pretrained() 方法中使用參數 num_labels=1，或者通過 config 對象傳遞此信息。首先需要從預訓練模型的 config 對象中復制這些信息：

from transformers import DistilBertConfig, DistilBertTokenizerFast, DistilBertForSequenceClassification
MODEL_PATH='distilbert-base-uncased'
config = DistilBertConfig.from_pretrained(MODEL_PATH, num_labels=1)
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_PATH)
model = DistilBertForSequenceClassification.from_pretrained(MODEL_PATH, config=config)

由于我們設置了 num_labels=1 參數，因此預訓練模型的輸出層包含一個神經元。接下來，準備數據集微調模型進行回歸分析。
在本節中，我們將使用語義文本相似度基準 (STS-B) 數據集，它包含從新聞標題等多種內容中提取的句子對。每對句子都有一個從 1 到 5 的相似度評分，我們的任務是微調 DistilBert 模型以預測這些評分，并使用 Pearson/Spearman 相關系數來評估模型。

2. 數據處理

(1) 加載數據。將原始數據分為三部分，但由于測試集沒有標簽，所以我們可以將驗證數據分為兩部分：

import datasets
from datasets import load_dataset
stsb_train= load_dataset('glue','stsb', split="train")
stsb_validation = load_dataset('glue','stsb', split="validation")
stsb_validation=stsb_validation.shuffle(seed=42)
stsb_val= datasets.Dataset.from_dict(stsb_validation[:750])
stsb_test= datasets.Dataset.from_dict(stsb_validation[750:])

(2) 使用 pandas 來整理 stsb_train 訓練數據：

import pandas as pd
pd.DataFrame(stsb_train)

整理后的訓練數據樣本如下：

數據樣本

(3) 查看三個數據集的形狀：

stsb_train.shape, stsb_val.shape, stsb_test.shape
# ((5749, 4), (750, 4), (750, 4))

(4) 對數據集進行分詞處理：

enc_train = stsb_train.map(lambda e: tokenizer( e['sentence1'],e['sentence2'], padding=True, truncation=True), batched=True, batch_size=1000) 
enc_val =   stsb_val.map(lambda e: tokenizer( e['sentence1'],e['sentence2'], padding=True, truncation=True), batched=True, batch_size=1000) 
enc_test =  stsb_test.map(lambda e: tokenizer( e['sentence1'],e['sentence2'], padding=True, truncation=True), batched=True, batch_size=1000)

(5) 分詞器將兩個句子用 [SEP] 分隔符連接，并為句子對生成 input_ids 和 attention_mask：

pd.DataFrame(enc_train)

輸出結果如下：

輸出結果

3. 模型構建與訓練

(1) 在 TrainingArguments 類中定義參數集：

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(# The output directory where the model predictions and checkpoints will be writtenoutput_dir='./stsb-model', do_train=True,do_eval=True,#  The number of epochs, defaults to 3.0 num_train_epochs=3,              per_device_train_batch_size=32,  per_device_eval_batch_size=64,# Number of steps used for a linear warmupwarmup_steps=100,                weight_decay=0.01,# TensorBoard log directorylogging_strategy='steps',                logging_dir='./logs',            logging_steps=50,# other options : no, stepsevaluation_strategy="epoch",save_strategy="epoch",fp16=True,load_best_model_at_end=True
)

(2) 定義 compute_metrics 函數。其中，評估指標基于皮爾遜相關系數 (Pearson correlation coefficient) 和斯皮爾曼等級相關系數 (Spearman’s rank correlation) 法，此外，還提供均方誤差 (Mean Square Error, MSE)、均方根誤差 (Root Mean Square Error, RMSE) 和平均絕對誤差 (Mean Absolute Error, MAE) 等常用的回歸模型評估指標：

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
import numpy as np
from scipy.stats import pearsonr
from scipy.stats import spearmanr
def compute_metrics(pred):preds = np.squeeze(pred.predictions) return {"MSE": ((preds - pred.label_ids) ** 2).mean().item(),"RMSE": (np.sqrt ((  (preds - pred.label_ids) ** 2).mean())).item(),"MAE": (np.abs(preds - pred.label_ids)).mean().item(),"Pearson" : pearsonr(preds,pred.label_ids)[0],"Spearman's Rank" : spearmanr(preds,pred.label_ids)[0]}

(3) 實例化 Trainer 對象：

trainer = Trainer(model=model,args=training_args,train_dataset=enc_train,eval_dataset=enc_val,compute_metrics=compute_metrics,tokenizer=tokenizer)

(4) 運行訓練過程：

train_result = trainer.train()
metrics = train_result.metrics

輸出結果如下：

輸出結果
最佳驗證損失為 0.542073，評估最佳權重模型：

q=[trainer.evaluate(eval_dataset=data) for data in [enc_train, enc_val, enc_test]]
pd.DataFrame(q, index=["train","val","test"]).iloc[:,:6]

輸出結果如下：

輸出結果

在測試數據集上，Pearson 和 Spearman 相關系數得分分別為 87.69 和 87.64。

4. 模型推理

(1) 運行模型進行推理。以下面兩個意義相同的句子為例，將它們輸入模型：

s1,s2="A plane is taking off.",	"An air plane is taking off."
encoding = tokenizer(s1,s2, return_tensors='pt', padding=True, truncation=True, max_length=512)
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
outputs.logits.item()
# 4.57421875

(2) 接下來，將語義不同的句子對輸入模型：

s1,s2="The men are playing soccer.",	"A man is riding a motorcycle."
encoding = tokenizer("hey how are you there","hey how are you", return_tensors='pt', padding=True, truncation=True, max_length=512)
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
outputs.logits.item()
# 3.1953125

(3) 最后，保存模型：

model_path = "sentence-pair-regression-model"
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)

小結

本節介紹了如何基于預訓練 DistilBert 架構完成語義相似度回歸分析。首先，通過修改配置或傳參的方式，為模型頂層添加單神經元回歸頭；隨后，借助 STS-B 數據集構建訓練、驗證與測試集，并應用分詞器生成模型輸入。接著，使用 Trainer 框架與自定義的 compute_metrics 函數，對模型在 MSE、RMSE、MAE 及 Pearson 和 Spearman 相關性等多維度指標上進行評估，驗證了微調方法在回歸任務中的有效性。

系列鏈接

Transformer實戰（1）——詞嵌入技術詳解
Transformer實戰（2）——循環神經網絡詳解
Transformer實戰（3）——從詞袋模型到Transformer：NLP技術演進
Transformer實戰（4）——從零開始構建Transformer
Transformer實戰（5）——Hugging Face環境配置與應用詳解
Transformer實戰（6）——Transformer模型性能評估
Transformer實戰（7）——datasets庫核心功能解析
Transformer實戰（8）——BERT模型詳解與實現
Transformer實戰（9）——Transformer分詞算法詳解
Transformer實戰（10）——生成式語言模型 (Generative Language Model, GLM)
Transformer實戰（11）——從零開始構建GPT模型
Transformer實戰（12）——基于Transformer的文本到文本模型
Transformer實戰（13）——從零開始訓練GPT-2語言模型
Transformer實戰（14）——微調Transformer語言模型用于文本分類
Transformer實戰（15）——使用PyTorch微調Transformer語言模型
Transformer實戰（16）——微調Transformer語言模型用于多類別文本分類
Transformer實戰（17）——微調Transformer語言模型進行多標簽文本分類