bert 相似度任務訓練，簡單版本

任務

代碼

train.py

predit.py

數據

任務

使用 bert-base-chinese 訓練相似度任務，參考：微調BERT模型實現相似性判斷 - 知乎

參考他上面代碼，他使用的是?BertForNextSentencePrediction 模型，BertForNextSentencePrediction?原本是設計用于下一個句子預測任務的。在BERT的原始訓練中，模型會接收到一對句子，并試圖預測第二個句子是否緊跟在第一個句子之后；所以使用這個模型標簽(label)只能是 0,1，相當于二分類任務了

但其實在相似度任務中，我們每一條數據都是【text1\ttext2\tlabel】的形式，其中 label 代表相似度，可以給兩個文本打分表示相似度，也可以映射為分類任務，0 代表不相似，1 代表相似，他這篇文章利用了這種思想，對新手還挺有用的。

現在我搞了一個招聘數據，里面有辦公區域列，處理過了，每一行代表【地址1\t地址2\t相似度】

只要兩文本中有一個地址相似我就作為相似，標簽為 1，否則 0

利用這數據微調，沒有使用驗證數據集，就最后使用測試集來看看效果。

代碼

train.py

import json
import torch
from transformers import BertTokenizer, BertForNextSentencePrediction
from torch.utils.data import DataLoader, Dataset# 能用gpu就用gpu
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cpu")bacth_size = 32
epoch = 3
auto_save_batch = 5000
learning_rate = 2e-5# 準備數據集
class MyDataset(Dataset):def __init__(self, data_file_paths):self.texts = []self.labels = []# 分詞器用默認的self.tokenizer = BertTokenizer.from_pretrained('../bert-base-chinese')# 自己實現對數據集的解析with open(data_file_paths, 'r', encoding='utf-8') as f:for line in f:text1, text2, label = line.split('\t')self.texts.append((text1, text2))self.labels.append(int(label))def __len__(self):return len(self.texts)def __getitem__(self, idx):text1, text2 = self.texts[idx]label = self.labels[idx]encoded_text = self.tokenizer(text1, text2, padding='max_length', truncation=True, max_length=128, return_tensors='pt')return encoded_text, label# 訓練數據文件路徑
train_dataset = MyDataset('../data/train.txt')# 定義模型
# num_labels=5 定義相似度評分有幾個
model = BertForNextSentencePrediction.from_pretrained('../bert-base-chinese', num_labels=6)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)# 訓練模型
train_loader = DataLoader(train_dataset, batch_size=bacth_size, shuffle=True)
trained_data = 0
batch_after_last_save = 0
total_batch = 0
total_epoch = 0for epoch in range(epoch):trained_data = 0for batch in train_loader:inputs, labels = batch# 不知道為啥，出來的數據維度是 (batch_size, 1, 128)，需要把第二維去掉inputs['input_ids'] = inputs['input_ids'].squeeze(1)inputs['token_type_ids'] = inputs['token_type_ids'].squeeze(1)inputs['attention_mask'] = inputs['attention_mask'].squeeze(1)# 因為要用GPU，將數據傳輸到gpu上inputs = inputs.to(device)labels = labels.to(device)optimizer.zero_grad()outputs = model(**inputs, labels=labels)loss, logits = outputs[:2]loss.backward()optimizer.step()trained_data += len(labels)trained_process = float(trained_data) / len(train_dataset)batch_after_last_save += 1total_batch += 1# 每訓練 auto_save_batch 個 batch，保存一次模型if batch_after_last_save >= auto_save_batch:batch_after_last_save = 0model.save_pretrained(f'../output/cn_equal_model_{total_epoch}_{total_batch}.pth')print("保存模型：cn_equal_model_{}_{}.pth".format(total_epoch, total_batch))print("訓練進度：{:.2f}%, loss={:.4f}".format(trained_process * 100, loss.item()))total_epoch += 1model.save_pretrained(f'../output/cn_equal_model_{total_epoch}_{total_batch}.pth')print("保存模型：cn_equal_model_{}_{}.pth".format(total_epoch, total_batch))

訓練好后的文件，輸出的最后一個文件夾才是效果最好的模型：

predit.py

import torch
from transformers import BertTokenizer, BertForNextSentencePredictiontokenizer = BertTokenizer.from_pretrained('../bert-base-chinese')
model = BertForNextSentencePrediction.from_pretrained('../output/cn_equal_model_3_171.pth')with torch.no_grad():with open('../data/test.txt', 'r', encoding='utf8') as f:lines = f.readlines()correct = 0for i, line in enumerate(lines):text1, text2, label = line.split('\t')encoded_text = tokenizer(text1, text2, padding='max_length', truncation=True, max_length=128, return_tensors='pt')outputs = model(**encoded_text)res = torch.argmax(outputs.logits, dim=1).item()print(text1, text2, label, res)if str(res) == label.strip('\n'):correct += 1print(f'{i + 1}/{len(lines)}')print(f'acc:{correct / len(lines)}')

可以看到還是較好的學習了我數據特征：只要兩文本中有一個地址相似我就作為相似，標簽為 1，否則 0

數據

鏈接：https://pan.baidu.com/s/1Cpr-ZD9Neakt73naGdsVTw?
提取碼：eryw?
鏈接：https://pan.baidu.com/s/1qHYjXC7UCeUsXVnYTQIPCg?
提取碼：o8py?
?

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/712869.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/712869.shtml
英文地址，請注明出處：http://en.pswp.cn/news/712869.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！