huggingface NLP主要知識點以及超級詳解使用

1.安裝huggingface依賴庫

pip install transformers
pip install datasets
pip install pytorch

pip install tokenizers
pip install diffusers
pip install accelerate
pip install evaluate
pip install optimum

pip install pillow
pip install requests
pip install gradio

Transformer以及相關庫：

2.設置huggingface大模型&數據集下載緩存目錄：

? 配置緩存環境變量：HF_HOME=C:\.cache\huggingface

3.huggingface常用基本命令：

下載數據集：
huggingface-cli download lansinuote/ChnSentiCorp ?--repo-type dataset

下載模型：
huggingface-cli download bert-base-chinese ??

transformers 版本升級(降級)命令：
pip uninstall transformers
pip install --upgrade transformers==v4.48.0

4.huggingface自然語言處理NLP模塊分類：

text-classification 文本分類，給一段文本進行打標分類
feature-extraction 特征提取：把一段文字用一個向量來表示
fill-mask 填詞：把一段文字的某些部分mask住，然后讓模型填空
ner 命名實體識別：識別文字中出現的人名地名的命名實體
question-answering 問答：給定一段文本以及針對它的一個問題，從文本中抽取答案
summarization 摘要：根據一段長文本中生成簡短的摘要
text-generation文本生成：給定一段文本，讓模型補充后面的內容
translation 翻譯：把一種語言的文字翻譯成另一種語言
conversional對話機器人：根據用戶輸入文本，產生回應，與用戶對話

? ? ? 自然語言處理的幾個階段：

? ??

5.huggingface主要知識點：

?tokenizer數據預處理
transformer
模型調用
pipeline
微調
?

transformer流程如下：

一般transformer模型有三個部分組成：1.tokennizer，2.Model，3.Post processing

tokenizer分詞器將我們輸入的信息轉成 Input IDs;
將Input IDs輸入到模型中，模型返回預測值;
將預測值輸入到后置處理器中，返回我們可以看懂的信息

6.huggingface基本使用代碼詳解：

? ?基于配置環境變量,模型下載code

import os
import re
from datasets import load_dataset
from transformers import BertTokenizer,BertForSequenceClassification
from transformers import Trainer,TrainingArguments
from sklearn.metrics import accuracy_score#step1.環境準備  pip install torch transformers datasets scikit-learn#step2.加載中文BERT預訓練模型和和分詞器#print("HF_HOME:",os.environ['HF_HOME'])
#print('HUGGINGFACE_HUB_CACHE:',os.environ['HUGGINGFACE_HUB_CACHE'])
#print('TRANSFORMERS_CACHE:',os.environ['TRANSFORMERS_CACHE'])
#print('HF_DATASETS_CACHE:',os.environ['HF_DATASETS_CACHE'])"""
自定義huggingface模型下載位置:windows系統，默認預訓練模型會被下載并緩存在本地到C://用戶//用戶名//.cache//huggingface//hub目錄下;    可以設置環境變量:HF_HOME=C:\.cache\huggingface
初次下載數據集或者模型,很慢,之后就讀取本地的數據或模型
"""tokenizer=BertTokenizer.from_pretrained('bert-base-chinese')
model= BertForSequenceClassification.from_pretrained('bert-base-chinese',num_labels=3)#step3.加載數據集ChnSentiCorp,并進行清洗
#數據集地址:https://huggingface.co/datasets/lansinuote/ChnSentiCorp
dataset= load_dataset('lansinuote/ChnSentiCorp')#定義數據清洗函數
def clean_text(text):text=re.sub(r'[^\w\s]','',text)  #去除標點符號text=text.strip() #去除前后空格return text#step4.數據預處理def tokenize_function(examples):return tokenizer(examples['text'],padding='max_length',truncation=True,max_length=128)#對數據進行分詞和編碼
encoded_dataset=dataset.map(tokenize_function,batched=True)#step5.訓練模型
#定義訓練參數，創建一個TrainingArguments對象
training_args=TrainingArguments(output_dir='./results', #指定訓練輸出的目錄，用于保存模型和其他輸出文件num_train_epochs=1, #設置訓練的輪數，這里設置為1輪per_device_train_batch_size=1,#每個設備(eg:GPU)上訓練批次大小，這里設置為1per_device_eval_batch_size=1,#每個設備上的評估批次大小，設置為1evaluation_strategy='epoch',#設置評估策略為每個epoch結束后進行進行評估logging_dir='./logs',#指定日志保存的目錄,用于記錄訓練過程中的日志信息
)#使用trainer進行訓練
trainer = Trainer(model=model,args=training_args,train_dataset=encoded_dataset['train'],eval_dataset=encoded_dataset['validation'],
)#開始訓練
trainer.train()#step6:評估模型性能#定義評估函數
def compute_metrics(p):preds=p.predictions.argmax(-1)return {"accuracy": accuracy_score(p.labels, preds)}#在測試集上評估模型
trainer.evaluate(encoded_dataset['test'],metric_key_prefix="eval")#{'eval_loss':0.2,'eval_accuracy':0.85}
"""
eval_loss:0.2 是模型在測試集上的損失值
損失值是一個衡量模型預測與實際標簽之間差異的指標；
較低的損失值通常表示模型的預測更接近于真實標簽；eval_accuracy:0.85 是模型在測試集上的準確率
準確率是指模型正確預測的樣本數量占總樣本數量的比例；
準確率為0.85，意味著模型在測試集上有85%的樣本被正確分類
"""#step7:導出模型
#保存模型和分詞器
model.save_pretrained('./saved_model')
tokenizer.save_pretrained('./saved_model')

使用HuggingFace HubAPI下載模型code

"""
1.安裝pip install huggingface_hub
2.確定要下載的文件;對于大多數模型至少要下載兩類文件:模型權重文件（如 .bin 或 .pt 文件）配置文件（如 .json 文件）
"""
from huggingface_hub import hf_hub_download#使用 Hugging Face Hub API 下載模型：#指定模型的倉庫名稱
repo_name="jonatasgrosman/wav2vec2-large-xlsr-53-english"
#指定要下載的文件名。需要知道模型的文件名,eg:"pytorch_model.bin", "config.json"
file_names=["pytorch_model.bin","config.json"]
for file_name in file_names:#下載文件到本地file_path=hf_hub_download(repo_id=repo_name,filename=file_name)print(f"Downloaded {file_name} to {file_path}")################采用這種方法下載模型文件，由于包含大文件會卡死超時，建議使用git clone方法下載模型##############

手動從huggingface下載模型，放在自己的工程目錄code

import os.pathfrom transformers import BertTokenizer, BertModel"""
手動從huggingface下載模型，放在自己的工程目錄：HuggingfaceModels下，進行調用
"""PATH=r"HuggingfaceModels/"
modelPath=os.path.join(PATH,'bert-base-chinese')#1.加載預訓練的專用于bert的分詞模型
tokenizer=BertTokenizer.from_pretrained(modelPath)
#2.加載預訓練的bertModel
model=BertModel.from_pretrained(modelPath)
text="雖然今天下雨了，但我拿到了心意的offer,很開心！"
#3.將text輸入分詞和編碼模型
encode_input=tokenizer(text,return_tensors="pt")
#4.將編碼好的文字輸入給預訓練好的bert模型
output=model(**encode_input)
print("output:",output)

中文分詞庫jieba使用：
pip install jieba

? 代碼如下：

"""
中文分詞庫 jieba：精確模式分詞:試圖將句子最精確地切開，適合文本分析.
"""
import jiebacontent="無線電法國別研究"
#直接返回列表內容,使用jieba.lcut
list=jieba.lcut(content,cut_all=False) #cut_all默認為False
print(list)
searchList=jieba.lcut_for_search(content)
print(searchList)

運行如下：

使用字典和分詞工具：

from transformers import BertTokenizer
"""
1.使用字典和分詞工具
"""#加載預訓練字典和分詞方法（即：加載tokenizer）
tokenizer=BertTokenizer.from_pretrained(pretrained_model_name_or_path='bert-base-chinese',cache_dir=None,force_download=False,
)#準備語料庫
sents=['選擇珠江花園的原因是方便。','筆記本的鍵盤確實是好。','房間太小。其他的一般。','今天才知道這本書還有第6卷，真有點郁悶.','機器背面似乎被撕兩張什么標簽，殘膠還在.',
]tokenizer,sents#1.編碼兩個句子，使用簡單的編碼函數tokenizer.encode()
# out=tokenizer.encode(
#     text=sents[0],
#     text_pair=sents[1],
#     truncation=True,#當句子長度大于max_length時，截斷
#     padding='max_length',#一律補pad到max_length長度
#     add_special_tokens=True,
#     max_length=30,
#     return_tensors=None,
# )
# print('編碼兩個句子,out:',out)
# tokenizer.decode(out)#2.使用增加的編碼函數tokenizer.encode_plus()
# out=tokenizer.encode_plus(
#     text=sents[0],
#     text_pair=sents[1],
#     truncation=True,#當句子長度大于max_lenght時，截斷
#     padding='max_length',#一律補零到max_length長度
#     add_special_tokens=True,
#     return_tensors=None,#可取值tf,pt,np,默認返回list
#     return_token_type_ids=True,#返回token_type_ids
#     return_attention_mask=True,#返回attention_mask
#     return_special_tokens_mask=True,#返回special_tokens_mask 特殊符號標識
#     #return_offsets_mapping=True, #返回offsets_mapping 標識每個單詞的起止位置,這個參數只能BertTokenizerFast使用
#     return_length=True,#返回length表示長度
# )#增強編碼的結果
#input_ids 就是編碼后的詞
#token_type_ids 第一個句子和特殊符號的位置是0，第二個句子的位置是1
#special_tokens_mask 特殊符合的位置是1，其他位置是0
#attention_mask ：pad的位置是0，其他的位置是1
#length 返回的句子長度# for k,v in out.items():
#     print(k,",",v)
#
# tokenizer.decode(out['input_ids'])#3.批量編碼句子
# out=tokenizer.batch_encode_plus(
#     batch_text_or_text_pairs=[sents[0],sents[1]],
#     add_special_tokens=True,
#     truncation=True,#當句子大于max_length時，截斷
#     padding='max_length',#一律補0到max_length長度
#     max_length=15,
#     return_tensors=None,#可取值 tf,pt,np,默認返回list
#     return_token_type_ids=True,#返回token_type_ids
#     return_attention_mask=True,#返回attention_mask
#     return_special_tokens_mask=True,#返回special_tokens_mask  特殊符號標識
#     #return_offsets_mapping=True,
#     return_length=True,#返回length 標識長度
# )
#
#
# for k,v in out.items():
#     print(k,",",v)
#
# tokenizer.decode(out['input_ids'][0],tokenizer.decode(out['input_ids'][1]))#4.批量成對編碼
# out=tokenizer.batch_encode_plus(
#     batch_text_or_text_pairs=[(sents[0],sents[1]),(sents[2],sents[3])],
#     add_special_tokens=True,
#     truncation=True,#當句子大于max_length時，截斷
#     padding='max_length',#一律補0到max_length長度
#     max_length=15,
#     return_tensors=None,#可取值 tf,pt,np,默認返回list
#     return_token_type_ids=True,#返回token_type_ids
#     return_attention_mask=True,#返回attention_mask
#     return_special_tokens_mask=True,#返回special_tokens_mask  特殊符號標識
#     #return_offsets_mapping=True,
#     return_length=True,#返回length 標識長度
# )
#
# for k,v in out.items():
#     print(k,",",v)
#
# tokenizer.decode(out['input_ids'][0])"""
字典操作
"""
#獲取字典
zidian=tokenizer.get_vocab()
print(type(zidian),len(zidian),'月光' in zidian)#添加新詞
tokenizer.add_tokens(new_tokens=['月光','希望'])#添加新符號
tokenizer.add_special_tokens({'eos_token':'[EOS]'})zidian=tokenizer.get_vocab()print(type(zidian),len(zidian),zidian['月光'],zidian['[EOS]'])#編碼新詞
out=tokenizer.encode(text='月光的新希望[EOS]',text_pair=None,truncation=True,padding='max_length',add_special_tokens=True,max_length=8,return_tensors=None,
)
print("out:",out)tokenizer.decode(out)

運行結果如下：

GPT2模型使用：

import os.path
from transformers import GPT2Tokenizer, GPT2Model, AutoTokenizer, AutoModel"""
使用GPT2模型
"""#AutoClasses通用模型，可以調用各種分詞器PATH=r"HuggingfaceModels/"
modelPath=os.path.join(PATH,'gpt2')# tokenizer=GPT2Tokenizer.from_pretrained(modelPath)
# model=GPT2Model.from_pretrained(modelPath)tokenizer=AutoTokenizer.from_pretrained(modelPath)
model=AutoModel.from_pretrained(modelPath)
text="i love dog,dog is cute."
encoded_input=tokenizer(text, return_tensors="pt")
output=model(**encoded_input)
print("output:",output)

運行結果如下：

huggingface pipeline使用：
?

from transformers import pipeline,GPT2Tokenizer"""
huggingface pipeline使用
"""#1.情感分類：#文本分類
# classifier= pipeline('sentiment-analysis')
#
# result=classifier("I hate you")[0]
# print("result:",result)
#
# result=classifier("I love you")[0]
# print("result:",result)#2.閱讀理解
# question_answerer= pipeline('question-answering')
# context=r"""
# Extractive Question Answering is the task of extracting an answer from question  answering dataset is the SQuAD dataset,
# which is entirely bring a model  on a SQuAD task,you may leverage the example/pytorch/question
# """
# result=question_answerer(question="What is extractive question answering?",context=context)
# print("result:",result)
#
# result=question_answerer(question="What is a good example of a  question answering dataset?",context=context)
# print("result:",result)#3.文本生成
#文本生成
# text_generator=pipeline('text-generation',model='gpt2')
# sample =text_generator("As far as I am concered,I will",
#                max_length=50,
#                truncation=True,
#                do_sample=False,
#                pad_token_id=text_generator.tokenizer.eos_token_id,
#                )
# print("sample:",sample)#4.命名實體識別
# ner_pipe= pipeline("ner")
# sequence="""
# Hugging face Inc. is a company based in New York City.therefore very close to the Manhattan Bridge which is visible from t
# """
# for entity in ner_pipe(sequence):
#     print("entity:",entity)#5.文本總結
# summarizer = pipeline("summarization")
# ARTIICLE = """
# New York (CNN) When Liana  was 22 years old, A year later, she got married again in Westchester Country,but to a
# only 18 days after that marriage,she got hitched yet again.how many time did she  marriage?
# """
# summarizer(ARTIICLE,max_length=40,min_length=30,do_sample=False,num_return_sequences=1)#6.翻譯
translator=pipeline("translation_en_to_de")
sentence="I love china,do you like china?"
out=translator(sentence,max_length=40)
print("out:",out)

運行結果如下：

預訓練Bert模型的二分類使用demo

import os.path
import torch
from torch import nn"""
預訓練Bert模型的二分類使用demo
"""
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModelPATH=r"HuggingfaceModels/"
modelPath=os.path.join(PATH,'bert-base-chinese')#1.加載BERT模型和分詞器
# tokenizer=BertTokenizer.from_pretrained(modelPath)
# model=BertModel.from_pretrained(modelPath)tokenizer=AutoTokenizer.from_pretrained(modelPath)
model=AutoModel.from_pretrained(modelPath)#2.定義句子分類器
class BertSentenceClassifier(nn.Module):def __init__(self,bert_model,num_classes):super(BertSentenceClassifier, self).__init__()self.bert=bert_modelself.classifier=nn.Linear(bert_model.config.hidden_size,num_classes)def forward(self,input_ids,attention_mask):#獲取BERT的輸出outputs=self.bert(input_ids=input_ids,attention_mask=attention_mask)#獲取[CLS] token的表示pooler_output=outputs.pooler_output#將其輸入到分類器中logits=self.classifier(pooler_output)return logits#示例文本
text="雖然今天下雨了，但我拿到了心意的offer,太糟糕了！"#將文本轉換為BERT的輸入格式
encode_input=tokenizer(text,return_tensors="pt")#初始化分類器，假設有兩個標簽（eg:積極和消極）
classifier=BertSentenceClassifier(model,num_classes=2)#獲取分類結果
logits=classifier(encode_input['input_ids'],encode_input['attention_mask'])#將logits轉換為概率
probabilities=torch.softmax(logits,dim=-1)#打印分類結果
print("Logits:",logits)
print("Probabilities:",probabilities)

運行結果如下：

說明這個人的態度是消極的，而非積極態度
?

huggingfade 中文分類demo-二分類
?

import os.path
import torch
from datasets import load_dataset
from transformers import BertTokenizer, BertModel, AdamW"""
huggingfade 中文分類demo-二分類
"""#1.定義數據集
class Dataset(torch.utils.data.Dataset):def __init__(self, split):self.dataset= load_dataset(os.path.join(r"HuggingfaceModels/", 'ChnSentiCorp'),split=split,trust_remote_code=True)def __len__(self):return len(self.dataset)def __getitem__(self, i):text=self.dataset[i]["text"]label=self.dataset[i]["label"]return text,labeldataset=Dataset('train')
len(dataset),dataset[0]#2.加載tokenizer
#加載字典和分詞工具
token=BertTokenizer.from_pretrained('bert-base-chinese')
print(token)#3.定義批處理函數
def collate_fn(data):sents=[i[0] for i in data]labels=[i[1] for i in data]#編碼data=token.batch_encode_plus(batch_text_or_text_pairs=sents,truncation=True,padding="max_length",max_length=500,return_tensors="pt",return_length=True)#input_ids:編碼之后的數字#attention_mask:是補零的位置是0，其他位置是1input_ids=data['input_ids']attention_mask=data['attention_mask']token_type_ids=data['token_type_ids']labels =torch.LongTensor(labels)#print(data['length'],data['length'].max())return input_ids,attention_mask,token_type_ids,labels#4.定義數據加載器
loader=torch.utils.data.DataLoader(dataset=dataset,batch_size=16,collate_fn=collate_fn,shuffle=True,drop_last=True)
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):breakprint(len(loader))
input_ids.shape, attention_mask.shape, token_type_ids.shape,labels.shape#5.加載bert中文模型
#加載預訓練模型
pretrained=BertModel.from_pretrained('bert-base-chinese')
#不訓練，不需要計算梯度
for param in pretrained.parameters():param.requires_grad_(False)
#模型試算
out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
out.last_hidden_state.shape#6.定義下游任務模型-簡單的神經網絡模型
class Model(torch.nn.Module):def __init__(self):super().__init__()self.fc=torch.nn.Linear(768,2)def forward(self,input_ids,attention_mask,token_type_ids):with torch.no_grad():out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=self.fc(out.last_hidden_state[:,0]) #取第0個詞的特征out=out.softmax(dim=-1)return outmodel = Model()
model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids).shape#7.訓練下游任務模型
#訓練
optimizer = AdamW(model.parameters(),lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()model.train()
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)loss=criterion(out,labels)loss.backward()optimizer.step()optimizer.zero_grad() #進行梯度下降if i % 5 == 0:out=out.argmax(dim=-1)accuracy=(out==labels).sum().item()/len(labels)print(i,loss.item(),accuracy)if i==3: #訓練3次break#8.測試
def test():model.eval()correct=0total=0loader_test=torch.utils.data.DataLoader(dataset=Dataset('validation'),batch_size=32,collate_fn=collate_fn,shuffle=True,drop_last=True)for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader_test):if i % 5 == 0:breakprint(i)with torch.no_grad():out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=out.argmax(dim=-1)correct+=(out==labels).sum().item()total+=len(labels)print(correct/total)test()

huggingface 中文填空(Mask)-demo

import os.path
import torch
from datasets import load_dataset
from transformers import BertTokenizer, BertModel, AdamW"""
huggingface 中文填空(Mask)-demo
"""#1.定義數據集
class Dataset(torch.utils.data.Dataset):def __init__(self, split):path = r"HuggingfaceModels/"dataSetPath = os.path.join(path, 'ChnSentiCorp')dataset=load_dataset(dataSetPath,split=split)#對文本數據進行過濾def f(data):return len(data['text'])>30self.dataset=dataset.filter(f)def __len__(self):return len(self.dataset)def __getitem__(self, i):text=self.dataset[i]['text']return textdataset=Dataset("train")len(dataset),dataset[0]#2.加載tokenizer
token=BertTokenizer.from_pretrained('bert-base-chinese')
token#3.定義批處理函數
def collate_fn(data):sents=[i[0] for i in data]labels=[i[1] for i in data]#編碼data=token.batch_encode_plus(batch_text_or_text_pairs=sents,truncation=True,padding="max_length",max_length=500,return_tensors="pt",return_length=True)#input_ids:編碼之后的數字#attention_mask:是補零的位置是0，其他位置是1input_ids=data['input_ids']attention_mask=data['attention_mask']token_type_ids=data['token_type_ids']labels =torch.LongTensor(labels)#把第15個詞固定替換為masklabels=input_ids[:,15].reshape(-1).clone()input_ids[:,15]=token.get_vocab()[token.mask_token]return input_ids,attention_mask,token_type_ids,labels#4.定義數據的加載器
loader=torch.utils.data.DataLoader(dataset=dataset,batch_size=16,collate_fn=collate_fn,shuffle=True,drop_last=True)
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):breakprint(len(loader))
print(token.decode(input_ids[0]))
print(token.decode(labels[0]))input_ids.shape, attention_mask.shape, token_type_ids.shape,labels.shape#5.加載bert中文模型
#加載預訓練模型
pretrained=BertModel.from_pretrained('bert-base-chinese')
#不訓練，不需要計算梯度
for param in pretrained.parameters():param.requires_grad_(False)
#模型試算
out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
out.last_hidden_state.shape#6.定義下游任務模型-簡單的神經網絡模型
class Model(torch.nn.Module):def __init__(self):super().__init__()self.decoder=torch.nn.Linear(768,token.vocab_size,bias=False)self.bias=torch.nn.Parameter(torch.zeros(token.vocab_size))self.decoder.bias=self.biasdef forward(self,input_ids,attention_mask,token_type_ids):with torch.no_grad():out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=self.decoder(out.last_hidden_state[:,15]) #取第15個詞的特征return outmodel = Model()
model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids).shape#7.訓練下游任務模型
#訓練
optimizer = AdamW(model.parameters(),lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()model.train()
for epoch in range(5):for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)loss=criterion(out,labels)loss.backward()optimizer.step()optimizer.zero_grad() #進行梯度下降if i % 50 == 0:out=out.argmax(dim=-1)accuracy=(out==labels).sum().item()/len(labels)print(epoch,i,loss.item(),accuracy)#8.測試
def test():model.eval()correct=0total=0loader_test=torch.utils.data.DataLoader(dataset=Dataset('test'),batch_size=32,collate_fn=collate_fn,shuffle=True,drop_last=True)for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader_test):if i == 15:breakprint(i)with torch.no_grad():out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=out.argmax(dim=-1)correct+=(out==labels).sum().item()total+=len(labels)print(token.decode(input_ids[0]))print(token.decode(labels[0]),token.decode(labels[0]))print(correct/total)test()

huggingface中文句子關系推斷demo

import random
import torch
from datasets import load_dataset
import os.path
from transformers import BertTokenizer, BertModel, AdamW"""
huggingface 中文句子關系推斷demo
"""
#1.定義數據集
class Dataset(torch.utils.data.Dataset):def __init__(self, split):path = r"HuggingfaceModels/"dataSetPath = os.path.join(path, 'ChnSentiCorp')dataset=load_dataset(dataSetPath,split=split)#對文本數據進行過濾def f(data):return len(data['text'])>40self.dataset=dataset.filter(f)def __len__(self):return len(self.dataset)def __getitem__(self, i):text=self.dataset[i]['text']#切分一句話為前半句和后半句sentence1=text[:20]sentence2=text[20:40]label=0#有一半的概率把后半句替換為一句無關的話if random.randint(0,1)==0:j=random.randint(0,len(self.dataset)-1)sentence2=self.dataset[j]['text'][20:40]label=1return sentence1,sentence2,labeldataset=Dataset("train")sentence1,sentence2,label=dataset[0]
len(dataset),sentence1,sentence2,label#2.加載tokenizer
token=BertTokenizer.from_pretrained('bert-base-chinese')
token#3.定義批處理函數
def collate_fn(data):sents=[i[:2] for i in data]labels=[i[:2] for i in data]#編碼data=token.batch_encode_plus(batch_text_or_text_pairs=sents,truncation=True,padding="max_length",max_length=45,return_tensors="pt",return_length=True,add_special_tokens=True,)#input_ids:編碼之后的數字#attention_mask:是補零的位置是0，其他位置是1input_ids=data['input_ids']attention_mask=data['attention_mask']token_type_ids=data['token_type_ids']labels =torch.LongTensor(labels)return input_ids,attention_mask,token_type_ids,labels#4.定義數據的加載器
loader=torch.utils.data.DataLoader(dataset=dataset,batch_size=8,collate_fn=collate_fn,shuffle=True,drop_last=True)
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):breakprint(len(loader))
print(token.decode(input_ids[0]))input_ids.shape, attention_mask.shape, token_type_ids.shape,labels.shape#5.加載bert中文模型
#加載預訓練模型
pretrained=BertModel.from_pretrained('bert-base-chinese')
#不訓練，不需要計算梯度
for param in pretrained.parameters():param.requires_grad_(False)
#模型試算
out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
out.last_hidden_state.shape#6.定義下游任務模型-簡單的神經網絡模型
class Model(torch.nn.Module):def __init__(self):super().__init__()self.fc=torch.nn.Linear(768,2)def forward(self,input_ids,attention_mask,token_type_ids):with torch.no_grad():out=pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)out=self.fc(out.last_hidden_state[:,0]) #取第0個詞的特征out=out.softmax(dim=-1)return outmodel = Model()
model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids).shape#7.訓練下游任務模型
#訓練
optimizer = AdamW(model.parameters(),lr=5e-4)
criterion = torch.nn.CrossEntropyLoss()model.train()
for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader):out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)loss=criterion(out,labels) #計算lossloss.backward()optimizer.step()optimizer.zero_grad() #進行梯度下降if i % 5 == 0:out=out.argmax(dim=-1)accuracy=(out==labels).sum().item()/len(labels)print(i,loss.item(),accuracy)if i==300:break#8.測試
def test():model.eval()correct=0total=0loader_test=torch.utils.data.DataLoader(dataset=Dataset('test'),batch_size=32,collate_fn=collate_fn,shuffle=True,drop_last=True)for i,(input_ids,attention_mask,token_type_ids,labels) in enumerate(loader_test):if i == 15:breakprint(i)with torch.no_grad():out=model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)pred=out.argmax(dim=-1)correct+=(pred==labels).sum().item()total+=len(labels)print(correct/total)test()