【BERT_Pretrain】Wikipedia_Bookcorpus數據預處理（二）

上一篇介紹了wikipedia和bookcopus數據集，這一篇主要講一下如何預處理數據，使其可以用于BERT的Pretrain任務MLM和NSP。

MLM是類似于完形填空的任務，NSP是判斷兩個句子是否連著。因此數據預處理的方式不同。首先，拿到原始數據集，兩個數據集都是段落，因此要分成單句。然后，有了單句針對不同任務進行預處理。BERT原文的原始做法是。將數據集按照NSP任務預處理，再進行mask，得到MLM任務的數據。

段落變單句

利用nltk包分開句子。(代碼接上一篇)

# pargraph->sentences
import nltk
from nltk.tokenize import sent_tokenizetry:sent_tokenize("Test sentence.")  # 嘗試使用以觸發錯誤
except LookupError:nltk.download('punkt')  # 自動下載所需資源nltk.download('punkt_tab')#'''
#test
#'''
text = "This is the first sentence. This is the second sentence."# 直接使用
sentences = sent_tokenize(text)
print(sentences)
print('success!')

將全部數據段落分成句子。

def preprocess_text(text):sentences = sent_tokenize(text)sentences = [s.strip() for s in sentences if len(s.strip()) > 10] #只有去除空白后長度超過 10 的句子才會被保留return sentencesdef preprocess_examples(examples):processed_texts = []for text in examples["text"]:processed_texts.extend(preprocess_text(text))return {"sentences": processed_texts}processed_dataset = dataset.map(preprocess_examples,batched=True,	#批量處理remove_columns=dataset.column_names,	#相當于keynum_proc=4  #CPU并行處理
)

打印出一些處理后數據的信息。

print(type(processed_dataset))
print(processed_dataset.num_rows)
print(processed_dataset.column_names)
print(processed_dataset.shape)

NSP+MLM數據預處理

from transformers import BertTokenizer
import random# 加載分詞器
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')def create_nsp_mlm_examples(examples):input_ids = []token_type_ids = []attention_mask = []next_sentence_label = []masked_lm_labels = []sentences = examples["sentences"]#NSPfor i in range(len(sentences) - 1):if random.random() > 0.5:text_a = sentences[i]text_b = sentences[i + 1]label = 1else:text_a = sentences[i]text_b = random.choice(sentences)while text_b == sentences[i + 1]:  # 防止隨機到了真實下一句text_b = random.choice(sentences)label = 0# 用分詞器將兩個句子拼接encoded = tokenizer(text_a,text_b,max_length=512,truncation=True,padding=False, # ‘max_length’	# 這里選的不帶padding，可以減少一部分內存占用return_tensors='pt')#MLMinput_id_list = encoded['input_ids'][0].tolist()mlm_labels = [0] * len(input_id_list)# 由于要添加mask，先指定沒有添加mask的時候是0（huggingface這里設置的是-100）special_token_ids = {tokenizer.cls_token_id, tokenizer.sep_token_id, tokenizer.pad_token_id}candidate_indices = [i for i, t_id in enumerate(input_id_list) if t_id not in special_token_ids]num_to_mask = max(1, int(len(candidate_indices) * 0.15))#句子中15%的隨機mask_indices = random.sample(candidate_indices, min(512, num_to_mask))for idx in mask_indices:original_token = input_id_list[idx]mlm_labels[idx] = original_tokenprob = random.random()if prob < 0.8:#80%為maskinput_id_list[idx] = tokenizer.mask_token_idelif 0.8 <= prob < 0.9:#10%為隨機input_id_list[idx] = random.randint(0, tokenizer.vocab_size - 1)# 10%保留原 token，不變input_ids.append(input_id_list)token_type_ids.append(encoded['token_type_ids'][0].tolist())attention_mask.append(encoded['attention_mask'][0].tolist())next_sentence_label.append(label)masked_lm_labels.append(mlm_labels)return {"input_ids": input_ids,"token_type_ids": token_type_ids,"attention_mask": attention_mask,"next_sentence_labels": next_sentence_label,"masked_lm_labels": masked_lm_labels}
# 將整個數據集處理一遍（耗時很長）
final_dataset = processed_dataset.map(create_nsp_mlm_examples,batched=True,remove_columns=processed_dataset.column_names,num_proc=4
)
#保存
final_dataset.save_to_disk("data_processed_nsp_mlm4gb", max_shard_size="4GB")

我這里的邏輯是，先預處理數據，再保存預處理數據直接用于訓練，但是也可以一邊訓練一邊預處理數據（尤其是你的磁盤大小不夠保存額外數據集的情況）。

注意，上文padding由于選擇了False，因此在用Dataloader包裹這個數據會遇到問題，因為torch.stack不能連接形狀不一樣的tensor，因此Dataloader里面的collate_fn還需要重寫。（下一章介紹）

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/87363.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/87363.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/87363.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！