【Next Token Prediction】VLM模型訓練中數據集標簽預處理詳解

源代碼來自：https://github.com/huggingface/nanoVLM/blob/main/data/collators.py

詳解如下所示：

import torch#-------------------------------#
# 主要是在數據加載器的構建中被使用
#-------------------------------#class BaseCollator(object):def __init__(self, tokenizer):self.tokenizer               = tokenizerrandom_string_5_letters      = "xzyvd" # 作為“錨點”，查找它在模板化后的完整文本中的位置# 將輸入消息轉換成Chat模板格式的字符串 例如 "<|start|>assistant\nxzyvd<|end|>" 此為純文本而不是被編碼后得到的token idsrandom_string_chat_templated = self.tokenizer.apply_chat_template([{"role": "assistant", "content": random_string_5_letters}], tokenize=False, add_special_tokens=False)random_string_location       = random_string_chat_templated.find(random_string_5_letters) # 查找我們之前插入的“隨機標記”出現的位置# 例如回復為<|start|>assistant\nxzyvd<|end|># 獲取到nxzyvd開始后的位置, 然后從而獲取到前綴的長度# 目的是在后續設置loss_mask時能夠精準跳過模板前綴，只對assistant回復的實際內容進行監督self.prefix_len              = len(self.tokenizer.encode(random_string_chat_templated[:random_string_location])) # 找到前綴模板結束的位置#----------------------------------------------------------## 用于處理批量對話消息# 隨后返回模型需要的token ids、attention mask以及loss mask# 1.將消息轉換為模型所需的 token 格式# 2.根據消息中的role（例如 assistant）標記哪些token需要計算損失（loss_mask），即只對assistant的具體輸出進行損失計算，而不對user的內容進行計算# 3.將所有輸入統一padding到最大長度max_len，確保批次的輸入大小一致#----------------------------------------------------------#def prepare_inputs_and_loss_mask(self, batched_messages, max_length=None):batch_token_ids: list[list[int]]  = [] # 保存每個批次消息的token idsbatch_masks:     list[list[int]]  = [] # 保存每個批次消息的loss_mask，即哪些token需要計算損失batch_attentions: list[list[int]] = [] # 保存每個批次消息的attention mask，模型用來指示哪些部分是有效輸入，哪些是 paddingfor messages in batched_messages: # 每一條消息中都包含若干user和assistant的內容#---------------------------------------------------------------------------------------## 對于此處生成的attention mask# tokenizer會自動將padding部分的attention mask設為0，其余為1# 其作用為告訴模型哪些token是“真正需要注意的內容”，哪些只是為了湊長度而padding的垃圾位# 它是Transformer中注意力機制不可或缺的一部分，尤其在處理變長輸入（如自然語言對話）時非常關鍵# NOTE：此處，tokenizer沒有做統一長度 padding，而是保留了變長的attention_mask#---------------------------------------------------------------------------------------#conv_ids = self.tokenizer.apply_chat_template(messages,tokenize=True, # 控制attention mask相關內容add_special_tokens=False,return_dict=True,) # conv_ids是面向整個對話的一個字典，包含了對應的 input_ids（token ids）和 attention_maskmask   = [0] * len(conv_ids["input_ids"]) # 為每個對話消息初始化一個全零的 mask 列表# Locate each assistant turn and flip its mask to 1cursor = 0 # 用來記錄當前已經處理過的token數量for msg in messages: # 對user與assistant的內容均進行處理segment_ids = self.tokenizer.apply_chat_template([msg], tokenize=True, add_special_tokens=False) # 將每條消息msg轉換為token ids # 只包含這一條消息的內容seg_len = len(segment_ids) # 獲取消息的長度, 即為每條消息的實際token數目#---------------------------------------## 當處理角色為assistant的時候展開下述操作# 只對其具體回復的內容進行操作#---------------------------------------#if msg["role"] == "assistant":start = cursor + self.prefix_len # 確定消息的起點end   = cursor + seg_len         # 根據消息的長度去確定終點mask[start:end] = [1] * (end - start)  # attend to these tokens # 將assistant的回復部分的mask設置為1cursor += seg_len # 因為一組對話中assistant回復的內容可能有多處, 因此需要進行累積batch_token_ids.append(conv_ids["input_ids"]) # token idsbatch_masks.append(mask) # 哪些token需要去計算batch_attentions.append(conv_ids["attention_mask"]) # 哪些部分是有效輸入# NOTE：主要針對assistant回復過長的情況進行處理if max_length is not None:  # We need to keep the tokens to allow for the img embed replacing logic to work. Otherwise, we would need to track which images correspond to long samples.batch_token_ids  = [ids[:max_length] for ids in batch_token_ids] # 對超過max length的樣本進行裁剪, 使其長度滿足要求# 如果長度超過 max_length，則將其截斷為全零的 mask（表示忽略該樣本）batch_masks      = [m if len(m) <= max_length else [0]*max_length for m in batch_masks] # Ignore samples that are longer than max_lengthbatch_attentions = [a[:max_length] for a in batch_attentions] # 同樣進行截取# Pad samples to max lengthif max_length is not None:max_len = max_lengthelse:max_len = max(map(len, batch_token_ids))# 對每個樣本均展開padding操作batch_token_ids  = [[self.tokenizer.pad_token_id]*(max_len-len(ids)) + ids for ids in batch_token_ids] # 使用pad_token_id將長度填充到max lengthbatch_masks      = [[0]*(max_len-len(m)) + m         for m   in batch_masks]                           # 填充至最大長度max_len，使用0填充batch_attentions = [[0]*(max_len-len(a)) + a         for a   in batch_attentions]                      # 填充至最大長度max_len，使用0填充 # NOTE: 相當于是在tokenzier的基礎上 根據max length去展開補充性paddingreturn torch.tensor(batch_token_ids), torch.tensor(batch_attentions), torch.tensor(batch_masks).to(torch.bool)#-------------------------------------#
# Visual Question Answering Collator
# 訓練與驗證數據集
#-------------------------------------#
class VQACollator(BaseCollator):def __init__(self, tokenizer, max_length):self.max_length  = max_lengthsuper().__init__(tokenizer)def __call__(self, batch):images           = [item["images"] for item in batch]messages_batched = [item["text_data"] for item in batch]# Stack imagesimgs   = [img for sublist in images for img in sublist]images = torch.stack(imgs)# Create inputs by concatenating special image tokens, question, and answerbatch_input_ids, batch_attention_mask, loss_masks = self.prepare_inputs_and_loss_mask(messages_batched, max_length=self.max_length)#--------------------------------------------------------------------------------------------------------------------------------------------------------------------------## Create labels where only answer tokens are predicted# 1. 首先將模型回復的內容全部復制一份出來, 然后將為mask為0的區域全部填充為-100, 表明直接忽視不參與計算# 2. 為適應因果語言建模, 展開標簽平移操作, 作用為確保模型在展開語言生成任務時, 能夠預測當前時間步的下一個token# 具體而言, labels[:, :-1]為選擇每個樣本的所有token中除去最后一個token的部分, labels[:, 1:]為獲取每個樣本中從第二個token到最后一個token的所有內容# 這樣就可以將每個樣本的所有token都可以向左移動一位, 從而將每個位置對應的token都用它的下一個token去進行預測。這樣每個token的標簽都變成了它的下一個token, 即為next token prediction# 3. 這樣最后一個token由于沒有標簽目標, 直接設置為-100即可, 表明到了結尾# 例子：# batch_input_ids為[[101, 2001, 2023, 2045, 102]], 其中2001處的loss mask為0, 那么labels即為[[101, 2023, 2045, 102]]# 然后第一個樣本的0 1 2 3四個位置上對應的label即變為[2023, 2045, 102, -100]# 這樣就形成了真值標簽[[2023, 2045, 102, -100]]#--------------------------------------------------------------------------------------------------------------------------------------------------------------------------#labels         = batch_input_ids.clone().masked_fill(~loss_masks, -100) # 將~loss_masks為1的地方填充為-100 NOTE：此處相當于就是無效的地方labels[:, :-1] = labels[:, 1:] # Shift labels for causal LMlabels[:, -1]  = -100 # Last token has no targetreturn {"image": images, # 圖像"input_ids": batch_input_ids, # 輸入內容"attention_mask": batch_attention_mask, # 告訴模型在等長序列中, 哪些是需要關注的實際token, 哪些是padding token"labels": labels, #標簽}#--------------------------------------------------------#
# 測試數據集
# https://huggingface.co/datasets/Lin-Chen/MMStar
#--------------------------------------------------------#
class MMStarCollator(BaseCollator): def __init__(self, tokenizer):super().__init__(tokenizer)def __call__(self, batch):images           = [item["image"] for item in batch]messages_batched = [item["text_data"] for item in batch]# Stack imagesimages = torch.stack(images)# Create inputs by concatenating special image tokens, question, and answerbatch_input_ids, batch_attention_mask, loss_masks = self.prepare_inputs_and_loss_mask(messages_batched)#---------------------------------------------------------------------------------------------------------------------------------------------## 1. 把需要預測的位置（即 loss_masks=1）設成pad token, 這意味著這些位置不會被送去模型作為“輸入”，因為它們是模型需要生成的內容# 2. 把要預測的部分在attention mask里屏蔽掉, 導致模型不會“看到”這些 token，符合推理階段的auto-regressive decoding 邏輯# 3. 只保留需要預測的token作為標簽，其余地方用pad填充#---------------------------------------------------------------------------------------------------------------------------------------------#"""example:query: "User: What color is the sky?\nAssistant: The sky is"prediction: "blue."那么 loss_mask 會標記 "blue." 這一段, collator就會:把 input_ids 中 "blue." 變成pad(輸入時忽略)把 attention_mask 中對應位置設為0(不關注)把 labels 中 "blue." 保留, 其余是pad(只評估藍天這個詞)"""input_ids      = batch_input_ids.masked_fill(loss_masks, self.tokenizer.pad_token_id)attention_mask = batch_attention_mask.masked_fill(loss_masks, 0)labels         = batch_input_ids.clone().masked_fill(~loss_masks, self.tokenizer.pad_token_id)return {"images": images,"input_ids": input_ids,"attention_mask": attention_mask,"labels": labels,}