問題1:出現"file is not a zip file"
原因是鏈接已經失效。
解決方法:打開下面鏈接自行下載,需要魔法。下載完解壓到特定位置。
下載鏈接:項目首頁 - Wikitext-2-v1數據包下載:Wikitext-2-v1 數據包下載本倉庫提供了一份Wikitext-2-v1的標準數據包,方便無法通過亞馬遜網址下載的用戶獲取 - GitCode
修改load_data_wiki函數中data_dir的路徑,如下:
#@save
def load_data_wiki(batch_size, max_len):"""加載WikiText-2數據集"""num_workers = d2l.get_dataloader_workers()#data_dir = = d2l.download_extract('wikitext-2', 'wikitext-2') data_dir = 'D:\data\wikitext-2-v1\wikitext-2' # 使用正斜杠避免轉義問題paragraphs = _read_wiki(data_dir)train_set = _WikiTextDataset(paragraphs, max_len)train_iter = torch.utils.data.DataLoader(train_set, batch_size,shuffle=True, num_workers=0)return train_iter, train_set.vocab
?
問題2:'gbk' codec can't decode byte 0xae in position 96: illegal multibyte sequence
原因是讀取文件的時候編碼方式不一樣。
解決方法:修改def _read_wiki(data_dir)函數,open里添加encoding = "utf-8"編碼方式。
#@save
def _read_wiki(data_dir):file_name = os.path.join(data_dir, 'wiki.train.tokens')with open(file_name, 'r',encoding = "utf-8") as f:lines = f.readlines()# 大寫字母轉換為小寫字母paragraphs = [line.strip().lower().split(' . ')for line in lines if len(line.split(' . ')) >= 2]random.shuffle(paragraphs)return paragraphs
問題3 :一直卡在load_data_wiki運行不下去
原因是上面函數load_data_wiki的多線程問題,在load_data_wiki函數里令num_workers=0(如下),即可解決。
batch_size, max_len = 512, 64
train_iter, vocab = load_data_wiki(batch_size, max_len)for (tokens_X, segments_X, valid_lens_x, pred_positions_X, mlm_weights_X,mlm_Y, nsp_y) in train_iter:print(tokens_X.shape, segments_X.shape, valid_lens_x.shape,pred_positions_X.shape, mlm_weights_X.shape, mlm_Y.shape,nsp_y.shape)break
#@save
def load_data_wiki(batch_size, max_len):"""加載WikiText-2數據集"""num_workers = d2l.get_dataloader_workers()#data_dir = = d2l.download_extract('wikitext-2', 'wikitext-2-v1')data_dir = 'D:\data\wikitext-2-v1\wikitext-2'paragraphs = _read_wiki(data_dir)train_set = _WikiTextDataset(paragraphs, max_len)train_iter = torch.utils.data.DataLoader(train_set, batch_size,shuffle=True, num_workers=0)return train_iter, train_set.vocab