【🍊 易編橙:一個幫助編程小伙伴少走彎路的終身成長社群🍊 】
大家好,我是小森( ﹡?o?﹡ ) ! 易編橙·終身成長社群創始團隊嘉賓,橙似錦計劃領銜成員、阿里云專家博主、騰訊云內容共創官、CSDN人工智能領域優質創作者 。
掌握PyTorch數據通常的處理方法,是構建高效、可擴展模型的關鍵一步。今天,我們就利用PyTorch高效地處理數據,為模型訓練打下堅實基礎。
在前面的線性回歸模型中,我們使用的數據很少,所以直接把全部數據放到模型中去使用。
但是在深度學習中,數據量通常是都非常多,非常大的,如此大量的數據,不可能一次性的在模型中進行向前的計算和反向傳播,經常我們會對整個數據進行隨機的打亂順序,把數據處理成一個個的batch,同時還會對數據進行預處理。
所以,接下來我們來學習pytorch中的數據加載的方法~
Dataset基類介紹
dataset定義了這個數據集的總長度,以及會返回哪些參數,模板:
from torch.utils.data import Datasetclass MyDataset(Dataset):def __init__(self, ):# 定義數據集包含的數據和標簽def __len__(self):return len(...)def __getitem__(self, index):# 當數據集被讀取時,返回一個包含數據和標簽的元組
數據加載案例
數據來源:http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
該數據集包含了5574條短信,其中正常短信(標記為“ham”)4831條,騷擾短信(標記為“spam”)743條。
from torch.utils.data import Dataset,DataLoader
import pandas as pddata_path = r"data/SMSSpamCollection" # 路徑class SMSDataset(Dataset):def __init__(self):lines = open(data_path,"r",encoding="utf-8")# 前4個為label,后面的為短信內容lines = [[i[:4].strip(),i[4:].strip()] for i in lines]# 轉為dataFrame類型self.df = pd.DataFrame(lines,columns=["label","sms"])def __getitem__(self, index):single_item = self.df.iloc[index,:]return single_item.values[0],single_item.values[1]def __len__(self):return self.df.shape[0]
我們現在已經成功地構建了一個數據集類?SMSDataset,
這個類能夠加載SMS 垃圾短信數據集,并將每條短信及其對應的標簽(ham
?或?spam
)封裝為可迭代的形式,以便于后續的數據加載和模型訓練。
d = SMSDataset()
for i in range(len(d)):print(i,d[i])
輸出:
...
5566 ('ham', "Why don't you wait 'til at least wednesday to see if you get your .")
5567 ('ham', 'Huh y lei...')
5568 ('spam', 'REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode')
5569 ('spam', 'This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.')
5570 ('ham', 'Will ü b going to esplanade fr home?')
5571 ('ham', 'Pity, * was in mood for that. So...any other suggestions?')
5572 ('ham', "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free")
5573 ('ham', 'Rofl. Its true to its name')
?DataLoader格式說明
my_dataset = DataLoader(mydataset, batch_size=2, shuffle=True,num_workers=4)# num_workers:多進程讀取數據
DataLoader
的使用方法示例: ?
from torch.utils.data import DataLoaderdataset = MyDataset()
data_loader = DataLoader(dataset=dataset,batch_size=10,shuffle=True,num_workers=2)#遍歷,獲取其中的每個batch的結果
for index, (label, context) in enumerate(data_loader):print(index,label,context)print("*"*100)
-
dataset:提前定義的dataset的實例
-
batch_size:傳入數據的batch的大小,常用128,256等等
-
shuffle:bool類型,表示是否在每次獲取數據的時候提前打亂數據
-
num_workers
:加載數據的線程數 ?
導入兩個列表到Dataset
class MyDataset(Dataset):def __init__(self, ):# 定義數據集包含的數據和標簽self.x_data = [i for i in range(10)]self.y_data = [2*i for i in range(10)]def __len__(self):return len(self.x_data)def __getitem__(self, index):# 當數據集被讀取時,返回一個包含數據和標簽的元組return self.x_data[index], self.y_data[index]mydataset = MyDataset()
my_dataset = DataLoader(mydataset)for x_i ,y_i in my_dataset:print(x_i,y_i)
💬輸出:
tensor([0]) tensor([0])
tensor([1]) tensor([2])
tensor([2]) tensor([4])
tensor([3]) tensor([6])
tensor([4]) tensor([8])
tensor([5]) tensor([10])
tensor([6]) tensor([12])
tensor([7]) tensor([14])
tensor([8]) tensor([16])
tensor([9]) tensor([18])
?💬如果修改batch_size為2,則輸出:
tensor([0, 1]) tensor([0, 2])
tensor([2, 3]) tensor([4, 6])
tensor([4, 5]) tensor([ 8, 10])
tensor([6, 7]) tensor([12, 14])
tensor([8, 9]) tensor([16, 18])
- 我們可以看出,這是管理每次輸出的批次的
- 還可以控制用多少個線程來加速讀取數據(Num Workers),這參數和電腦cpu核心數有關系,盡量不超過電腦的核心數
我們看到可以不使用DataLoader,但這樣就不能批次處理,只能for i in range(len(d))這樣得到數據,也不能自動實現打亂邏輯,也不能串行加載。
data_loader = DataLoader(dataset=Dataset,batch_size=10,shuffle=True,num_workers=2)
# 獲取其中的每個batch的結果
for index, (label, context) in enumerate(data_loader):print(index,label,context)print("*"*100)
輸出:
555 ('ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam') ("I forgot 2 ask ü all smth.. There's a card on da present lei... How? ü all want 2 write smth or sign on it?", 'Am i that much dirty fellow?', 'have got * few things to do. may be in * pub later.', 'Ok lor. Anyway i thk we cant get tickets now cos like quite late already. U wan 2 go look 4 ur frens a not? Darren is wif them now...', 'When you came to hostel.', 'Well i know Z will take care of me. So no worries.', 'I REALLY NEED 2 KISS U I MISS U MY BABY FROM UR BABY 4EVA', 'Booked ticket for pongal?', 'Awww dat is sweet! We can think of something to do he he! Have a nice time tonight ill probably txt u later cos im lonely :( xxx.', 'We tried to call you re your reply to our sms for a video mobile 750 mins UNLIMITED TEXT + free camcorder Reply of call 08000930705 Now')
****************************************************************************************************
556 ('ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam') (':-( sad puppy noise', 'G.W.R', 'Otherwise had part time job na-tuition..', 'They finally came to fix the ceiling.', 'The word "Checkmate" in chess comes from the Persian phrase "Shah Maat" which means; "the king is dead.." Goodmorning.. Have a good day..:)', 'Yup', 'I am real, baby! I want to bring out your inner tigress...', 'THANX4 TODAY CER IT WAS NICE 2 CATCH UP BUT WE AVE 2 FIND MORE TIME MORE OFTEN OH WELL TAKE CARE C U SOON.C', "She said,'' do u mind if I go into the bedroom for a minute ? '' ''OK'', I sed in a sexy mood. She came out 5 minuts latr wid a cake...n My Wife,", 'Ur cash-balance is currently 500 pounds - to maximize ur cash-in now send COLLECT to 83600 only 150p/msg. CC: 08718720201 PO BOX 114/14 TCR/W1')
****************************************************************************************************
557 ('ham', 'ham', 'ham', 'ham') ('It shall be fine. I have avalarr now. Will hollalater', "Nah it's straight, if you can just bring bud or drinks or something that's actually a little more useful than straight cash", 'U sleeping now.. Or you going to take? Haha.. I got spys wat.. Me online checking n replying mails lor..', 'In other news after hassling me to get him weed for a week andres has no money. HAUGHAIGHGTUJHYGUJ')
****************************************************************************************************
導入Excel數據到Dataset中
💥dataset只是一個類,因此數據可以從外部導入,我們也可以在dataset中規定數據在返回時進行更多的操作,數據在返回時也不一定是有兩個。
pip install pandas
pip install openpyxl
class myDataset(Dataset):def __init__(self, data_loc):data = pd.read_ecl(data_loc)self.x1,self.x2,self.x3,self.x4,self.y = data['x1'],data['x2'],data['x3'] ,data['x4'],data['y']def __len__(self):return len(self.x1)def __getitem__(self, idx):return self.x1[idx],self.x2[idx],self.x3[idx],self.x4[idx],self.y[idx]mydataset = myDataset(data_loc='e:\pythonProject Pytorch1\data.xls')
my_dataset = DataLoader(mydataset,batch_size=2)
for x1_i ,x2_i,x3_i,x4_i,y_i in my_dataset:print(x1_i,x2_i,x3_i,x4_i,y_i)
💯加載官方數據集?
有一些數據集是PyTorch自帶的,它被保存在TorchVision
和torchtext
中
-
torchvision
提供了對圖片數據處理相關的api和數據-
數據位置:
torchvision.datasets
,例如:torchvision.datasets.MNIST
(手寫數字圖片數據)
-
-
torchtext
提供了對文本數據處理相關的API和數據-
數據位置:
torchtext.datasets
,例如:torchtext.datasets.IMDB(電影
評論文本數據)
-
我們以Mnist手寫數字為例 ,看看pytorch如何加載其中自帶的數據集
torchvision.datasets.MNIST(root='/files/', train=True, download=True, transform=)`
-
root
參數表示數據存放的位置 -
train:
bool類型,表示是使用訓練集的數據還是測試集的數據 -
download:
bool類型,表示是否需要下載數據到root目錄 -
transform:
實現的對圖片的處理函數