賽題
2024中國高校計算機大賽 — 大數據挑戰賽
經驗分享
大家好,我是掃地僧團隊的隊長,以前參加這樣打榜的比賽比較少,了解的打榜技巧不是太多,所以想從科研的角度給大家一點分享。
這次比賽主要從以下五個步驟進行:數據集構造👉Baseline選擇👉模型優化👉模型調參👉模型集成
1. 數據集構造
官方已經給了數據集,可以嘗試根據溫度篩選出與中國溫度類似的場站,但是不確定是否會有效果:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as pltroot_path = '../dataset/global'
data_path = 'temp.npy'
data = np.load(os.path.join(root_path, data_path))data_oneyear = data[:365*24,:,0]
df = pd.DataFrame(data_oneyear)# 夏天平均溫度大于15攝氏度
summer_df = df.iloc[4000:5500]
print(summer_df.shape)
summer_index = summer_df.mean(axis=0).apply(lambda x: x > 15)
summer_index = summer_index[summer_index].index.to_list()
print(len(summer_index))# 冬天平均溫度小于20攝氏度
winter_df = df.iloc[0:500]
print(winter_df.shape)
winter_index = winter_df.mean(axis=0).apply(lambda x: x < 20)
winter_index = winter_index[winter_index].index.to_list()
print(len(winter_index))# 取兩個表的交集
index = list(set(summer_index) & set(winter_index))
print(len(index))# 取兩個表的交集
index = list(set(summer_index) & set(north_index) & set(winter_index))
print(len(index))
# 篩選電站
root_path= '../dataset/global'
temp_path = 'temp.npy'
wind_path = 'wind.npy'
global_data_path = 'global_data.npy'
temp_data = np.load(os.path.join(root_path, temp_path))
wind_data = np.load(os.path.join(root_path, wind_path))
global_data = np.load(os.path.join(root_path, global_data_path))
print(temp_data.shape)
print(wind_data.shape)
print(global_data.shape)temp_seleted = temp_data[:,index,:]
wind_seleted = wind_data[:,index,:]
global_seleted = global_data[:,:,:,index]
print(temp_seleted.shape)
print(wind_seleted.shape)
print(global_seleted.shape)# 劃分訓練集和驗證集
l = temp_seleted.shape[0]
train_size = int(l * 0.9)
temp_seleted_train = temp_seleted[:train_size,:,:]
wind_seleted_train = wind_seleted[:train_size,:,:]
global_seleted_train = global_seleted[:int(train_size/3),:,:]
temp_seleted_val = temp_seleted[train_size:,:,:]
wind_seleted_val = wind_seleted[train_size:,:,:]
global_seleted_val = global_seleted[int(train_size/3):,:,:]
print("train:",temp_seleted_train.shape,wind_seleted_train.shape,global_seleted_train.shape)
print("val:",temp_seleted_val.shape,wind_seleted_val.shape,global_seleted_val.shape)# 保存訓練集和驗證集
if not os.path.exists(os.path.join('../dataset', 'seleted_global_train_val')):os.makedirs(os.path.join('../dataset', 'seleted_global_train_val'))
selected_path = os.path.join('../dataset', 'seleted_global_train_val')
np.save(os.path.join(selected_path, 'temp_train.npy'), temp_seleted_train)
np.save(os.path.join(selected_path, 'temp_val.npy'), temp_seleted_val)
np.save(os.path.join(selected_path, 'wind_train.npy'), wind_seleted_train)
np.save(os.path.join(selected_path, 'wind_val.npy'), wind_seleted_val)
np.save(os.path.join(selected_path, 'global_train.npy'), global_seleted_train)
np.save(os.path.join(selected_path, 'global_val.npy'), global_seleted_val)
篩選后溫度和風速形狀如圖所示:
2. Baseline選擇
官方Baseline給的是iTransformer,關于iTransformer模型的解讀請參考:【PaperInFive-時間序列預測】iTransformer:轉置Transformer刷新時間序列預測SOTA(清華)
可以關注近近兩年開源的SOTA模型,這里分享一個Github,可以去上面找近年的SOTA模型:https://github.com/ddz16/TSFpaper
3. 模型優化
選好效果好的Baseline后就可以進行模型優化,比如iTransformer只建模了特征信息,那么可以在模型中補充對時序特征的建模,比如進行一些卷積操作,或者在時間維度上進行self-Attention,關于時間維度上的建模大家也可以參考SOTA論文,可以把不同論文里的模塊進行一個融合,說不定會有好效果。
4. 模型調參
確定了模型結構后就可以進行模型超參數的調整,比如模型的維度和層數,學習率和batch size等,經過測試增加模型的dimention在一定程度上可以提高模型表現,但是增加層數好像效果不太明顯。
學習率方面我初始值為0.01或0.005,每一輪除以2進行衰減。batch size我設為40960。
5. 模型集成
最后可以把不同特征的模型進行集成,比如可以把多個模型的結果取平均,或者可以在訓練時采用Mixture of Expert的方式加權求和。
幫助代碼
1. 模型測試
加在exp_long_term_forecasting.py
里面:
def val(self, setting):_, _, val_data, val_loader = self._get_data()time_now = time.time()criterion = self._select_criterion()if self.args.use_amp:scaler = torch.cuda.amp.GradScaler()self.model.load_state_dict(torch.load(self.args.state_dict_path,map_location=torch.device('cuda:0')))self.model.eval()val_loss = []for i, (batch_x, batch_y) in enumerate(val_loader):batch_x = batch_x.float().to(self.device)batch_y = batch_y.float().to(self.device)# encoder - decoderif self.args.use_amp:with torch.cuda.amp.autocast():if self.args.output_attention:outputs = self.model(batch_x)[0]else:outputs = self.model(batch_x)f_dim = -1 if self.args.features == 'MS' else 0outputs = outputs[:, -self.args.pred_len:, f_dim:]batch_y = batch_y[:, -self.args.pred_len:, f_dim:].to(self.device)loss = criterion(outputs, batch_y)print("\titers: {0} | loss: {2:.7f}".format(i + 1, loss.item()))val_loss.append(loss.item())else:if self.args.output_attention:outputs = self.model(batch_x)[0]else:outputs = self.model(batch_x)f_dim = -1 if self.args.features == 'MS' else 0outputs = outputs[:, -self.args.pred_len:, f_dim:]batch_y = batch_y[:, -self.args.pred_len:, f_dim:].to(self.device)loss = criterion(outputs, batch_y)if (i + 1) % 50 == 0:print("\titers: {0} | loss: {1:.7f}".format(i + 1, loss.item()))val_loss.append(loss.item())val_loss = np.average(val_loss)print("Val Loss: {0:.7f}".format(val_loss))return self.model
2. 驗證集Dataloader
加在data_factory.py
里面:
def data_provider(args):Data = data_dict[args.data]shuffle_flag = Truedrop_last = Falsebatch_size = args.batch_size train_data_set = Data(root_path=args.root_path,data_path=args.train_data_path,global_path=args.train_global_path,size=[args.seq_len, args.label_len, args.pred_len],features=args.features)train_data_loader = DataLoader(train_data_set,batch_size=batch_size,shuffle=shuffle_flag,num_workers=args.num_workers,drop_last=drop_last)val_data_set = Data(root_path=args.root_path,data_path=args.val_data_path,global_path=args.val_global_path,size=[args.seq_len, args.label_len, args.pred_len],features=args.features)val_data_loader = DataLoader(val_data_set,batch_size=int(batch_size/8),shuffle=False,num_workers=args.num_workers,drop_last=drop_last)return train_data_set, train_data_loader, val_data_set, val_data_loader
最后
希望大家以賽為友,共同進步,一起分享一些有用的小技巧。