分子AI預測賽筆記

#AI夏令營 #Datawhale #夏令營

Taks1 跑通baseline

根據task1跑通baseline

注冊賬號

直接注冊或登錄百度賬號，etc

fork 項目

零基礎入門 Ai 數據挖掘競賽-速通 Baseline - 飛槳AI Studio星河社區

啟動項目?

選擇運行環境，并點擊確定，沒有特殊要求就默認的基礎版就可以了

等待片刻，等待在線項目啟動

運行項目代碼

點擊運行全部Cell

程序運行完生成文件 submit.csv

這個文件就最終提交的文件。

?Taks2 賽題深入解析

理解賽題，了解機器學習競賽通用流程

數據字段理解?

Docs

對 Smiles、Assay (DC50/Dmax)、Assay (Protac to Target, IC50)、Assay (Cellular activities, IC5、Article DOI、InChI字段學習分析

預測目標

選手需要預測PROTACs的降解能力，具體來說，就是預測Label字段的值。

根據DC50和Dmax的值來判斷降解能力的好壞：如果DC50大于100nM且Dmax小于80%，則Label為0；如果DC50小于等于100nM或Dmax大于等于80%，則Label為1。

零基礎入門AI(機器學習)競賽 - 飛書云文檔
https://datawhaler.feishu.cn/wiki/Ue7swBbiJiBhsdk5SupcqfL7nLX

Docs

Task3初步調試參數

學習9群助教【溫酒相隨】原創，九月助教編輯調整，首發于B站~

https://www.bilibili.com/read/cv35897986/?jump_opus=1

導入庫、訓練集和測試集

# 1. 導入需要用到的相關庫
# 導入 pandas 庫，用于數據處理和分析
import pandas as pd
# 導入 numpy 庫，用于科學計算和多維數組操作
import numpy as np
# 從 lightgbm 模塊中導入 LGBMClassifier 類
from lightgbm import LGBMClassifier# 2. 讀取訓練集和測試集
# 使用 read_excel() 函數從文件中讀取訓練集數據，文件名為 'traindata-new.xlsx'
train = pd.read_excel('./data/train.xlsx')
# 使用 read_excel() 函數從文件中讀取測試集數據，文件名為 'testdata-new.xlsx'
test = pd.read_excel('./data/test.xlsx')
train

?查看數據類型

data = train.info()data

部分數據的數據項比較少。可以篩掉減少擬合

# 篩選
train = train.iloc[:,1:]
test = test.iloc[:,1:]
# 行保留 列從第一個下標1開始
# train['lan'].value_counts()# language

查看object類型的列表


# 查看object類型的列表
train.select_dtypes(include = 'object').columns

缺失值查看

# 缺失值查看temp = train.isnull().sum()temp[temp > 0]

唯一值個數判斷?

# 唯一值個數判斷
# fea = train.columns
fea = train.columns.tolist()
fea

?輸出唯一值

# 輸出唯一值for f in fea:print(f,train[f].nunique());# nunique() 統計列中的唯一值

?篩選

# 定義了一個空列表cols，用于存儲在測試數據集中非空值小于10個的列名。
cols = []
for f in test.columns:if test[f].notnull().sum() < 10:cols.append(f)
cols# 使用drop方法從訓練集和測試集中刪除了這些列，以避免在后續的分析或建模中使用這些包含大量缺失值的列
train = train.drop(cols, axis=1)
test = test.drop(cols, axis=1)
# 使用pd.concat將清洗后的訓練集和測試集合并成一個名為data的DataFrame，便于進行統一的特征工程處理
data = pd.concat([train, test], axis=0, ignore_index=True)
newData = data.columns[2:]

將SMILES轉換為分子對象列表,并轉換為SMILES字符串列表??

data['smiles_list'] = data['Smiles'].apply(lambda x:[Chem.MolToSmiles(mol, isomericSmiles=True) for mol in [Chem.MolFromSmiles(x)]])
data['smiles_list'] = data['smiles_list'].map(lambda x: ' '.join(x))

用TfidfVectorizer計算TF-IDF?

tfidf = TfidfVectorizer(max_df = 0.9, min_df = 1, sublinear_tf = True)res = tfidf.fit_transform(data['smiles_list'])

轉為dataframe格式?

# 將結果轉為dataframe格式
tfidf_df = pd.DataFrame(res.toarray())
tfidf_df.columns = [f'smiles_tfidf_{i}' for i in range(tfidf_df.shape[1])]
# 按列合并到data數據
data = pd.concat([data, tfidf_df], axis=1)

自然數編碼

# 自然數編碼
def label_encode(series):unique = list(series.unique())return series.map(dict(zip(unique, range(series.nunique()))))
# 對每個類轉換為其編碼
for col in cols:if data[col].dtype == 'object':data[col]  = label_encode(data[col])

構建訓練集和測試集?

# 提取data中label行不為空的，將其作為train的數據并更新索引
train = data[data.Label.notnull()].reset_index(drop=True)
# 提取data中label行為空的，將其作為teat的數據并更新索引
test = data[data.Label.isnull()].reset_index(drop=True)
# 特征篩選
features = [f for f in train.columns if f not in ['uuid','Label','smiles_list']]
# 構建訓練集和測試集
x_train = train[features]
x_test = test[features]
# 訓練集標簽
y_train = train['Label'].astype(int)

使用采用5折交叉驗證（KFold(n_splits=5）

def cv_model(clf, train_x, train_y, test_x, clf_name, seed=2022):# 進行5折交叉驗證kf = KFold(n_splits=5, shuffle=True, random_state=seed)train = np.zeros(train_x.shape[0])test = np.zeros(test_x.shape[0])cv_scores = []# 每一折數據采用訓練索引和驗證索引來分割訓練集和驗證集for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):print('************************************ {} {}************************************'.format(str(i+1), str(seed)))trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]# 配置CatBoost分類器的參數params = {'learning_rate': 0.05, 'depth': 8, 'l2_leaf_reg': 10, 'bootstrap_type':'Bernoulli','random_seed':seed,'od_type': 'Iter', 'od_wait': 100, 'random_seed': 11, 'allow_writing_files': False, 'task_type':'CPU'}# 使用CatBoost分類器訓練模型model = clf(iterations=20000, **params, eval_metric='AUC')model.fit(trn_x, trn_y, eval_set=(val_x, val_y),metric_period=100,cat_features=[],use_best_model=True,verbose=1)val_pred  = model.predict_proba(val_x)[:,1]test_pred = model.predict_proba(test_x)[:,1]train[valid_index] = val_predtest += test_pred / kf.n_splitscv_scores.append(f1_score(val_y, np.where(val_pred>0.5, 1, 0)))print(cv_scores)print("%s_score_list:" % clf_name, cv_scores)print("%s_score_mean:" % clf_name, np.mean(cv_scores))print("%s_score_std:" % clf_name, np.std(cv_scores))return train, testcat_train, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test, "cat")

這段代碼是一個交叉驗證模型的函數，用于訓練和評估分類器模型。具體來說，它使用了CatBoost分類器，在給定的訓練數據集上進行了5折交叉驗證，并返回了訓練集和測試集的預測結果。

函數中的參數包括：

clf: 分類器模型的類對象，這里是CatBoostClassifier。
train_x, train_y: 訓練數據的特征和標簽。
test_x: 測試數據的特征。
clf_name: 分類器的名稱，用于輸出結果。
seed: 隨機種子，默認為2022。

函數的主要流程如下：

創建了一個5折交叉驗證器(KFold)。
初始化了訓練集和測試集的預測結果數組。
在每一折循環中，根據訓練索引和驗證索引分割訓練集和驗證集。
配置CatBoost分類器的參數，并使用訓練集訓練模型。
對驗證集和測試集進行預測，并將預測結果加入到結果數組中。
計算并保存每一折驗證集的F1分數。
輸出每一折的F1分數列表、平均分數和標準差。
返回訓練集和測試集的預測結果。

通過調用這個函數，可以得到CatBoost分類器在給定數據集上的交叉驗證結果，評估模型的性能以及獲取訓練集和測試集的預測結果。

輸出結果

from datetime import datetimecurrent_time = datetime.now()  # 獲取當前時間
formatted_time = current_time.strftime("%Y-%m-%d %H:%M:%S")  # 格式化時間# print("當前時間：", current_time)
# print("格式化時間：", formatted_time)
# 5. 保存結果文件到本地
pd.DataFrame({'uuid': test['uuid'],'Label': pred}
).to_csv(formatted_time+ '.csv', index=None)

?本地torch部分未用

這個夏令營不簡單 #AI夏令營 #Datawhale #夏令營?

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/39648.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/39648.shtml
英文地址，請注明出處：http://en.pswp.cn/web/39648.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！