目錄
一、數據基礎與預處理目標
二、具體預處理步驟及代碼解析
2.1 數據加載與初步清洗
2.2 標簽編碼
2.3 缺失值處理
(1)刪除含缺失值的樣本
(2)按類別均值填充
(3)按類別中位數填充
(4)按類別眾數填充
(5)線性回歸填充
(6)隨機森林填充
2.4 特征標準化
2.5 數據集拆分與類別平衡
(1)拆分訓練集與測試集
(2)處理類別不平衡
2.6 數據保存
三、具體代碼
四、預處理小結
一、數據基礎與預處理目標
本礦物分類系統基于礦物數據.xlsx
展開,該數據包含 1044 條樣本,涵蓋 13 種元素特征(氯、鈉、鎂等)和 4 類礦物標簽(A、B、C、D)。因數據敏感,故無法提供,僅提供代碼用于學習。數據預處理的核心目標是:通過規范格式、處理缺失值、平衡類別等操作,為后續模型訓練提供可靠輸入。
二、具體預處理步驟及代碼解析
2.1 數據加載與初步清洗
首先加載數據并剔除無效信息:
import pandas as pd# 加載數據,僅保留有效樣本
data = pd.read_excel('礦物數據.xlsx', sheet_name='Sheet1')
# 刪除特殊類別E(僅1條樣本,無統計意義)
data = data[data['礦物類型'] != 'E']
# 轉換數據類型,將非數值符號(如"/"、空格)轉為缺失值NaN
for col in data.columns:if col not in ['序號', '礦物類型']: # 標簽列不轉換# errors='coerce'確保非數值轉為NaNdata[col] = pd.to_numeric(data[col], errors='coerce')
此步驟解決了原始數據中格式混亂的問題,確保所有特征列均為數值型,為后續處理奠定基礎。
2.2 標簽編碼
將字符標簽(A/B/C/D)轉為模型可識別的整數:
# 建立標簽映射關系
label_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
# 轉換標簽并保持DataFrame格式
data['礦物類型'] = data['礦物類型'].map(label_dict)
# 分離特征與標簽
X = data.drop(['序號', '礦物類型'], axis=1) # 特征集
y = data['礦物類型'] # 標簽集
編碼后,標簽范圍為 0-3,符合機器學習模型對輸入格式的要求。
2.3 缺失值處理
針對數據中存在的缺失值,設計了 6 種處理方案,具體如下:
(1)刪除含缺失值的樣本
適用于缺失率極低(<1%)的場景,直接剔除無效樣本:
def drop_missing(train_data, train_label):# 合并特征與標簽,便于按行刪除combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True) # 重置索引,避免刪除后索引混亂cleaned = combined.dropna() # 刪除含缺失值的行# 分離特征與標簽return cleaned.drop('礦物類型', axis=1), cleaned['礦物類型']
該方法優點是無偏差,缺點是可能丟失有效信息(當缺失率較高時)。
(2)按類別均值填充
對數值型特征,按礦物類型分組計算均值,用組內均值填充缺失值(減少跨類別干擾):
def mean_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)# 按礦物類型分組填充filled_groups = []for type_id in combined['礦物類型'].unique():group = combined[combined['礦物類型'] == type_id]# 計算組內各特征均值,用于填充該組缺失值filled_group = group.fillna(group.mean())filled_groups.append(filled_group)# 合并各組數據filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('礦物類型', axis=1), filled['礦物類型']
適用于特征分布較均勻的場景,避免了不同類別間的均值混淆。
(3)按類別中位數填充
當特征存在極端值(如個別樣本鈉含量遠高于均值)時,用中位數填充更穩健:
def median_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)filled_groups = []for type_id in combined['礦物類型'].unique():group = combined[combined['礦物類型'] == type_id]# 中位數對極端值不敏感filled_group = group.fillna(group.median())filled_groups.append(filled_group)filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('礦物類型', axis=1), filled['礦物類型']
(4)按類別眾數填充
針對離散型特征(如部分元素含量為整數編碼),采用眾數(出現次數最多的值)填充:
def mode_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)filled_groups = []for type_id in combined['礦物類型'].unique():group = combined[combined['礦物類型'] == type_id]# 對每列取眾數,無眾數時返回Nonefill_values = group.apply(lambda x: x.mode().iloc[0] if not x.mode().empty else None)filled_group = group.fillna(fill_values)filled_groups.append(filled_group)filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('礦物類型', axis=1), filled['礦物類型']
(5)線性回歸填充
利用特征間的線性相關性(如氯與鈉含量的關聯)預測缺失值:
from sklearn.linear_model import LinearRegressiondef linear_reg_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)features = combined.drop('礦物類型', axis=1)# 按缺失值數量升序處理(從缺失少的列開始)null_counts = features.isnull().sum().sort_values()for col in null_counts.index:if null_counts[col] == 0:continue # 無缺失值則跳過# 構建訓練數據:用其他特征預測當前列X_train = features.drop(col, axis=1).dropna() # 其他特征無缺失的樣本y_train = features.loc[X_train.index, col] # 當前列的非缺失值# 待填充樣本(當前列缺失,其他特征完整)X_pred = features.drop(col, axis=1).loc[features[col].isnull()]# 訓練線性回歸模型lr = LinearRegression()lr.fit(X_train, y_train)# 預測并填充缺失值features.loc[features[col].isnull(), col] = lr.predict(X_pred)return features, combined['礦物類型']
該方法要求特征間存在一定線性關系,適用于元素含量呈比例關聯的場景。
(6)隨機森林填充
對于特征間非線性關系,采用隨機森林模型預測缺失值:
from sklearn.ensemble import RandomForestRegressordef rf_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)features = combined.drop('礦物類型', axis=1)null_counts = features.isnull().sum().sort_values()for col in null_counts.index:if null_counts[col] == 0:continue# 分離訓練樣本和待填充樣本X_train = features.drop(col, axis=1).dropna()y_train = features.loc[X_train.index, col]X_pred = features.drop(col, axis=1).loc[features[col].isnull()]# 訓練隨機森林回歸器(100棵樹,固定隨機種子確保結果可復現)rfr = RandomForestRegressor(n_estimators=100, random_state=10)rfr.fit(X_train, y_train)# 填充預測結果features.loc[features[col].isnull(), col] = rfr.predict(X_pred)return features, combined['礦物類型']
隨機森林能捕捉特征間復雜關系,填充精度通常高于線性方法,但計算成本略高。
2.4 特征標準化
不同元素含量數值差異大(如鈉可達上千,硒多為 0-1),需消除量綱影響:
from sklearn.preprocessing import StandardScalerdef standardize_features(X_train, X_test):# 用訓練集的均值和標準差進行標準化(避免測試集信息泄露)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train) # 擬合訓練集并轉換X_test_scaled = scaler.transform(X_test) # 用相同參數轉換測試集# 轉回DataFrame格式,保留特征名稱return pd.DataFrame(X_train_scaled, columns=X_train.columns), pd.DataFrame(X_test_scaled, columns=X_test.columns)
標準化后,所有特征均值為 0、標準差為 1,確保模型不受數值大小干擾。
2.5 數據集拆分與類別平衡
(1)拆分訓練集與測試集
按 7:3 比例拆分,保持類別分布一致:
from sklearn.model_selection import train_test_split# stratify=y確保測試集與原始數據類別比例一致
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y
)
(2)處理類別不平衡
采用 SMOTE 算法生成少數類樣本,平衡各類別數量:
from imblearn.over_sampling import SMOTE# 僅對訓練集過采樣(測試集保持原始分布)
smote = SMOTE(k_neighbors=1, random_state=0) # 近鄰數=1,避免引入過多噪聲
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
2.6 數據保存
將預處理后的數據存儲,供后續模型訓練使用:
def save_processed_data(X_train, y_train, X_test, y_test, method):# 拼接特征與標簽train_df = pd.concat([X_train, pd.DataFrame(y_train, columns=['礦物類型'])], axis=1)test_df = pd.concat([X_test, pd.DataFrame(y_test, columns=['礦物類型'])], axis=1)# 保存為Excel,明確標識預處理方法train_df.to_excel(f'訓練集_{method}.xlsx', index=False)test_df.to_excel(f'測試集_{method}.xlsx', index=False)# 示例:保存經隨機森林填充和標準化的數據
save_processed_data(X_train_balanced, y_train_balanced, X_test, y_test, 'rf_fill_standardized')
三、具體代碼
數據預處理.py:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import fill_datadata = pd.read_excel('礦物數據.xlsx')
data = data[data['礦物類型'] != 'E'] # 刪除特殊類別E,整個數據集中只存在1個E數據
null_num = data.isnull()
null_total = data.isnull().sum()X_whole = data.drop('礦物類型', axis=1).drop('序號', axis=1)
y_whole = data.礦物類型'''將數據中的中文標簽轉換為字符'''
label_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
encoded_labels = [label_dict[label] for label in y_whole]
y_whole = pd.DataFrame(encoded_labels, columns=['礦物類型'])'''字符串數據轉換成float,異常數據('\'和空格)轉換成nan'''
for column_name in X_whole.columns:X_whole[column_name] = pd.to_numeric(X_whole[column_name], errors='coerce')'''Z標準化'''
scaler = StandardScaler()
X_whole_Z = scaler.fit_transform(X_whole)
X_whole = pd.DataFrame(X_whole_Z, columns=X_whole.columns) # Z標準化處理后為numpy數據,這里再轉換回pandas數據'''數據集切分'''
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_whole, y_whole, test_size=0.3, random_state=0)'''數據填充,6種方法'''
# # 1.刪除空缺行
# X_train_fill, y_train_fill = fill_data.cca_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.cca_test_fill(X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.cca_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)
#
# # 2.平均值填充
# X_train_fill, y_train_fill = fill_data.mean_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.mean_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.mean_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)
#
# # 3.中位數填充
# X_train_fill, y_train_fill = fill_data.median_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.median_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.median_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# # 4.眾數填充
# X_train_fill, y_train_fill = fill_data.mode_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.mode_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.mode_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# # 5.線性回歸填充
# X_train_fill, y_train_fill = fill_data.linear_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.linear_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.linear_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# 6.隨機森林填充
X_train_fill, y_train_fill = fill_data.RandomForest_train_fill(X_train_w, y_train_w)
X_test_fill, y_test_fill = fill_data.RandomForest_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
fill_data.RandomForest_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)
fill_data.py:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression'''過采樣'''
def oversampling(train_data, train_label):oversampler = SMOTE(k_neighbors=1, random_state=0)os_x_train, os_y_train = oversampler.fit_resample(train_data, train_label)return os_x_train, os_y_train'''1.刪除空缺行'''
def cca_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True) # 重置索引df_filled = data.dropna() # 刪除包含缺失值的行或列return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def cca_test_fill(test_data, test_label):data = pd.concat([test_data, test_label], axis=1)data = data.reset_index(drop=True)df_filled = data.dropna()return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def cca_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[刪除空缺行].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[刪除空缺行].xlsx', index=False)'''2.平均值填充'''
def mean_train_method(data):fill_values = data.mean()return data.fillna(fill_values)def mean_test_method(train_data, test_data):fill_values = train_data.mean()return test_data.fillna(fill_values)def mean_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['礦物類型'] == 0]B = data[data['礦物類型'] == 1]C = data[data['礦物類型'] == 2]D = data[data['礦物類型'] == 3]A = mean_train_method(A)B = mean_train_method(B)C = mean_train_method(C)D = mean_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def mean_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['礦物類型'] == 0]B_train = train_data_all[train_data_all['礦物類型'] == 1]C_train = train_data_all[train_data_all['礦物類型'] == 2]D_train = train_data_all[train_data_all['礦物類型'] == 3]A_test = test_data_all[test_data_all['礦物類型'] == 0]B_test = test_data_all[test_data_all['礦物類型'] == 1]C_test = test_data_all[test_data_all['礦物類型'] == 2]D_test = test_data_all[test_data_all['礦物類型'] == 3]A = mean_test_method(A_train, A_test)B = mean_test_method(B_train, B_test)C = mean_test_method(C_train, C_test)D = mean_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def mean_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[平均值填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[平均值填充].xlsx', index=False)'''3.中位數填充'''
def median_train_method(data):fill_values = data.median()return data.fillna(fill_values)def median_test_method(train_data, test_data):fill_values = train_data.median()return test_data.fillna(fill_values)def median_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['礦物類型'] == 0]B = data[data['礦物類型'] == 1]C = data[data['礦物類型'] == 2]D = data[data['礦物類型'] == 3]A = median_train_method(A)B = median_train_method(B)C = median_train_method(C)D = median_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def median_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['礦物類型'] == 0]B_train = train_data_all[train_data_all['礦物類型'] == 1]C_train = train_data_all[train_data_all['礦物類型'] == 2]D_train = train_data_all[train_data_all['礦物類型'] == 3]A_test = test_data_all[test_data_all['礦物類型'] == 0]B_test = test_data_all[test_data_all['礦物類型'] == 1]C_test = test_data_all[test_data_all['礦物類型'] == 2]D_test = test_data_all[test_data_all['礦物類型'] == 3]A = median_test_method(A_train, A_test)B = median_test_method(B_train, B_test)C = median_test_method(C_train, C_test)D = median_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def median_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[中位數填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[中位數填充].xlsx', index=False)'''4.眾數填充'''
def mode_train_method(data):fill_values = data.apply(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)a = data.mode()return data.fillna(fill_values)def mode_test_method(train_data, test_data):fill_values = train_data.apply(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)a = train_data.mode()return test_data.fillna(fill_values)def mode_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['礦物類型'] == 0]B = data[data['礦物類型'] == 1]C = data[data['礦物類型'] == 2]D = data[data['礦物類型'] == 3]A = mode_train_method(A)B = mode_train_method(B)C = mode_train_method(C)D = mode_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def mode_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['礦物類型'] == 0]B_train = train_data_all[train_data_all['礦物類型'] == 1]C_train = train_data_all[train_data_all['礦物類型'] == 2]D_train = train_data_all[train_data_all['礦物類型'] == 3]A_test = test_data_all[test_data_all['礦物類型'] == 0]B_test = test_data_all[test_data_all['礦物類型'] == 1]C_test = test_data_all[test_data_all['礦物類型'] == 2]D_test = test_data_all[test_data_all['礦物類型'] == 3]A = mode_test_method(A_train, A_test)B = mode_test_method(B_train, B_test)C = mode_test_method(C_train, C_test)D = mode_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def mode_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[眾數填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[眾數填充].xlsx', index=False)'''5.線性回歸填充'''
def linear_train_fill(train_data, train_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('礦物類型', axis=1)null_num = train_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X = train_data_X[filling_feature].drop(i, axis=1)y = train_data_X[i]row_numbers_mg_null = train_data_X[train_data_X[i].isnull()].index.tolist()X_train = X.drop(row_numbers_mg_null)y_train = y.drop(row_numbers_mg_null)X_test = X.iloc[row_numbers_mg_null]lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_test)train_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成訓練數據集中的{i}列數據的填充')return train_data_X, train_data_all.礦物類型def linear_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('礦物類型', axis=1)test_data_X = test_data_all.drop('礦物類型', axis=1)null_num = test_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X_train = train_data_X[filling_feature].drop(i, axis=1)y_train = train_data_X[i]X_test = test_data_X[filling_feature].drop(i, axis=1)row_numbers_mg_null = test_data_X[test_data_X[i].isnull()].index.tolist()X_test = X_test.iloc[row_numbers_mg_null]lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_test)test_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成測試數據集中的{i}列數據的填充')return test_data_X, test_data_all.礦物類型def linear_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[線性回歸填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[線性回歸填充].xlsx', index=False)'''6.隨機森林填充'''
def RandomForest_train_fill(train_data, train_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('礦物類型', axis=1)null_num = train_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X = train_data_X[filling_feature].drop(i, axis=1)y = train_data_X[i]row_numbers_mg_null = train_data_X[train_data_X[i].isnull()].index.tolist()X_train = X.drop(row_numbers_mg_null)y_train = y.drop(row_numbers_mg_null)X_test = X.iloc[row_numbers_mg_null]rfg = RandomForestRegressor(n_estimators=100, random_state=10)rfg.fit(X_train, y_train)y_pred = rfg.predict(X_test)train_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成訓練數據集中的{i}列數據的填充')return train_data_X, train_data_all.礦物類型def RandomForest_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('礦物類型', axis=1)test_data_X = test_data_all.drop('礦物類型', axis=1)null_num = test_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X_train = train_data_X[filling_feature].drop(i, axis=1)y_train = train_data_X[i]X_test = test_data_X[filling_feature].drop(i, axis=1)row_numbers_mg_null = test_data_X[test_data_X[i].isnull()].index.tolist()X_test = X_test.iloc[row_numbers_mg_null]rfg = RandomForestRegressor(n_estimators=100, random_state=10)rfg.fit(X_train, y_train)y_pred = rfg.predict(X_test)test_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成測試數據集中的{i}列數據的填充')return test_data_X, test_data_all.礦物類型def RandomForest_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[隨機森林填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[隨機森林填充].xlsx', index=False)
四、預處理小結
數據預處理完成以下關鍵工作:
- 清洗無效樣本與異常符號,統一數據格式;
- 通過多種方法處理缺失值,適應不同數據特征;
- 標準化特征,消除量綱差異;
- 拆分并平衡數據集,為模型訓練做準備。
經處理后的數據已滿足模型輸入要求,下一階段將進行模型訓練,包括:
- 選擇隨機森林、SVM 等分類算法;
- 開展模型評估與超參數調優;
- 對比不同模型的分類性能。
后續將基于本文預處理后的數據,詳細介紹模型訓練過程及結果分析。