礦物分類系統開發筆記(一):數據預處理

目錄

一、數據基礎與預處理目標

二、具體預處理步驟及代碼解析

2.1 數據加載與初步清洗

2.2 標簽編碼

2.3 缺失值處理

(1)刪除含缺失值的樣本

(2)按類別均值填充

(3)按類別中位數填充

(4)按類別眾數填充

(5)線性回歸填充

(6)隨機森林填充

2.4 特征標準化

2.5 數據集拆分與類別平衡

(1)拆分訓練集與測試集

(2)處理類別不平衡

2.6 數據保存

三、具體代碼

四、預處理小結


一、數據基礎與預處理目標

本礦物分類系統基于礦物數據.xlsx展開,該數據包含 1044 條樣本,涵蓋 13 種元素特征(氯、鈉、鎂等)和 4 類礦物標簽(A、B、C、D)。因數據敏感,故無法提供,僅提供代碼用于學習。數據預處理的核心目標是:通過規范格式、處理缺失值、平衡類別等操作,為后續模型訓練提供可靠輸入。


二、具體預處理步驟及代碼解析

2.1 數據加載與初步清洗

首先加載數據并剔除無效信息:

import pandas as pd# 加載數據,僅保留有效樣本
data = pd.read_excel('礦物數據.xlsx', sheet_name='Sheet1')
# 刪除特殊類別E(僅1條樣本,無統計意義)
data = data[data['礦物類型'] != 'E']
# 轉換數據類型,將非數值符號(如"/"、空格)轉為缺失值NaN
for col in data.columns:if col not in ['序號', '礦物類型']:  # 標簽列不轉換# errors='coerce'確保非數值轉為NaNdata[col] = pd.to_numeric(data[col], errors='coerce')

此步驟解決了原始數據中格式混亂的問題,確保所有特征列均為數值型,為后續處理奠定基礎。

2.2 標簽編碼

將字符標簽(A/B/C/D)轉為模型可識別的整數:

# 建立標簽映射關系
label_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
# 轉換標簽并保持DataFrame格式
data['礦物類型'] = data['礦物類型'].map(label_dict)
# 分離特征與標簽
X = data.drop(['序號', '礦物類型'], axis=1)  # 特征集
y = data['礦物類型']  # 標簽集

編碼后,標簽范圍為 0-3,符合機器學習模型對輸入格式的要求。

2.3 缺失值處理

針對數據中存在的缺失值,設計了 6 種處理方案,具體如下:

(1)刪除含缺失值的樣本

適用于缺失率極低(<1%)的場景,直接剔除無效樣本:

def drop_missing(train_data, train_label):# 合并特征與標簽,便于按行刪除combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)  # 重置索引,避免刪除后索引混亂cleaned = combined.dropna()  # 刪除含缺失值的行# 分離特征與標簽return cleaned.drop('礦物類型', axis=1), cleaned['礦物類型']

該方法優點是無偏差,缺點是可能丟失有效信息(當缺失率較高時)。

(2)按類別均值填充

對數值型特征,按礦物類型分組計算均值,用組內均值填充缺失值(減少跨類別干擾):

def mean_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)# 按礦物類型分組填充filled_groups = []for type_id in combined['礦物類型'].unique():group = combined[combined['礦物類型'] == type_id]# 計算組內各特征均值,用于填充該組缺失值filled_group = group.fillna(group.mean())filled_groups.append(filled_group)# 合并各組數據filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('礦物類型', axis=1), filled['礦物類型']

適用于特征分布較均勻的場景,避免了不同類別間的均值混淆。

(3)按類別中位數填充

當特征存在極端值(如個別樣本鈉含量遠高于均值)時,用中位數填充更穩健:

def median_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)filled_groups = []for type_id in combined['礦物類型'].unique():group = combined[combined['礦物類型'] == type_id]# 中位數對極端值不敏感filled_group = group.fillna(group.median())filled_groups.append(filled_group)filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('礦物類型', axis=1), filled['礦物類型']

(4)按類別眾數填充

針對離散型特征(如部分元素含量為整數編碼),采用眾數(出現次數最多的值)填充:

def mode_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)filled_groups = []for type_id in combined['礦物類型'].unique():group = combined[combined['礦物類型'] == type_id]# 對每列取眾數,無眾數時返回Nonefill_values = group.apply(lambda x: x.mode().iloc[0] if not x.mode().empty else None)filled_group = group.fillna(fill_values)filled_groups.append(filled_group)filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('礦物類型', axis=1), filled['礦物類型']

(5)線性回歸填充

利用特征間的線性相關性(如氯與鈉含量的關聯)預測缺失值:

from sklearn.linear_model import LinearRegressiondef linear_reg_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)features = combined.drop('礦物類型', axis=1)# 按缺失值數量升序處理(從缺失少的列開始)null_counts = features.isnull().sum().sort_values()for col in null_counts.index:if null_counts[col] == 0:continue  # 無缺失值則跳過# 構建訓練數據:用其他特征預測當前列X_train = features.drop(col, axis=1).dropna()  # 其他特征無缺失的樣本y_train = features.loc[X_train.index, col]  # 當前列的非缺失值# 待填充樣本(當前列缺失,其他特征完整)X_pred = features.drop(col, axis=1).loc[features[col].isnull()]# 訓練線性回歸模型lr = LinearRegression()lr.fit(X_train, y_train)# 預測并填充缺失值features.loc[features[col].isnull(), col] = lr.predict(X_pred)return features, combined['礦物類型']

該方法要求特征間存在一定線性關系,適用于元素含量呈比例關聯的場景。

(6)隨機森林填充

對于特征間非線性關系,采用隨機森林模型預測缺失值:

from sklearn.ensemble import RandomForestRegressordef rf_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)features = combined.drop('礦物類型', axis=1)null_counts = features.isnull().sum().sort_values()for col in null_counts.index:if null_counts[col] == 0:continue# 分離訓練樣本和待填充樣本X_train = features.drop(col, axis=1).dropna()y_train = features.loc[X_train.index, col]X_pred = features.drop(col, axis=1).loc[features[col].isnull()]# 訓練隨機森林回歸器(100棵樹,固定隨機種子確保結果可復現)rfr = RandomForestRegressor(n_estimators=100, random_state=10)rfr.fit(X_train, y_train)# 填充預測結果features.loc[features[col].isnull(), col] = rfr.predict(X_pred)return features, combined['礦物類型']

隨機森林能捕捉特征間復雜關系,填充精度通常高于線性方法,但計算成本略高。

2.4 特征標準化

不同元素含量數值差異大(如鈉可達上千,硒多為 0-1),需消除量綱影響:

from sklearn.preprocessing import StandardScalerdef standardize_features(X_train, X_test):# 用訓練集的均值和標準差進行標準化(避免測試集信息泄露)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)  # 擬合訓練集并轉換X_test_scaled = scaler.transform(X_test)  # 用相同參數轉換測試集# 轉回DataFrame格式,保留特征名稱return pd.DataFrame(X_train_scaled, columns=X_train.columns), pd.DataFrame(X_test_scaled, columns=X_test.columns)

標準化后,所有特征均值為 0、標準差為 1,確保模型不受數值大小干擾。

2.5 數據集拆分與類別平衡

(1)拆分訓練集與測試集

按 7:3 比例拆分,保持類別分布一致:

from sklearn.model_selection import train_test_split# stratify=y確保測試集與原始數據類別比例一致
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y
)

(2)處理類別不平衡

采用 SMOTE 算法生成少數類樣本,平衡各類別數量:

from imblearn.over_sampling import SMOTE# 僅對訓練集過采樣(測試集保持原始分布)
smote = SMOTE(k_neighbors=1, random_state=0)  # 近鄰數=1,避免引入過多噪聲
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

2.6 數據保存

將預處理后的數據存儲,供后續模型訓練使用:

def save_processed_data(X_train, y_train, X_test, y_test, method):# 拼接特征與標簽train_df = pd.concat([X_train, pd.DataFrame(y_train, columns=['礦物類型'])], axis=1)test_df = pd.concat([X_test, pd.DataFrame(y_test, columns=['礦物類型'])], axis=1)# 保存為Excel,明確標識預處理方法train_df.to_excel(f'訓練集_{method}.xlsx', index=False)test_df.to_excel(f'測試集_{method}.xlsx', index=False)# 示例:保存經隨機森林填充和標準化的數據
save_processed_data(X_train_balanced, y_train_balanced, X_test, y_test, 'rf_fill_standardized')


三、具體代碼

數據預處理.py

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import fill_datadata = pd.read_excel('礦物數據.xlsx')
data = data[data['礦物類型'] != 'E']  # 刪除特殊類別E,整個數據集中只存在1個E數據
null_num = data.isnull()
null_total = data.isnull().sum()X_whole = data.drop('礦物類型', axis=1).drop('序號', axis=1)
y_whole = data.礦物類型'''將數據中的中文標簽轉換為字符'''
label_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
encoded_labels = [label_dict[label] for label in y_whole]
y_whole = pd.DataFrame(encoded_labels, columns=['礦物類型'])'''字符串數據轉換成float,異常數據('\'和空格)轉換成nan'''
for column_name in X_whole.columns:X_whole[column_name] = pd.to_numeric(X_whole[column_name], errors='coerce')'''Z標準化'''
scaler = StandardScaler()
X_whole_Z = scaler.fit_transform(X_whole)
X_whole = pd.DataFrame(X_whole_Z, columns=X_whole.columns)  # Z標準化處理后為numpy數據,這里再轉換回pandas數據'''數據集切分'''
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_whole, y_whole, test_size=0.3, random_state=0)'''數據填充,6種方法'''
# # 1.刪除空缺行
# X_train_fill, y_train_fill = fill_data.cca_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.cca_test_fill(X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.cca_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)
#
# # 2.平均值填充
# X_train_fill, y_train_fill = fill_data.mean_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.mean_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.mean_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)
#
# # 3.中位數填充
# X_train_fill, y_train_fill = fill_data.median_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.median_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.median_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# # 4.眾數填充
# X_train_fill, y_train_fill = fill_data.mode_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.mode_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.mode_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# # 5.線性回歸填充
# X_train_fill, y_train_fill = fill_data.linear_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.linear_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.linear_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# 6.隨機森林填充
X_train_fill, y_train_fill = fill_data.RandomForest_train_fill(X_train_w, y_train_w)
X_test_fill, y_test_fill = fill_data.RandomForest_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
fill_data.RandomForest_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)

fill_data.py

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression'''過采樣'''
def oversampling(train_data, train_label):oversampler = SMOTE(k_neighbors=1, random_state=0)os_x_train, os_y_train = oversampler.fit_resample(train_data, train_label)return os_x_train, os_y_train'''1.刪除空缺行'''
def cca_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)  # 重置索引df_filled = data.dropna()  # 刪除包含缺失值的行或列return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def cca_test_fill(test_data, test_label):data = pd.concat([test_data, test_label], axis=1)data = data.reset_index(drop=True)df_filled = data.dropna()return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def cca_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[刪除空缺行].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[刪除空缺行].xlsx', index=False)'''2.平均值填充'''
def mean_train_method(data):fill_values = data.mean()return data.fillna(fill_values)def mean_test_method(train_data, test_data):fill_values = train_data.mean()return test_data.fillna(fill_values)def mean_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['礦物類型'] == 0]B = data[data['礦物類型'] == 1]C = data[data['礦物類型'] == 2]D = data[data['礦物類型'] == 3]A = mean_train_method(A)B = mean_train_method(B)C = mean_train_method(C)D = mean_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def mean_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['礦物類型'] == 0]B_train = train_data_all[train_data_all['礦物類型'] == 1]C_train = train_data_all[train_data_all['礦物類型'] == 2]D_train = train_data_all[train_data_all['礦物類型'] == 3]A_test = test_data_all[test_data_all['礦物類型'] == 0]B_test = test_data_all[test_data_all['礦物類型'] == 1]C_test = test_data_all[test_data_all['礦物類型'] == 2]D_test = test_data_all[test_data_all['礦物類型'] == 3]A = mean_test_method(A_train, A_test)B = mean_test_method(B_train, B_test)C = mean_test_method(C_train, C_test)D = mean_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def mean_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[平均值填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[平均值填充].xlsx', index=False)'''3.中位數填充'''
def median_train_method(data):fill_values = data.median()return data.fillna(fill_values)def median_test_method(train_data, test_data):fill_values = train_data.median()return test_data.fillna(fill_values)def median_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['礦物類型'] == 0]B = data[data['礦物類型'] == 1]C = data[data['礦物類型'] == 2]D = data[data['礦物類型'] == 3]A = median_train_method(A)B = median_train_method(B)C = median_train_method(C)D = median_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def median_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['礦物類型'] == 0]B_train = train_data_all[train_data_all['礦物類型'] == 1]C_train = train_data_all[train_data_all['礦物類型'] == 2]D_train = train_data_all[train_data_all['礦物類型'] == 3]A_test = test_data_all[test_data_all['礦物類型'] == 0]B_test = test_data_all[test_data_all['礦物類型'] == 1]C_test = test_data_all[test_data_all['礦物類型'] == 2]D_test = test_data_all[test_data_all['礦物類型'] == 3]A = median_test_method(A_train, A_test)B = median_test_method(B_train, B_test)C = median_test_method(C_train, C_test)D = median_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def median_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[中位數填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[中位數填充].xlsx', index=False)'''4.眾數填充'''
def mode_train_method(data):fill_values = data.apply(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)a = data.mode()return data.fillna(fill_values)def mode_test_method(train_data, test_data):fill_values = train_data.apply(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)a = train_data.mode()return test_data.fillna(fill_values)def mode_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['礦物類型'] == 0]B = data[data['礦物類型'] == 1]C = data[data['礦物類型'] == 2]D = data[data['礦物類型'] == 3]A = mode_train_method(A)B = mode_train_method(B)C = mode_train_method(C)D = mode_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def mode_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['礦物類型'] == 0]B_train = train_data_all[train_data_all['礦物類型'] == 1]C_train = train_data_all[train_data_all['礦物類型'] == 2]D_train = train_data_all[train_data_all['礦物類型'] == 3]A_test = test_data_all[test_data_all['礦物類型'] == 0]B_test = test_data_all[test_data_all['礦物類型'] == 1]C_test = test_data_all[test_data_all['礦物類型'] == 2]D_test = test_data_all[test_data_all['礦物類型'] == 3]A = mode_test_method(A_train, A_test)B = mode_test_method(B_train, B_test)C = mode_test_method(C_train, C_test)D = mode_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型def mode_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[眾數填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[眾數填充].xlsx', index=False)'''5.線性回歸填充'''
def linear_train_fill(train_data, train_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('礦物類型', axis=1)null_num = train_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X = train_data_X[filling_feature].drop(i, axis=1)y = train_data_X[i]row_numbers_mg_null = train_data_X[train_data_X[i].isnull()].index.tolist()X_train = X.drop(row_numbers_mg_null)y_train = y.drop(row_numbers_mg_null)X_test = X.iloc[row_numbers_mg_null]lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_test)train_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成訓練數據集中的{i}列數據的填充')return train_data_X, train_data_all.礦物類型def linear_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('礦物類型', axis=1)test_data_X = test_data_all.drop('礦物類型', axis=1)null_num = test_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X_train = train_data_X[filling_feature].drop(i, axis=1)y_train = train_data_X[i]X_test  = test_data_X[filling_feature].drop(i, axis=1)row_numbers_mg_null = test_data_X[test_data_X[i].isnull()].index.tolist()X_test = X_test.iloc[row_numbers_mg_null]lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_test)test_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成測試數據集中的{i}列數據的填充')return test_data_X, test_data_all.礦物類型def linear_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[線性回歸填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[線性回歸填充].xlsx', index=False)'''6.隨機森林填充'''
def RandomForest_train_fill(train_data, train_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('礦物類型', axis=1)null_num = train_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X = train_data_X[filling_feature].drop(i, axis=1)y = train_data_X[i]row_numbers_mg_null = train_data_X[train_data_X[i].isnull()].index.tolist()X_train = X.drop(row_numbers_mg_null)y_train = y.drop(row_numbers_mg_null)X_test = X.iloc[row_numbers_mg_null]rfg = RandomForestRegressor(n_estimators=100, random_state=10)rfg.fit(X_train, y_train)y_pred = rfg.predict(X_test)train_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成訓練數據集中的{i}列數據的填充')return train_data_X, train_data_all.礦物類型def RandomForest_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('礦物類型', axis=1)test_data_X = test_data_all.drop('礦物類型', axis=1)null_num = test_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X_train = train_data_X[filling_feature].drop(i, axis=1)y_train = train_data_X[i]X_test  = test_data_X[filling_feature].drop(i, axis=1)row_numbers_mg_null = test_data_X[test_data_X[i].isnull()].index.tolist()X_test = X_test.iloc[row_numbers_mg_null]rfg = RandomForestRegressor(n_estimators=100, random_state=10)rfg.fit(X_train, y_train)y_pred = rfg.predict(X_test)test_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成測試數據集中的{i}列數據的填充')return test_data_X, test_data_all.礦物類型def RandomForest_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//訓練數據集[隨機森林填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//測試數據集[隨機森林填充].xlsx', index=False)

四、預處理小結

數據預處理完成以下關鍵工作:

  1. 清洗無效樣本與異常符號,統一數據格式;
  2. 通過多種方法處理缺失值,適應不同數據特征;
  3. 標準化特征,消除量綱差異;
  4. 拆分并平衡數據集,為模型訓練做準備。

經處理后的數據已滿足模型輸入要求,下一階段將進行模型訓練,包括:

  • 選擇隨機森林、SVM 等分類算法;
  • 開展模型評估與超參數調優;
  • 對比不同模型的分類性能。

后續將基于本文預處理后的數據,詳細介紹模型訓練過程及結果分析。

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/pingmian/93665.shtml
繁體地址,請注明出處:http://hk.pswp.cn/pingmian/93665.shtml
英文地址,請注明出處:http://en.pswp.cn/pingmian/93665.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

《UE5_C++多人TPS完整教程》學習筆記43 ——《P44 奔跑混合空間(Running Blending Space)》

本文為B站系列教學視頻 《UE5_C多人TPS完整教程》 —— 《P44 奔跑混合空間&#xff08;Running Blending Space&#xff09;》 的學習筆記&#xff0c;該系列教學視頻為計算機工程師、程序員、游戲開發者、作家&#xff08;Engineer, Programmer, Game Developer, Author&…

TensorRT-LLM.V1.1.0rc1:Dockerfile.multi文件解讀

一、TensorRT-LLM有三種安裝方式&#xff0c;從簡單到難 1.NGC上的預構建發布容器進行部署,見《tensorrt-llm0.20.0離線部署DeepSeek-R1-Distill-Qwen-32B》。 2.通過pip進行部署。 3.從源頭構建再部署&#xff0c;《TensorRT-LLM.V1.1.0rc0:在無 GitHub 訪問權限的服務器上編…

UniApp 實現pdf上傳和預覽

一、上傳1、html<template><button click"takeFile">pdf上傳</button> </template>2、JStakeFile() {// #ifdef H5// H5端使用input方式選擇文件const input document.createElement(input);input.type file;input.accept .pdf;input.onc…

《用Proxy解構前端壁壘:跨框架狀態共享庫的從零到優之路》

一個項目中同時出現React的函數式組件、Vue的模板語法、Angular的依賴注入時,數據在不同框架體系間的流轉便成了開發者不得不面對的難題—狀態管理,這個本就復雜的命題,在跨框架場景下更顯棘手。而Proxy,作為JavaScript語言賦予開發者的“元編程利器”,正為打破這道壁壘提…

MOESI FSM的全路徑測試用例

MOESI FSM的全路徑測試用例摘要&#xff1a;本文首先提供一個UVM版本的測試序列&#xff08;基于SystemVerilog和UVM框架&#xff09;&#xff0c;設計為覆蓋MOESI FSM的全路徑&#xff1b;其次詳細解釋如何使用覆蓋組&#xff08;covergroup&#xff09;來量化測試的覆蓋率&am…

git倉庫和分支的關系

1?? 倉庫分支&#xff08;Repository Branch&#xff09;每個 Git 倉庫都有自己的分支結構。分支決定你當前倉庫看到的代碼版本。示例&#xff1a;倉庫分支只是局部修改&#xff0c;項目分支才是全局管理所有倉庫分支的概念。wifi_camera 倉庫&#xff1a; - main - dev - fe…

Linux的基本操作

Linux 系統基礎操作完整指南一、文件與目錄操作1. 導航與查看pwd (Print Working Directory)作用&#xff1a;顯示當前所在目錄的完整路徑示例&#xff1a;pwd → 輸出 /home/user/documents使用場景&#xff1a;當你在多層目錄中迷失時快速定位當前位置ls (List)常用選項&…

npm設置了鏡像 pnpm還需要設置鏡像嗎

npm配置鏡像后是否需要為pnpm單獨設置鏡像&#xff1f; 是的&#xff0c;即使您已經為npm設置了鏡像源&#xff08;如淘寶鏡像&#xff09;&#xff0c;仍然需要單獨為pnpm配置鏡像源。這是因為npm和pnpm是兩個獨立的包管理工具&#xff0c;它們的配置系統和環境變量是分離的&a…

Linux管道

預備知識&#xff1a;進程通信進程需要某種協同&#xff0c;協同的前提條件是通信。有些數據是用來通知就緒的&#xff0c;有些是單純的傳輸數據&#xff0c;還有一些是控制相關信息。進程具有獨立性&#xff0c;所以通信的成本可能稍微高一點&#xff1b;進程間通信前提是讓不…

基于Spring Boot的快遞物流倉庫管理系統 商品庫存管理系統

&#x1f525;作者&#xff1a;it畢設實戰小研&#x1f525; &#x1f496;簡介&#xff1a;java、微信小程序、安卓&#xff1b;定制開發&#xff0c;遠程調試 代碼講解&#xff0c;文檔指導&#xff0c;ppt制作&#x1f496; 精彩專欄推薦訂閱&#xff1a;在下方專欄&#x1…

腳手架開發-Common封裝基礎通用工具類<基礎工具類>

書接上文 java一個腳手架搭建_redission java腳手架-CSDN博客 以微服務為基礎搭建一套腳手架開始前的介紹-CSDN博客 腳手架開發-準備配置-進行數據初始化-配置文件的準備-CSDN博客 腳手架開發-準備配置-配置文件的準備項目的一些中間件-CSDN博客 腳手架開發-Nacos集成-CSD…

軟件系統運維常見問題

系統部署常見問題 環境配置、兼容性問題。生產與測試環境的操作系統、庫版本、中間件版本不一致&#xff0c;運行環境軟件版本不匹配。新舊版本代碼/依賴不兼容。依賴缺失或沖突問題。后端包啟動失敗&#xff0c;提示類/方法/第三方依賴庫找不到或者版本沖突。配置錯誤。系統啟…

2021 IEEE【論文精讀】用GAN讓音頻隱寫術騙過AI檢測器 - 對抗深度學習的音頻信息隱藏

使用GAN生成音頻隱寫術的隱寫載體 本文為個人閱讀GAN音頻隱寫論文&#xff0c;部分內容注解&#xff0c;由于原文篇幅較長這里就不再一一粘貼&#xff0c;僅對原文部分內容做注解&#xff0c;僅供參考詳情參考原文鏈接 原文鏈接&#xff1a;https://ieeexplore.ieee.org/abstra…

PWA技術》》漸進式Web應用 Push API 和 WebSocket 、webworker 、serviceworker

PWA # 可離線 # 高性能 # 無需安裝 # 原生體驗Manifest {"name": "天氣助手", // 應用全名"short_name": "天氣", // 短名稱&#xff08;主屏幕顯示&#xff09;"start_url": "/index.html&…

數據結構——棧和隊列oj練習

225. 用隊列實現棧 - 力扣&#xff08;LeetCode&#xff09; 這一題需要我們充分理解隊列和棧的特點。 隊列&#xff1a;隊頭出數據&#xff0c;隊尾入數據。 棧&#xff1a;棧頂出數據和入數據。 我們可以用兩個隊列實現棧&#xff0c;在這過程中&#xff0c;我們總要保持其…

Java基礎 8.19

目錄 1.局部內部類的使用 總結 1.局部內部類的使用 說明&#xff1a;局部內部類是定義在外部類的局部位置&#xff0c;比如方法中&#xff0c;并且有類名可以直接訪問外部類的所有成員&#xff0c;包含私有的不能添加訪問修飾符&#xff0c;因為它的地位就是一個局部變量。局…

從父類到子類:C++ 繼承的奇妙旅程(2)

前言&#xff1a;各位代碼航海家&#xff0c;歡迎回到C繼承宇宙&#xff01;上回我們解鎖了繼承的「基礎裝備包」&#xff0c;成功馴服了public、protected和花式成員隱藏術。但——??前方高能預警&#xff1a; 繼承世界的暗流涌動遠不止于此&#xff01;今天我們將勇闖三大神…

【圖像算法 - 16】庖丁解牛:基于YOLO12與OpenCV的車輛部件級實例分割實戰(附完整代碼)

庖丁解牛&#xff1a;基于YOLO12與OpenCV的車輛部件級實例分割實戰&#xff08;附完整代碼&#xff09; 摘要&#xff1a; 告別“只見整車不見細節”&#xff01;本文將帶您深入實戰&#xff0c;利用YOLO12-seg訓練實例分割模型&#xff0c;結合OpenCV的強大圖像處理能力&…

ubuntu22.04配置遠程桌面

文章目錄前言檢查桌面類型xorg遠程桌面(xrdp)安裝xrdpxrdp添加到ssl-certwayland遠程桌面(gnome-remote-desktop)檢查安裝開啟開啟狀況檢查自動登錄奇技淫巧前言 在windows上使用遠程桌面服務&#xff0c;連接ubuntu主機的遠程桌面 檢查桌面類型 查看桌面類型、協議 echo $…

SQL Server 中子查詢、臨時表與 CTE 的選擇與對比

在 SQL Server 的實際開發過程中&#xff0c;我們常常需要將復雜的查詢邏輯分解為多個階段進行處理。實現這一目標的常見手段有 子查詢 (Subquery)、臨時表 (Temporary Table) 和 CTE (Common Table Expression)。這三者在語法、執行效率以及可維護性方面各有優勢與局限。如何選…