在機器學習的實踐中,數據預處理與模型構建是極為關鍵的環節。本文將回顧數據預處理的全流程,并基于處理后的數據完成簡單的機器學習建模與評估,暫不涉及復雜的調參過程。
一、預處理流程回顧
機器學習的成功,很大程度上依賴于高質量的數據。以下是數據預處理的標準流程:
- 導入庫:引入必要的 Python 庫,用于數據處理、分析、可視化以及建模。
- 讀取數據與理解:讀取數據集,通過
info()
和head()
方法初步了解數據的基本信息與結構。 - 缺失值處理:識別并處理數據中的缺失值。
- 異常值處理:檢測并處理異常數據點。
- 離散值處理:將離散型數據轉換為適合模型處理的格式。
- 特征工程:包括特征縮放、衍生新特征以及特征選擇等操作。
- 劃分數據集:將數據劃分為訓練集和測試集,用于模型訓練與評估。
1.1 導入所需的包
import pandas as pd # 用于數據處理和分析,可處理表格數據
import numpy as np # 用于數值計算,提供高效的數組操作
import matplotlib.pyplot as plt # 用于繪制各種類型的圖表
import seaborn as sns # 基于matplotlib的高級繪圖庫,能繪制更美觀的統計圖形# 設置中文字體(解決中文顯示問題)
plt.rcParams['font.sans-serif'] = ['SimHei'] # Windows系統常用黑體字體
plt.rcParams['axes.unicode_minus'] = False # 正常顯示負號
1.2 查看數據信息
data = pd.read_csv('data.csv') # 讀取數據
print("數據基本信息:")
data.info()
print("\n數據前5行預覽:")
print(data.head())
數據基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Id 7500 non-null int64 1 Home Ownership 7500 non-null object 2 Annual Income 5943 non-null float643 Years in current job 7129 non-null object 4 Tax Liens 7500 non-null float645 Number of Open Accounts 7500 non-null float646 Years of Credit History 7500 non-null float647 Maximum Open Credit 7500 non-null float648 Number of Credit Problems 7500 non-null float649 Months since last delinquent 3419 non-null float6410 Bankruptcies 7486 non-null float6411 Purpose 7500 non-null object 12 Term 7500 non-null object 13 Current Loan Amount 7500 non-null float6414 Current Credit Balance 7500 non-null float6415 Monthly Debt 7500 non-null float6416 Credit Score 5943 non-null float6417 Credit Default 7500 non-null int64
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB
數據前 5 行預覽:
Id Home Ownership Annual Income Years in current job Tax Liens \
0 0 Own Home 482087.0 NaN 0.0
1 1 Own Home 1025487.0 10+ years 0.0
2 2 Home Mortgage 751412.0 8 years 0.0
3 3 Own Home 805068.0 6 years 0.0
4 4 Rent 776264.0 8 years 0.0 Number of Open Accounts Years of Credit History Maximum Open Credit \
0 11.0 26.3 685960.0
1 15.0 15.3 1181730.0
2 11.0 35.0 1182434.0
3 8.0 22.5 147400.0
4 13.0 13.6 385836.0 Number of Credit Problems Months since last delinquent Bankruptcies \
0 1.0 NaN 1.0
1 0.0 NaN 0.0
2 0.0 NaN 0.0
3 1.0 NaN 1.0
4 1.0 NaN 0.0 Purpose Term Current Loan Amount \
0 debt consolidation Short Term 99999999.0
1 debt consolidation Long Term 264968.0
2 debt consolidation Short Term 99999999.0
3 debt consolidation Short Term 121396.0
4 debt consolidation Short Term 125840.0 Current Credit Balance Monthly Debt Credit Score Credit Default
0 47386.0 7914.0 749.0 0
1 394972.0 18373.0 737.0 1
2 308389.0 13651.0 742.0 0
3 95855.0 11338.0 694.0 0
4 93309.0 7180.0 719.0 0
1.3 缺失值處理
- Annual Income:存在 1557 個缺失值,可根據 “Home Ownership” 等相關特征的平均收入進行填充。
- Years in current job:存在 371 個缺失值,需先將字符串類型轉換為數值類型,再用眾數或中位數填充。
- Months since last delinquent:缺失值較多(4081 個),可根據其對目標變量的影響程度,選擇多重填補法或直接刪除缺失行。
- Credit Score:存在 1557 個缺失值,處理方式與 “Annual Income” 類似。
1.4 數據類型轉換
- Years in current job:將字符串類型轉換為數值類型。
- Home Ownership、Purpose、Term:根據特征性質,選擇獨熱編碼或標簽編碼。
1.5 異常值處理
對于數值型特征,如 “Annual Income” 和 “Current Loan Amount”,可通過箱線圖檢測異常值,并根據實際情況決定是否處理。
1.6 特征縮放
對數值型特征進行 Min-Max 標準化或 Z-score 標準化,統一特征的取值范圍。
1.7 特征工程
- 衍生新特征:例如計算 “負債收入比”(Debt-to-Income Ratio)。
- 特征選擇:通過相關性分析等方法,篩選與目標變量相關性高的特征。
二、數據預處理實操
2.1 處理 object 類型變量
# 篩選字符串變量
discrete_features = data.select_dtypes(include=['object']).columns.tolist()
print(discrete_features)# 查看每個字符串變量的唯一值
for feature in discrete_features:print(f"\n{feature}的唯一值:")print(data[feature].value_counts())
處理結果:
- Home Ownership:進行標簽編碼
mapping = {'Own Home': 1,'Rent': 2,'Have Mortgage': 3,'Home Mortgage': 4
}data['Home Ownership']=data['Home Ownership'].map(mapping)
data.head()
- Years in current job:進行標簽編碼
years_in_job_mapping = {'< 1 year': 1,'1 year': 2,'2 years': 3,'3 years': 4,'4 years': 5,'5 years': 6,'6 years': 7,'7 years': 8,'8 years': 9,'9 years': 10,'10+ years': 11
}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)
- Purpose:進行獨熱編碼
data = pd.get_dummies(data, columns=['Purpose'])
# 將獨熱編碼后的bool類型轉換為數值
for col in data.columns:if 'Purpose' in col:data[col] = data[col].astype(int)
- Term:進行 0-1 映射
term_mapping = {'Short Term': 0,'Long Term': 1
}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term': 'Long Term'}, inplace=True)
2.2 處理數值型變量
# 篩選數值型特征
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()# 用中位數填補缺失值
for feature in continuous_features:median_value = data[feature].median()data[feature].fillna(median_value, inplace=True)
處理后的數據信息:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 32 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Id 7500 non-null int64 1 Home Ownership 7500 non-null int64 2 Annual Income 7500 non-null float643 Years in current job 7500 non-null float644 Tax Liens 7500 non-null float645 Number of Open Accounts 7500 non-null float646 Years of Credit History 7500 non-null float647 Maximum Open Credit 7500 non-null float648 Number of Credit Problems 7500 non-null float649 Months since last delinquent 7500 non-null float6410 Bankruptcies 7500 non-null float6411 Long Term 7500 non-null int64 12 Current Loan Amount 7500 non-null float6413 Current Credit Balance 7500 non-null float6414 Monthly Debt 7500 non-null float6415 Credit Score 7500 non-null float6416 Credit Default 7500 non-null int64 17 Purpose_business loan 7500 non-null int32 18 Purpose_buy a car 7500 non-null int32 19 Purpose_buy house 7500 non-null int32 20 Purpose_debt consolidation 7500 non-null int32 21 Purpose_educational expenses 7500 non-null int32 22 Purpose_home improvements 7500 non-null int32 23 Purpose_major purchase 7500 non-null int32 24 Purpose_medical bills 7500 non-null int32 25 Purpose_moving 7500 non-null int32 26 Purpose_other 7500 non-null int32 27 Purpose_renewable energy 7500 non-null int32 28 Purpose_small business 7500 non-null int32 29 Purpose_take a trip 7500 non-null int32 30 Purpose_vacation 7500 non-null int32 31 Purpose_wedding 7500 non-null int32
dtypes: float64(13), int32(15), int64(4)
memory usage: 1.4 MB
三、機器學習模型建模與評估
3.1 數據劃分
from sklearn.model_selection import train_test_split
X = data.drop(['Credit Default'], axis=1) # 特征
y = data['Credit Default'] # 標簽
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print(f"訓練集形狀: {X_train.shape}, 測試集形狀: {X_test.shape}")
結果:
訓練集形狀: (6000, 31), 測試集形狀: (1500, 31)
3.2 模型訓練與評估
使用多種常見的分類模型進行訓練與評估,包括 SVM、KNN、邏輯回歸、樸素貝葉斯、決策樹、隨機森林、XGBoost 和 LightGBM。
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")# SVM模型
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print("\nSVM 分類報告:")
print(classification_report(y_test, svm_pred))
print("SVM 混淆矩陣:")
print(confusion_matrix(y_test, svm_pred))
print("SVM 模型評估指標:")
print(f"準確率: {accuracy_score(y_test, svm_pred):.4f}")
print(f"精確率: {precision_score(y_test, svm_pred):.4f}")
print(f"召回率: {recall_score(y_test, svm_pred):.4f}")
print(f"F1 值: {f1_score(y_test, svm_pred):.4f}")# KNN模型
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print("\nKNN 分類報告:")
print(classification_report(y_test, knn_pred))
print("KNN 混淆矩陣:")
print(confusion_matrix(y_test, knn_pred))
print("KNN 模型評估指標:")
print(f"準確率: {accuracy_score(y_test, knn_pred):.4f}")
print(f"精確率: {precision_score(y_test, knn_pred):.4f}")
print(f"召回率: {recall_score(y_test, knn_pred):.4f}")
print(f"F1 值: {f1_score(y_test, knn_pred):.4f}")# 邏輯回歸模型
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
print("\n邏輯回歸 分類報告:")
print(classification_report(y_test, logreg_pred))
print("邏輯回歸 混淆矩陣:")
print(confusion_matrix(y_test, logreg_pred))
print("邏輯回歸 模型評估指標:")
print(f"準確率: {accuracy_score(y_test, logreg_pred):.4f}")
print(f"精確率: {precision_score(y_test, logreg
@浙大疏錦行