day10 python機器學習全流程實踐

在機器學習的實踐中，數據預處理與模型構建是極為關鍵的環節。本文將回顧數據預處理的全流程，并基于處理后的數據完成簡單的機器學習建模與評估，暫不涉及復雜的調參過程。

一、預處理流程回顧

機器學習的成功，很大程度上依賴于高質量的數據。以下是數據預處理的標準流程：

導入庫：引入必要的 Python 庫，用于數據處理、分析、可視化以及建模。
讀取數據與理解：讀取數據集，通過info()和head()方法初步了解數據的基本信息與結構。
缺失值處理：識別并處理數據中的缺失值。
異常值處理：檢測并處理異常數據點。
離散值處理：將離散型數據轉換為適合模型處理的格式。
特征工程：包括特征縮放、衍生新特征以及特征選擇等操作。
劃分數據集：將數據劃分為訓練集和測試集，用于模型訓練與評估。

1.1 導入所需的包

import pandas as pd  # 用于數據處理和分析，可處理表格數據
import numpy as np   # 用于數值計算，提供高效的數組操作
import matplotlib.pyplot as plt  # 用于繪制各種類型的圖表
import seaborn as sns  # 基于matplotlib的高級繪圖庫，能繪制更美觀的統計圖形# 設置中文字體（解決中文顯示問題）
plt.rcParams['font.sans-serif'] = ['SimHei']  # Windows系統常用黑體字體
plt.rcParams['axes.unicode_minus'] = False    # 正常顯示負號

1.2 查看數據信息

data = pd.read_csv('data.csv')    # 讀取數據
print("數據基本信息：")
data.info()
print("\n數據前5行預覽：")
print(data.head())

數據基本信息：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):#   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  0   Id                            7500 non-null   int64  1   Home Ownership                7500 non-null   object 2   Annual Income                 5943 non-null   float643   Years in current job          7129 non-null   object 4   Tax Liens                     7500 non-null   float645   Number of Open Accounts       7500 non-null   float646   Years of Credit History       7500 non-null   float647   Maximum Open Credit           7500 non-null   float648   Number of Credit Problems     7500 non-null   float649   Months since last delinquent  3419 non-null   float6410  Bankruptcies                  7486 non-null   float6411  Purpose                       7500 non-null   object 12  Term                          7500 non-null   object 13  Current Loan Amount           7500 non-null   float6414  Current Credit Balance        7500 non-null   float6415  Monthly Debt                  7500 non-null   float6416  Credit Score                  5943 non-null   float6417  Credit Default                7500 non-null   int64  
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB

數據前 5 行預覽：

   Id Home Ownership  Annual Income Years in current job  Tax Liens  \
0   0       Own Home       482087.0                  NaN        0.0   
1   1       Own Home      1025487.0            10+ years        0.0   
2   2  Home Mortgage       751412.0              8 years        0.0   
3   3       Own Home       805068.0              6 years        0.0   
4   4           Rent       776264.0              8 years        0.0   Number of Open Accounts  Years of Credit History  Maximum Open Credit  \
0                     11.0                     26.3             685960.0   
1                     15.0                     15.3            1181730.0   
2                     11.0                     35.0            1182434.0   
3                      8.0                     22.5             147400.0   
4                     13.0                     13.6             385836.0   Number of Credit Problems  Months since last delinquent  Bankruptcies  \
0                        1.0                           NaN           1.0   
1                        0.0                           NaN           0.0   
2                        0.0                           NaN           0.0   
3                        1.0                           NaN           1.0   
4                        1.0                           NaN           0.0   Purpose        Term  Current Loan Amount  \
0  debt consolidation  Short Term           99999999.0   
1  debt consolidation   Long Term             264968.0   
2  debt consolidation  Short Term           99999999.0   
3  debt consolidation  Short Term             121396.0   
4  debt consolidation  Short Term             125840.0   Current Credit Balance  Monthly Debt  Credit Score  Credit Default  
0                 47386.0        7914.0         749.0               0  
1                394972.0       18373.0         737.0               1  
2                308389.0       13651.0         742.0               0  
3                 95855.0       11338.0         694.0               0  
4                 93309.0        7180.0         719.0               0

1.3 缺失值處理

Annual Income：存在 1557 個缺失值，可根據 “Home Ownership” 等相關特征的平均收入進行填充。
Years in current job：存在 371 個缺失值，需先將字符串類型轉換為數值類型，再用眾數或中位數填充。
Months since last delinquent：缺失值較多（4081 個），可根據其對目標變量的影響程度，選擇多重填補法或直接刪除缺失行。
Credit Score：存在 1557 個缺失值，處理方式與 “Annual Income” 類似。

1.4 數據類型轉換

Years in current job：將字符串類型轉換為數值類型。
Home Ownership、Purpose、Term：根據特征性質，選擇獨熱編碼或標簽編碼。

1.5 異常值處理

對于數值型特征，如 “Annual Income” 和 “Current Loan Amount”，可通過箱線圖檢測異常值，并根據實際情況決定是否處理。

1.6 特征縮放

對數值型特征進行 Min-Max 標準化或 Z-score 標準化，統一特征的取值范圍。

1.7 特征工程

衍生新特征：例如計算 “負債收入比”（Debt-to-Income Ratio）。
特征選擇：通過相關性分析等方法，篩選與目標變量相關性高的特征。

二、數據預處理實操

2.1 處理 object 類型變量

# 篩選字符串變量 
discrete_features = data.select_dtypes(include=['object']).columns.tolist()
print(discrete_features)# 查看每個字符串變量的唯一值
for feature in discrete_features:print(f"\n{feature}的唯一值：")print(data[feature].value_counts())

處理結果：

Home Ownership：進行標簽編碼

mapping = {'Own Home': 1,'Rent': 2,'Have Mortgage': 3,'Home Mortgage': 4
}data['Home Ownership']=data['Home Ownership'].map(mapping)
data.head()

Years in current job：進行標簽編碼

years_in_job_mapping = {'< 1 year': 1,'1 year': 2,'2 years': 3,'3 years': 4,'4 years': 5,'5 years': 6,'6 years': 7,'7 years': 8,'8 years': 9,'9 years': 10,'10+ years': 11
}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)

Purpose：進行獨熱編碼

data = pd.get_dummies(data, columns=['Purpose'])
# 將獨熱編碼后的bool類型轉換為數值
for col in data.columns:if 'Purpose' in col:data[col] = data[col].astype(int)

Term：進行 0-1 映射

term_mapping = {'Short Term': 0,'Long Term': 1
}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term': 'Long Term'}, inplace=True)

2.2 處理數值型變量

# 篩選數值型特征
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()# 用中位數填補缺失值
for feature in continuous_features:median_value = data[feature].median()data[feature].fillna(median_value, inplace=True)

處理后的數據信息：

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 32 columns):#   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  0   Id                            7500 non-null   int64  1   Home Ownership                7500 non-null   int64  2   Annual Income                 7500 non-null   float643   Years in current job          7500 non-null   float644   Tax Liens                     7500 non-null   float645   Number of Open Accounts       7500 non-null   float646   Years of Credit History       7500 non-null   float647   Maximum Open Credit           7500 non-null   float648   Number of Credit Problems     7500 non-null   float649   Months since last delinquent  7500 non-null   float6410  Bankruptcies                  7500 non-null   float6411  Long Term                     7500 non-null   int64  12  Current Loan Amount           7500 non-null   float6413  Current Credit Balance        7500 non-null   float6414  Monthly Debt                  7500 non-null   float6415  Credit Score                  7500 non-null   float6416  Credit Default                7500 non-null   int64  17  Purpose_business loan         7500 non-null   int32  18  Purpose_buy a car             7500 non-null   int32  19  Purpose_buy house             7500 non-null   int32  20  Purpose_debt consolidation    7500 non-null   int32  21  Purpose_educational expenses  7500 non-null   int32  22  Purpose_home improvements     7500 non-null   int32  23  Purpose_major purchase        7500 non-null   int32  24  Purpose_medical bills         7500 non-null   int32  25  Purpose_moving                7500 non-null   int32  26  Purpose_other                 7500 non-null   int32  27  Purpose_renewable energy      7500 non-null   int32  28  Purpose_small business        7500 non-null   int32  29  Purpose_take a trip           7500 non-null   int32  30  Purpose_vacation              7500 non-null   int32  31  Purpose_wedding               7500 non-null   int32  
dtypes: float64(13), int32(15), int64(4)
memory usage: 1.4 MB

三、機器學習模型建模與評估

3.1 數據劃分

from sklearn.model_selection import train_test_split
X = data.drop(['Credit Default'], axis=1)  # 特征
y = data['Credit Default']  # 標簽
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print(f"訓練集形狀: {X_train.shape}, 測試集形狀: {X_test.shape}")

結果：

訓練集形狀: (6000, 31), 測試集形狀: (1500, 31)

3.2 模型訓練與評估

使用多種常見的分類模型進行訓練與評估，包括 SVM、KNN、邏輯回歸、樸素貝葉斯、決策樹、隨機森林、XGBoost 和 LightGBM。

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")# SVM模型
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print("\nSVM 分類報告：")
print(classification_report(y_test, svm_pred))
print("SVM 混淆矩陣：")
print(confusion_matrix(y_test, svm_pred))
print("SVM 模型評估指標：")
print(f"準確率: {accuracy_score(y_test, svm_pred):.4f}")
print(f"精確率: {precision_score(y_test, svm_pred):.4f}")
print(f"召回率: {recall_score(y_test, svm_pred):.4f}")
print(f"F1 值: {f1_score(y_test, svm_pred):.4f}")# KNN模型
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print("\nKNN 分類報告：")
print(classification_report(y_test, knn_pred))
print("KNN 混淆矩陣：")
print(confusion_matrix(y_test, knn_pred))
print("KNN 模型評估指標：")
print(f"準確率: {accuracy_score(y_test, knn_pred):.4f}")
print(f"精確率: {precision_score(y_test, knn_pred):.4f}")
print(f"召回率: {recall_score(y_test, knn_pred):.4f}")
print(f"F1 值: {f1_score(y_test, knn_pred):.4f}")# 邏輯回歸模型
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
print("\n邏輯回歸 分類報告：")
print(classification_report(y_test, logreg_pred))
print("邏輯回歸 混淆矩陣：")
print(confusion_matrix(y_test, logreg_pred))
print("邏輯回歸 模型評估指標：")
print(f"準確率: {accuracy_score(y_test, logreg_pred):.4f}")
print(f"精確率: {precision_score(y_test, logreg

@浙大疏錦行