詳細案例，集成算法

以下是一個使用隨機森林（RF）和 XGBoost 解決結構化數據分類問題的完整案例（以泰坦尼克號生存預測為例），包含數據處理、建模和結果分析：

案例：泰坦尼克號乘客生存預測

目標：根據乘客特征（如年齡、船艙等級等）預測生存與否（0=遇難，1=存活）。

一、數據準備

1. 加載數據

import pandas as pd

train = pd.read_csv("train.csv") # 訓練集

test = pd.read_csv("test.csv") # 測試集

2. 數據清洗

- 缺失值處理：

- 年齡（Age）：用中位數填充。

- 登船港口（Embarked）：用眾數填充。

- 船艙等級（Fare）：用均值填充測試集缺失值。

train['Age'].fillna(train['Age'].median(), inplace=True)

train['Embarked'].fillna(train['Embarked'].mode()[0], inplace=True)

test['Age'].fillna(test['Age'].median(), inplace=True)

test['Fare'].fillna(test['Fare'].mean(), inplace=True)

- 特征工程：

- 提取船艙首字母（如Cabin='C85' → 'C'），缺失值標記為'X'。

- 轉換分類變量（如性別、登船港口）為數值型（獨熱編碼）。

train['Cabin'] = train['Cabin'].fillna('X').apply(lambda x: x[0])

test['Cabin'] = test['Cabin'].fillna('X').apply(lambda x: x[0])

train = pd.get_dummies(train, columns=['Sex', 'Embarked', 'Cabin'])

test = pd.get_dummies(test, columns=['Sex', 'Embarked', 'Cabin'])

- 選擇核心特征：

features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] + \

? ? ? ? ? ?[col for col in train.columns if 'Sex_' in col or 'Embarked_' in col or 'Cabin_' in col]

X_train = train[features]

y_train = train['Survived']

X_test = test[features]

二、模型訓練與調優

1. 隨機森林（RF）

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

# 初始模型

rf = RandomForestClassifier(random_state=42)

rf.fit(X_train, y_train)

# 網格搜索調參

param_grid = {

? ? 'n_estimators': [100, 200],

? ? 'max_depth': [None, 10, 20],

? ? 'min_samples_split': [2, 5]

}

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_

2. XGBoost

import xgboost as xgb

from xgboost import plot_importance

import matplotlib.pyplot as plt

# 初始模型

xgb_model = xgb.XGBClassifier(

? ? objective='binary:logistic',

? ? random_state=42,

? ? n_estimators=500,

? ? learning_rate=0.1

)

xgb_model.fit(X_train, y_train, early_stopping_rounds=50,?

? ? ? ? ? ? ?eval_set=[(X_train, y_train)], verbose=False)

# 特征重要性可視化

plot_importance(xgb_model)

plt.show()

三、結果分析

1. 模型評估（訓練集）

from sklearn.metrics import accuracy_score, classification_report

# 隨機森林

y_pred_rf = best_rf.predict(X_train)

print("RF準確率:", accuracy_score(y_train, y_pred_rf)) # 示例輸出：~0.85

# XGBoost

y_pred_xgb = xgb_model.predict(X_train)

print("XGBoost準確率:", accuracy_score(y_train, y_pred_xgb)) # 示例輸出：~0.88

2. 預測提交（測試集）

# 生成預測結果

test['Survived'] = best_rf.predict(X_test) # 或用xgb_model.predict(X_test)

submission = test[['PassengerId', 'Survived']]

submission.to_csv("submission.csv", index=False)

四、關鍵結論

1.?特征重要性：

- XGBoost顯示性別（Sex_female）、船艙等級（Pclass）、年齡（Age）是最重要的生存預測因素（見特征重要性圖）。

2.?模型對比：

- 隨機森林調參后準確率約85%，XGBoost通過正則化和早停機制準確率更高（約88%），但需注意過擬合風險。

3.?優化方向：

- 可嘗試Stacking集成（如RF+邏輯回歸作為元模型），或進一步調整XGBoost的?max_depth?和?reg_alpha?。

代碼優化建議

- 并行加速：XGBoost設置?n_jobs=-1?啟用多線程，RF設置?n_jobs=-1?加速訓練。

- 數據泄漏檢查：特征工程需在訓練集和測試集獨立處理（避免用測試集數據填充訓練集缺失值）。

需要完整數據集或其他算法案例（如GBDT）可隨時告知！

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/904128.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/904128.shtml
英文地址，請注明出處：http://en.pswp.cn/news/904128.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！