機器學習算法 ——XGBoost 的介紹和使用

前言：

最近在工作中遇到一個結構化數據回歸預測的問題，用到了很多回歸算法（如多元線性回歸等）都沒有很好的效果，于是使用了XGBoost，自己也沖三個特征參數人為的增加來幾個，訓練出來的效果還是很不錯的，XGBoost還是比較擅長對結構化數據的預測分析。希望對你有幫助

一、XGBoost 的基本介紹

XGBoost（eXtreme Gradient Boosting）是一種基于梯度提升樹（GBDT）的高效機器學習算法，由陳天奇等人提出。它在 GBDT 的基礎上進行了諸多優化，具有速度快、性能好、泛化能力強等特點，在各類機器學習競賽和實際工業應用中都有著廣泛的應用。

二、XGBoost 的核心原理

（一）梯度提升框架

XGBoost 遵循梯度提升的思想，通過迭代地構建一系列決策樹來進行預測。每一棵新的樹都是為了糾正前面所有樹的預測誤差而生成的。具體來說，假設已經有了 t-1 棵樹，對于每一個樣本，這些樹的預測結果之和為(\hat{y}^{(t-1)}_i)，那么第 t 棵樹的目標就是使得預測誤差盡可能減小。

（二）正則化

XGBoost 引入了正則化項來防止過擬合。正則化項包括樹的葉子節點數量和葉子節點權重的 L2 范數。其目標函數可以表示為：

(Obj(\theta) = \sum{i=1}^n l(y_i, \hat{y}i) + \sum_{k=1}^K \Omega(f_k))

其中，(l(y_i, \hat{y}i))是損失函數，用于衡量預測值(\hat{y}i)與真實值(y_i)之間的差異；(\Omega(f_k))是正則化項，(f_k)表示第 k 棵樹，(K)是樹的數量。

（三）樹的構建

XGBoost 在構建樹的過程中，采用了貪心算法。對于每個特征，它會嘗試不同的分割點，計算分割后的增益，選擇增益最大的分割點進行分裂。為了提高計算效率，XGBoost 還采用了直方圖優化等技術，將連續特征的取值離散化為若干個直方圖 bins，從而減少分割點的搜索時間。

三、XGBoost 的核心特性

（一）高效性

支持并行計算：在樹的構建過程中，特征的分裂可以并行處理，大大提高了算法的運行速度。
直方圖優化：如前面所述，通過將連續特征離散化，減少了分割點的搜索時間，提高了計算效率。

（二）靈活性

支持多種損失函數：可以根據不同的任務（如分類、回歸）選擇合適的損失函數，如平方損失、邏輯損失等。
可以處理多種類型的數據：包括數值型數據和類別型數據，對于類別型數據，需要進行適當的編碼處理。

（三）魯棒性

對缺失值不敏感：XGBoost 可以自動處理缺失值，在訓練過程中會學習缺失值的處理方式。
內置正則化：通過正則化項可以有效防止過擬合，提高模型的泛化能力。

四、XGBoost 的使用步驟

（一）數據準備

數據收集：獲取用于訓練和測試的數據集。
數據清洗：處理數據中的缺失值、異常值等。對于缺失值，可以采用均值填充、中位數填充、眾數填充等方法；對于異常值，可以根據實際情況進行刪除或修正。
特征工程：對數據進行特征選擇、特征轉換等操作，以提高模型的性能。例如，可以進行標準化、歸一化處理，或者構建新的特征。

（二）模型訓練

導入 XGBoost 庫：在 Python 中，可以使用import xgboost as xgb來導入 XGBoost 庫。
劃分訓練集和測試集：可以使用train_test_split函數將數據集劃分為訓練集和測試集，一般按照 7:3 或 8:2 的比例進行劃分。
定義參數：設置 XGBoost 的相關參數，如學習率（learning_rate）、樹的數量（n_estimators）、最大深度（max_depth）等。
訓練模型：使用XGBClassifier（分類任務）或XGBRegressor（回歸任務）構建模型，并使用fit方法進行訓練。

（三）模型評估

預測：使用訓練好的模型對測試集進行預測，得到預測結果。
評估指標：根據不同的任務選擇合適的評估指標。對于分類任務，可以使用準確率、精確率、召回率、F1 值、ROC 曲線等；對于回歸任務，可以使用均方誤差（MSE）、均方根誤差（RMSE）、平均絕對誤差（MAE）等。

（四）模型調優

網格搜索：通過設置不同的參數組合，使用網格搜索來尋找最優的參數。
隨機搜索：與網格搜索類似，但隨機選擇參數組合進行搜索，在一定程度上可以提高搜索效率。

五、詳細示例

（一）分類任務示例（基于鳶尾花數據集）

數據準備

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
?
# 加載數據集
iris = load_iris()
X = iris.data
y = iris.target
?
# 劃分訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

模型訓練

# 定義模型
model = XGBClassifier(learning_rate=0.1,n_estimators=100,max_depth=3,objective='multi:softmax',  # 多分類任務num_class=3,  # 類別數量random_state=42
)
?
# 訓練模型
model.fit(X_train, y_train)

模型評估

# 預測
y_pred = model.predict(X_test)
?
# 計算準確率
accuracy = accuracy_score(y_test, y_pred)
print(f"準確率：{accuracy:.2f}")
?
# 詳細評估報告
print(classification_report(y_test, y_pred))

運行結果：

準確率：1.00precision ?  recall  f1-score ? support
?0 ? ? ? 1.00 ? ?  1.00 ? ?  1.00 ? ? ?  141 ? ? ? 1.00 ? ?  1.00 ? ?  1.00 ? ? ?  152 ? ? ? 1.00 ? ?  1.00 ? ?  1.00 ? ? ?  11
?accuracy ? ? ? ? ? ? ? ? ? ? ? ? ? 1.00 ? ? ?  40macro avg ? ? ? 1.00 ? ?  1.00 ? ?  1.00 ? ? ?  40
weighted avg ? ? ? 1.00 ? ?  1.00 ? ?  1.00 ? ? ?  40

從結果可以看出，使用 XGBoost 模型在鳶尾花數據集上的分類準確率達到了 100%，效果非常好。

（二）回歸任務示例（基于波士頓房價數據集）

數據準備

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
?
# 注意：sklearn 1.2.0版本后移除了boston數據集，這里使用其他方式獲取（如從mlxtend庫）
# 假設已獲取數據
X, y = load_boston(return_X_y=True)
?
# 劃分訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

模型訓練

# 定義模型
model = XGBRegressor(learning_rate=0.1,n_estimators=100,max_depth=3,objective='reg:squarederror',  # 回歸任務，平方誤差損失random_state=42
)
?
# 訓練模型
model.fit(X_train, y_train)

模型評估

# 預測
y_pred = model.predict(X_test)
?
# 計算評估指標
mse = mean_squared_error(y_test, y_pred)
rmse = mse **0.5
mae = mean_absolute_error(y_test, y_pred)
?
print(f"均方誤差（MSE）：{mse:.2f}")
print(f"均方根誤差（RMSE）：{rmse:.2f}")
print(f"平均絕對誤差（MAE）：{mae:.2f}")

運行結果（示例）：

均方誤差（MSE）：10.23
均方根誤差（RMSE）：3.20
平均絕對誤差（MAE）：2.35

從結果可以看出，模型的預測效果較好，各項誤差指標都處于較低水平。

六、XGBoost 與其他算法的對比

（一）與 GBDT 的對比

相同點：兩者都屬于梯度提升樹算法，都是通過迭代構建決策樹來進行預測。
不同點

正則化：XGBoost 引入了更嚴格的正則化項，包括樹的葉子節點數量和葉子節點權重的 L2 范數，而 GBDT 沒有明確的正則化項，容易過擬合。
計算效率：XGBoost 支持并行計算和直方圖優化，計算速度比 GBDT 快很多。
對缺失值的處理：XGBoost 可以自動處理缺失值，而 GBDT 需要手動處理。

（二）與隨機森林的對比

相同點：兩者都由多棵決策樹組成，都可以用于分類和回歸任務。
不同點

構建方式：隨機森林是通過 bootstrap 抽樣構建多棵決策樹，然后進行投票或平均得到結果，樹之間是獨立的；XGBoost 是迭代地構建樹，每一棵新的樹都依賴于前面的樹。
偏差與方差：隨機森林通過集成多棵樹來降低方差，XGBoost 通過梯度提升來降低偏差。
性能：在很多任務中，XGBoost 的性能優于隨機森林，但隨機森林的訓練速度可能更快一些。

（三）與支持向量機（SVM）的對比

相同點：都可以用于分類和回歸任務。
不同點

處理數據規模：SVM 在處理大規模數據時效率較低，而 XGBoost 可以較好地處理大規模數據。
核函數：SVM 需要選擇合適的核函數來處理非線性問題，而 XGBoost 通過決策樹的組合可以自然地處理非線性問題。
解釋性：XGBoost 的模型解釋性相對較強，可以通過特征重要性等指標了解特征的影響；SVM 的解釋性相對較弱。

七、我的訓練腳本（僅供參考）

max_depth：樹的深度，影響模型復雜度和過擬合風險。
learning_rate（或 eta）：學習率，控制每次迭代的步長。
n_estimators：提升樹的數量，即訓練的輪數。

代碼：

import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
from matplotlib.font_manager import FontProperties
import joblib
font_path = r"simhei.ttf"
font = FontProperties(fname=font_path)
with open(r'test.json', 'r', encoding='utf-8') as f:data = json.load(f)
?
#數據處理，將嵌套列表展開為適合建模的表格
rows = []
for item in data:gcbh = item['gcbh']jxtbhj = item['jxtbhj']for debh, sl, dj,dwsl in zip(item['debh'], item['sl'], item['dj'],item['dwsl']):rows.append([gcbh, debh, sl, dj,dwsl, jxtbhj])
df = pd.DataFrame(rows, columns=['gcbh', 'debh', 'sl', 'dj', "dwsl",'jxtbhj'])
?
# 新增特征
df['single_item_total'] = df['sl'] * df['dj'] * df['dwsl']
?
# 在DataFrame創建后，添加更多特征
# 1. 工程級別的統計特征
df['total_by_gcbh'] = df.groupby('gcbh')['single_item_total'].transform('sum') ?
df['items_count'] = df.groupby('gcbh')['debh'].transform('count') ? 
df['avg_price'] = df.groupby('gcbh')['dj'].transform('mean') ? ? ? ?
df['price_std'] = df.groupby('gcbh')['dj'].transform('std') ? ? ? 
df['total_quantity'] = df.groupby('gcbh')['sl'].transform('sum') ? 
?
# 2. 項目級別的統計特征
df['price_ratio'] = df['dj'] / df['avg_price'] ? ? ? ? ? ? ? ? ? ? ?
df['quantity_ratio'] = df['sl'] / df['total_quantity'] ? ? ? ? ? ? ?
df['value_ratio'] = df['single_item_total'] / df['total_by_gcbh'] ? ? ? ? ? ? ?
?
# 修改特征列表
cat_features = ['debh'] 
num_features = ['sl', 'dj', 'single_item_total', ? ? ? ? # 基礎特征'total_by_gcbh', 'items_count','avg_price','price_std','total_quantity', 'price_ratio','quantity_ratio', 'value_ratio' ?
]
?
# 3. 處理缺失值
print("處理缺失值前的數據形狀:", df.shape)
print("缺失值統計:")
print(df[num_features].isnull().sum())
?
# 處理缺失值 - 使用中位數填充數值特征
for col in num_features:if df[col].isnull().sum() > 0:median_val = df[col].median()df[col].fillna(median_val, inplace=True)print(f"列 {col} 使用中位數 {median_val:.2f} 填充了 {df[col].isnull().sum()} 個缺失值")
?
# 處理分類特征的缺失值
for col in cat_features:if df[col].isnull().sum() > 0:mode_val = df[col].mode()[0]df[col].fillna(mode_val, inplace=True)print(f"列 {col} 使用眾數 {mode_val} 填充了 {df[col].isnull().sum()} 個缺失值")
?
print("處理缺失值后的數據形狀:", df.shape)
print("缺失值統計:")
print(df[num_features + cat_features].isnull().sum())
?
# 4. 處理異常值
def handle_outliers(df, columns, n_sigmas=3):for col in columns:mean = df[col].mean()std = df[col].std()df[col] = df[col].clip(mean - n_sigmas * std, mean + n_sigmas * std)return df
?
# 處理數值特征的異常值
df = handle_outliers(df, num_features)
?
# 目標值不做對數變換，因為看起來對數變換效果不好
y = df['jxtbhj'].values
?
# 檢查目標值是否有缺失值
if np.isnan(y).any():print("警告：目標值包含缺失值，將刪除這些行")mask = ~np.isnan(y)df = df[mask]y = y[mask]print(f"刪除缺失值后的數據形狀: {df.shape}")
?
# 工程編號獨熱編碼
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_cat = encoder.fit_transform(df[cat_features])
X_num = df[num_features].values
?
# 最終檢查確保沒有NaN值
print("最終數據檢查:")
print(f"X_cat 包含 NaN: {np.isnan(X_cat).any()}")
print(f"X_num 包含 NaN: {np.isnan(X_num).any()}")
print(f"y 包含 NaN: {np.isnan(y).any()}")
?
X = np.hstack([X_cat, X_num])
?
# 特征縮放
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
?
# 4. 劃分訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
?
# 在數據處理后，模型訓練前添加這段代碼
plt.figure(figsize=(12, 4))
?
# 原始值分布
plt.subplot(121)
plt.hist(df['jxtbhj'], bins=50)
plt.title('原始 jxtbhj 分布', fontproperties=font)
?
# 對數變換后的分布
plt.subplot(122)
plt.hist(np.log1p(df['jxtbhj']), bins=50)
plt.title('log(jxtbhj) 分布', fontproperties=font)
?
plt.tight_layout()
plt.show()
?
# 打印一些基本統計信息
print("原始 jxtbhj 統計信息：")
print(df['jxtbhj'].describe())
?
# 5. 建立多種模型
models = {# '嶺回歸': Ridge(# ? ? alpha=0.5,  # 適中的正則化# ? ? fit_intercept=True# ),# '隨機森林': RandomForestRegressor(# ? ? n_estimators=300,  # 增加樹的數量# ? ? max_depth=8, ? ? ? # 控制過擬合# ? ? min_samples_leaf=5,# ? ? max_features='sqrt',# ? ? random_state=42# ),'XGBoost': xgb.XGBRegressor(n_estimators=1500,#900max_depth=8,learning_rate=0.1, ?# 降低學習率0.09subsample=0.8, ? ? ? # 隨機采樣colsample_bytree=0.8,# 特征采樣reg_alpha=0.1, ? ? ? # L1正則化reg_lambda=1.0, ? ? ?# L2正則化random_state=42,verbosity=0)
}
results = {}
?
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for name, model in models.items():cv_scores = cross_val_score(model, X_scaled, y, cv=kf, scoring='r2',error_score='raise')print(f"\n模型: {name}")print(f"交叉驗證 R2 分數: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
?# 在全部數據上訓練和評估model.fit(X_train, y_train)y_pred = model.predict(X_test)mse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)results[name] = {'model': model, 'mse': mse, 'r2': r2, 'y_pred': y_pred}print(f"測試集 MSE: {mse:.2f}")print(f"測試集 R2: {r2:.4f}")
?# 在模型訓練后添加def analyze_predictions(y_true, y_pred, name):# 計算預測誤差百分比error_percent = np.abs((y_pred - y_true) / y_true) * 100
?print(f"\n{name} 預測分析:")print(f"平均百分比誤差: {error_percent.mean():.2f}%")print(f"中位數百分比誤差: {np.median(error_percent):.2f}%")print(f"90%預測的誤差在 {np.percentile(error_percent, 90):.2f}% 以內")
?# 計算不同誤差范圍的預測比例for threshold in [10, 20, 30, 50]:accuracy = (error_percent <= threshold).mean() * 100print(f"誤差在{threshold}%以內的預測比例: {accuracy:.2f}%")
?analyze_predictions(y_test, y_pred, name)
?
# 6. 選擇最佳模型用于解釋性分析
best_model_name = max(results, key=lambda k: results[k]['r2'])
best_model = results[best_model_name]['model']
y_pred = results[best_model_name]['y_pred']
print(f"最佳模型: {best_model_name}")
?
# 保存XGBoost模型
if best_model_name == 'XGBoost':best_model.save_model('xgb_model.json')print('XGBoost模型已保存到 xgb_model.json')# 保存encoder和scalerjoblib.dump(encoder, 'encoder.joblib')joblib.dump(scaler, 'scaler.joblib')print('encoder和scaler已保存')
?
# 7. 特征重要性分析
feature_names = list(encoder.get_feature_names_out(cat_features)) + num_features
if hasattr(best_model, 'coef_'):importances = best_model.coef_
elif hasattr(best_model, 'feature_importances_'):importances = best_model.feature_importances_
else:importances = np.zeros(len(feature_names))
?
plt.figure(figsize=(12, 6))
sns.barplot(x=feature_names, y=importances)
plt.title(f'{best_model_name} 特征重要性', fontproperties=font)
plt.xlabel('特征', fontproperties=font)
plt.ylabel('重要性', fontproperties=font)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
?
# 8. 預測值與真實值對比圖
plt.figure(figsize=(8, 8))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title(f'{best_model_name} 真實值 vs 預測值', fontproperties=font)
plt.xlabel('真實值', fontproperties=font)
plt.ylabel('預測值', fontproperties=font)
plt.tight_layout()
plt.show()
?
# 9. 殘差分析
residuals = y_test - y_pred
plt.figure(figsize=(8, 5))
sns.histplot(residuals, bins=30, kde=True)
plt.title(f'{best_model_name} 殘差分布', fontproperties=font)
plt.xlabel('殘差', fontproperties=font)
plt.tight_layout()
plt.show()
?
plt.figure(figsize=(8, 5))
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='red', linestyle='--')
plt.title(f'{best_model_name} 預測值與殘差', fontproperties=font)
plt.xlabel('預測值', fontproperties=font)
plt.ylabel('殘差', fontproperties=font)
plt.tight_layout()
plt.show()