kaggle比賽必備算法XGBoost入門及實戰

xgboost一直在kaggle競賽江湖里被傳為神器，它在對結構化數據的應用占據主導地位，是目前開源的最快最好的工具包，與常見的工具包算法相比速度提高了10倍以上！

XGBoost is an implementation of gradient boosted decision trees (GBDT) designed for speed and performance.

xgboost 是對梯度增強決策樹(GBDT)的一種實現，具有更高的性能以及速度，既可以用于分類也可以用于回歸問題中。本文將盡量用通俗易懂的方式進行介紹并給出實戰案例。

1.什么是XGBoost?

xgboost 是 eXtreme?Gradient?Boosting的簡寫。

The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost.

以上是xgboost作者陳天奇在Quora（類似國內的知乎）上對名字的解釋。

xgboost是一個軟件包，可以通過各種API接口訪問使用：

Command Line Interface (CLI).命令行
C++
Java
R
python 和 scikit-learn
Julia
Scala

xgboost 的底層實現在我看來還是比較復雜的，想深入理解可以查看陳博士的Paper (XGBoost: A Scalable Tree Boosting System)?，它針對傳統GBDT算法做了很多細節改進，包括損失函數、正則化、稀疏感知算法、并行化算法設計等等。

Szilard Pafka 數據科學家做了詳細的性能對比，并發表的文章進行解釋說明（githup：https://github.com/szilard/benchm-ml）?以下是 XGBoost 與其它 gradient boosting 和 bagged decision trees 實現的效果比較，可以看出它比 R, Python，Spark，H2O 中的基準配置要更快。

2.XGBoost的特性

說到Xgboost，不得不先從GBDT(Gradient Boosting Decision Tree)說起。因為xgboost本質上還是一個GBDT，但是力爭把速度和效率發揮到極致，所以叫X (Extreme) GBoosted，兩者都是boosting方法。

Gradient Boosting 是 boosting 的其中一種方法，所謂 Boosting ，就是將弱分離器 f_i(x) 組合起來形成強分類器 F(x) 的一種方法。

那什么是GBDT？？我們看一張圖：

GBDT的原理很簡單，我們反復循環構建新的模型并將它們組合成一個集成模型的，從初始Naive模型開始，我們從計算數據集中每個觀測的誤差，然后下一個樹去擬合誤差函數對預測值的殘差(殘差就是預測值與真實值之間的誤差)，最終這樣多個學習器相加在一起用來進行最終預測，準確率就會比單獨的一個要高。

之所以稱為 Gradient，是因為在添加新模型時使用了梯度下降算法來最小化的損失。

如圖所示：Y = Y1 + Y2 + Y3。

Parallelization并行處理

XGBoost工具支持并行處理。Boosting不是一種串行的結構嗎?怎么并行的？注意XGBoost的并行不是 tree 粒度的并行，XGBoost 也是一次迭代完才能進行下一次迭代的。

XGBoost的并行是在特征粒度上的，決策樹的學習最耗時的一個步驟就是對特征的值進行排序，XGBoost在訓練之前，預先對數據進行了排序，然后保存為block結構，后面的迭代中重復地使用這個結構，大大減小計算量。這個block結構也使得并行成為了可能，在進行節點的分裂時，需要計算每個特征的增益，最終選增益最大的那個特征去做分裂，那么各個特征的增益計算就可以開多線程進行。

正則化

XGBoost 在代價函數里加入了正則項，用于控制模型的復雜度。正則項里包含了樹的葉子節點個數、每個葉子節點上輸出的 score 的L2模的平方和。從Bias-variance tradeoff 角度來講，正則項降低了模型的 variance，使學習出來的模型更加簡單，防止過擬合，這也是xgboost優于傳統GBDT的一個特性。

缺失值處理

對于特征的值有缺失的樣本，xgboost可以自動學習出它的分裂方向。

損失函數

傳統GBDT在優化時只用到一階導數信息，xgboost則對代價函數進行了二階泰勒展開，同時用到了一階和二階導數。

分布式計算

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

3.實戰示例

官方指導文檔：
https://xgboost.readthedocs.io/en/latest/index.html

github 示例文檔：
https://github.com/dmlc/xgboost/tree/master/demo/guide-python

本文將繼續使用kaggle競賽平臺上的房價預測的數據，具體數據請移步kaggle實戰之房價預測，了解一下？

準備數據

import pandas as pd
from sklearn.model_selection import train_test_split from sklearn.preprocessing import Imputer data = pd.read_csv('../input/train.csv') data.dropna(axis=0, subset=['SalePrice'], inplace=True) y = data.SalePrice X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object']) train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25) my_imputer = Imputer() train_X = my_imputer.fit_transform(train_X) test_X = my_imputer.transform(test_X)

構建模型

#引入xgboost
from xgboost import XGBRegressor
#xgboost 有封裝好的分類器和回歸器，可以直接用XGBRegressor建立模型
my_model = XGBRegressor() # Add silent=True to avoid printing out updates with each cycle my_model.fit(train_X, train_y, verbose=False)

評估及預測

# make predictions
predictions = my_model.predict(test_X) from sklearn.metrics import mean_absolute_error print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

模型調參

xgboost模型參數有幾十種，以上我們只是使用的默認的參數進行訓練預測，參數詳解請參照：
Parameters (official guide)

XGBoost有幾個參數會顯著影響模型的準確性和訓練速度，你應該需要特別關注，并適當的進行調參：

#1. n_estimators

n_estimators指定通過建模周期的次數，可以理解成迭代次數，值太大會導致過度擬合，這是對訓練數據的準確預測，但對新數據的預測偏差很大。一般的范圍在100-1000之間。

#2.early_stopping_rounds

這個參數 early_stopping_rounds 提供了一種自動查找理想值的方法，試驗證分數停止改進時停止迭代。

我們可以為 n_estimators 設置一個較高值，然后使用 early_stopping_rounds來找到停止迭代的最佳時間。

my_model = XGBRegressor(n_estimators=1000) #評價模型在測試集上的表現，也可以輸出每一步的分數 verbose=True my_model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X, test_y)], verbose=True) #[0] validation_0-rmse:181513 #Will train until validation_0-rmse hasn't improved in 5 rounds. #[1] validation_0-rmse:164369 #[2] validation_0-rmse:149195 #[3] validation_0-rmse:135397 #[4] validation_0-rmse:123108 #[5] validation_0-rmse:111850 #[6] validation_0-rmse:102150 #[7] validation_0-rmse:93362.6 #[8] validation_0-rmse:85537.3 #[9] validation_0-rmse:78620.8 #[10] validation_0-rmse:72306.5 #[11] validation_0-rmse:67021.3 #[12] validation_0-rmse:62228.3 # XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, # colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, # max_depth=3, min_child_weight=1, missing=None, n_estimators=1000, # n_jobs=1, nthread=None, objective='reg:linear', random_state=0, # reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, # silent=True, subsample=1)

當使用 early_stopping_rounds 參數時，您需要留出一些數據來檢查要使用的輪數，從以上實例代碼可以看出 n_estimators=1000 為停止迭代次數。

#3.learning_rate

以下是一個小技巧但很重要，可用于訓練更好的xgboost模型：

一般來說，一個較小的 learning_rate（和較大的 n_estimators ）將產生更精確的xgboost 模型，由于該模型在整個循環中進行了更多的迭代，因此它也需要更長的時間來訓練。

我們用?GridSearchCV?來進行調整 learning_rate 的值，以找到最合適的參數：

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold # 設定要調節的 learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] my_model= XGBRegressor(n_estimators=1000) learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(my_model, param_grid, n_jobs=-1, cv=kfold) grid_result = grid_search.fit(train_X, train_y) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) #Best: 0.285843 using {'learning_rate': 0.01}

#4.n_jobs

在運行時的較大數據集上，可以使用并行性更快地構建模型，通常將參數n_jobs設置為機器上的CPU的數量。對于較小的數據集，這個參數沒有任何作用。

可視化XGBoost樹

訓練完模型后, 一般只是看看特征重要性(feature importance score)，但是 XGBoost提供了一個方法plot_tree()，使得模型中每個決策樹是可以畫出來看看的。

import matplotlib.pyplot as plt
from xgboost import plot_tree plot_tree(my_model,num_trees=0) #設置圖形的大小 plt.rcParams['figure.figsize'] = [50, 20] plt.show()

可以得到以下這個的圖，plot_tree有些參數可以調整, 比如num_trees=0表示畫第一棵樹，rankdir='LR’表示圖片是從左到右(Left to Right)。

上圖中f1,f2是feature ID，需要進行轉換成特征屬性顯示：

#這個函數就是根據給定的特征名字(我直接使用了數據的列名稱), 按照特定格式生成一個xgb.fmap文件
def ceate_feature_map(features): outfile = open('xgb.fmap', 'w') i = 0 for feat in features: outfile.write('{0}\t{1}\tq\n'.format(i, feat)) i = i + 1 outfile.close() ceate_feature_map(X.columns) #在調用plot_tree函數的時候, 直接指定fmap文件即可 plot_tree(my_model,num_trees=0,fmap='xgb.fmap') plt.rcParams['figure.figsize'] = [50, 20] plt.show()

可視化XGBoost模型的另一種方法是顯示模型中原始數據集中每個要屬性的重要性，XGBoost有一個?plot_importance()?方法，可以展示特征的重要性。

from xgboost import plot_importance
plot_importance(my_model)
plt.rcParams['figure.figsize'] = [10, 10] plt.show()

從圖中可以看出，特征屬性 f16 - GrLivArea得分最高，可以展示特征的重要性。因此，XGBoost還為我們提供了一種進行特征選擇的方法！！

4.參考資料

1.xgboost作者講義PPT：https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
2.xgboost入門與實戰（原理篇）：https://blog.csdn.net/sb19931201/article/details/52557382
3.史上最詳細的XGBoost實戰：https://zhuanlan.zhihu.com/p/31182879
4.一文讀懂機器學習大殺器XGBOOST原理
5.A Gentle Introduction to XGBoost for Applied Machine Learning

轉載于:https://www.cnblogs.com/yumoye/p/10421153.html