Python28-11 CatBoost梯度提升算法

CatBoost（Categorical Boosting）是由Yandex(一家俄羅斯互聯網企業，旗下的搜索引擎曾在俄國內擁有逾60%的市場占有率，同時也提供其他互聯網產品和服務)開發的一種基于梯度提升的機器學習算法。CatBoost特別擅長處理類別特征，并且能夠有效地避免過擬合和數據泄露問題。CatBoost的全稱是“Categorical Boosting”，它的設計初衷是為了在處理包含大量類別特征的數據時表現得更好。

CatBoost的特點

處理類別特征：CatBoost可以直接處理類別特征而不需要進行額外的編碼（如one-hot編碼）。
避免過擬合：CatBoost采用了一種新的處理類別特征的方法，有效地減少了過擬合。
高效性：CatBoost在訓練速度和預測速度方面都表現優異。
支持CPU和GPU訓練：CatBoost既可以在CPU上運行，也可以利用GPU進行加速訓練。
自動處理缺失值：CatBoost可以自動處理缺失值，無需額外的預處理步驟。

CatBoost的核心原理

CatBoost的核心原理基于梯度提升決策樹（GBDT），但在處理類別特征和避免過擬合方面進行了創新。以下是一些關鍵的技術點：

類別特征處理：
- CatBoost引入了一個稱為“均值編碼”的方法，基于類別的均值計算新特征。
- 使用一種稱為“目標編碼”的技術，將類別特征轉化為數值特征時，通過使用目標值的平均值來減少數據泄露的風險。
- 在訓練過程中，通過使用統計信息對數據進行處理，避免直接使用目標變量進行編碼。
有序提升（Ordered Boosting）：
- 為了防止數據泄露和過擬合，CatBoost在訓練時對數據進行了有序處理。
- 有序提升通過在訓練過程中隨機打亂數據，并確保模型在某一時刻只看到過去的數據，而不會使用未來的信息進行決策。
計算優化：
- CatBoost通過預計算和緩存的方式加速了特征的計算過程。
- 支持CPU和GPU訓練，能夠在大規模數據集上表現出色。

CatBoost的基本使用

以下是一個使用CatBoost進行分類任務的基本示例，我們使用Auto MPG（Miles Per Gallon）數據集，它是一個經典的回歸問題數據集，常用于機器學習和統計分析。該數據集記錄了不同型號汽車的燃油效率（即每加侖燃油行駛的英里數）以及其他多個相關特征。

數據集特征:

mpg: 每加侖燃油行駛的英里數（目標變量）。
cylinders: 氣缸數量，表示發動機的氣缸數。
displacement: 發動機排量（立方英寸）。
horsepower: 發動機功率（馬力）。
weight: 車輛重量（磅）。
acceleration: 0到60英里每小時的加速度時間（秒）。
model_year: 車輛生產年份。
origin: 車輛產地（1=美國，2=歐洲，3=日本）。

數據集前幾行：

????mpg??cylinders??displacement??horsepower??weight??acceleration??model_year??origin
0??18.0??????????8?????????307.0???????130.0??3504.0??????????12.0??????????70???????1
1??15.0??????????8?????????350.0???????165.0??3693.0??????????11.5??????????70???????1
2??18.0??????????8?????????318.0???????150.0??3436.0??????????11.0??????????70???????1
3??16.0??????????8?????????304.0???????150.0??3433.0??????????12.0??????????70???????1
4??17.0??????????8?????????302.0???????140.0??3449.0??????????10.5??????????70???????1

代碼示例：

import?pandas?as?pd??#?導入Pandas庫，用于數據處理
import?numpy?as?np??#?導入Numpy庫，用于數值計算
from?sklearn.model_selection?import?train_test_split??#?從sklearn庫導入train_test_split，用于劃分數據集
from?sklearn.metrics?import?mean_squared_error,?mean_absolute_error??#?導入均方誤差和平均絕對誤差，用于評估模型性能
from?catboost?import?CatBoostRegressor??#?導入CatBoost庫中的CatBoostRegressor，用于回歸任務
import?matplotlib.pyplot?as?plt??#?導入Matplotlib庫，用于繪圖
import?seaborn?as?sns??#?導入Seaborn庫，用于繪制統計圖#?設置隨機種子以便結果復現
np.random.seed(42)#?從UCI機器學習庫加載Auto?MPG數據集
url?=?"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
column_names?=?['mpg',?'cylinders',?'displacement',?'horsepower',?'weight',?'acceleration',?'model_year',?'origin']
data?=?pd.read_csv(url,?names=column_names,?na_values='?',?comment='\t',?sep='?',?skipinitialspace=True)#?查看數據集的前幾行
print(data.head())#?處理缺失值
data?=?data.dropna()#?特征和目標變量
X?=?data.drop('mpg',?axis=1)??#?特征變量
y?=?data['mpg']??#?目標變量#?將類別特征轉換為字符串類型（CatBoost可以直接處理類別特征）
X['cylinders']?=?X['cylinders'].astype(str)
X['model_year']?=?X['model_year'].astype(str)
X['origin']?=?X['origin'].astype(str)#?將數據集劃分為訓練集和測試集
X_train,?X_test,?y_train,?y_test?=?train_test_split(X,?y,?test_size=0.2,?random_state=42)#?定義CatBoost回歸器
model?=?CatBoostRegressor(iterations=1000,??#?迭代次數learning_rate=0.1,??#?學習率depth=6,??#?決策樹深度loss_function='RMSE',??#?損失函數verbose=100??#?輸出訓練過程信息
)#?訓練模型
model.fit(X_train,?y_train,?eval_set=(X_test,?y_test),?early_stopping_rounds=50)#?進行預測
y_pred?=?model.predict(X_test)#?評估模型性能
mse?=?mean_squared_error(y_test,?y_pred)??#?計算均方誤差
mae?=?mean_absolute_error(y_test,?y_pred)??#?計算平均絕對誤差#?打印模型的評估結果
print(f'Mean?Squared?Error?(MSE):?{mse:.4f}')
print(f'Mean?Absolute?Error?(MAE):?{mae:.4f}')#?繪制真實值與預測值的對比圖
plt.figure(figsize=(10,?6))
plt.scatter(y_test,?y_pred,?alpha=0.5)??#?繪制散點圖
plt.plot([y_test.min(),?y_test.max()],?[y_test.min(),?y_test.max()],?'--k')??#?繪制對角線
plt.xlabel('True?Values')??#?X軸標簽
plt.ylabel('Predictions')??#?Y軸標簽
plt.title('True?Values?vs?Predictions')??#?圖標題
plt.show()#?特征重要性可視化
feature_importances?=?model.get_feature_importance()??#?獲取特征重要性
feature_names?=?X.columns??#?獲取特征名稱plt.figure(figsize=(10,?6))
sns.barplot(x=feature_importances,?y=feature_names)??#?繪制特征重要性條形圖
plt.title('Feature?Importances')??#?圖標題
plt.show()#?輸出
'''
mpg??cylinders??displacement??horsepower??weight??acceleration??\
0??18.0??????????8?????????307.0???????130.0??3504.0??????????12.0???
1??15.0??????????8?????????350.0???????165.0??3693.0??????????11.5???
2??18.0??????????8?????????318.0???????150.0??3436.0??????????11.0???
3??16.0??????????8?????????304.0???????150.0??3433.0??????????12.0???
4??17.0??????????8?????????302.0???????140.0??3449.0??????????10.5???model_year??origin??
0??????????70???????1??
1??????????70???????1??
2??????????70???????1??
3??????????70???????1??
4??????????70???????1??
0:?learn:?7.3598113?test:?6.6405869?best:?6.6405869?(0)?total:?1.7ms?remaining:?1.69s
100:?learn:?1.5990203?test:?2.3207830?best:?2.3207666?(94)?total:?132ms?remaining:?1.17s
200:?learn:?1.0613606?test:?2.2319632?best:?2.2284239?(183)?total:?272ms?remaining:?1.08s
Stopped?by?overfitting?detector??(50?iterations?wait)bestTest?=?2.21453232
bestIteration?=?238Shrink?model?to?first?239?iterations.
Mean?Squared?Error?(MSE):?4.9042
Mean?Absolute?Error?(MAE):?1.6381<Figure?size?1000x600?with?1?Axes>
<Figure?size?1000x600?with?1?Axes>
'''

Mean Squared Error (MSE): 均方誤差，表示預測值與實際值之間的平均平方差異。值越小，模型性能越好，在這里MSE的值是4.9042。

Mean Absolute Error (MAE): 平均絕對誤差，表示預測值與實際值之間的平均絕對差異。值越小，模型性能越好，在這里MAE的值是1.6381。