LightGBM、XGBoost和CatBoost自定義損失函數和評估指標

- 函數（縮放誤差）
- 數學原理
- - 損失函數定義
  - 梯度計算
  - 評估指標
- LightGBM實現
- - 自定義損失函數
  - 自定義評估指標
  - 使用方式
- XGBoost實現
- - 自定義損失函數
  - 自定義評估指標
  - 使用方式
- CatBoost實現
- - 自定義損失函數
  - 自定義評估指標
  - 使用方式
- 框架對比
- 實際應用
- - 適用場景
- 常見問題
- - 1. 為什么要設置最小閾值？
  - 2. 梯度和Hessian計算錯誤怎么辦？
  - 3. 不同框架的性能差異
  - 4. 超參數調優建議

函數（縮放誤差）

傳統的均方誤差（MSE）和平均絕對誤差（MAE）對所有預測值給予相同的權重，但在某些場景下，更關心相對誤差而非絕對誤差。縮放誤差通過將誤差除以真實值來實現這一目標：

縮放誤差 = (真實值 - 預測值) / max(真實值, 閾值)

這樣設計的優勢：

對于大數值和小數值的預測給予相對平等的權重
避免大數值主導損失函數
更適合預測范圍變化很大的場景

數學原理

損失函數定義

設損失函數為：

L(y, ?) = ((y - ?) / max(y, threshold))2

其中：

y 是真實值
? 是預測值
threshold 是防止除零的最小閾值

梯度計算

對于梯度提升算法，我們需要計算損失函數對預測值的一階導數（梯度）和二階導數（Hessian）：

設 d = max(y, threshold)，e = (y - ?) / d

一階導數（梯度）：?L/?? = -2e/d
二階導數（Hessian）：?2L/??2 = 2/d2

評估指標

配套的評估指標使用縮放平均絕對誤差（Scaled MAE）：

Scaled MAE = mean(|y - ?| / max(y, threshold))

LightGBM實現

自定義損失函數

def custom_loss_squared_lgb(y_pred, train_data):"""LightGBM自定義縮放均方誤差損失函數參數:y_pred: 預測值數組train_data: LightGBM的Dataset對象返回:tuple: (梯度數組, Hessian數組)"""y_true = train_data.get_label()  # 獲取真實標簽# 計算分母，防止除零denominator = np.maximum(y_true, threshold)# 計算縮放誤差error = (y_true - y_pred) / denominator# 計算梯度和Hessiangrad = -2 * error / denominatorhess = 2 / (denominator ** 2)return grad, hess

自定義評估指標

def mae_metric_lgb(preds, train_data):"""LightGBM自定義縮放MAE評估指標參數:preds: 預測值數組train_data: LightGBM的Dataset對象返回:tuple: (指標名稱, 指標值, 是否越大越好)"""y_true = train_data.get_label()denominator = np.maximum(y_true, threshold)error = np.abs(preds - y_true) / denominatorreturn 'scaled_mae', np.mean(error), False

使用方式

import lightgbm as lgb
import numpy as np# 參數配置
params = {'objective': custom_loss_squared_lgb,  # 使用自定義損失函數'boosting_type': 'gbdt','num_leaves': 31,'learning_rate': 0.01,'verbosity': -1
}# 訓練模型
model = lgb.train(params, train_set, valid_sets=[train_set, valid_set],feval=mae_metric_lgb,  # 使用自定義評估指標num_boost_round=1000,callbacks=[lgb.early_stopping(100)]
)

XGBoost實現

自定義損失函數

def custom_loss_squared_xgb(y_pred, train_data):"""XGBoost自定義縮放均方誤差損失函數參數:y_pred: 預測值數組train_data: XGBoost的DMatrix對象返回:tuple: (梯度數組, Hessian數組)"""y_true = train_data.get_label()  # 獲取真實標簽# 計算分母，防止除零denominator = np.maximum(y_true, threshold)# 計算縮放誤差error = (y_true - y_pred) / denominator# 計算梯度和Hessiangrad = -2 * error / denominatorhess = 2 / (denominator ** 2)return grad, hess

自定義評估指標

def mae_metric_xgb(y_pred, train_data):"""XGBoost自定義縮放MAE評估指標參數:y_pred: 預測值數組train_data: XGBoost的DMatrix對象返回:tuple: (指標名稱, 指標值)"""y_true = train_data.get_label()denominator = np.maximum(y_true, threshold)error = np.abs(y_true - y_pred) / denominatorreturn 'custom_mae', np.mean(error)

使用方式

import xgboost as xgb
import numpy as np# 參數配置
params = {'booster': 'gbtree','learning_rate': 0.01,'max_depth': 6,'random_state': 42
}# 訓練模型
model = xgb.train(params,train_matrix,num_boost_round=1000,evals=[(train_matrix, 'train'), (valid_matrix, 'valid')],obj=custom_loss_squared_xgb,  # 自定義損失函數feval=mae_metric_xgb,         # 自定義評估指標early_stopping_rounds=100,verbose_eval=50
)

CatBoost實現

CatBoost的自定義函數需要用類的形式實現。

自定義損失函數

class CustomCatBoostObjective(object):"""CatBoost自定義縮放均方誤差損失函數"""def calc_ders_range(self, approxes, targets, weights):"""計算梯度和Hessian參數:approxes: 當前預測值列表targets: 真實標簽列表weights: 樣本權重列表（可選）返回:list: [(梯度, Hessian), ...] 的列表"""assert len(approxes) == len(targets)if weights is not None:assert len(weights) == len(approxes)result = []for index in range(len(targets)):y_true = targets[index]y_pred = approxes[index]# 計算分母，防止除零denominator = max(y_true, threshold)# 計算縮放誤差error = (y_true - y_pred) / denominator# 計算梯度和Hessiangrad = -2 * error / denominatorhess = 2 / (denominator ** 2)# 應用樣本權重if weights is not None:grad *= weights[index]hess *= weights[index]result.append((grad, hess))return result

自定義評估指標

class CustomCatBoostEval(object):"""CatBoost自定義縮放MAE評估指標"""def is_max_optimal(self):"""指標是否越大越好"""return Falsedef evaluate(self, approxes, targets, weights):"""計算評估指標參數:approxes: 預測值列表的列表 [[pred1, pred2, ...]]targets: 真實標簽列表weights: 樣本權重列表（可選）返回:tuple: (誤差總和, 權重總和)"""assert len(approxes) == 1assert len(targets) == len(approxes[0])error_sum = 0.0weight_sum = 0.0for i in range(len(targets)):y_true = targets[i]y_pred = approxes[0][i]# 計算縮放誤差denominator = max(y_true, threshold)error = abs(y_true - y_pred) / denominator# 應用樣本權重if weights is not None:error *= weights[i]weight_sum += weights[i]else:weight_sum += 1.0error_sum += errorreturn error_sum, weight_sumdef get_final_error(self, error, weight):"""計算最終的評估指標值"""return error / (weight + 1e-38)

使用方式

from catboost import CatBoostRegressor, Pool
import numpy as np# 創建數據池
train_pool = Pool(X_train, y_train)
valid_pool = Pool(X_valid, y_valid)# 參數配置
params = {'objective': CustomCatBoostObjective(),'eval_metric': CustomCatBoostEval(),'iterations': 1000,'learning_rate': 0.01,'depth': 6,'random_state': 42,'verbose': False
}# 訓練模型
model = CatBoostRegressor(**params)
model.fit(train_pool,eval_set=valid_pool,early_stopping_rounds=100,verbose_eval=50,use_best_model=True
)

框架對比

特性	LightGBM	XGBoost	CatBoost
損失函數形式	函數	函數	類方法
參數名稱	`objective`	`obj`	`objective`
數據獲取	`train_data.get_label()`	`dtrain.get_label()`	直接傳入 `targets`
評估指標形式	函數	函數	類方法
評估返回格式	`(name, value, is_higher_better)`	`(name, value)`	`error_sum, weight_sum`
權重支持	自動處理	自動處理	需手動處理
實現復雜度	簡單	簡單	中等

實際應用

適用場景

新能源功率預測：風電、光伏功率預測范圍從0到滿功率
金融風險評估：不同規模公司的風險評估
銷售預測：不同產品類別的銷售額預測
網絡流量預測：不同時段流量變化很大

常見問題

1. 為什么要設置最小閾值？

問題：直接用真實值作為分母會遇到什么問題？

答案：

當真實值為0或接近0時，會導致除零錯誤或梯度爆炸
設置最小閾值可以保證數值穩定性
閾值的選擇應根據數據的實際分布來確定

2. 梯度和Hessian計算錯誤怎么辦？

問題：如何驗證梯度計算的正確性？

答案：可以用數值微分驗證：

def verify_gradients(y_true, y_pred, eps=1e-6):"""驗證梯度計算的正確性"""# 解析梯度denominator = np.maximum(y_true, threshold)error = (y_true - y_pred) / denominatorgrad_analytical = -2 * error / denominator# 數值梯度loss_plus = ((y_true - (y_pred + eps)) / denominator) ** 2loss_minus = ((y_true - (y_pred - eps)) / denominator) ** 2grad_numerical = (loss_plus - loss_minus) / (2 * eps)# 比較diff = np.abs(grad_analytical - grad_numerical)print(f"最大梯度差異: {np.max(diff)}")return np.allclose(grad_analytical, grad_numerical, atol=1e-5)

3. 不同框架的性能差異

問題：三個框架在使用自定義損失函數時的性能如何？

答案：

LightGBM：通常最快，內存效率高
XGBoost：穩定性好，文檔完善
CatBoost：對類別特征處理好，但自定義函數實現相對復雜

4. 超參數調優建議

# LightGBM調優示例
from optuna import create_studydef objective(trial):params = {'objective': custom_loss_squared_lgb,'boosting_type': 'gbdt','num_leaves': trial.suggest_int('num_leaves', 10, 100),'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1),'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),'verbosity': -1}model = lgb.train(params,train_data,valid_sets=[valid_data],feval=mae_metric_lgb,num_boost_round=1000,callbacks=[lgb.early_stopping(100)],verbose_eval=False)y_pred = model.predict(X_valid)scaled_mae = np.mean(np.abs(y_valid - y_pred) / np.maximum(y_valid, threshold))return scaled_maestudy = create_study(direction='minimize')
study.optimize(objective, n_trials=100)