評分卡建模整體流程梳理
學習目標
- 掌握評分卡建模流程
- 使用Toad庫構建評分卡
1 加載數據
import pandas as pd
from sklearn.metrics import roc_auc_score,roc_curve,auc
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np
import math
import xgboost as xgb
import toad
# 加載數據
data_all = pd.read_csv("scorecard.txt") # 指定不參與訓練列名
ex_lis = ['uid', 'samp_type', 'bad_ind']
# 參與訓練列名
ft_lis = list(data_all.columns)
for i in ex_lis: ft_lis.remove(i) # 開發樣本、驗證樣本與時間外樣本
dev = data_all[(data_all['samp_type'] == 'dev')]
val = data_all[(data_all['samp_type'] == 'val') ]
off = data_all[(data_all['samp_type'] == 'off') ]
探索性數據分析,同時處理數值型和字符型
toad.detector.detect(data_all)
顯示結果:
type size missing unique mean_or_top1 std_or_top2 min_or_top3 1%_or_top4 10%_or_top5 50%_or_bottom5 75%_or_bottom4 90%_or_bottom3 99%_or_bottom2 max_or_bottom1 bad_ind float64 95806 0.00% 2 0.0187671 0.135702 0 0 0 0 0 0 1 1 uid object 95806 0.00% 95806 Ab99_96002866062686144:0.00% A7511004:0.00% A10729014:0.00% A8502810:0.00% A594541:0.00% A8899777:0.00% A10150838:0.00% A3044048:0.00% A1888452:0.00% A7659794:0.00% td_score float64 95806 0.00% 95806 0.499739 0.288349 5.46966e-06 0.00961341 0.0997056 0.500719 0.747984 0.900024 0.990041 0.999999 jxl_score float64 95806 0.00% 95806 0.499338 0.28885 1.28155e-05 0.00994678 0.0991025 0.499795 0.748646 0.899703 0.989348 0.999985 mj_score float64 95806 0.00% 95806 0.50164 0.288679 6.92442e-06 0.0105076 0.100882 0.503048 0.752032 0.899308 0.990047 0.999993 rh_score float64 95806 0.00% 95806 0.498407 0.287797 5.00212e-06 0.00991632 0.0999483 0.497466 0.747188 0.899286 0.989473 0.999986 zzc_score float64 95806 0.00% 95806 0.500627 0.289067 1.15778e-05 0.0101856 0.0990114 0.501688 0.750986 0.899924 0.990043 0.999998 zcx_score float64 95806 0.00% 95806 0.499672 0.289137 9.97767e-06 0.0103249 0.0997429 0.49913 0.750683 0.901942 0.989712 0.999987 person_info float64 95806 0.00% 7 -0.078229 0.156859 -0.322581 -0.322581 -0.322581 -0.0537176 0.078853 0.078853 0.078853 0.078853 finance_info float64 95806 0.00% 35 0.0367625 0.0396866 0.0238095 0.0238095 0.0238095 0.0238095 0.0238095 0.0714286 0.214286 1.02381 credit_info float64 95806 0.00% 100 0.0636262 0.143098 0 0 0 0 0.06 0.18 0.8 1 act_info float64 95806 0.00% 74 0.236197 0.157132 0.0769231 0.0769231 0.0769231 0.205128 0.346154 0.487179 0.615385 1.08974 samp_type object 95806 0.00% 3 dev:68.16% off:16.67% val:15.16% None None None None dev:68.16% off:16.67% val:15.16%
2 特征篩選(缺失值,IV,相關系數)
使用缺失率、IV、相關系數進行特征篩選。但是考慮到后續建模過程要對變量進行分箱處理,該操作會使變量的IV變小,變量間的相關性變大,因此此處可以對IV和相關系的閾值限制適當放松,或不做限制
dev_slct1, drop_lst= toad.selection.select(dev, dev['bad_ind'], empty=0.7, iv=0.03, corr=0.7, return_drop=True, exclude=ex_lis)
print("keep:", dev_slct1.shape[1], "drop empty:", len(drop_lst['empty']), "drop iv:", len(drop_lst['iv']), "drop corr:", len(drop_lst['corr']))
顯示結果:
keep: 12 drop empty: 0 drop iv: 1 drop corr: 0
3 卡方分箱
# 得到切分節點
combiner = toad.transform.Combiner()
combiner.fit(dev_slct1, dev_slct1['bad_ind'], method='chi',min_samples=0.05, exclude=ex_lis)
# 導出箱的節點
bins = combiner.export()
print(bins)
顯示結果:
{'td_score': [0.7989831262724624], 'jxl_score': [0.4197048501965005], 'mj_score': [0.3615303943747963], 'zzc_score': [0.4469861520889339], 'zcx_score': [0.7007847486465795], 'person_info': [-0.2610139784946237, -0.1286774193548387, -0.05371756272401434, 0.013863440860215051, 0.06266021505376344, 0.07885304659498207], 'finance_info': [0.047619047619047616], 'credit_info': [0.02, 0.04, 0.11], 'act_info': [0.1153846153846154, 0.14102564102564102, 0.16666666666666666, 0.20512820512820512, 0.2692307692307692, 0.35897435897435903, 0.3974358974358974, 0.5256410256410257]}
4 Bivar圖,調整分箱
畫圖觀察每個變量在開發樣本和時間外樣本上的Bivar圖,為方便閱讀,這里只以單變量act_info做示范
# 根據節點實施分箱
dev_slct2 = combiner.transform(dev_slct1)
val2 = combiner.transform(val[dev_slct1.columns])
off2 = combine