目錄
特征選擇
過濾法(Filter Methods)
方差選擇法
相關系數法
卡方檢驗
包裹法(Wrapper Methods)
遞歸特征消除(RFE)
嵌入法(Embedded Methods)
L1正則化(Lasso回歸)
基于樹模型的特征重要性
特征降維
主成分分析(PCA)
線性判別分析(LDA)
自動特征生成
使用Featuretools生成特征
特征交互
多項式特征
手動組合特征
特征編碼
獨熱編碼(One-Hot Encoding)
標簽編碼(Label Encoding)
基于頻率或目標的編碼(Frequency/Target Encoding)
基于頻率的編碼(Frequency Encoding)
基于目標的編碼(Target Encoding)
基于目標的編碼(平滑處理)
時間特征提取
文本特征提取
詞袋模型(Bag of Words)
TF-IDF
特征選擇
特征選擇是減少特征數量、提高模型性能的關鍵步驟,常見的方法包括過濾法、包裹法和嵌入法
過濾法(Filter Methods)
過濾法通過統計指標來篩選特征,獨立于模型
-
方差選擇法
選擇方差大于閾值的特征
## 特征選擇
# 過濾法(Filter Methods)
# 1. 方差選擇法
import numpy as np
from sklearn.feature_selection import VarianceThresholddata = np.array([[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]])
selector = VarianceThreshold(threshold=0.5)
selected_data = selector.fit_transform(data)
print(selected_data)
# [[0]
# [4]
# [1]]
-
相關系數法
選擇與目標變量相關性高的特征
# 2. 相關系數法
import numpy as np
from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonrX = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([1, 0, 1])# 定義score_func,計算每個特征與目標變量y的皮爾遜相關系數
def pearsonr_score(X, y):scores = []for feature in X.T: # 遍歷每個特征corr, _ = pearsonr(feature, y) # 計算皮爾遜相關系數scores.append(abs(corr)) # 取絕對值作為評分return np.array(scores)# 使用SelectKBest選擇相關性最高的2個特征
selector = SelectKBest(score_func=pearsonr_score, k=2)
selected_data = selector.fit_transform(X, y)print(selected_data)
# [[2 3]
# [5 6]
# [8 9]]
-
卡方檢驗
選擇與目標變量相關性高的特征
# 卡方檢驗
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([1, 0, 1])
selector = SelectKBest(chi2, k=2)
selected_data = selector.fit_transform(X, y)
print(selected_data)
# [[2 3]
# [5 6]
# [8 9]]
包裹法(Wrapper Methods)
包裹法通過訓練模型來評估特征子集的性能
-
遞歸特征消除(RFE)
遞歸地移除最不重要的特征
# 包裹法(Wrapper Methods)
# 遞歸特征消除(RFE)import numpy as np
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegressionX = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([1, 0, 1])
model = LogisticRegression()
selector = RFE(model, n_features_to_select=2)
selected_data = selector.fit_transform(X, y)
print(selected_data)
# [[1 3]
# [4 6]
# [7 9]]
嵌入法(Embedded Methods)
?嵌入法在模型訓練過程中自動進行特征選擇
-
L1正則化(Lasso回歸)
通過L1正則化將不重要的特征權重壓縮為0
# 嵌入法(Embedded Methods)
# 1. L1正則化(Lasso回歸)
import numpy as np
from sklearn.linear_model import LassoX = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([1, 0, 1])
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print(lasso.coef_)
# [0. 0. 0.]
-
基于樹模型的特征重要性
使用樹模型(如隨機森林)評估特征重要性
# 2. 基于樹模型的特征重要性
import numpy as np
from sklearn.ensemble import RandomForestClassifierX = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([1, 0, 1])
model = RandomForestClassifier()
model.fit(X, y)
print(model.feature_importances_)
# [0.37857143 0.30357143 0.31785714]
特征降維
特征降維是將高維特征投影到低維空間,常見的方法包括PCA和LDA
主成分分析(PCA)
通過線性變換將數據投影到方差最大的方向
## 特征降維
# 1. 主成分分析(PCA)
import numpy as np
from sklearn.decomposition import PCAX = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(X)
print(reduced_data)
# [[-5.19615242e+00 3.62353582e-16]
# [ 0.00000000e+00 0.00000000e+00]
# [ 5.19615242e+00 3.62353582e-16]]
線性判別分析(LDA)
通過線性變換將數據投影到最大化類間區分度的方向
# 2. 線性判別分析(LDA)
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysisX = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
y = np.array([1, 0, 1, 2])
lda = LinearDiscriminantAnalysis(n_components=2) # 注意: n_components不能超過類別數量減1
reduced_data = lda.fit_transform(X, y)
print(reduced_data)
# [[ 1.06066017]
# [ 0.35355339]
# [-0.35355339]
# [-1.06066017]]
自動特征生成
自動特征生成是通過工具或算法自動生成新的特征
使用Featuretools生成特征
自動從關系型數據中生成聚合和轉換特征
## 自動特征生成
# 1. 使用Featuretools生成特征
import numpy as np
import featuretools as ft
import pandas as pd# 創建數據
data = pd.DataFrame({'transaction_id': range(1, 6), # 添加 transaction_id 列'customer_id': [1, 2, 1, 2, 3],'amount': [100, 150, 200, 300, 500],'timestamp': pd.date_range('2025-01-01', periods=5)
})# 創建 EntitySet
es = ft.EntitySet(id='transactions')# 添加 transactions 數據
es.add_dataframe(dataframe=data,dataframe_name='transactions',index='transaction_id', # 指定索引列time_index='timestamp' # 指定時間索引列
)# 定義客戶實體
customers = pd.DataFrame({'customer_id': [1, 2, 3]})# 添加 customers 數據
es.add_dataframe(dataframe=customers,dataframe_name='customers',index='customer_id' # 指定索引列
)# 建立關系
es.add_relationship(parent_dataframe_name='customers',parent_column_name='customer_id',child_dataframe_name='transactions',child_column_name='customer_id'
)# 自動特征生成
feature_matrix, feature_defs = ft.dfs(entityset=es,target_dataframe_name='customers',agg_primitives=['mean', 'max', 'std'],trans_primitives=['day', 'month']
)print(feature_matrix)
# MAX(transactions.amount) ... STD(transactions.amount)
# customer_id ...
# 1 200.0 ... 70.710678
# 2 300.0 ... 106.066017
# 3 500.0 ... NaN
#
# [3 rows x 3 columns]
特征交互
特征交互是通過組合特征來發現新的信息,常見的方法包括多項式特征和手動組合特征
多項式特征
生成特征的多項式組合
## 特征交互
# 1. 多項式特征
import numpy as np
from sklearn.preprocessing import PolynomialFeaturesX = np.array([[1, 2], [3, 4], [5, 6]])
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(X)
print(poly_features)
# [[ 1. 2. 1. 2. 4.]
# [ 3. 4. 9. 12. 16.]
# [ 5. 6. 25. 30. 36.]]
手動組合特征
根據業務知識手動組合特征
# 2. 手動組合特征
import pandas as pddata = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
data['A_plus_B'] = data['A'] + data['B']
data['A_times_B'] = data['A'] * data['B']
print(data)
# A B A_plus_B A_times_B
# 0 1 4 5 4
# 1 2 5 7 10
# 2 3 6 9 18
特征編碼
特征編碼是將非數值特征轉換為數值特征,常見的方法包括獨熱編碼(One-Hot Encoding)、標簽編碼(Label Encoding)和基于頻率或目標的編碼(Frequency/Target Encoding)
獨熱編碼(One-Hot Encoding)
將分類特征轉換為獨熱向量
## 特征編碼
# 1. 獨熱編碼(One-Hot Encoding)
from sklearn.preprocessing import OneHotEncoderdata = np.array([['cat'], ['dog'], ['bird']])
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(data)
print(encoded_data)
# [[0. 1. 0.]
# [0. 0. 1.]
# [1. 0. 0.]]
標簽編碼(Label Encoding)
將分類特征轉換為整數標簽
# 2. 標簽編碼(Label Encoding)
from sklearn.preprocessing import LabelEncoderdata = np.array(['cat', 'dog', 'bird'])
encoder = LabelEncoder()
encoded_data = encoder.fit_transform(data)
print(encoded_data)
# [1 2 0]
基于頻率或目標的編碼(Frequency/Target Encoding)
對于一些具有大量類別的分類特征,可以使用基于頻率或目標的編碼方法
-
基于頻率的編碼(Frequency Encoding)
將每個類別值替換為它在數據集中出現的頻率
# 3. 基于頻率或目標的編碼
# 3.1 基于頻率的編碼(Frequency Encoding)
import pandas as pd# 示例數據
data = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C', 'B']})# 計算每個類別的頻率
frequency_map = data['category'].value_counts(normalize=True).to_dict()# 替換類別值為頻率
data['frequency_encoded'] = data['category'].map(frequency_map)
print(data)
# category frequency_encoded
# 0 A 0.333333
# 1 B 0.333333
# 2 A 0.333333
# 3 C 0.333333
# 4 B 0.333333
# 5 A 0.333333
# 6 C 0.333333
# 7 C 0.333333
# 8 B 0.333333
-
基于目標的編碼(Target Encoding)
將每個類別值替換為目標變量的統計值(如均值、中位數等)
# 3.2 基于目標的編碼(Target Encoding)
import pandas as pd
import numpy as np# 示例數據
data = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C', 'B'],'target': [1, 0, 1, 1, 0, 1, 0, 1, 0]
})# 計算每個類別的目標均值
target_mean = data.groupby('category')['target'].mean().to_dict()# 替換類別值為目標均值
data['target_encoded'] = data['category'].map(target_mean)
print(data)
# category target target_encoded
# 0 A 1 1.000000
# 1 B 0 0.000000
# 2 A 1 1.000000
# 3 C 1 0.666667
# 4 B 0 0.000000
# 5 A 1 1.000000
# 6 C 0 0.666667
# 7 C 1 0.666667
# 8 B 0 0.000000
-
基于目標的編碼(平滑處理)
為了避免過擬合,結合全局目標均值和類別目標均值,通過平滑參數控制兩者的權重
# 基于目標的編碼(平滑處理)
import pandas as pd
import numpy as np# 示例數據
data = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C', 'B'],'target': [1, 0, 1, 1, 0, 1, 0, 1, 0]
})# 全局目標均值
global_mean = data['target'].mean()# 每個類別的目標均值
target_mean = data.groupby('category')['target'].mean()# 類別出現的次數
category_count = data['category'].value_counts()# 平滑參數
alpha = 10# 計算平滑后的目標均值
smoothed_target_mean = (target_mean * category_count + global_mean * alpha) / (category_count + alpha)# 替換類別值為平滑后的目標均值
data['smoothed_target_encoded'] = data['category'].map(smoothed_target_mean)
print(data)
# category target smoothed_target_encoded
# 0 A 1 0.658120
# 1 B 0 0.427350
# 2 A 1 0.658120
# 3 C 1 0.581197
# 4 B 0 0.427350
# 5 A 1 0.658120
# 6 C 0 0.581197
# 7 C 1 0.581197
# 8 B 0 0.427350
時間特征提取
時間特征提取是從時間戳中提取有用信息,如年、月、日、小時等
## 時間特征提取
import pandas as pddata = pd.DataFrame({'timestamp': pd.date_range('2025-01-01', periods=5)})
data['year'] = data['timestamp'].dt.year
data['month'] = data['timestamp'].dt.month
data['day'] = data['timestamp'].dt.day
data['hour'] = data['timestamp'].dt.hour
print(data)
# timestamp year month day hour
# 0 2025-01-01 2025 1 1 0
# 1 2025-01-02 2025 1 2 0
# 2 2025-01-03 2025 1 3 0
# 3 2025-01-04 2025 1 4 0
# 4 2025-01-05 2025 1 5 0
文本特征提取
文本特征提取是從文本數據中提取有用信息,常見的方法包括詞袋模型(Bag of Words)和TF-IDF
詞袋模型(Bag of Words)
將文本轉換為單詞頻率向量
## 文本特征提取
# 1. 詞袋模型(Bag of Words)
from sklearn.feature_extraction.text import CountVectorizercorpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out()) # ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
print(X.toarray())
# [[0 1 1 1 0 0 1 0 1]
# [0 2 0 1 0 1 1 0 1]
# [1 0 0 1 1 0 1 1 1]
# [0 1 1 1 0 0 1 0 1]]
TF-IDF
將文本轉換為TF-IDF權重向量
# 2. TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizercorpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out()) # ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
print(X.toarray())
# [[0. 0.46979139 0.58028582 0.38408524 0. 0.
# 0.38408524 0. 0.38408524]
# [0. 0.6876236 0. 0.28108867 0. 0.53864762
# 0.28108867 0. 0.28108867]
# [0.51184851 0. 0. 0.26710379 0.51184851 0.
# 0.26710379 0.51184851 0.26710379]
# [0. 0.46979139 0.58028582 0.38408524 0. 0.
# 0.38408524 0. 0.38408524]]