目錄
簡單隨機劃分(train_test_split)
分組劃分(Group Splitting)
簡單分組劃分 (Group Splitting)
分層分組劃分 (Stratified Group Splitting)
交叉驗證法(Cross-Validation)
分組K 折交叉驗證(GroupKFold)
留一組法(LeaveOneGroupOut)
簡單隨機劃分(train_test_split)
簡單隨機分組通過隨機分配組到訓練集和測試集,確保同一組的數據不會同時出現在訓練集和測試集中。這種方法簡單易實現,但可能無法保證數據分布的平衡性
## 簡單隨機劃分import pandas as pd
from sklearn.model_selection import train_test_splitdata = {'A': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0],'B': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5],'C': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0],'D': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5],'E': [0.1, 0.2, 0.1, 0.2, 0.3, 0.4],'y': [0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)# 定義分組特征
df['group'] = df[['A', 'B', 'C', 'D']].apply(lambda row: '_'.join(map(str, row)), axis=1)# 獲取所有唯一組
unique_groups = df['group'].unique()# 劃分組(而非劃分行)
train_groups, test_groups = train_test_split(unique_groups,test_size=0.2, # 測試集比例random_state=42, # 隨機種子stratify=df.groupby('group')['y'].first().values # 按目標變量分層
)# 根據組劃分數據
train_data = df[df['group'].isin(train_groups)]
test_data = df[df['group'].isin(test_groups)]print(f"訓練集: {len(train_data)}條(來自{len(train_groups)}個組)")
print(f"測試集: {len(test_data)}條(來自{len(test_groups)}個組)")# 訓練集: 2條(來自1個組)
# 測試集: 4條(來自1個組)
分組劃分(Group Splitting)
簡單分組劃分 (Group Splitting)
簡單分組劃分的核心思想是將具有相同參數組合的數據劃分為同一個組,然后基于這些組進行劃分,確保訓練集和測試集的組互斥
## 簡單分組劃分import pandas as pd
from sklearn.model_selection import GroupShuffleSplitdata = {'A': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0],'B': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5],'C': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0],'D': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5],'E': [0.1, 0.2, 0.1, 0.2, 0.3, 0.4],'y': [0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)# 定義分組特征
df['group'] = df[['A', 'B', 'C', 'D']].apply(lambda row: '_'.join(map(str, row)), axis=1)# 使用 GroupShuffleSplit 進行分組劃分
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)for train_index, test_index in gss.split(df, groups=df['group']):train = df.iloc[train_index]test = df.iloc[test_index]# 分離特征和目標變量
X_train, y_train = train.drop(columns=['y', 'group']), train['y']
X_test, y_test = test.drop(columns=['y', 'group']), test['y']print("訓練集:")
print(X_train)
# A B C D E
# 0 1.0 0.5 1.0 0.5 0.1
# 1 1.0 0.5 1.0 0.5 0.2
# 4 1.0 0.5 1.0 0.5 0.3
# 5 1.0 0.5 1.0 0.5 0.4
print("測試集:")
print(X_test)
# 2 2.0 1.5 2.0 1.5 0.1
# 3 2.0 1.5 2.0 1.5 0.2
分層分組劃分 (Stratified Group Splitting)
分層分組在隨機分組的基礎上,確保訓練集和測試集在某個關鍵特征(如目標變量 ?y )的分布上保持一致,能更好地保持數據的分布特性
## 分層分組劃分from sklearn.model_selection import StratifiedShuffleSplit
import pandas as pddata = {'A': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.5, 2, 1.5],'B': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5, 1.5, 2, 1.5],'C': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.5, 2, 1.5],'D': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5, 1.5, 2, 1.5],'E': [0.1, 0.2, 0.1, 0.2, 0.3, 0.4, 1.5, 2, 1.5],'y': [0, 1, 0, 1, 0, 1, 2, 2, 1]
}
df = pd.DataFrame(data)# 定義分組特征df['group_id'] = df[['A', 'B', 'C', 'D']].astype(str).apply('_'.join, axis=1)group_targets = df.groupby('group_id')['y'].first()
unique_groups = group_targets.index.values# 使用 StratifiedShuffleSplit 進行分層分組劃分
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)for train_groups_idx, test_groups_idx in sss.split(np.zeros(len(unique_groups)), group_targets):train_groups = unique_groups[train_groups_idx]test_groups = unique_groups[test_groups_idx]train_df = df[df['group_id'].isin(train_groups)]test_df = df[df['group_id'].isin(test_groups)]# 分離特征和目標變量
X_train, y_train = train.drop(columns=['y', 'group']), train['y']
X_test, y_test = test.drop(columns=['y', 'group']), test['y']print("訓練集:")
print(X_train)
# A B C D E
# 0 1.0 0.5 1.0 0.5 0.1
# 1 1.0 0.5 1.0 0.5 0.2
# 4 1.0 0.5 1.0 0.5 0.3
# 5 1.0 0.5 1.0 0.5 0.4
print("測試集:")
print(X_test)
# A B C D E
# 2 2.0 1.5 2.0 1.5 0.1
# 3 2.0 1.5 2.0 1.5 0.2
交叉驗證法(Cross-Validation)
交叉驗證法通過將數據劃分為多個子集(折),每次使用一個子集作為測試集,其余作為訓練集,重復多次以評估模型的性能。這種方法充分利用數據,結果更穩定
分組K 折交叉驗證(GroupKFold)
GroupKFold ?是一種分組 K 折交叉驗證方法,它確保每個組(group)的數據完全獨立于其他組。這種方法非常適合處理具有明確分組特征的數據
## 分組K折交叉驗證
import pandas as pd
from sklearn.model_selection import GroupKFolddata = {'A': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0],'B': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5],'C': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0],'D': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5],'E': [0.1, 0.2, 0.1, 0.2, 0.3, 0.4],'y': [0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)# 定義分組特征
df['group'] = df[['A', 'B', 'C', 'D']].apply(lambda row: '_'.join(map(str, row)), axis=1)
# 獲取分組標識符
groups = df['group'].values# 使用 GroupKFold 進行分組 K 折交叉驗證
gkf = GroupKFold(n_splits=2) # 假設我們使用 2 折交叉驗證for train_index, test_index in gkf.split(df, groups=groups):train = df.iloc[train_index]test = df.iloc[test_index]X_train, y_train = train.drop(columns=['y', 'group']), train['y']X_test, y_test = test.drop(columns=['y', 'group']), test['y']print("訓練集:")print(X_train)
# A B C D E
# 0 1.0 0.5 1.0 0.5 0.1
# 1 1.0 0.5 1.0 0.5 0.2
# 4 1.0 0.5 1.0 0.5 0.3
# 5 1.0 0.5 1.0 0.5 0.4print("測試集:")print(X_test)
# A B C D E
# 2 2.0 1.5 2.0 1.5 0.1
# 3 2.0 1.5 2.0 1.5 0.2
留一組法(LeaveOneGroupOut)
## 留一組法import pandas as pd
from sklearn.model_selection import LeaveOneGroupOutdata = {'A': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0],'B': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5],'C': [1.0, 1.0, 2.0, 2.0, 1.0, 1.0],'D': [0.5, 0.5, 1.5, 1.5, 0.5, 0.5],'E': [0.1, 0.2, 0.1, 0.2, 0.3, 0.4],'y': [0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)# 定義分組特征
df['group'] = df[['A', 'B', 'C', 'D']].apply(lambda row: '_'.join(map(str, row)), axis=1)
# 獲取分組標識符
groups = df['group'].values# 使用 LeaveOneGroupOut 進行分組劃分
logo = LeaveOneGroupOut()for train_index, test_index in logo.split(df, groups=groups):train = df.iloc[train_index]test = df.iloc[test_index]X_train, y_train = train.drop(columns=['y', 'group']), train['y']X_test, y_test = test.drop(columns=['y', 'group']), test['y']print("訓練集:")print(X_train)
# A B C D E
# 0 1.0 0.5 1.0 0.5 0.1
# 1 1.0 0.5 1.0 0.5 0.2
# 4 1.0 0.5 1.0 0.5 0.3
# 5 1.0 0.5 1.0 0.5 0.4print("測試集:")print(X_test)
# A B C D E
# 2 2.0 1.5 2.0 1.5 0.1
# 3 2.0 1.5 2.0 1.5 0.2