「日拱一碼」022 機器學習—

基于單次隨機劃分的方法

普通單次隨機劃分（train_test_split）

分層單次隨機劃分(使用 ?train_test_split ?的 ?stratify ?參數)

基于多次隨機劃分的方法

普通多次隨機劃分(ShuffleSplit)

分層多次隨機劃分（StratifiedShuffleSplit）

?基于交叉驗證的方法

K 折交叉驗證（KFold）

分層 K 折交叉驗證（StratifiedKFold）

?基于分組劃分的方法

分組隨機劃分（GroupShuffleSplit）

分組 K 折交叉驗證（GroupKFold）

分組分層 K 折交叉驗證（StratifiedGroupKFold）

基于時間序列劃分的方法

基于自定義劃分的方法

數據劃分是數據預處理中的一個重要環節，通常用于將數據集分為訓練集、驗證集和測試集，以便在機器學習或數據分析中進行模型訓練、超參數調整和性能評估。以下是幾種常見的數據劃分方法：

基于單次隨機劃分的方法

普通單次隨機劃分（train_test_split）

普通單次隨機劃分是將數據集隨機分為訓練集和測試集（或訓練集、驗證集和測試集）。這種方法適用于數據分布較為均勻的情況

## 基于單次隨機劃分的方法
# 創建數據集import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split# 創建一個簡單的分類數據集,100個樣本,4個特征,2個相關特征,2個冗余特征
X, y = make_classification(n_samples=100, n_features=4, n_informative=2, n_redundant=2, random_state=42)# 將數據轉換為 DataFrame
df = pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3', 'Feature4'])
df['Label'] = y# 添加分組信息（假設每個樣本屬于一個組）
df['Group'] = np.random.choice(['Group1', 'Group2', 'Group3'], size=len(df), p=[0.4, 0.3, 0.3])# 添加時間戳信息（假設數據是按時間順序生成的）
df['Timestamp'] = pd.date_range(start='2025-01-01', periods=len(df), freq='D')# print(df)
#     Feature1  Feature2  Feature3  Feature4  Label   Group  Timestamp
# 0  -1.053839 -1.027544 -0.329294  0.826007      1  Group1 2025-01-01
# 1   1.569317  1.306542 -0.239385 -0.331376      0  Group1 2025-01-02
# 2  -0.358856 -0.691021 -1.225329  1.652145      1  Group2 2025-01-03
# 3  -0.136856  0.460938  1.896911 -2.281386      0  Group3 2025-01-04
# 4  -0.048629  0.502301  1.778730 -2.171053      0  Group2 2025-01-05
# ..       ...       ...       ...       ...    ...     ...        ...
# 95 -2.241820 -1.248690  2.357902 -2.009185      0  Group2 2025-04-06
# 96  0.573042  0.362054 -0.462814  0.341294      1  Group2 2025-04-07
# 97 -0.375121 -0.149518  0.588465 -0.575002      0  Group2 2025-04-08
# 98  1.594888  0.780256 -2.030223  1.863789      1  Group1 2025-04-09
# 99 -0.149941 -0.566037 -1.416933  1.804741      1  Group1 2025-04-10
#
# [100 rows x 7 columns]# 普通單次隨機劃分(train_test_split)
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Label', 'Group', 'Timestamp']), df['Label'], test_size=0.2, random_state=42)print("普通單次隨機劃分結果：")
print("訓練集大小：", X_train.shape) # (80, 4)
print("測試集大小：", X_test.shape) # (20, 4)

分層單次隨機劃分(使用 ?train_test_split ?的 ?stratify ?參數)

分層單次隨機劃分是在隨機劃分的基礎上，確保每個劃分后的子集在目標變量的分布上與原始數據集保持一致。這對于分類問題尤其重要，特別是當數據集中某些類別樣本較少時

# 分層單次隨機劃分X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(df.drop(columns=['Label', 'Group', 'Timestamp']), df['Label'], test_size=0.2, stratify=df['Label'], random_state=42)print("分層單次隨機劃分結果：")
print("訓練集標簽分布：", y_train_strat.value_counts())
# Label
# 0    40
# 1    40
# Name: count, dtype: int64
print("測試集標簽分布：", y_test_strat.value_counts())
# Label
# 1    10
# 0    10
# Name: count, dtype: int64

基于多次隨機劃分的方法

普通多次隨機劃分(ShuffleSplit)

普通多次隨機劃分會隨機打亂數據，然后根據指定的比例劃分訓練集和測試集，劃分時不考慮目標變量（標簽）的分布。

## 基于多次隨機劃分的方法
# 普通多次隨機劃分ShuffleSplitfrom sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit
ss = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)for fold, (train_idx, test_idx) in enumerate(ss.split(df)):train_data = df.iloc[train_idx]test_data = df.iloc[test_idx]print(f"第 {fold + 1} 次劃分：")print("訓練集大小：", train_data.shape)print("測試集大小：", test_data.shape)print("訓練集標簽分布：", train_data['Label'].value_counts())print("測試集標簽分布：", test_data['Label'].value_counts())print()
# 第 1 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 0    42
# 1    38
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    12
# 0     8
# Name: count, dtype: int64
# 
# 第 2 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 0    43
# 1    37
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    13
# 0     7
# Name: count, dtype: int64
# 
# 第 3 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
# 
# 第 4 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 1    42
# 0    38
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 0    12
# 1     8
# Name: count, dtype: int64
# 
# 第 5 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 0    41
# 1    39
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    11
# 0     9
# Name: count, dtype: int64

分層多次隨機劃分（StratifiedShuffleSplit）

分層多次隨機劃分會在劃分數據時保持目標變量（標簽）的分布與原始數據集一致

# 分層多次隨機劃分StratifiedShuffleSplitsss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)for fold, (train_idx, test_idx) in enumerate(sss.split(df.drop(columns=['Label']), df['Label'])):train_data = df.iloc[train_idx]test_data = df.iloc[test_idx]print(f"第 {fold + 1} 次劃分：")print("訓練集大小：", train_data.shape)print("測試集大小：", test_data.shape)print("訓練集標簽分布：", train_data['Label'].value_counts())print("測試集標簽分布：", test_data['Label'].value_counts())print()
# 第 1 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 0    40
# 1    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
#
# 第 2 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64
#
# 第 3 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64
#
# 第 4 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
#
# 第 5 次劃分：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 訓練集標簽分布： Label
# 0    40
# 1    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64

?基于交叉驗證的方法

交叉驗證是一種更穩健的劃分方法，通過多次劃分數據集并訓練模型，可以更好地評估模型的性能

K 折交叉驗證（KFold）

將數據集分為 K 個子集，每次選擇一個子集作為測試集，其余作為訓練集，重復 K 次

## 基于交叉驗證的方法
# K折交叉驗證
from sklearn.model_selection import KFold# K 折交叉驗證
kf = KFold(n_splits=5, shuffle=True, random_state=42)for fold, (train_idx, test_idx) in enumerate(kf.split(df)):train_data_kfold = df.iloc[train_idx]test_data_kfold = df.iloc[test_idx]print(f"第 {fold + 1} 折：")print("訓練集大小：", train_data_kfold.shape)print("測試集大小：", test_data_kfold.shape)
# 第 1 折：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 第 2 折：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 第 3 折：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 第 4 折：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)
# 第 5 折：
# 訓練集大小： (80, 7)
# 測試集大小： (20, 7)

分層 K 折交叉驗證（StratifiedKFold）

在 K 折交叉驗證的基礎上，確保每個子集在目標變量的分布上與原始數據集保持一致

# 分層 K 折交叉驗證
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)for fold, (train_idx, test_idx) in enumerate(skf.split(df.drop(columns=['Label', 'Group', 'Timestamp']), df['Label'])):train_data_skfold = df.iloc[train_idx]test_data_skfold = df.iloc[test_idx]print(f"第 {fold + 1} 折：")print("訓練集標簽分布：", train_data_skfold['Label'].value_counts())print("測試集標簽分布：", test_data_skfold['Label'].value_counts())
# 第 1 折：
# 訓練集標簽分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64
# 第 2 折：
# 訓練集標簽分布： Label
# 0    40
# 1    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
# 第 3 折：
# 訓練集標簽分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64
# 第 4 折：
# 訓練集標簽分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
# 第 5 折：
# 訓練集標簽分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64

?基于分組劃分的方法

當數據中存在分組結構時（例如用戶、實驗組等），需要確保每個劃分后的子集中包含完整的分組，而不是將同一組的數據分到不同的子集中

分組隨機劃分（GroupShuffleSplit）

分組隨機劃分是一種隨機劃分方法，它會隨機選擇完整的分組作為訓練集和測試集。這種方法可以確保同一組的數據不會被拆分到不同的子集中

## 基于分組劃分的方法
from sklearn.model_selection import GroupShuffleSplit# 分組隨機劃分 GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(df.drop(columns=['Label', 'Group']), df['Label'], groups=df['Group']))train_data_gss = df.iloc[train_idx]
test_data_gss = df.iloc[test_idx]print("分組隨機劃分結果：")
print("訓練集大小：", train_data_gss.shape) # (64, 7)
print("測試集大小：", test_data_gss.shape) # (36, 7)
print("訓練集分組分布：", train_data_gss['Group'].value_counts())
# Group
# Group2    34
# Group3    30
# Name: count, dtype: int64
print("測試集分組分布：", test_data_gss['Group'].value_counts())
# Group
# Group1    36
# Name: count, dtype: int64

分組 K 折交叉驗證（GroupKFold）

分組 K 折交叉驗證是一種多次劃分方法，它將數據集分為 K 個子集，每次選擇一個子集作為測試集，其余作為訓練集。與普通 K 折不同的是，分組 K 折確保每個子集包含完整的分組

from sklearn.model_selection import GroupKFold# 分組K折交叉劃分 GroupKFold
gkf = GroupKFold(n_splits=3)for fold, (train_idx, test_idx) in enumerate(gkf.split(df.drop(columns=['Label', 'Group']), df['Label'], groups=df['Group'])):train_data_gkf = df.iloc[train_idx]test_data_gkf = df.iloc[test_idx]print(f"第 {fold + 1} 折：")print("訓練集大小：", train_data_gkf.shape)print("測試集大小：", test_data_gkf.shape)print("訓練集分組分布：", train_data_gkf['Group'].value_counts())print("測試集分組分布：", test_data_gkf['Group'].value_counts())
# 第 1 折：
# 訓練集大小： (58, 7)
# 測試集大小： (42, 7)
# 訓練集分組分布： Group
# Group3    31
# Group2    27
# Name: count, dtype: int64
# 測試集分組分布： Group
# Group1    42
# Name: count, dtype: int64
# 第 2 折：
# 訓練集大小： (69, 7)
# 測試集大小： (31, 7)
# 訓練集分組分布： Group
# Group1    42
# Group2    27
# Name: count, dtype: int64
# 測試集分組分布： Group
# Group3    31
# Name: count, dtype: int64
# 第 3 折：
# 訓練集大小： (73, 7)
# 測試集大小： (27, 7)
# 訓練集分組分布： Group
# Group1    42
# Group3    31
# Name: count, dtype: int64
# 測試集分組分布： Group
# Group2    27
# Name: count, dtype: int64

分組分層 K 折交叉驗證（StratifiedGroupKFold）

分組分層 K 折交叉驗證結合了分組和分層的思想。它不僅確保同一組的數據不會被拆分，還保證了每個子集在目標變量的分布上與原始數據集一致

from sklearn.model_selection import KFold
import numpy as npclass StratifiedGroupKFold:def __init__(self, n_splits=5):self.n_splits = n_splitsdef split(self, X, y, groups):unique_groups = np.unique(groups)group_to_y = {group: y[groups == group] for group in unique_groups}group_to_idx = {group: np.where(groups == group)[0] for group in unique_groups}# Sort groups by the proportion of positive labelssorted_groups = sorted(unique_groups, key=lambda g: np.mean(group_to_y[g]))# Split groups into foldsfolds = np.array_split(sorted_groups, self.n_splits)for fold in folds:test_idx = np.concatenate([group_to_idx[group] for group in fold])train_idx = np.setdiff1d(np.arange(len(groups)), test_idx)yield train_idx, test_idx# 分組分層K折交叉劃分 StratifiedGroupKFold
sgkf = StratifiedGroupKFold(n_splits=3)for fold, (train_idx, test_idx) in enumerate(sgkf.split(df.drop(columns=['Label', 'Group']), df['Label'], groups=df['Group'])):train_data_sgkf = df.iloc[train_idx]test_data_sgkf = df.iloc[test_idx]print(f"第 {fold + 1} 折：")print("訓練集大小：", train_data_sgkf.shape)print("測試集大小：", test_data_sgkf.shape)print("訓練集分組分布：", train_data_sgkf['Group'].value_counts())print("測試集分組分布：", test_data_sgkf['Group'].value_counts())print("訓練集標簽分布：", train_data_sgkf['Label'].value_counts())print("測試集標簽分布：", test_data_sgkf['Label'].value_counts())
# 第 1 折：
# 訓練集大小： (73, 7)
# 測試集大小： (27, 7)
# 訓練集分組分布： Group
# Group1    38
# Group3    35
# Name: count, dtype: int64
# 測試集分組分布： Group
# Group2    27
# Name: count, dtype: int64
# 訓練集標簽分布： Label
# 1    38
# 0    35
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 0    15
# 1    12
# Name: count, dtype: int64
# 第 2 折：
# 訓練集大小： (65, 7)
# 測試集大小： (35, 7)
# 訓練集分組分布： Group
# Group1    38
# Group2    27
# Name: count, dtype: int64
# 測試集分組分布： Group
# Group3    35
# Name: count, dtype: int64
# 訓練集標簽分布： Label
# 1    34
# 0    31
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 0    19
# 1    16
# Name: count, dtype: int64
# 第 3 折：
# 訓練集大小： (62, 7)
# 測試集大小： (38, 7)
# 訓練集分組分布： Group
# Group3    35
# Group2    27
# Name: count, dtype: int64
# 測試集分組分布： Group
# Group1    38
# Name: count, dtype: int64
# 訓練集標簽分布： Label
# 0    34
# 1    28
# Name: count, dtype: int64
# 測試集標簽分布： Label
# 1    22
# 0    16
# Name: count, dtype: int64

基于時間序列劃分的方法

對于時間序列數據，不能簡單地隨機劃分，因為時間序列數據具有時間依賴性。通常需要將數據按照時間順序分為訓練集、驗證集和測試集，確保訓練集中的數據早于驗證集和測試集

## 基于時間序列的劃分方法
# 按時間順序劃分
df_sorted = df.sort_values(by='Timestamp')
train_size = int(len(df_sorted) * 0.8)
train_data = df_sorted[:train_size]
test_data = df_sorted[train_size:]print("按時間順序劃分結果：")
print("訓練集時間范圍：", train_data['Timestamp'].min(), "到", train_data['Timestamp'].max()) # 2025-01-01 00:00:00 到 2025-03-21 00:00:00
print("測試集時間范圍：", test_data['Timestamp'].min(), "到", test_data['Timestamp'].max()) # 2025-03-22 00:00:00 到 2025-04-10 00:00:00

基于自定義劃分的方法

在某些情況下，可能需要根據特定的業務邏輯或數據特性進行自定義劃分

## 自定義劃分
# 根據 Group 列的值進行自定義劃分
train_data = df[df['Group'] == 'Group1']
val_data = df[df['Group'] == 'Group2']
test_data = df[df['Group'] == 'Group3']# 輸出劃分結果
print("訓練集大小：", train_data.shape)  # (40, 7)
print("驗證集大小：", val_data.shape)  # (24, 7)
print("測試集大小：", test_data.shape)  # (36, 7)print("\n訓練集分組分布：")
print(train_data['Group'].value_counts())
# Group
# Group1    40
# Name: count, dtype: int64print("\n驗證集分組分布：")
print(val_data['Group'].value_counts())
# Group
# Group2    24
# Name: count, dtype: int64print("\n測試集分組分布：")
print(test_data['Group'].value_counts())
# Group
# Group3    36
# Name: count, dtype: int64