作者簡介
杜嘉寶,男,西安工程大學電子信息學院,2024級研究生
研究方向:變壓器故障預警與檢測
電子郵件:djb857497378@gmail.com
王子謙,男,西安工程大學電子信息學院,2024級研究生,張宏偉人工智能課題組
研究方向:機器視覺與人工智能
電子郵件:1523018430@qq.com
在這篇文章中,我將分享如何使用支持向量機(SVM)算法對心臟病數據進行分類。整個流程包括數據加載、預處理、SMOTE過采樣、PCA降維、超參數調優、灰狼優化算法的使用等。通過這篇文章,希望你能夠了解如何通過集成不同技術實現更好的分類效果。
1. 安裝包的準備
首先,你需要安裝必要的Python庫。以下是一些主要的庫:
pip install numpy pandas scikit-learn imbalanced-learn matplotlib
這些庫提供了SVM、SMOTE過采樣、PCA降維以及其他常用的數據處理工具。
2. 數據集介紹
本次我們使用的是UCI心臟病數據集,包含多種與心臟病相關的特征,如年齡、性別、血壓、膽固醇水平等。數據集中的目標變量target有五個不同的類別,表示不同程度的心臟病。
通過加載數據,我們將其分為特征集(X)和目標值(y),并進行清洗(將?替換為0)。
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
features, target = load_data(url, columns)
3. SVM算法介紹
支持向量機(SVM)是一種強大的分類算法,尤其適合處理高維數據。它的目標是尋找一個超平面,將不同類別的樣本分開,并且最大化類別之間的間隔。在這篇文章中,我們使用SVC(支持向量分類)來實現心臟病數據的分類。
from sklearn.svm import SVC
svm_clf = SVC(kernel='rbf', C=1.0, gamma=0.1)
svm_clf.fit(X_train, y_train)
4. SMOTE算法介紹
在心臟病數據集中,類別不平衡問題較為嚴重。為了解決這個問題,我們使用了SMOTE(Synthetic Minority Over-sampling Technique)算法,通過生成合成樣本來平衡各個類別。
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = smote(X_train, y_train, sampling_strategy={2: 55, 3: 55, 4: 55})
5. GridSearch算法
為了尋找SVM模型的最佳超參數,我們使用了GridSearchCV進行超參數搜索。通過在不同的C和gamma值之間進行網格搜索,找出最優的組合。
from sklearn.model_selection import GridSearchCV
gridsearch(X_train, y_train, X_test, y_test)
6. PCA算法介紹
主成分分析(PCA)是降維的常用方法,可以在保留數據大部分信息的情況下,減少數據的維度。在本次任務中,我們使用PCA將數據維度降到95%的信息量。
from sklearn.decomposition import PCA
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
7. 灰狼優化算法介紹
灰狼優化算法(Grey Wolf Optimizer, GWO)是一種模擬灰狼捕獵行為的優化算法。在這里,我們使用灰狼優化算法來尋找SVM模型的最佳超參數C和gamma。
def grey_wolf_optimizer(...):# 代碼實現見下文return alpha_pos, alpha_score
灰狼優化算法通過模擬灰狼群體的領導行為,幫助我們在搜索空間中找到最優解。
8. 實現流程
8.1 數據加載與預處理
首先,加載數據集,并進行數據清洗。然后,使用SMOTE算法處理類別不平衡問題。
(X_train, X_test, y_train, y_test) = train_test_split(features, target, test_size=0.3, random_state=1, stratify=target)
X_train, y_train = smote(X_train, y_train, sampling_strategy={2: 38, 3: 38, 4: 38})
8.2 數據標準化與PCA降維
使用StandardScaler對數據進行標準化處理,并使用PCA降維。
X_train, X_test = scaler(X_train, X_test)
pca = PCA(n_components=0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
8.3 模型訓練與優化
使用SVM進行訓練,并通過灰狼優化算法調整超參數。
print(grey_wolf_optimizer(0, 100, 10000, 100, 2, X_train, y_train, X_test, y_test))
8.4 可視化
使用Matplotlib進行數據可視化,展示PCA降維后的數據分布。
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_train[:, 0], X_train[:, 1], X_train[:, 2], c=y_train, cmap='viridis', alpha=0.7)
9. 源代碼
Utils.py
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCAdef scaler(x_train, x_test):standard_transform = StandardScaler()return standard_transform.fit_transform(x_train), standard_transform.fit_transform(x_test)def gridsearch(x_train, y_train, x_test, y_test):C = np.linspace(0,100,100)gamma = np.linspace(0,100,100)param_grid = {'C' : C,'gamma':gamma,'kernel':['rbf','poly']}svm_clf = SVC()grid_search = GridSearchCV(estimator=svm_clf,param_grid=param_grid,n_jobs=-1,cv=5,scoring='accuracy')grid_search.fit(x_train,y_train)print('網格中最優參數:',grid_search.best_params_)print('測試集的準確率:',grid_search.score(x_test,y_test))def apply_pca(data, n_components):# 標準化數據,使其均值為0,方差為1pca_scaler = StandardScaler()scaled_data = pca_scaler.fit_transform(data)# 進行 PCA 降維pca = PCA(n_components=n_components)reduced_data = pca.fit_transform(scaled_data)# 獲取解釋方差比例explained_variance = pca.explained_variance_ratio_return reduced_data, explained_variancedef grey_wolf_optimizer(lb, ub, n_wolves, max_iter, dim, x_train, y_train, x_test, y_test):# 定義目標函數def objective_function(C, gamma):clf = SVC(kernel='rbf', C=C, gamma=gamma, random_state=1)clf.fit(x_train, y_train)return 1 - clf.score(x_test, y_test)# 初始化狼群wolves = np.random.uniform(lb, ub, (n_wolves, dim))# 初始化 alpha、beta、delta 位置及適應度值alpha_pos = np.zeros(dim)alpha_score = float('inf')beta_pos = np.zeros(dim)beta_score = float('inf')delta_pos = np.zeros(dim)delta_score = float('inf')# 迭代優化for t in range(max_iter):# 計算當前狼群的適應度for i in range(n_wolves):wolves[i, :] = np.clip(wolves[i, :], lb, ub) # 約束搜索范圍C = max(float(wolves[i, 0]), 1e-3)gamma = max(float(wolves[i, 1]),1e-3)fitness = objective_function(C, gamma)# 更新 alpha、beta、deltaif fitness < alpha_score:delta_score, delta_pos = beta_score, beta_pos.copy()beta_score, beta_pos = alpha_score, alpha_pos.copy()alpha_score, alpha_pos = fitness, wolves[i, :].copy()elif fitness < beta_score:delta_score, delta_pos = beta_score, beta_pos.copy()beta_score, beta_pos = fitness, wolves[i, :].copy()elif fitness < delta_score:delta_score, delta_pos = fitness, wolves[i, :].copy()# 計算系數 aa = 2 - t * (2 / max_iter)# 更新狼群位置for i in range(n_wolves):r1, r2 = np.random.rand(dim), np.random.rand(dim)A1 = 2 * a * r1 - aC1 = 2 * r2D_alpha = abs(C1 * alpha_pos - wolves[i, :])X1 = alpha_pos - A1 * D_alphar1, r2 = np.random.rand(dim), np.random.rand(dim)A2 = 2 * a * r1 - aC2 = 2 * r2D_beta = abs(C2 * beta_pos - wolves[i, :])X2 = beta_pos - A2 * D_betar1, r2 = np.random.rand(dim), np.random.rand(dim)A3 = 2 * a * r1 - aC3 = 2 * r2D_delta = abs(C3 * delta_pos - wolves[i, :])X3 = delta_pos - A3 * D_delta# 計算新位置wolves[i, :] = (X1 + X2 + X3) / 3print(f"Iteration {t+1}: Best C={alpha_pos[0]}, Best gamma={alpha_pos[1]}, Best fitness={1-alpha_score}")return alpha_pos, alpha_score
load_data.pyimport pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTEdef load_data(url, columns):# 讀取數據df = pd.read_csv(url, names=columns)df_cleaned = df.replace('?', 0)X = df_cleaned.iloc[:, :-1] # 特征y = df_cleaned.iloc[:, -1] # 目標值return X, ydef smote(x, y, sampling_strategy, random_state=1, k_neighbors=1):smote = SMOTE(random_state=random_state, sampling_strategy=sampling_strategy, k_neighbors=k_neighbors)x_resampled, y_resampled = smote.fit_resample(x, y)return x_resampled, y_resampledif __name__ == '__main__':url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach","exang", "oldpeak", "slope", "ca", "thal", "target"]features, target = load_data(url, columns)labels, count = np.unique(target, return_counts=True)print('labels', labels, ' ', 'count:', count)sampling_strategy = {2: 55, 3: 55, 4: 55}smote = SMOTE(random_state=42, sampling_strategy=sampling_strategy, k_neighbors=1)features_resampled, target_resampled = smote.fit_resample(features, target)labels, count = np.unique(target_resampled, return_counts=True)
print('labels', labels, ' ', 'count:', count)train.py
from sklearn.model_selection import train_test_split
from load_data import load_data, smote
from utils import scaler, gridsearch,grey_wolf_optimizer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCAurl = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
features, target = load_data(url, columns)
(X_train,X_test,y_train,y_test) = train_test_split(features, target,test_size=0.3,random_state=1,stratify=target)
labels, count = np.unique(y_train, return_counts=True)
print('labels', labels, ' ', 'count:', count)
sampling_strategy = {2: 38, 3: 38, 4: 38}
X_train, y_train = smote(X_train, y_train,sampling_strategy)
labels, count = np.unique(y_train, return_counts=True)
print('labels', labels, ' ', 'count:', count)
X_train, X_test = scaler(X_train,X_test)
pca = PCA(n_components=0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_train[:, 0], X_train[:, 1], X_train[:, 2], c=y_train, cmap='viridis', alpha=0.7)ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_zlabel("Principal Component 3")
ax.set_title("PCA 3D Scatter Plot")
plt.show()
print(grey_wolf_optimizer(0,100,10000,100,2,X_train,y_train,X_test,y_test))