【機器學習】突破分類瓶頸：用邏輯回歸與Softmax回歸解鎖多分類世界

💗💗💗歡迎來到我的博客，你將找到有關如何使用技術解決問題的文章，也會找到某個技術的學習路線。無論你是何種職業，我都希望我的博客對你有所幫助。最后不要忘記訂閱我的博客以獲取最新文章，也歡迎在文章下方留下你的評論和反饋。我期待著與你分享知識、互相學習和建立一個積極的社區。謝謝你的光臨，讓我們一起踏上這個知識之旅！

文章目錄

🍋1. 引言
🍋2. 邏輯回歸
- 🍋代碼實現
🍋3. Softmax回歸
- 🍋代碼實現
🍋4. 集成學習
🍋5. 類別不平衡問題
- 🍋代碼示例（重采樣）
🍋6. 邏輯回歸案例（Iris數據集）
- 🍋評估結果
🍋結論

🍋1. 引言

在機器學習中，分類問題是最常見的一類問題。無論是二分類還是多分類，解決這些問題的算法有很多，其中邏輯回歸、Softmax回歸和集成學習方法在實際應用中被廣泛使用。但在實際數據中，類別不平衡問題可能會影響模型的效果，如何有效地解決這一問題也是一個亟待解決的難題。

🍋2. 邏輯回歸

概述：邏輯回歸（Logistic Regression）是一種用于二分類問題的經典線性分類器，目標是通過訓練數據集的特征來預測某一類別的概率。

模型原理：邏輯回歸的核心是使用sigmoid函數將線性組合的輸出映射到[0,1]區間，用于二分類問題：

在這里插入圖片描述
其中，𝑤是權重，𝑏是偏置，𝑋是輸入特征。

🍋代碼實現

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report# 生成數據集
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)# 數據分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# 訓練模型
model = LogisticRegression()
model.fit(X_train, y_train)# 預測
y_pred = model.predict(X_test)# 評估
print(classification_report(y_test, y_pred))

優缺點：

優點：模型簡單，計算開銷小，容易解釋。
缺點：對于非線性問題效果差，容易受到異常值影響。

🍋3. Softmax回歸

概述：Softmax回歸（Softmax Regression）是邏輯回歸的擴展，處理多分類問題。它將輸入的線性組合映射到多個類別的概率值。

模型原理： Softmax函數是對邏輯回歸的擴展，公式為：
在這里插入圖片描述

?
是偏置。

🍋代碼實現

from sklearn.linear_model import LogisticRegression# 使用Softmax回歸處理多分類問題
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

優缺點：

優點：可以處理多分類問題，適用于線性可分問題。
缺點：當類別數量很多時，計算復雜度較高。

🍋4. 集成學習

概述：集成學習是一種通過結合多個弱學習器來提高模型性能的方法。常見的集成學習方法有Bagging、Boosting和Stacking。

from sklearn.ensemble import RandomForestClassifier# 隨機森林分類器
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

🍋5. 類別不平衡問題

概述：在現實世界中，很多分類任務會面臨類別不平衡的問題，即某一類別的樣本數量遠遠少于其他類別。這會導致模型傾向于預測樣本較多的類別，降低模型的整體性能。

解決方法：

重采樣：通過增加少數類樣本或減少多數類樣本來平衡數據集。
加權損失函數：對模型在少數類樣本上的誤差給予更大的懲罰。
集成方法：例如SMOTE與Boosting結合來提高少數類的預測能力。

🍋代碼示例（重采樣）

from imblearn.over_sampling import SMOTE# 過采樣
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)# 訓練模型
model = LogisticRegression()
model.fit(X_res, y_res)y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

🍋6. 邏輯回歸案例（Iris數據集）

我們將通過以下步驟進行分析：

加載數據：選擇Iris數據集。
數據預處理：包括數據分割與標準化。
訓練模型：使用邏輯回歸訓練模型。
評估模型：通過混淆矩陣、準確率等評估模型效果。

Iris數據集包含150個樣本，分別來自3個不同種類的鳶尾花（Setosa、Versicolor、Virginica）。每個樣本有4個特征：萼片長度、萼片寬度、花瓣長度和花瓣寬度。目標是根據這些特征來預測鳶尾花的種類。

# 導入需要的庫
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt# 1. 加載Iris數據集
iris = load_iris()
X = iris.data  # 特征
y = iris.target  # 標簽# 2. 數據分割：80%訓練數據，20%測試數據
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 3. 數據標準化：邏輯回歸對特征的尺度比較敏感
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)# 4. 訓練邏輯回歸模型
logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train, y_train)# 5. 在測試集上進行預測
y_pred = logreg.predict(X_test)# 6. 模型評估
print("分類報告：")
print(classification_report(y_test, y_pred))# 7. 混淆矩陣
cm = confusion_matrix(y_test, y_pred)
print("混淆矩陣：")
print(cm)# 可視化混淆矩陣
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()