點擊 “AladdinEdu,同學們用得起的【H卡】算力平臺”,H卡級別算力,按量計費,靈活彈性,頂級配置,學生專屬優惠。
決策樹/SVM/KNN算法對比 × 模型評估指標解析
讀者收獲:掌握經典機器學習全流程
當80%的機器學習問題可用Scikit-learn解決,掌握其核心流程將成為你的核心競爭力。本文通過對比實驗揭示算法本質,帶你一站式打通機器學習任督二脈。
一、Scikit-learn全景圖:3大核心模塊解析
1.1 算法選擇矩陣
1.2 環境極速配置
# 創建專用環境
conda create -n sklearn_env python=3.10
conda activate sklearn_env # 安裝核心庫
pip install numpy pandas matplotlib seaborn scikit-learn # 驗證安裝
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")
二、分類實戰:鳶尾花識別
2.1 數據探索與預處理
from sklearn.datasets import load_iris
import pandas as pd # 加載數據集
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target # 數據概覽
print(f"樣本數: {df.shape[0]}")
print(f"特征數: {df.shape[1]-1}")
print(f"類別分布:\n{df['target'].value_counts()}") # 可視化分析
import seaborn as sns
sns.pairplot(df, hue='target', palette='viridis')
2.2 三大分類器對比實驗
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier # 劃分數據集
X = df.drop(columns='target')
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 初始化分類器
models = { "決策樹": DecisionTreeClassifier(max_depth=3), "SVM": SVC(kernel='rbf', probability=True), "KNN": KNeighborsClassifier(n_neighbors=5)
} # 訓練與評估
results = {}
for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) results[name] = y_pred
2.3 分類結果可視化
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix # 繪制混淆矩陣
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for i, (name, y_pred) in enumerate(results.items()): cm = confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i]) axes[i].set_title(f"{name} 混淆矩陣")
plt.show()
三、回歸實戰:波士頓房價預測
3.1 數據解析與特征工程
from sklearn.datasets import fetch_openml # 加載數據集
boston = fetch_openml(name='boston', version=1)
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = boston.target # 關鍵特征分析
corr = df.corr()['PRICE'].sort_values(ascending=False)
print(f"與房價相關性最高的特征:\n{corr.head(5)}") # 特征工程
df['RM_LSTAT'] = df['RM'] / df['LSTAT'] # 創造新特征
3.2 回歸模型對比
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR # 劃分數據集
X = df.drop(columns='PRICE')
y = df['PRICE']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 初始化回歸器
regressors = { "線性回歸": LinearRegression(), "決策樹回歸": DecisionTreeRegressor(max_depth=5), "支持向量回歸": SVR(kernel='rbf')
} # 訓練與預測
predictions = {}
for name, reg in regressors.items(): reg.fit(X_train, y_train) pred = reg.predict(X_test) predictions[name] = pred
3.3 回歸結果可視化
# 繪制預測值與真實值對比
plt.figure(figsize=(15, 10))
for i, (name, pred) in enumerate(predictions.items(), 1): plt.subplot(3, 1, i) plt.scatter(y_test, pred, alpha=0.7) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--') plt.xlabel('真實價格') plt.ylabel('預測價格') plt.title(f'{name} 預測效果')
plt.tight_layout()
四、模型評估指標深度解析
4.1 分類指標四維分析
鳶尾花分類評估實例:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score metrics = []
for name, y_pred in results.items(): metrics.append({ "模型": name, "準確率": accuracy_score(y_test, y_pred), "精確率": precision_score(y_test, y_pred, average='macro'), "召回率": recall_score(y_test, y_pred, average='macro'), "F1": f1_score(y_test, y_pred, average='macro') }) metrics_df = pd.DataFrame(metrics)
print(metrics_df)
4.2 回歸指標三維對比
波士頓房價評估實例:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score reg_metrics = []
for name, pred in predictions.items(): reg_metrics.append({ "模型": name, "MSE": mean_squared_error(y_test, pred), "MAE": mean_absolute_error(y_test, pred), "R2": r2_score(y_test, pred) }) reg_metrics_df = pd.DataFrame(reg_metrics)
print(reg_metrics_df)
五、算法原理對比揭秘
5.1 決策樹:可解釋性之王
核心參數調優指南:
params = { 'max_depth': [3, 5, 7, None], 'min_samples_split': [2, 5, 10], 'criterion': ['gini', 'entropy']
} best_tree = GridSearchCV( DecisionTreeClassifier(), param_grid=params, cv=5, scoring='f1_macro'
)
best_tree.fit(X_train, y_train)
5.2 SVM:高維空間的分割大師
核函數選擇策略:
5.3 KNN:簡單高效的惰性學習
距離度量對比:
distance_metrics = [ ('euclidean', '歐氏距離'), ('manhattan', '曼哈頓距離'), ('cosine', '余弦相似度')
] for metric, name in distance_metrics: knn = KNeighborsClassifier(n_neighbors=5, metric=metric) knn.fit(X_train, y_train) score = knn.score(X_test, y_test) print(f"{name} 準確率: {score:.4f}")
六、模型優化實戰技巧
6.1 特征工程:性能提升關鍵
波士頓房價特征優化:
from sklearn.preprocessing import PolynomialFeatures # 創建多項式特征
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X) # 新特征訓練
lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train)
r2 = lr_poly.score(X_test_poly, y_test)
print(f"R2提升: {reg_metrics_df.loc[0,'R2']:.2f} → {r2:.2f}")
6.2 交叉驗證:防止過擬合
from sklearn.model_selection import cross_val_score # 5折交叉驗證
scores = cross_val_score( SVC(), X, y, cv=5, scoring='accuracy'
)
print(f"平均準確率: {scores.mean():.4f} (±{scores.std():.4f})")
6.3 網格搜索:自動化調參
from sklearn.model_selection import GridSearchCV # 定義參數網格
param_grid = { 'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf', 'linear']
} # 執行搜索
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=3)
grid.fit(X_train, y_train)
print(f"最優參數: {grid.best_params_}")
七、工業級部署方案
7.1 模型持久化
import joblib # 保存模型
joblib.dump(best_model, 'iris_classifier.pkl') # 加載模型
clf = joblib.load('iris_classifier.pkl') # 在線預測
new_data = [[5.1, 3.5, 1.4, 0.2]]
prediction = clf.predict(new_data)
print(f"預測類別: {iris.target_names[prediction[0]]}")
7.2 構建預測API
from flask import Flask, request, jsonify app = Flask(__name__)
model = joblib.load('iris_classifier.pkl') @app.route('/predict', methods=['POST'])
def predict(): data = request.get_json() features = [data['sepal_length'], data['sepal_width'], data['petal_length'], data['petal_width']] prediction = model.predict([features]) return jsonify({'class': iris.target_names[prediction[0]]}) if __name__ == '__main__': app.run(host='0.0.0.0', port=5000)
7.3 性能監控儀表盤
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve # 分類性能可視化
fig, ax = plt.subplots(1, 2, figsize=(15, 6))
plot_roc_curve(model, X_test, y_test, ax=ax[0])
plot_precision_recall_curve(model, X_test, y_test, ax=ax[1])
八、避坑指南:常見錯誤解決方案
8.1 數據預處理陷阱
問題:測試集出現未知類別
解決方案:
from sklearn.preprocessing import OneHotEncoder # 訓練階段
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(X_train_categorical) # 測試階段自動忽略未知類別
X_test_encoded = encoder.transform(X_test_categorical)
8.2 特征尺度問題
癥狀:SVM/KNN性能異常
處方:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # 注意:只變換不擬合
8.3 樣本不均衡處理
解決方案對比:
結語:機器學習工程師的成長之路
當你在Scikit-learn中完整實現從數據加載到模型部署的全流程,已超越70%的入門者。但真正的進階之路剛剛開始。
下一步行動指南:
# 1. 復現經典論文算法
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l1', solver='liblinear') # 2. 參加Kaggle競賽
from kaggle import api
api.competitions_list(search='getting started') # 3. 構建個人項目組合
projects = [ {"name": "鳶尾花分類器", "type": "分類", "accuracy": 0.97}, {"name": "房價預測", "type": "回歸", "R2": 0.85}
]
記住:在機器學習領域,理論認知的深度=代碼實踐的厚度。現在運行你的第一個完整流程,讓Scikit-learn成為你AI旅程中最可靠的伙伴。
附錄:Scikit-learn速查表
任務類型 | 導入路徑 | 核心參數 |
---|---|---|
分類 | from sklearn.ensemble import RandomForestClassifier | n_estimators, max_depth |
回歸 | from sklearn.linear_model import LinearRegression | fit_intercept, normalize |
聚類 | from sklearn.cluster import KMeans | n_clusters, init |
降維 | from sklearn.decomposition import PCA | n_components |
模型選擇 | from sklearn.model_selection import GridSearchCV | param_grid, cv |
數據預處理 | from sklearn.preprocessing import StandardScaler | with_mean, with_std |