`scikit-learn`點滴

scikit-learn是非常漂亮的一個機器學習庫,在某些時候,使用這些庫能夠大量的節省你的時間,至少,我們用Python,應該是很難寫出速度快如斯的代碼的.

scikit-learn官方出了一些文檔,但是個人覺得,它的文檔很多東西都沒有講清楚,它說算法原理的時候,只是描述一下,除非你對這種算法已經爛熟于心,才會對它的描述會心一笑,它描述API的時候,很多時候只是講了一些常見用法,一些比較高級的用法就語焉不詳,雖然有很多人說,這玩意的文檔寫得不錯,但是我覺得特坑.所以這篇博文,會記錄一些我使用這個庫的時候碰到的一些坑,以及如何跨過這些坑.慢慢來更新吧,當然,以后如果不用了,文章估計也不會更新了,當然,我也沒有打算說,這篇文章有多少人能看.就這樣吧.

聚類

坑1: 如何自定義距離函數?

雖然說scikit-learn這個庫實現了很多的聚類函數,但是這些算法使用的距離大部分都是歐氏距離或者明科夫斯基距離,事實上,根據我們教材上的描述,所謂的距離,可不單單僅有這兩種,為了不同的目的,我們可以用不同的距離來度量兩個向量之間的距離,但是很遺憾,我并沒有看見scikit-learn中提供自定義距離的選項,網上搜了一大圈也沒有見到.

但是不用擔心,我們可以間接實現這個東西.以DBSCAN算法為例,下面是類的一個構造函數:

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, n_jobs=1) # eps表示兩個向量可以被視作為同一個類的最大的距離 # min_samples表示一個類中至少要包含的元素數量,如果小于這個數量,那么不構成一個類

我們要特別注意一下metric這個選項,我們來看一下選項:

metric : string, or callableThe metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN. New in version 0.17: metric precomputed to accept precomputed sparse matrix.

這段描述其實透露了一個很重要的信息,那就是其實你可以自己提前計算各個向量的相似度,構成一個相似度的矩陣,只要你設置metric='precomputedd'就行,那么如何調用呢?

我們來看一下fit函數.

fit(X, y=None, sample_weight=None)
# X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
# A feature array, or array of distances between samples if metric='precomputed'.

上面的注釋是什么意思呢,我翻譯一下,如果你將metric設置成了precomputed的話,那么傳入的X參數應該為各個向量之間的相似度矩陣,然后fit函數會直接用你這個矩陣來進行計算.否則的話,你還是要乖乖地傳入(n_samples, n_features)形式的向量.

這意味著什么,同志們.這意味著我們可以用我們自定義的距離事先計算好各個向量的相似度,然后調用這個函數來獲得結果,是不是很爽.

具體怎么來編程,我給個例子,拋個磚.

import numpy as np
from sklearn.cluster import DBSCAN
if __name__ == '__main__': Y = np.array([[0, 1, 2], [1, 0, 3], [2, 3, 0]]) # 相似度矩陣,距離越小代表兩個向量距離越近 # N = Y.shape[0] db = DBSCAN(eps=0.13, metric='precomputed', min_samples=3).fit(Y) labels = db.labels_ # 然后來看一下分類的結果吧! n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) # 類的數目 print('類的數目是:%d'%(n_clusters_))

我們繼續來看一下AP聚類,其實也很類似:

class sklearn.cluster.AffinityPropagation(damping=0.5, max_iter=200, convergence_iter=15, copy=True, preference=None, affinity='euclidean', verbose=False)

關鍵在這個affinity參數上:

affinity : string, optional, default=``euclidean``Which affinity to use. At the moment precomputed and euclidean are supported. euclidean uses the negative squared euclidean distance between points.

這個東西也支持precomputed參數.再來看一下fit函數:

fit(X, y=None)
# Create affinity matrix from negative euclidean distances, then apply affinity propagation clustering.
# Parameters:   
#   X: array-like, shape (n_samples, n_features) or (n_samples, n_samples) :
# Data matrix or, if affinity is precomputed, matrix of similarities / affinities.

這里的X和前面是類似的,如果你將metric設置成了precomputed的話,那么傳入的X參數應該為各個向量之間的相似度矩陣,然后fit函數會直接用你這個矩陣來進行計算.否則的話,你還是要乖乖地傳入(n_samples, n_features)形式的向量.

例子1

"""目標:~~~~~~~~~~~~~~~~在這個文件里面,我最想測試一下的是,我前面的那些聚類算法是否是正確的.首先要測試的是AP聚類.
"""
from sklearn.cluster import AffinityPropagation
from sklearn import metrics from sklearn.datasets.samples_generator import make_blobs from sklearn.metrics.pairwise import euclidean_distances import matplotlib.pyplot as plt from itertools import cycle def draw_pic(n_clusters, cluster_centers_indices, labels, X): ''' 口說無憑,繪制一張圖就一目了然. ''' colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk') for k, col in zip(range(n_clusters), colors): class_members = labels == k cluster_center = X[cluster_centers_indices[k]] # 得到聚類的中心 plt.plot(X[class_members, 0], X[class_members, 1], col + '.') plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14) for x in X[class_members]: plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col) plt.title('Estimated number of clusters: %d' % n_clusters) plt.show() if __name__ == '__main__': centers = [[1, 1], [-1, -1], [1, -1]] # 接下來要生成300個點,并且每個點屬于哪一個中心都要標記下來,記錄到labels_true中. X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5, random_state=0) af = AffinityPropagation(preference=-50).fit(X) # 開始用AP聚類 cluster_centers_indices = af.cluster_centers_indices_ # 得到聚類的中心點 labels = af.labels_ # 得到label n_clusters = len(cluster_centers_indices) # 類的數目 draw_pic(n_clusters, cluster_centers_indices, labels, X) #===========接下來的話提前計算好距離=================# distance_matrix = -euclidean_distances(X, squared=True) # 提前計算好歐幾里德距離,需要注意的是,這里使用的是歐幾里德距離的平方 af1 = AffinityPropagation(affinity='precomputed', preference=-50).fit(distance_matrix) cluster_centers_indices1 = af1.cluster_centers_indices_ # 得到聚類的中心 labels1 = af1.labels_ # 得到label n_clusters1 = len(cluster_centers_indices1) # 類的數目 draw_pic(n_clusters1, cluster_centers_indices1, labels1, X)

兩種方法都將產生這樣的圖:

AP聚類

例子2

既然都到這里了,我們索性來測試一下DBSCAN算法好了.

"""目標:~~~~~~~~~~~~~~前面已經測試過了ap聚類,接下來測試DBSACN.
"""
import numpy as np
from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt from sklearn.metrics.pairwise import euclidean_distances def draw_pic(n_clusters, core_samples_mask, labels, X): ''' 開始繪制圖片 ''' # Black removed and is used for noise instead. unique_labels = set(labels) colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels))) for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = 'k' class_member_mask = (labels == k) xy = X[class_member_mask & core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14) xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6) plt.title('Estimated number of clusters: %d' % n_clusters) plt.show() if __name__ == '__main__': #=========首先產生數據===========# centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X) #=========接下來開始聚類==========# db = DBSCAN(eps=0.3, min_samples=10).fit(X) labels = db.labels_ # 每個點的標簽 core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True n_clusters = len(set(labels)) - (1 if -1 in labels else 0) # 類的數目 draw_pic(n_clusters, core_samples_mask, labels, X) #==========接下來我們提前計算好距離============# distance_matrix = euclidean_distances(X) db1 = DBSCAN(eps=0.3, min_samples=10, metric='precomputed').fit(distance_matrix) labels1 = db1.labels_ # 每個點的標簽 core_samples_mask1 = np.zeros_like(db1.labels_, dtype=bool) core_samples_mask1[db1.core_sample_indices_] = True n_clusters1 = len(set(labels1)) - (1 if -1 in labels1 else 0) # 類的數目 draw_pic(n_clusters1, core_samples_mask1, labels1, X)