One of machine learning's most popular applications is in solving classification problems.
機器學習最流行的應用之一是解決分類問題。
Classification problems are situations where you have a data set, and you want to classify observations from that data set into a specific category.
分類問題是指您擁有數據集,并且想要將來自該數據集的觀察結果分類為特定類別的情況。
A famous example is a spam filter for email providers. Gmail uses supervised machine learning techniques to automatically place emails in your spam folder based on their content, subject line, and other features.
一個著名的例子是針對電子郵件提供商的垃圾郵件過濾器。 Gmail使用受監督的機器學習技術,根據郵件的內容,主題行和其他功能自動將其放入垃圾郵件文件夾。
Two machine learning models perform much of the heavy lifting when it comes to classification problems:
當涉及分類問題時,兩種機器學習模型會承擔很多繁重的工作:
- K-nearest neighbors K近鄰
- K-means clustering K均值聚類
This tutorial will teach you how to code K-nearest neighbors and K-means clustering algorithms in Python.
本教程將教您如何在Python中編寫K近鄰和K均值聚類算法。
K最近鄰居模型 (K-Nearest Neighbors Models)
The K-nearest neighbors algorithm is one of the world’s most popular machine learning models for solving classification problems.
K近鄰算法是解決分類問題的世界上最受歡迎的機器學習模型之一。
A common exercise for students exploring machine learning is to apply the K nearest neighbors algorithm to a data set where the categories are not known. A real-life example of this would be if you needed to make predictions using machine learning on a data set of classified government information.
學生探索機器學習的一個常見練習是將K最近鄰算法應用于類別未知的數據集。 一個真實的例子是,如果您需要使用機器學習對機密政府信息的數據集進行預測。
In this tutorial, you will learn to write your first K nearest neighbors machine learning algorithm in Python. We will be working with an anonymous data set similar to the situation described above.
在本教程中,您將學習用Python編寫第一個K最近鄰機器學習算法。 我們將使用類似于上述情況的匿名數據集。
您在本教程中需要的數據集 (The Data Set You Will Need in This Tutorial)
The first thing you need to do is download the data set we will be using in this tutorial. I have uploaded the file to my website. You can access it by clicking here.
您需要做的第一件事是下載我們將在本教程中使用的數據集。 我已將文件上傳到我的網站 。 您可以通過單擊此處訪問它。
Now that you have downloaded the data set, you will want to move the file to the directory that you’ll be working in. After that, open a Jupyter Notebook and we can get started writing Python code!
現在,您已經下載了數據集,您將需要將文件移動到將要使用的目錄中。之后,打開Jupyter Notebook ,我們可以開始編寫Python代碼了!
在本教程中您將需要的圖書館 (The Libraries You Will Need in This Tutorial)
To write a K nearest neighbors algorithm, we will take advantage of many open-source Python libraries including NumPy, pandas, and scikit-learn.
要編寫K最近鄰算法,我們將利用許多開源Python庫,包括NumPy , pandas和scikit-learn 。
Begin your Python script by writing the following import statements:
通過編寫以下導入語句開始Python腳本:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline
將數據集導入我們的Python腳本 (Importing the Data Set Into Our Python Script)
Our next step is to import the classified_data.csv
file into our Python script. The pandas library makes it easy to import data into a pandas DataFrame.
下一步是將classified_data.csv
文件導入到我們的Python腳本中。 使用pandas庫可以輕松地將數據導入pandas DataFrame中 。
Since the data set is stored in a csv
file, we will be using the read_csv
method to do this:
由于數據集存儲在一個csv
文件中,因此我們將使用read_csv
方法來執行此操作:
raw_data = pd.read_csv('classified_data.csv')
Printing this DataFrame inside of your Jupyter Notebook will give you a sense of what the data looks like:
在Jupyter Notebook內部打印此DataFrame可以使您大致了解數據的樣子:
You will notice that the DataFrame starts with an unnamed column whose values are equal to the DataFrame’s index. We can fix this by making a slight adjustment to the command that imported our data set into the Python script:
您會注意到,DataFrame以未命名的列開頭,該列的值等于DataFrame的索引。 我們可以通過對將數據集導入Python腳本的命令稍作調整來解決此問題:
raw_data = pd.read_csv('classified_data.csv', index_col = 0)
Next, let’s take a look at the actual features that are contained in this data set. You can print a list of the data set’s column names with the following statement:
接下來,讓我們看一下此數據集中包含的實際功能。 您可以使用以下語句打印數據集的列名列表:
print(raw_data.columns)
This returns:
返回:
Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ','TARGET CLASS'],dtype='object')
Since this is a classified data set, we have no idea what any of these columns means. For now, it is sufficient to recognize that every column is numerical in nature and thus well-suited for modelling with machine learning techniques.
由于這是一個分類的數據集,因此我們不知道這些列的含義。 到目前為止,足以認識到每一列本質上都是數字,因此非常適合使用機器學習技術進行建模。
標準化數據集 (Standardizing the Data Set)
Since the K nearest neighbors algorithm makes predictions about a data point by using the observations that are closest to it, the scale of the features within a data set matters a lot.
由于K最近鄰算法通過使用最接近的觀測值對數據點進行預測,因此數據集中特征的尺度非常重要。
Because of this, machine learning practitioners typically standardize
the data set, which means adjusting every x
value so that they are roughly on the same scale.
因此,機器學習從業人員通常會standardize
數據集,這意味著調整每個x
值,以使它們大致在同一范圍內。
Fortunately, scikit-learn
includes some excellent functionality to do this with very little headache.
幸運的是, scikit-learn
包含一些出色的功能,可以scikit-learn
完成此任務。
To start, we will need to import the StandardScaler
class from scikit-learn
. Add the following command to your Python script to do this:
首先,我們需要從scikit-learn
導入StandardScaler
類。 將以下命令添加到您的Python腳本中以執行此操作:
from sklearn.preprocessing import StandardScaler
This function behaves a lot like the LinearRegression
and LogisticRegression
classes that we used earlier in this course. We will want to create an instance of this class and then fit the instance of that class on our data set.
此函數的行為與我們在本課程前面使用的LinearRegression
和LogisticRegression
類非常相似。 我們將要創建此類的實例,然后將該類的實例適合我們的數據集。
First, let’s create an instance of the StandardScaler
class named scaler
with the following statement:
首先,讓我們使用以下語句創建一個名為scaler
的StandardScaler
類的實例:
scaler = StandardScaler()
We can now train this instance on our data set using the fit
method:
現在,我們可以使用fit
方法在數據集上訓練該實例:
scaler.fit(raw_data.drop('TARGET CLASS', axis=1))
Now we can use the transform
method to standardize all of the features in the data set so they are roughly the same scale. We’ll assign these scaled features to the variable named scaled_features
:
現在,我們可以使用transform
方法來標準化數據集中的所有特征,因此它們的比例大致相同。 我們將這些縮放后的特征分配給名為scaled_features
的變量:
scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))
This actually creates a NumPy array of all the features in the data set, and we want it to be a pandas DataFrame instead.
實際上,這將創建一個NumPy數組 ,其中包含數據集中的所有功能,而我們希望它是一個熊貓DataFrame 。
Fortunately, this is an easy fix. We’ll simply wrap the scaled_features
variable in a pd.DataFrame
method and assign this DataFrame to a new variable called scaled_data
with an appropriate argument to specify the column names:
幸運的是,這很容易解決。 我們將簡單地將scaled_features
變量包裝在pd.DataFrame
方法中,然后將此DataFrame分配給名為scaled_data
的新變量,并使用適當的參數來指定列名稱:
scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)
Now that we have imported our data set and standardized its features, we are ready to split the data set into training data and test data.
現在,我們已經導入了數據集并對其功能進行了標準化,我們準備將數據集分為訓練數據和測試數據。
將數據集分為訓練數據和測試數據 (Splitting the Data Set Into Training Data and Test Data)
We will use the train_test_split
function from scikit-learn
combined with list unpacking to create training data and test data from our classified data set.
我們將結合使用scikit-learn
的train_test_split
函數和列表解train_test_split
來從分類數據集中創建訓練數據和測試數據。
First, you’ll need to import train_test_split
from the model_validation
module of scikit-learn
with the following statement:
首先,您需要使用以下語句從scikit-learn
的model_validation
模塊中導入train_test_split
:
from sklearn.model_selection import train_test_split
Next, we will need to specify the x
and y
values that will be passed into this train_test_split
function.
接下來,我們將需要指定將傳遞給此train_test_split
函數的x
和y
值。
The x
values will be the scaled_data
DataFrame that we created previously. The y
values will be the TARGET CLASS
column of our original raw_data
DataFrame.
x
值將是我們先前創建的scaled_data
DataFrame。 y
值將是我們原始raw_data
DataFrame的TARGET CLASS
列。
You can create these variables with the following statements:
您可以使用以下語句創建這些變量:
x = scaled_datay = raw_data['TARGET CLASS']
Next, you’ll need to run the train_test_split
function using these two arguments and a reasonable test_size
. We will use a test_size
of 30%, which gives the following parameters for the function:
接下來,您需要使用這兩個參數和合理的test_size
運行train_test_split
函數。 我們將使用30%的test_size
,它為該函數提供以下參數:
x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)
Now that our data set has been split into training data and test data, we’re ready to start training our model!
現在,我們的數據集已分為訓練數據和測試數據,我們準備開始訓練我們的模型!
訓練K最近鄰居模型 (Training a K Nearest Neighbors Model)
Let’s start by importing the KNeighborsClassifier
from scikit-learn
:
讓我們首先從scikit-learn
導入KNeighborsClassifier
:
from sklearn.neighbors import KNeighborsClassifier
Next, let’s create an instance of the KNeighborsClassifier
class and assign it to a variable named model
接下來,讓我們創建KNeighborsClassifier
類的實例,并將其分配給名為model
的變量。
This class requires a parameter named n_neighbors
, which is equal to the K
value of the K nearest neighbors algorithm that you’re building. To start, let’s specify n_neighbors = 1
:
此類需要一個名為n_neighbors
的參數,該參數等于您要構建的K個最近鄰居算法的K
值。 首先,讓我們指定n_neighbors = 1
:
model = KNeighborsClassifier(n_neighbors = 1)
Now we can train our K nearest neighbors model using the fit
method and our x_training_data
and y_training_data
variables:
現在,我們可以使用fit
方法以及x_training_data
和y_training_data
變量訓練我們的K個最近鄰居模型:
model.fit(x_training_data, y_training_data)
Now let’s make some predictions with our newly-trained K nearest neighbors algorithm!
現在,讓我們用我們新訓練的K最近鄰算法做出一些預測!
使用我們的K最近鄰算法進行預測 (Making Predictions With Our K Nearest Neighbors Algorithm)
We can make predictions with our K nearest neighbors algorithm in the same way that we did with our linear regression and logistic regression models earlier in this course: by using the predict
method and passing in our x_test_data
variable.
我們可以使用K最近鄰算法進行predict
方法與本課程前面的線性回歸和邏輯回歸模型相同:通過使用predict
方法并傳入x_test_data
變量。
More specifically, here’s how you can make predictions and assign them to a variable called predictions
:
更具體地講,這里是你如何能做出預測,并將其分配給一個變量稱為predictions
:
predictions = model.predict(x_test_data)
Let’s explore how accurate our predictions
are in the next section of this tutorial.
讓我們在本教程的下一部分中探索我們的predictions
準確性。
測量模型的準確性 (Measuring the Accuracy of Our Model)
We saw in our logistic regression tutorial that scikit-learn
comes with built-in functions that make it easy to measure the performance of machine learning classification models.
我們在邏輯回歸教程中看到scikit-learn
帶有內置函數,可輕松測量機器學習分類模型的性能。
Let’s import two of these functions (classification_report
and confuson_matrix
) into our report now:
我們要匯入其中的兩個功能( classification_report
和confuson_matrix
)到我們的報告現在:
from sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrix
Let’s work through each of these one-by-one, starting with the classfication_report
. You can generate the report with the following statement:
讓我們從classfication_report
開始,逐一研究這些內容。 您可以使用以下語句生成報告:
print(classification_report(y_test_data, predictions))
This generates:
這將產生:
precision recall f1-score support0 0.94 0.85 0.89 1501 0.86 0.95 0.90 150accuracy 0.90 300macro avg 0.90 0.90 0.90 300weighted avg 0.90 0.90 0.90 300
Similarly, you can generate a confusion matrix with the following statement:
同樣,您可以使用以下語句生成混淆矩陣:
print(confusion_matrix(y_test_data, predictions))
This generates:
這將產生:
[[141 12][ 18 129]]
Looking at these performance metrics, it looks like our model is already fairly performant. It can still be improved.
從這些性能指標來看,我們的模型似乎已經相當不錯了。 仍然可以改進。
In the next section, we will see how we can improve the performance of our K nearest neighbors model by choosing a better value for K
.
在下一節中,我們將看到如何通過為K
選擇一個更好的值來改善我們的K最近鄰居模型的性能。
使用彎頭法選擇最佳K
值 (Choosing An Optimal K
Value Using the Elbow Method)
In this section, we will use the elbow method to choose an optimal value of K
for our K nearest neighbors algorithm.
在本節中,我們將使用彎頭法為我們的K最近鄰算法選擇K
的最佳值。
The elbow method involves iterating through different K values and selecting the value with the lowest error rate when applied to our test data.
彎頭法涉及遍歷不同的K值,并選擇應用于我們的測試數據時錯誤率最低的值。
To start, let’s create an empty list called error_rates
. We will loop through different K
values and append their error rates to this list.
首先,讓我們創建一個名為error_rates
的空列表 。 我們將遍歷不同的K
值,并將其錯誤率附加到此列表中。
error_rates = []
Next, we need to make a Python loop that iterates through the different values of K
we’d like to test and executes the following functionality with each iteration:
接下來,我們需要創建一個Python循環,該循環遍歷我們要測試的K
的不同值,并在每次迭代中執行以下功能:
Creates a new instance of the
KNeighborsClassifier
class fromscikit-learn
從
scikit-learn
創建KNeighborsClassifier
類的新實例- Trains the new model using our training data 使用我們的訓練數據訓練新模型
- Makes predictions on our test data 對我們的測試數據做出預測
- Calculates the mean difference for every incorrect prediction (the lower this is, the more accurate our model is) 計算每個錯誤預測的均值差(這個值越低,我們的模型越準確)
Here is the code to do this for K
values between 1
and 100
:
這是針對K
值介于1
和100
之間的代碼:
for i in np.arange(1, 101):new_model = KNeighborsClassifier(n_neighbors = i)new_model.fit(x_training_data, y_training_data)new_predictions = new_model.predict(x_test_data)error_rates.append(np.mean(new_predictions != y_test_data))
Let’s visualize how our error rate changes with different K
values using a quick matplotlib visualization:
讓我們使用快速的matplotlib可視化效果來可視化我們的錯誤率如何隨不同的K
值變化:
plt.plot(error_rates)
As you can see, our error rates tend to be minimized with a K
value of approximately 50. This means that 50
is a suitable choice for K
that balances both simplicity and predictive power.
如您所見,我們的錯誤率傾向于以大約50的K
值最小化。這意味著50
是K
兼顧簡單性和預測能力的合適選擇。
本教程的完整代碼 (The Full Code For This Tutorial)
You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:
您可以在GitHub存儲庫中查看本教程的完整代碼。 還將其粘貼在下面以供您參考:
#Common importsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline#Import the data setraw_data = pd.read_csv('classified_data.csv', index_col = 0)#Import standardization functions from scikit-learnfrom sklearn.preprocessing import StandardScaler#Standardize the data setscaler = StandardScaler()scaler.fit(raw_data.drop('TARGET CLASS', axis=1))scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)#Split the data set into training data and test datafrom sklearn.model_selection import train_test_splitx = scaled_datay = raw_data['TARGET CLASS']x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)#Train the model and make predictionsfrom sklearn.neighbors import KNeighborsClassifiermodel = KNeighborsClassifier(n_neighbors = 1)model.fit(x_training_data, y_training_data)predictions = model.predict(x_test_data)#Performance measurementfrom sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrixprint(classification_report(y_test_data, predictions))print(confusion_matrix(y_test_data, predictions))#Selecting an optimal K valueerror_rates = []for i in np.arange(1, 101):new_model = KNeighborsClassifier(n_neighbors = i)new_model.fit(x_training_data, y_training_data)new_predictions = new_model.predict(x_test_data)error_rates.append(np.mean(new_predictions != y_test_data))plt.figure(figsize=(16,12))plt.plot(error_rates)
K-均值聚類模型 (K-Means Clustering Models)
The K-means clustering algorithm is typically the first unsupervised machine learning model that students will learn.
K均值聚類算法通常是學生將要學習的第一個無監督機器學習模型。
It allows machine learning practitioners to create groups of data points within a data set with similar quantitative characteristics. It is useful for solving problems like creating customer segments or identifying localities in a city with high crime rates.
它允許機器學習從業人員在具有相似定量特征的數據集中創建數據點組。 它對于解決諸如創建客戶群或確定犯罪率高的城市中的地區之類的問題很有用。
In this section, you will learn how to build your first K means clustering algorithm in Python.
在本部分中,您將學習如何在Python中構建第一個K均值聚類算法。
我們將在本教程中使用的數據集 (The Data Set We Will Use In This Tutorial)
In this tutorial, we will be using a data set of data generated using scikit-learn
.
在本教程中,我們將使用scikit-learn
生成的數據集。
Let’s import scikit-learn
’s make_blobs
function to create this artificial data. Open up a Jupyter Notebook and start your Python script with the following statement:
讓我們導入scikit-learn
的make_blobs
函數來創建此人工數據。 打開Jupyter Notebook,并使用以下語句啟動Python腳本:
from sklearn.datasets import make_blobs
Now let’s use the make_blobs
function to create some artificial data!
現在,讓我們使用make_blobs
函數創建一些人工數據!
More specifically, here is how you could create a data set with 200
samples that has 2
features and 4
cluster centers. The standard deviation within each cluster will be set to 1.8
.
更具體地說,這里是如何創建包含200
樣本的數據集的示例,該樣本集具有2
功能部件和4
群集中心。 每個群集內的標準偏差將設置為1.8
。
raw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)
If you print this raw_data
object, you’ll notice that it is actually a Python tuple. The first element of this tuple is a NumPy array with 200 observations. Each observation contains 2 features (just like we specified with our make_blobs
function!).
如果您打印此raw_data
對象,您會注意到它實際上是一個Python元組 。 該元組的第一個元素是具有200個觀測值的NumPy數組 。 每個觀察包含2個功能(就像我們用make_blobs
函數指定的make_blobs
!)。
Now that our data has been created, we can move on to importing other important open-source libraries into our Python script.
現在我們的數據已經創建,我們可以繼續將其他重要的開源庫導入到我們的Python腳本中。
我們將在本教程中使用的導入 (The Imports We Will Use In This Tutorial)
This tutorial will make use of a number of popular open-source Python libraries, including pandas, NumPy, and matplotlib. Let’s continue our Python script by adding the following imports:
本教程將利用許多流行的開源Python庫,包括pandas , NumPy和matplotlib 。 讓我們通過添加以下導入來繼續我們的Python腳本:
import pandas as pdimport numpy as npimport seabornimport matplotlib.pyplot as plt%matplotlib inline
The first group of imports in this code block is for manipulating large data sets. The second group of imports is for creating data visualizations.
此代碼塊中的第一組導入用于處理大型數據集。 第二組導入用于創建數據可視化。
Let’s move on to visualizing our data set next.
接下來讓我們繼續可視化我們的數據集。
可視化我們的數據集 (Visualizing Our Data Set)
In our make_blobs
function, we specified for our data set to have 4 cluster centers. The best way to verify that this has been handled correctly is by creating some quick data visualizations.
在我們的make_blobs
函數中,我們為數據集指定了4個集群中心。 驗證此問題是否正確處理的最佳方法是創建一些快速的數據可視化文件。
To start, let’s use the following command to plot all of the rows in the first column of our data set against all of the rows in the second column of our data set:
首先,讓我們使用以下命令將數據集第一列中的所有行與數據集第二列中的所有行進行繪制:
Note: your data set will appear differently than mine since this is randomly-generated data.
注意:由于這是隨機生成的數據,因此數據集的顯示方式與我的不同。
This image seems to indicate that our data set has only three clusters. This is because two of the clusters are very close to each other.
該圖像似乎表明我們的數據集只有三個聚類。 這是因為兩個群集彼此非常接近。
To fix this, we need to reference the second element of our raw_data
tuple, which is a NumPy array that contains the cluster to which each observation belongs.
為了解決這個問題,我們需要引用raw_data
元組的第二個元素,它是一個NumPy數組,其中包含每個觀察值所屬的簇。
If we color our data set using each observation’s cluster, the unique clusters will quickly become clear. Here is the code to do this:
如果我們使用每個觀察值的群集為數據集著色,則唯一的群集將很快變得清晰。 這是執行此操作的代碼:
plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])
We can now see that our data set has four unique clusters. Let’s move on to building our K means cluster model in Python!
現在我們可以看到我們的數據集具有四個唯一的群集。 讓我們繼續在Python中構建我們的K均值集群模型!
建立和訓練我們的K均值聚類模型 (Building and Training Our K Means Clustering Model)
The first step to building our K means clustering algorithm is importing it from scikit-learn
. To do this, add the following command to your Python script:
建立我們的K均值聚類算法的第一步是從scikit-learn
導入它。 為此,將以下命令添加到您的Python腳本中:
from sklearn.cluster import KMeans
Next, lets create an instance of this KMeans
class with a parameter of n_clusters=4
and assign it to the variable model
:
接下來,讓我們使用參數n_clusters=4
創建此KMeans
類的實例,并將其分配給變量model
:
model = KMeans(n_clusters=4)
Now let’s train our model by invoking the fit
method on it and passing in the first element of our raw_data
tuple:
現在,通過調用模型上的fit
方法并傳入raw_data
元組的第一個元素來訓練模型:
model.fit(raw_data[0])
In the next section, we’ll explore how to make predictions with this K means clustering model.
在下一節中,我們將探討如何使用這種K均值聚類模型進行預測。
Before moving on, I wanted to point out one difference that you may have noticed between the process for building this K means clustering algorithm (which is an unsupervised machine learning algorithm) and the supervised machine learning algorithms we’ve worked with so far in this course.
在繼續之前,我想指出一個差異,您可能已經注意到,構建此K均值聚類算法(這是一種無監督的機器學習算法)的過程與我們迄今為止在此方面使用的有監督的機器學習算法之間的區別課程。
Namely, we did not have to split the data set into training data and test data. This is an important difference - and in fact, you never need to make the train/test split on a data set when building unsupervised machine learning models!
即,我們不必將數據集分為訓練數據和測試數據。 這是一個重要的區別-實際上,在構建無監督的機器學習模型時,您無需對數據集進行訓練/測試拆分!
用我們的K均值聚類模型進行預測 (Making Predictions With Our K Means Clustering Model)
Machine learning practitioners generally use K means clustering algorithms to make two types of predictions:
機器學習從業人員通常使用K均值聚類算法進行兩種類型的預測:
- Which cluster each data point belongs to 每個數據點屬于哪個群集
- Where the center of each cluster is 每個群集的中心在哪里
It is easy to generate these predictions now that our model has been trained.
既然我們的模型已經過訓練,就很容易生成這些預測。
First, let’s predict which cluster each data point belongs to. To do this, access the labels_
attribute from our model
object using the dot operator, like this:
首先,讓我們預測每個數據點屬于哪個群集。 為此,請使用點運算符從我們的model
對象訪問labels_
屬性,如下所示:
model.labels_
This generates a NumPy array with predictions for each data point that looks like this:
這將生成一個NumPy數組,其中包含每個數據點的預測,如下所示:
array([3, 2, 7, 0, 5, 1, 7, 7, 6, 1, 2, 4, 6, 7, 6, 4, 4, 3, 3, 6, 0, 0,6, 4, 5, 6, 0, 2, 6, 5, 4, 3, 4, 2, 6, 6, 6, 5, 6, 2, 1, 1, 3, 4,3, 5, 7, 1, 7, 5, 3, 6, 0, 3, 5, 5, 7, 1, 3, 1, 5, 7, 7, 0, 5, 7,3, 4, 0, 5, 6, 5, 1, 4, 6, 4, 5, 6, 7, 2, 2, 0, 4, 1, 1, 1, 6, 3,3, 7, 3, 6, 7, 7, 0, 3, 4, 3, 4, 0, 3, 5, 0, 3, 6, 4, 3, 3, 4, 6,1, 3, 0, 5, 4, 2, 7, 0, 2, 6, 4, 2, 1, 4, 7, 0, 3, 2, 6, 7, 5, 7,5, 4, 1, 7, 2, 4, 7, 7, 4, 6, 6, 3, 7, 6, 4, 5, 5, 5, 7, 0, 1, 1,0, 0, 2, 5, 0, 3, 2, 5, 1, 5, 6, 5, 1, 3, 5, 1, 2, 0, 4, 5, 6, 3,4, 4, 5, 6, 4, 4, 2, 1, 7, 4, 6, 6, 0, 6, 3, 5, 0, 5, 2, 4, 6, 0,1, 0], dtype=int32)
To see where the center of each cluster lies, access the cluster_centers_
attribute using the dot operator like this:
要查看每個集群的中心位置,請使用點運算符訪問cluster_centers_
屬性,如下所示:
model.cluster_centers_
This generates a two-dimensional NumPy array that contains the coordinates of each clusters center. It will look like this:
這將生成一個二維NumPy數組,其中包含每個聚類中心的坐標。 它看起來像這樣:
array([[ -8.06473328, -0.42044783],[ 0.15944397, -9.4873621 ],[ 1.49194628, 0.21216413],[-10.97238157, -2.49017206],[ 3.54673215, -9.7433692 ],[ -3.41262049, 7.80784834],[ 2.53980034, -2.96376999],[ -0.4195847 , 6.92561289]])
We’ll assess the accuracy of these predictions in the next section.
我們將在下一部分中評估這些預測的準確性。
可視化我們模型的準確性 (Visualizing the Accuracy of Our Model)
The last thing we’ll do in this tutorial is visualize the accuracy of our model. You can use the following code to do this:
我們在本教程中要做的最后一件事是可視化模型的準確性。 您可以使用以下代碼執行此操作:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))ax1.set_title('Our Model')ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)ax2.set_title('Original Data')ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1])
This generates two different plots side-by-side where one plot shows the clusters according to the real data set and the other plot shows the clusters according to our model. Here is what the output looks like:
這將并排生成兩個不同的圖,其中一個圖根據實際數據集顯示聚類,而另一個圖根據我們的模型顯示聚類。 輸出如下所示:
Although the coloring between the two plots is different, you can see that our model did a fairly good job of predicting the clusters within our data set. You can also see that the model was not perfect - if you look at the data points along a cluster’s edge, you can see that it occasionally misclassified an observation from our data set.
盡管兩個圖之間的顏色不同,但是您可以看到我們的模型在預測數據集中的聚類方面做得很好。 您還可以看到該模型不是完美的-如果您查看集群邊緣的數據點,您會發現它有時會錯誤地將數據從我們的數據集中分類。
There’s one last thing that needs to be mentioned about measuring our model’s prediction. In this example ,we knew which cluster each observation belonged to because we actually generated this data set ourselves.
關于測量模型的預測,還有最后一件事需要提及。 在此示例中,我們知道每個觀測值屬于哪個群集,因為我們實際上是自己生成了此數據集。
This is highly unusual. K means clustering is more often applied when the clusters aren’t known in advance. Instead, machine learning practitioners use K means clustering to find patterns that they don’t already know within a data set.
這是非常不尋常的。 K表示當群集未知時更常應用群集。 取而代之的是,機器學習從業人員使用K表示聚類來查找他們在數據集中尚不知道的模式。
本教程的完整代碼 (The Full Code For This Tutorial)
You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:
您可以在GitHub存儲庫中查看本教程的完整代碼。 還將其粘貼在下面以供您參考:
#Create artificial data setfrom sklearn.datasets import make_blobsraw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)#Data importsimport pandas as pdimport numpy as np#Visualization importsimport seabornimport matplotlib.pyplot as plt%matplotlib inline#Visualize the dataplt.scatter(raw_data[0][:,0], raw_data[0][:,1])plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])#Build and train the modelfrom sklearn.cluster import KMeansmodel = KMeans(n_clusters=4)model.fit(raw_data[0])#See the predictionsmodel.labels_model.cluster_centers_#PLot the predictions against the original data setf, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))ax1.set_title('Our Model')ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)ax2.set_title('Original Data')ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1])
最后的想法 (Final Thoughts)
This tutorial taught you how to how to build K-nearest neighbors and K-means clustering machine learning models in Python.
本教程教您如何在Python中建立K近鄰和K均值集群機器學習模型。
If you're interested in learning more about machine learning, my book Pragmatic Machine Learning will teach you practical machine learning techniques by building 9 real projects. The book launches August 3rd. You can preorder it for 50% off using the link below:
如果您有興趣了解有關機器學習的更多信息,我的書《 實用機器學習》將通過構建9個真實項目來教您實用的機器學習技術。 該書于8月3日發行。 您可以使用以下鏈接預訂50%的折扣:
Here is a brief summary of what you learned about K-nearest neighbors models in Python:
這是您從Python中了解的K近鄰模型的摘要:
- How classified data is a common tool used to teach students how to solve their first K nearest neighbor problems 機密數據是如何用來教學生如何解決他們的第一個K最近鄰問題的常用工具
- Why it’s important to standardize your data set when building K nearest neighbor models 為什么在建立K個最近鄰居模型時標準化數據集很重要
How to split your data set into training data and test data using the
train_test_split
function如何使用
train_test_split
函數將數據集分為訓練數據和測試數據- How to train your first K nearest neighbors model and make predictions with it 如何訓練您的第一個K最近鄰模型并進行預測
- How to measure the performance of a K nearest neighbors model 如何測量K最近鄰居模型的性能
- How to use the elbow method to select an optimal value of K in a K nearest neighbors model 如何使用肘法在K最近鄰居模型中選擇K的最優值
Similarly, here is a brief summary of what you learned about K-means clustering models in Python:
同樣,這是您從Python中了解到的K-means聚類模型的摘要:
How to create artificial data in
scikit-learn
using themake_blobs
function如何使用
make_blobs
函數在scikit-learn
創建人工數據- How to build and train a K means clustering model 如何建立和訓練K均值聚類模型
- That unsupervised machine learning techniques do not require you to split your data into training data and test data 這種無監督的機器學習技術不需要您將數據分為訓練數據和測試數據
How to build and train a K means clustering model using
scikit-learn
如何使用
scikit-learn
構建和訓練K均值聚類模型- How to visualizes the performance of a K means clustering algorithm when you know the clusters in advance 當您提前了解聚類時,如何可視化K表示聚類算法
翻譯自: https://www.freecodecamp.org/news/how-to-build-and-train-k-nearest-neighbors-ml-models-in-python/