如何在Python中建立和訓練K最近鄰和K-Means集群ML模型

One of machine learning's most popular applications is in solving classification problems.

機器學習最流行的應用之一是解決分類問題。

Classification problems are situations where you have a data set, and you want to classify observations from that data set into a specific category.

分類問題是指您擁有數據集,并且想要將來自該數據集的觀察結果分類為特定類別的情況。

A famous example is a spam filter for email providers. Gmail uses supervised machine learning techniques to automatically place emails in your spam folder based on their content, subject line, and other features.

一個著名的例子是針對電子郵件提供商的垃圾郵件過濾器。 Gmail使用受監督的機器學習技術,根據郵件的內容,主題行和其他功能自動將其放入垃圾郵件文件夾。

Two machine learning models perform much of the heavy lifting when it comes to classification problems:

當涉及分類問題時,兩種機器學習模型會承擔很多繁重的工作:

  • K-nearest neighbors

    K近鄰
  • K-means clustering

    K均值聚類

This tutorial will teach you how to code K-nearest neighbors and K-means clustering algorithms in Python.

本教程將教您如何在Python中編寫K近鄰和K均值聚類算法。

K最近鄰居模型 (K-Nearest Neighbors Models)

The K-nearest neighbors algorithm is one of the world’s most popular machine learning models for solving classification problems.

K近鄰算法是解決分類問題的世界上最受歡迎的機器學習模型之一。

A common exercise for students exploring machine learning is to apply the K nearest neighbors algorithm to a data set where the categories are not known. A real-life example of this would be if you needed to make predictions using machine learning on a data set of classified government information.

學生探索機器學習的一個常見練習是將K最近鄰算法應用于類別未知的數據集。 一個真實的例子是,如果您需要使用機器學習對機密政府信息的數據集進行預測。

In this tutorial, you will learn to write your first K nearest neighbors machine learning algorithm in Python. We will be working with an anonymous data set similar to the situation described above.

在本教程中,您將學習用Python編寫第一個K最近鄰機器學習算法。 我們將使用類似于上述情況的匿名數據集。

您在本教程中需要的數據集 (The Data Set You Will Need in This Tutorial)

The first thing you need to do is download the data set we will be using in this tutorial. I have uploaded the file to my website. You can access it by clicking here.

您需要做的第一件事是下載我們將在本教程中使用的數據集。 我已將文件上傳到我的網站 。 您可以通過單擊此處訪問它。

Now that you have downloaded the data set, you will want to move the file to the directory that you’ll be working in. After that, open a Jupyter Notebook and we can get started writing Python code!

現在,您已經下載了數據集,您將需要將文件移動到將要使用的目錄中。之后,打開Jupyter Notebook ,我們可以開始編寫Python代碼了!

在本教程中您將需要的圖書館 (The Libraries You Will Need in This Tutorial)

To write a K nearest neighbors algorithm, we will take advantage of many open-source Python libraries including NumPy, pandas, and scikit-learn.

要編寫K最近鄰算法,我們將利用許多開源Python庫,包括NumPy , pandas和scikit-learn 。

Begin your Python script by writing the following import statements:

通過編寫以下導入語句開始Python腳本:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline

將數據集導入我們的Python腳本 (Importing the Data Set Into Our Python Script)

Our next step is to import the classified_data.csv file into our Python script. The pandas library makes it easy to import data into a pandas DataFrame.

下一步是將classified_data.csv文件導入到我們的Python腳本中。 使用pandas庫可以輕松地將數據導入pandas DataFrame中 。

Since the data set is stored in a csv file, we will be using the read_csv method to do this:

由于數據集存儲在一個csv文件中,因此我們將使用read_csv方法來執行此操作:

raw_data = pd.read_csv('classified_data.csv')

Printing this DataFrame inside of your Jupyter Notebook will give you a sense of what the data looks like:

在Jupyter Notebook內部打印此DataFrame可以使您大致了解數據的樣子:

You will notice that the DataFrame starts with an unnamed column whose values are equal to the DataFrame’s index. We can fix this by making a slight adjustment to the command that imported our data set into the Python script:

您會注意到,DataFrame以未命名的列開頭,該列的值等于DataFrame的索引。 我們可以通過對將數據集導入Python腳本的命令稍作調整來解決此問題:

raw_data = pd.read_csv('classified_data.csv', index_col = 0)

Next, let’s take a look at the actual features that are contained in this data set. You can print a list of the data set’s column names with the following statement:

接下來,讓我們看一下此數據集中包含的實際功能。 您可以使用以下語句打印數據集的列名列表:

print(raw_data.columns)

This returns:

返回:

Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ','TARGET CLASS'],dtype='object')

Since this is a classified data set, we have no idea what any of these columns means. For now, it is sufficient to recognize that every column is numerical in nature and thus well-suited for modelling with machine learning techniques.

由于這是一個分類的數據集,因此我們不知道這些列的含義。 到目前為止,足以認識到每一列本質上都是數字,因此非常適合使用機器學習技術進行建模。

標準化數據集 (Standardizing the Data Set)

Since the K nearest neighbors algorithm makes predictions about a data point by using the observations that are closest to it, the scale of the features within a data set matters a lot.

由于K最近鄰算法通過使用最接近的觀測值對數據點進行預測,因此數據集中特征的尺度非常重要。

Because of this, machine learning practitioners typically standardize the data set, which means adjusting every x value so that they are roughly on the same scale.

因此,機器學習從業人員通常會standardize數據集,這意味著調整每個x值,以使它們大致在同一范圍內。

Fortunately, scikit-learn includes some excellent functionality to do this with very little headache.

幸運的是, scikit-learn包含一些出色的功能,可以scikit-learn完成此任務。

To start, we will need to import the StandardScaler class from scikit-learn. Add the following command to your Python script to do this:

首先,我們需要從scikit-learn導入StandardScaler類。 將以下命令添加到您的Python腳本中以執行此操作:

from sklearn.preprocessing import StandardScaler

This function behaves a lot like the LinearRegression and LogisticRegression classes that we used earlier in this course. We will want to create an instance of this class and then fit the instance of that class on our data set.

此函數的行為與我們在本課程前面使用的LinearRegressionLogisticRegression類非常相似。 我們將要創建此類的實例,然后將該類的實例適合我們的數據集。

First, let’s create an instance of the StandardScaler class named scaler with the following statement:

首先,讓我們使用以下語句創建一個名為scalerStandardScaler類的實例:

scaler = StandardScaler()

We can now train this instance on our data set using the fit method:

現在,我們可以使用fit方法在數據集上訓練該實例:

scaler.fit(raw_data.drop('TARGET CLASS', axis=1))

Now we can use the transform method to standardize all of the features in the data set so they are roughly the same scale. We’ll assign these scaled features to the variable named scaled_features:

現在,我們可以使用transform方法來標準化數據集中的所有特征,因此它們的比例大致相同。 我們將這些縮放后的特征分配給名為scaled_features的變量:

scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))

This actually creates a NumPy array of all the features in the data set, and we want it to be a pandas DataFrame instead.

實際上,這將創建一個NumPy數組 ,其中包含數據集中的所有功能,而我們希望它是一個熊貓DataFrame 。

Fortunately, this is an easy fix. We’ll simply wrap the scaled_features variable in a pd.DataFrame method and assign this DataFrame to a new variable called scaled_data with an appropriate argument to specify the column names:

幸運的是,這很容易解決。 我們將簡單地將scaled_features變量包裝在pd.DataFrame方法中,然后將此DataFrame分配給名為scaled_data的新變量,并使用適當的參數來指定列名稱:

scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)

Now that we have imported our data set and standardized its features, we are ready to split the data set into training data and test data.

現在,我們已經導入了數據集并對其功能進行了標準化,我們準備將數據集分為訓練數據和測試數據。

將數據集分為訓練數據和測試數據 (Splitting the Data Set Into Training Data and Test Data)

We will use the train_test_split function from scikit-learn combined with list unpacking to create training data and test data from our classified data set.

我們將結合使用scikit-learntrain_test_split函數和列表解train_test_split來從分類數據集中創建訓練數據和測試數據。

First, you’ll need to import train_test_split from the model_validation module of scikit-learn with the following statement:

首先,您需要使用以下語句從scikit-learnmodel_validation模塊中導入train_test_split

from sklearn.model_selection import train_test_split

Next, we will need to specify the x and y values that will be passed into this train_test_split function.

接下來,我們將需要指定將傳遞給此train_test_split函數的xy值。

The x values will be the scaled_data DataFrame that we created previously. The y values will be the TARGET CLASS column of our original raw_data DataFrame.

x值將是我們先前創建的scaled_data DataFrame。 y值將是我們原始raw_data DataFrame的TARGET CLASS列。

You can create these variables with the following statements:

您可以使用以下語句創建這些變量:

x = scaled_datay = raw_data['TARGET CLASS']

Next, you’ll need to run the train_test_split function using these two arguments and a reasonable test_size. We will use a test_size of 30%, which gives the following parameters for the function:

接下來,您需要使用這兩個參數和合理的test_size運行train_test_split函數。 我們將使用30%的test_size ,它為該函數提供以下參數:

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

Now that our data set has been split into training data and test data, we’re ready to start training our model!

現在,我們的數據集已分為訓練數據和測試數據,我們準備開始訓練我們的模型!

訓練K最近鄰居模型 (Training a K Nearest Neighbors Model)

Let’s start by importing the KNeighborsClassifier from scikit-learn:

讓我們首先從scikit-learn導入KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

Next, let’s create an instance of the KNeighborsClassifier class and assign it to a variable named model

接下來,讓我們創建KNeighborsClassifier類的實例,并將其分配給名為model的變量。

This class requires a parameter named n_neighbors, which is equal to the K value of the K nearest neighbors algorithm that you’re building. To start, let’s specify n_neighbors = 1:

此類需要一個名為n_neighbors的參數,該參數等于您要構建的K個最近鄰居算法的K值。 首先,讓我們指定n_neighbors = 1

model = KNeighborsClassifier(n_neighbors = 1)

Now we can train our K nearest neighbors model using the fit method and our x_training_data and y_training_data variables:

現在,我們可以使用fit方法以及x_training_datay_training_data變量訓練我們的K個最近鄰居模型:

model.fit(x_training_data, y_training_data)

Now let’s make some predictions with our newly-trained K nearest neighbors algorithm!

現在,讓我們用我們新訓練的K最近鄰算法做出一些預測!

使用我們的K最近鄰算法進行預測 (Making Predictions With Our K Nearest Neighbors Algorithm)

We can make predictions with our K nearest neighbors algorithm in the same way that we did with our linear regression and logistic regression models earlier in this course: by using the predict method and passing in our x_test_data variable.

我們可以使用K最近鄰算法進行predict方法與本課程前面的線性回歸和邏輯回歸模型相同:通過使用predict方法并傳入x_test_data變量。

More specifically, here’s how you can make predictions and assign them to a variable called predictions:

更具體地講,這里是你如何能做出預測,并將其分配給一個變量稱為predictions

predictions = model.predict(x_test_data)

Let’s explore how accurate our predictions are in the next section of this tutorial.

讓我們在本教程的下一部分中探索我們的predictions準確性。

測量模型的準確性 (Measuring the Accuracy of Our Model)

We saw in our logistic regression tutorial that scikit-learn comes with built-in functions that make it easy to measure the performance of machine learning classification models.

我們在邏輯回歸教程中看到scikit-learn帶有內置函數,可輕松測量機器學習分類模型的性能。

Let’s import two of these functions (classification_report and confuson_matrix) into our report now:

我們要匯入其中的兩個功能( classification_reportconfuson_matrix )到我們的報告現在:

from sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrix

Let’s work through each of these one-by-one, starting with the classfication_report. You can generate the report with the following statement:

讓我們從classfication_report開始,逐一研究這些內容。 您可以使用以下語句生成報告:

print(classification_report(y_test_data, predictions))

This generates:

這將產生:

precision    recall  f1-score   support0       0.94      0.85      0.89       1501       0.86      0.95      0.90       150accuracy                           0.90       300macro avg       0.90      0.90      0.90       300weighted avg       0.90      0.90      0.90       300

Similarly, you can generate a confusion matrix with the following statement:

同樣,您可以使用以下語句生成混淆矩陣:

print(confusion_matrix(y_test_data, predictions))

This generates:

這將產生:

[[141  12][ 18 129]]

Looking at these performance metrics, it looks like our model is already fairly performant. It can still be improved.

從這些性能指標來看,我們的模型似乎已經相當不錯了。 仍然可以改進。

In the next section, we will see how we can improve the performance of our K nearest neighbors model by choosing a better value for K.

在下一節中,我們將看到如何通過為K選擇一個更好的值來改善我們的K最近鄰居模型的性能。

使用彎頭法選擇最佳K (Choosing An Optimal K Value Using the Elbow Method)

In this section, we will use the elbow method to choose an optimal value of K for our K nearest neighbors algorithm.

在本節中,我們將使用彎頭法為我們的K最近鄰算法選擇K的最佳值。

The elbow method involves iterating through different K values and selecting the value with the lowest error rate when applied to our test data.

彎頭法涉及遍歷不同的K值,并選擇應用于我們的測試數據時錯誤率最低的值。

To start, let’s create an empty list called error_rates. We will loop through different K values and append their error rates to this list.

首先,讓我們創建一個名為error_rates的空列表 。 我們將遍歷不同的K值,并將其錯誤率附加到此列表中。

error_rates = []

Next, we need to make a Python loop that iterates through the different values of K we’d like to test and executes the following functionality with each iteration:

接下來,我們需要創建一個Python循環,該循環遍歷我們要測試的K的不同值,并在每次迭代中執行以下功能:

  • Creates a new instance of the KNeighborsClassifier class from scikit-learn

    scikit-learn創建KNeighborsClassifier類的新實例

  • Trains the new model using our training data

    使用我們的訓練數據訓練新模型
  • Makes predictions on our test data

    對我們的測試數據做出預測
  • Calculates the mean difference for every incorrect prediction (the lower this is, the more accurate our model is)

    計算每個錯誤預測的均值差(這個值越低,我們的模型越準確)

Here is the code to do this for K values between 1 and 100:

這是針對K值介于1100之間的代碼:

for i in np.arange(1, 101):new_model = KNeighborsClassifier(n_neighbors = i)new_model.fit(x_training_data, y_training_data)new_predictions = new_model.predict(x_test_data)error_rates.append(np.mean(new_predictions != y_test_data))

Let’s visualize how our error rate changes with different K values using a quick matplotlib visualization:

讓我們使用快速的matplotlib可視化效果來可視化我們的錯誤率如何隨不同的K值變化:

plt.plot(error_rates)

As you can see, our error rates tend to be minimized with a K value of approximately 50. This means that 50 is a suitable choice for K that balances both simplicity and predictive power.

如您所見,我們的錯誤率傾向于以大約50的K值最小化。這意味著50K兼顧簡單性和預測能力的合適選擇。

本教程的完整代碼 (The Full Code For This Tutorial)

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

您可以在GitHub存儲庫中查看本教程的完整代碼。 還將其粘貼在下面以供您參考:

#Common importsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline#Import the data setraw_data = pd.read_csv('classified_data.csv', index_col = 0)#Import standardization functions from scikit-learnfrom sklearn.preprocessing import StandardScaler#Standardize the data setscaler = StandardScaler()scaler.fit(raw_data.drop('TARGET CLASS', axis=1))scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)#Split the data set into training data and test datafrom sklearn.model_selection import train_test_splitx = scaled_datay = raw_data['TARGET CLASS']x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)#Train the model and make predictionsfrom sklearn.neighbors import KNeighborsClassifiermodel = KNeighborsClassifier(n_neighbors = 1)model.fit(x_training_data, y_training_data)predictions = model.predict(x_test_data)#Performance measurementfrom sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrixprint(classification_report(y_test_data, predictions))print(confusion_matrix(y_test_data, predictions))#Selecting an optimal K valueerror_rates = []for i in np.arange(1, 101):new_model = KNeighborsClassifier(n_neighbors = i)new_model.fit(x_training_data, y_training_data)new_predictions = new_model.predict(x_test_data)error_rates.append(np.mean(new_predictions != y_test_data))plt.figure(figsize=(16,12))plt.plot(error_rates)

K-均值聚類模型 (K-Means Clustering Models)

The K-means clustering algorithm is typically the first unsupervised machine learning model that students will learn.

K均值聚類算法通常是學生將要學習的第一個無監督機器學習模型。

It allows machine learning practitioners to create groups of data points within a data set with similar quantitative characteristics. It is useful for solving problems like creating customer segments or identifying localities in a city with high crime rates.

它允許機器學習從業人員在具有相似定量特征的數據集中創建數據點組。 它對于解決諸如創建客戶群或確定犯罪率高的城市中的地區之類的問題很有用。

In this section, you will learn how to build your first K means clustering algorithm in Python.

在本部分中,您將學習如何在Python中構建第一個K均值聚類算法。

我們將在本教程中使用的數據集 (The Data Set We Will Use In This Tutorial)

In this tutorial, we will be using a data set of data generated using scikit-learn.

在本教程中,我們將使用scikit-learn生成的數據集。

Let’s import scikit-learn’s make_blobs function to create this artificial data. Open up a Jupyter Notebook and start your Python script with the following statement:

讓我們導入scikit-learnmake_blobs函數來創建此人工數據。 打開Jupyter Notebook,并使用以下語句啟動Python腳本:

from sklearn.datasets import make_blobs

Now let’s use the make_blobs function to create some artificial data!

現在,讓我們使用make_blobs函數創建一些人工數據!

More specifically, here is how you could create a data set with 200 samples that has 2 features and 4 cluster centers. The standard deviation within each cluster will be set to 1.8.

更具體地說,這里是如何創建包含200樣本的數據集的示例,該樣本集具有2功能部件和4群集中心。 每個群集內的標準偏差將設置為1.8

raw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)

If you print this raw_data object, you’ll notice that it is actually a Python tuple. The first element of this tuple is a NumPy array with 200 observations. Each observation contains 2 features (just like we specified with our make_blobs function!).

如果您打印此raw_data對象,您會注意到它實際上是一個Python元組 。 該元組的第一個元素是具有200個觀測值的NumPy數組 。 每個觀察包含2個功能(就像我們用make_blobs函數指定的make_blobs !)。

Now that our data has been created, we can move on to importing other important open-source libraries into our Python script.

現在我們的數據已經創建,我們可以繼續將其他重要的開源庫導入到我們的Python腳本中。

我們將在本教程中使用的導入 (The Imports We Will Use In This Tutorial)

This tutorial will make use of a number of popular open-source Python libraries, including pandas, NumPy, and matplotlib. Let’s continue our Python script by adding the following imports:

本教程將利用許多流行的開源Python庫,包括pandas , NumPy和matplotlib 。 讓我們通過添加以下導入來繼續我們的Python腳本:

import pandas as pdimport numpy as npimport seabornimport matplotlib.pyplot as plt%matplotlib inline

The first group of imports in this code block is for manipulating large data sets. The second group of imports is for creating data visualizations.

此代碼塊中的第一組導入用于處理大型數據集。 第二組導入用于創建數據可視化。

Let’s move on to visualizing our data set next.

接下來讓我們繼續可視化我們的數據集。

可視化我們的數據集 (Visualizing Our Data Set)

In our make_blobs function, we specified for our data set to have 4 cluster centers. The best way to verify that this has been handled correctly is by creating some quick data visualizations.

在我們的make_blobs函數中,我們為數據集指定了4個集群中心。 驗證此問題是否正確處理的最佳方法是創建一些快速的數據可視化文件。

To start, let’s use the following command to plot all of the rows in the first column of our data set against all of the rows in the second column of our data set:

首先,讓我們使用以下命令將數據集第一列中的所有行與數據集第二列中的所有行進行繪制:

Note: your data set will appear differently than mine since this is randomly-generated data.

注意:由于這是隨機生成的數據,因此數據集的顯示方式與我的不同。

This image seems to indicate that our data set has only three clusters. This is because two of the clusters are very close to each other.

該圖像似乎表明我們的數據集只有三個聚類。 這是因為兩個群集彼此非常接近。

To fix this, we need to reference the second element of our raw_data tuple, which is a NumPy array that contains the cluster to which each observation belongs.

為了解決這個問題,我們需要引用raw_data元組的第二個元素,它是一個NumPy數組,其中包含每個觀察值所屬的簇。

If we color our data set using each observation’s cluster, the unique clusters will quickly become clear. Here is the code to do this:

如果我們使用每個觀察值的群集為數據集著色,則唯一的群集將很快變得清晰。 這是執行此操作的代碼:

plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])

We can now see that our data set has four unique clusters. Let’s move on to building our K means cluster model in Python!

現在我們可以看到我們的數據集具有四個唯一的群集。 讓我們繼續在Python中構建我們的K均值集群模型!

建立和訓練我們的K均值聚類模型 (Building and Training Our K Means Clustering Model)

The first step to building our K means clustering algorithm is importing it from scikit-learn. To do this, add the following command to your Python script:

建立我們的K均值聚類算法的第一步是從scikit-learn導入它。 為此,將以下命令添加到您的Python腳本中:

from sklearn.cluster import KMeans

Next, lets create an instance of this KMeans class with a parameter of n_clusters=4 and assign it to the variable model:

接下來,讓我們使用參數n_clusters=4創建此KMeans類的實例,并將其分配給變量model

model = KMeans(n_clusters=4)

Now let’s train our model by invoking the fit method on it and passing in the first element of our raw_data tuple:

現在,通過調用模型上的fit方法并傳入raw_data元組的第一個元素來訓練模型:

model.fit(raw_data[0])

In the next section, we’ll explore how to make predictions with this K means clustering model.

在下一節中,我們將探討如何使用這種K均值聚類模型進行預測。

Before moving on, I wanted to point out one difference that you may have noticed between the process for building this K means clustering algorithm (which is an unsupervised machine learning algorithm) and the supervised machine learning algorithms we’ve worked with so far in this course.

在繼續之前,我想指出一個差異,您可能已經注意到,構建此K均值聚類算法(這是一種無監督的機器學習算法)的過程與我們迄今為止在此方面使用的有監督的機器學習算法之間的區別課程。

Namely, we did not have to split the data set into training data and test data. This is an important difference - and in fact, you never need to make the train/test split on a data set when building unsupervised machine learning models!

即,我們不必將數據集分為訓練數據和測試數據。 這是一個重要的區別-實際上,在構建無監督的機器學習模型時,您無需對數據集進行訓練/測試拆分!

用我們的K均值聚類模型進行預測 (Making Predictions With Our K Means Clustering Model)

Machine learning practitioners generally use K means clustering algorithms to make two types of predictions:

機器學習從業人員通常使用K均值聚類算法進行兩種類型的預測:

  • Which cluster each data point belongs to

    每個數據點屬于哪個群集
  • Where the center of each cluster is

    每個群集的中心在哪里

It is easy to generate these predictions now that our model has been trained.

既然我們的模型已經過訓練,就很容易生成這些預測。

First, let’s predict which cluster each data point belongs to. To do this, access the labels_ attribute from our model object using the dot operator, like this:

首先,讓我們預測每個數據點屬于哪個群集。 為此,請使用點運算符從我們的model對象訪問labels_屬性,如下所示:

model.labels_

This generates a NumPy array with predictions for each data point that looks like this:

這將生成一個NumPy數組,其中包含每個數據點的預測,如下所示:

array([3, 2, 7, 0, 5, 1, 7, 7, 6, 1, 2, 4, 6, 7, 6, 4, 4, 3, 3, 6, 0, 0,6, 4, 5, 6, 0, 2, 6, 5, 4, 3, 4, 2, 6, 6, 6, 5, 6, 2, 1, 1, 3, 4,3, 5, 7, 1, 7, 5, 3, 6, 0, 3, 5, 5, 7, 1, 3, 1, 5, 7, 7, 0, 5, 7,3, 4, 0, 5, 6, 5, 1, 4, 6, 4, 5, 6, 7, 2, 2, 0, 4, 1, 1, 1, 6, 3,3, 7, 3, 6, 7, 7, 0, 3, 4, 3, 4, 0, 3, 5, 0, 3, 6, 4, 3, 3, 4, 6,1, 3, 0, 5, 4, 2, 7, 0, 2, 6, 4, 2, 1, 4, 7, 0, 3, 2, 6, 7, 5, 7,5, 4, 1, 7, 2, 4, 7, 7, 4, 6, 6, 3, 7, 6, 4, 5, 5, 5, 7, 0, 1, 1,0, 0, 2, 5, 0, 3, 2, 5, 1, 5, 6, 5, 1, 3, 5, 1, 2, 0, 4, 5, 6, 3,4, 4, 5, 6, 4, 4, 2, 1, 7, 4, 6, 6, 0, 6, 3, 5, 0, 5, 2, 4, 6, 0,1, 0], dtype=int32)

To see where the center of each cluster lies, access the cluster_centers_ attribute using the dot operator like this:

要查看每個集群的中心位置,請使用點運算符訪問cluster_centers_屬性,如下所示:

model.cluster_centers_

This generates a two-dimensional NumPy array that contains the coordinates of each clusters center. It will look like this:

這將生成一個二維NumPy數組,其中包含每個聚類中心的坐標。 它看起來像這樣:

array([[ -8.06473328,  -0.42044783],[  0.15944397,  -9.4873621 ],[  1.49194628,   0.21216413],[-10.97238157,  -2.49017206],[  3.54673215,  -9.7433692 ],[ -3.41262049,   7.80784834],[  2.53980034,  -2.96376999],[ -0.4195847 ,   6.92561289]])

We’ll assess the accuracy of these predictions in the next section.

我們將在下一部分中評估這些預測的準確性。

可視化我們模型的準確性 (Visualizing the Accuracy of Our Model)

The last thing we’ll do in this tutorial is visualize the accuracy of our model. You can use the following code to do this:

我們在本教程中要做的最后一件事是可視化模型的準確性。 您可以使用以下代碼執行此操作:

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))ax1.set_title('Our Model')ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)ax2.set_title('Original Data')ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1])

This generates two different plots side-by-side where one plot shows the clusters according to the real data set and the other plot shows the clusters according to our model. Here is what the output looks like:

這將并排生成兩個不同的圖,其中一個圖根據實際數據集顯示聚類,而另一個圖根據我們的模型顯示聚類。 輸出如下所示:

Although the coloring between the two plots is different, you can see that our model did a fairly good job of predicting the clusters within our data set. You can also see that the model was not perfect - if you look at the data points along a cluster’s edge, you can see that it occasionally misclassified an observation from our data set.

盡管兩個圖之間的顏色不同,但是您可以看到我們的模型在預測數據集中的聚類方面做得很好。 您還可以看到該模型不是完美的-如果您查看集群邊緣的數據點,您會發現它有時會錯誤地將數據從我們的數據集中分類。

There’s one last thing that needs to be mentioned about measuring our model’s prediction. In this example ,we knew which cluster each observation belonged to because we actually generated this data set ourselves.

關于測量模型的預測,還有最后一件事需要提及。 在此示例中,我們知道每個觀測值屬于哪個群集,因為我們實際上是自己生成了此數據集。

This is highly unusual. K means clustering is more often applied when the clusters aren’t known in advance. Instead, machine learning practitioners use K means clustering to find patterns that they don’t already know within a data set.

這是非常不尋常的。 K表示當群集未知時更常應用群集。 取而代之的是,機器學習從業人員使用K表示聚類來查找他們在數據集中尚不知道的模式。

本教程的完整代碼 (The Full Code For This Tutorial)

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

您可以在GitHub存儲庫中查看本教程的完整代碼。 還將其粘貼在下面以供您參考:

#Create artificial data setfrom sklearn.datasets import make_blobsraw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)#Data importsimport pandas as pdimport numpy as np#Visualization importsimport seabornimport matplotlib.pyplot as plt%matplotlib inline#Visualize the dataplt.scatter(raw_data[0][:,0], raw_data[0][:,1])plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])#Build and train the modelfrom sklearn.cluster import KMeansmodel = KMeans(n_clusters=4)model.fit(raw_data[0])#See the predictionsmodel.labels_model.cluster_centers_#PLot the predictions against the original data setf, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))ax1.set_title('Our Model')ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)ax2.set_title('Original Data')ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1])

最后的想法 (Final Thoughts)

This tutorial taught you how to how to build K-nearest neighbors and K-means clustering machine learning models in Python.

本教程教您如何在Python中建立K近鄰和K均值集群機器學習模型。

If you're interested in learning more about machine learning, my book Pragmatic Machine Learning will teach you practical machine learning techniques by building 9 real projects. The book launches August 3rd. You can preorder it for 50% off using the link below:

如果您有興趣了解有關機器學習的更多信息,我的書《 實用機器學習》將通過構建9個真實項目來教您實用的機器學習技術。 該書于8月3日發行。 您可以使用以下鏈接預訂50%的折扣:

Here is a brief summary of what you learned about K-nearest neighbors models in Python:

這是您從Python中了解的K近鄰模型的摘要:

  • How classified data is a common tool used to teach students how to solve their first K nearest neighbor problems

    機密數據是如何用來教學生如何解決他們的第一個K最近鄰問題的常用工具
  • Why it’s important to standardize your data set when building K nearest neighbor models

    為什么在建立K個最近鄰居模型時標準化數據集很重要
  • How to split your data set into training data and test data using the train_test_split function

    如何使用train_test_split函數將數據集分為訓練數據和測試數據

  • How to train your first K nearest neighbors model and make predictions with it

    如何訓練您的第一個K最近鄰模型并進行預測
  • How to measure the performance of a K nearest neighbors model

    如何測量K最近鄰居模型的性能
  • How to use the elbow method to select an optimal value of K in a K nearest neighbors model

    如何使用肘法在K最近鄰居模型中選擇K的最優值

Similarly, here is a brief summary of what you learned about K-means clustering models in Python:

同樣,這是您從Python中了解到的K-means聚類模型的摘要:

  • How to create artificial data in scikit-learn using the make_blobs function

    如何使用make_blobs函數在scikit-learn創建人工數據

  • How to build and train a K means clustering model

    如何建立和訓練K均值聚類模型
  • That unsupervised machine learning techniques do not require you to split your data into training data and test data

    這種無監督的機器學習技術不需要您將數據分為訓練數據和測試數據
  • How to build and train a K means clustering model using scikit-learn

    如何使用scikit-learn構建和訓練K均值聚類模型

  • How to visualizes the performance of a K means clustering algorithm when you know the clusters in advance

    當您提前了解聚類時,如何可視化K表示聚類算法

翻譯自: https://www.freecodecamp.org/news/how-to-build-and-train-k-nearest-neighbors-ml-models-in-python/

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390287.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390287.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390287.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

552. 學生出勤記錄 II

552. 學生出勤記錄 II 可以用字符串表示一個學生的出勤記錄,其中的每個字符用來標記當天的出勤情況(缺勤、遲到、到場)。記錄中只含下面三種字符: ‘A’:Absent,缺勤 ‘L’:Late,遲…

C/C++中計算函數運行時間

#include<stdio.h> #include<time.h> clock_t start,stop;//clock_t 是clock&#xff08;&#xff09;函數返回變量的類型 double duration;//記錄被測函數的運行時間&#xff0c;以秒為單位 int main() { startclock();//開始計時 MyFunction();//把被測函數加在這…

作為一名前端開發工程師,你必須掌握的WEB模板引擎:Handlebars

為什么需要使用模板引擎&#xff1f; 關于為什么要使用模板引擎&#xff0c;按照我常說的一句話就是&#xff1a;不用重復造輪子了。 簡單來說&#xff0c;模板最本質的作用是“變靜為動”&#xff0c;一切利于這方面的都是優勢&#xff0c;不利于的都是劣勢。要想很好地實現“…

extjs 實用開發指南_如何提出有效問題:針對開發人員的實用指南

extjs 實用開發指南Learning is a journey that never ends. At every point in your career, you will keep learning, re-learning, and un-learning. 學習是一個永無止境的旅程。 在職業生涯的每個階段&#xff0c;您都會不斷學習&#xff0c;重新學習和不學習。 The abil…

LOJ 6270

最近&#xff08;一直&#xff09;有點&#xff08;很&#xff09;蠢 按照區間大小排序做區間包含多少區間的話 只用考慮 左端點比當前左端點小的和右端點比當前右端點大的&#xff0c;因為不可能同時滿足 關于K&#xff0c;就在做到K的時候減一下就好了&#xff0c;一直傻逼在…

Zabbix3.4安裝詳細步驟

Zabbix3.4安裝的詳細步驟一、zabbix介紹現在大多數公司都會用到監控軟件&#xff0c;主流的監控軟件就是Zabbix了&#xff0c;當然還會有Nagios等其他的軟件&#xff1a;zabbix是一個基于WEB界面的提供分布式系統監視以及網絡監視功能的企業級的開源解決方案。zabbix能監視各種…

軟件自學成才到公司要學歷嗎_作為一名自學成才的移動開發人員,我在旅途中學到了什么

軟件自學成才到公司要學歷嗎In this post, Ill share my entire journey about how I became a professional mobile developer.在這篇文章中&#xff0c;我將分享我如何成為一名專業的移動開發人員的整個過程。 I hope that reading about my experience will help you refle…

cs231n---語義分割 物體定位 物體檢測 物體分割

1 語義分割 語義分割是對圖像中每個像素作分類&#xff0c;不區分物體&#xff0c;只關心像素。如下&#xff1a; &#xff08;1&#xff09;完全的卷積網絡架構 處理語義分割問題可以使用下面的模型&#xff1a; 其中我們經過多個卷積層處理&#xff0c;最終輸出體的維度是C*H…

http協議內容

前言&#xff1a; http協議&#xff1a; 對瀏覽器客戶端 和 服務器端 之間數據傳輸的格式規范http1.0&#xff1a;當前瀏覽器客戶端與服務器端建立連接之后&#xff0c; 只能發送一次請求&#xff0c;一次請求之后連接關閉。 http1.1&#xff1a;當前瀏覽器客戶端與服務器端建…

array_combine()

轉載于:https://www.cnblogs.com/xiaobiaomei/p/8392728.html

CSS外邊距(margin)重疊及防止方法

#css外邊距margin重疊及防止方法CSS外邊距(margin)重疊及防止方法 #1-什么是外邊距margin重疊1. 什么是外邊距(margin)重疊 外邊距重疊是指兩個或多個盒子(可能相鄰也可能嵌套)的相鄰邊界(其間沒有任何非空內容、補白、邊框)重合在一起而形成一個單一邊界。 #2-相鄰marign重疊的…

composer windows安裝

一.前期準備: 1.下載安裝包,https://getcomposer.org/download/ 2.在php.ini文檔中打開extensionphp_openssl.dll 3.下載php_ssh2.dll、php_ssh2.pdb,http://windows.php.net/downloads/pecl/releases/ssh2/0.12/ 4.把php_ssh2.dll、php_ssh2.pdb文件放php的ext文件夾 5.重啟ap…

spring整合mybatis采坑

本來這個錯誤是整合spring和mybatis遇到的錯誤&#xff0c;但是一直沒有解決&#xff0c;但是在做SpringMVC時也了出現了這樣的錯誤org.springframework.beans.factory.BeanCreationException: Error creating bean with name sqlSessionFactory defined in class path resourc…

處理測試環境硬盤爆滿

測試環境經常會收到這類告警 第一步 登陸機器查看硬盤使用 執行df 好吧,使用情況真不妙,根目錄占用過大 第二步 確定哪個文件太大或者文件過多 進入爆滿的目錄,如這里是根目錄 cd / 然后找下面哪個文件夾或者文件太大,有幾種方式: 1.dusudo du -h --max-depth1 | sort -hr 越前…

LeetCode-46. Permutations

一、問題描述 就是全排列問題。 二、問題解決 應該哪一本數據結構的書上都有講了。 void get_permute(vector<int>& nums, int pos, vector<vector<int>>& result) {if (nums.size() pos) {result.push_back(nums);return;}for (int i pos; i <…

web操作系統開發的_哪種操作系統更適合Web開發

web操作系統開發的If youre new to web development and are in the market for a new laptop, you might be wondering which operating system is best.如果您是Web開發的新手&#xff0c;并且正在購買新的筆記本電腦&#xff0c;您可能想知道哪種操作系統是最好的。 Spoile…

白鷺引擎 - 顯示對象的基準點與橫縱坐標 ( 繪制一個來回移動的綠色方塊 )

class Main extends egret.DisplayObjectContainer {/** * Main 類構造器, 初始化的時候自動執行, ( 子類的構造函數必須調用父類的構造函數 super )* constructor 是類的構造函數, 類在實例化的時候調用* egret.Event.ADDED_TO_STAGE, 在將顯示對象添加到舞臺顯示列表時調度*/…

SpringBoot項目屬性配置

我們知道&#xff0c;在項目中&#xff0c;很多時候需要用到一些配置的東西&#xff0c;這些東西可能在測試環境和生產環境下會有不同的配置&#xff0c;后面也有可能會做修改&#xff0c;所以我們不能在代碼中寫死&#xff0c;要寫到配置中。我們可以把這些內容寫到applicatio…

670. 最大交換

670. 最大交換 給定一個非負整數&#xff0c;你至多可以交換一次數字中的任意兩位。返回你能得到的最大值。 示例 1 : 輸入: 2736 輸出: 7236 解釋: 交換數字2和數字7。 示例 2 : 輸入: 9973 輸出: 9973 解釋: 不需要交換。 解題思路 目標就是優先鎖定高位&#xff0c;像…

flexbox布局_Flexbox vs Grid-如何構建最常見HTML布局

flexbox布局There are so many great CSS resources all over the internet. But what if you just want a simple layout and you want it NOW? 互聯網上有很多很棒CSS資源。 但是&#xff0c;如果您只是想要一個簡單的布局而現在就想要呢&#xff1f; In this article, I d…