機器學習:分類_機器學習基礎：K最近鄰居分類

機器學習:分類

In the previous stories, I had given an explanation of the program for implementation of various Regression models. Also, I had described the implementation of the Logistic Regression model. In this article, we shall see the algorithm of the K-Nearest Neighbors or KNN Classification along with a simple example.

在先前的故事中，我已經解釋了用于實現各種回歸模型的程序。另外，我已經描述了Logistic回歸模型的實現。在本文中，我們將看到K最近鄰算法或KNN分類算法以及一個簡單示例。

KNN分類概述 (Overview of KNN Classification)

The K-Nearest Neighbors or KNN Classification is a simple and easy to implement, supervised machine learning algorithm that is used mostly for classification problems.

K最近鄰或KNN分類是一種簡單且易于實現的，受監督的機器學習算法，主要用于分類問題。

Let us understand this algorithm with a very simple example. Suppose there are two classes represented by Rectangles and Triangles. If we want to add a new shape (Diamond) to any one of the classes, then we can implement the KNN Classification model.

讓我們通過一個非常簡單的示例來了解該算法。假設有兩個用矩形和三角形表示的類。如果要向任何一個類添加新的形狀(鉆石)，則可以實現KNN分類模型。

In this model, we have to choose the number of nearest neighbors (N). Here, as we have chosen N=4, the new data point calculates the distance between each of the points and draws a circular region around its nearest 4 neighbors ( as N=4). In this problem as all the four nearest neighbors lie in the Class 1 (Rectangles), the new data point (Diamond) is also assigned as a Class 1 data point.

在此模型中，我們必須選擇最近鄰居的數量(N)。在這里，由于我們選擇了N = 4，因此新數據點將計算每個點之間的距離，并在其最近的4個鄰居周圍繪制一個圓形區域(當N = 4時)。在此問題中，由于四個最近的鄰居都位于Class 1(矩形)中，因此新數據點(Diamond)也被分配為Class 1數據點。

In this way, we can alter the parameter, N with various values and choose the most accurate value for the model by a trial and error basis, also avoiding over-fitting and high loss.

這樣，我們可以通過反復試驗改變參數N的值，并為模型選擇最準確的值，從而避免過擬合和高損失。

In this way, we can implement the KNN Classification algorithm. Let us now move to its implementation with a real world example in the next section.

這樣，我們可以實現KNN分類算法。現在，讓我們在下一部分中以一個真實的示例轉到其實現。

問題分析 (Problem Analysis)

To apply the KNN Classification model in practical use, I am using the same dataset used in building the Logistic Regression model. In this, we DMV Test dataset which has three columns. The first two columns consist of the two DMV written tests (DMV_Test_1 and DMV_Test_2) which are the independent variables and the last column consists of the dependent variable, Results which denote that the driver has got the license (1) or not (0).

為了在實際應用中應用KNN分類模型，我使用的是用于構建Logistic回歸模型的相同數據集。在這里，我們有三列的DMV Test數據集。前兩列包含兩個DMV書面測試( DMV_Test_1和DMV_Test_2 )，它們是自變量，最后一列包含因變量，結果表示驅動程序已獲得許可證(1)或沒有獲得許可證(0)。

In this, we have to build a KNN Classification model using this data to predict if a driver who has taken the two DMV written tests will get the license or not using those marks obtained in their written tests and classify the results.

在這種情況下，我們必須使用此數據構建KNN分類模型，以預測已參加兩次DMV筆試的駕駛員是否會使用在其筆試中獲得的那些標記來獲得駕照，然后對結果進行分類。

步驟1：導入庫 (Step 1: Importing the Libraries)

As always, the first step will always include importing the libraries which are the NumPy, Pandas and the Matplotlib.

與往常一樣，第一步將始終包括導入NumPy，Pandas和Matplotlib庫。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

步驟2：導入數據集 (Step 2: Importing the dataset)

In this step, we shall get the dataset from my GitHub repository as “DMVWrittenTests.csv”. The variable X will store the two “DMV Tests ”and the variable Y will store the final output as “Results”. The dataset.head(5)is used to visualize the first 5 rows of the data.

在這一步中，我們將從GitHub存儲庫中獲取數據集，名稱為“ DMVWrittenTests.csv”。變量X將存儲兩個“ DMV測試 ”，變量Y將最終輸出存儲為“ 結果 ” 。 dataset.head(5)用于可視化數據的前5行。

dataset = pd.read_csv('https://raw.githubusercontent.com/mk-gurucharan/Classification/master/DMVWrittenTests.csv')X = dataset.iloc[:, [0, 1]].values
y = dataset.iloc[:, 2].valuesdataset.head(5)>>
DMV_Test_1   DMV_Test_2   Results
34.623660    78.024693    0
30.286711    43.894998    0
35.847409    72.902198    0
60.182599    86.308552    1
79.032736    75.344376    1

步驟3：將資料集分為訓練集和測試集 (Step 3: Splitting the dataset into the Training set and Test set)

In this step, we have to split the dataset into the Training set, on which the Logistic Regression model will be trained and the Test set, on which the trained model will be applied to classify the results. In this the test_size=0.25 denotes that 25% of the data will be kept as the Test set and the remaining 75% will be used for training as the Training set.

在這一步中，我們必須將數據集分為訓練集和測試集，訓練集將在該訓練集上訓練邏輯回歸模型，測試集將在訓練集上應用訓練后的模型對結果進行分類。在這種情況下， test_size=0.25表示將保留25％的數據作為測試集，而將剩余的75 ％的數據用作培訓集 。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

步驟4：功能縮放 (Step 4: Feature Scaling)

This is an additional step that is used to normalize the data within a particular range. It also aids in speeding up the calculations. As the data is widely varying, we use this function to limit the range of the data within a small limit ( -2,2). For example, the score 62.0730638 is normalized to -0.21231162 and the score 96.51142588 is normalized to 1.55187648. In this way, the scores of X_train and X_test are normalized to a smaller range.

這是一個附加步驟，用于對特定范圍內的數據進行規范化。它還有助于加快計算速度。由于數據變化很大，我們使用此功能將數據范圍限制在很小的限制(-2,2)內。例如，將分數62.0730638標準化為-0.21231162，將分數96.51142588標準化為1.55187648。這樣，將X_train和X_test的分數歸一化為較小的范圍。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

步驟5：在訓練集上訓練KNN分類模型 (Step 5: Training the KNN Classification model on the Training Set)

In this step, the class KNeighborsClassifier is imported and is assigned to the variable “classifier”. The classifier.fit() function is fitted with X_train and Y_train on which the model will be trained.

在此步驟中，將導入類KNeighborsClassifier并將其分配給變量“ classifier” 。 classifier.fit()函數配有X_train和Y_train ，將在其上訓練模型。

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

步驟6：預測測試集結果 (Step 6: Predicting the Test set results)

In this step, the classifier.predict() function is used to predict the values for the Test set and the values are stored to the variable y_pred.

在此步驟中， classifier.predict()函數用于預測測試集的值，并將這些值存儲到變量y_pred.

y_pred = classifier.predict(X_test) 
y_pred

步驟7：混淆矩陣和準確性 (Step 7: Confusion Matrix and Accuracy)

This is a step that is mostly used in classification techniques. In this, we see the Accuracy of the trained model and plot the confusion matrix.

這是分類技術中最常用的步驟。在此，我們看到了訓練模型的準確性，并繪制了混淆矩陣。

The confusion matrix is a table that is used to show the number of correct and incorrect predictions on a classification problem when the real values of the Test Set are known. It is of the format

混淆矩陣是一個表，用于在已知測試集的實際值時顯示有關分類問題的正確和不正確預測的數量。它的格式

The True values are the number of correct predictions made.

True值是做出正確預測的次數。

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)from sklearn.metrics import accuracy_score 
print ("Accuracy : ", accuracy_score(y_test, y_pred))
cm>>Accuracy :  0.92>>array([[11,  1],
       [ 1, 12]])

From the above confusion matrix, we infer that, out of 25 test set data, 23 were correctly classified and 2 were incorrectly classified which is little better than the Logistic Regression model.

從上面的混淆矩陣中，我們推斷出，在25個測試集數據中，有23個被正確分類，而2個被錯誤分類，這比Logistic回歸模型好一點。

步驟8：將實際值與預測值進行比較 (Step 8: Comparing the Real Values with Predicted Values)

In this step, a Pandas DataFrame is created to compare the classified values of both the original Test set (y_test) and the predicted results (y_pred).

在此步驟中，將創建一個Pandas DataFrame來比較原始測試集( y_test )和預測結果( y_pred )的分類值。

df = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})
df>> 
Real Values   Predicted Values
0             0
0             1
1             1
0             0
0             0
1             1
1             1
0             0
0             0
1             1
0             0
1             0
1             1
1             1
0             0
0             0
0             0
1             1
1             1
1             1
1             1
0             0
1             1
1             1
0             0

Though this visualization may not be of much use as it was with Regression, from this, we can see that the model is able to classify the test set values with a decent accuracy of 92% as calculated above.

盡管這種可視化可能不像使用回歸那樣有用，但是從中我們可以看到，該模型能夠以如上計算的92％的準確度對測試集值進行分類。

步驟9：可視化結果 (Step 9: Visualizing the Results)

In this last step, we visualize the results of the KNN Classification model on a graph that is plotted along with the two regions.

在這最后一步中，我們在與兩個區域一起繪制的圖形上可視化KNN分類模型的結果。

from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('KNN Classification')
plt.xlabel('DMV_Test_1')
plt.ylabel('DMV_Test_2')
plt.legend()
plt.show()

In this graph, the value 1 (i.e, Yes) is plotted in “Red” color and the value 0 (i.e, No) is plotted in “Green” color. The KNN Classification model separates the two regions. It is not linear as the Logistic Regression model. Thus, any data with the two data points (DMV_Test_1 and DMV_Test_2) given, can be plotted on the graph and depending upon which region if falls in, the result (Getting the Driver’s License) can be classified as Yes or No.

在該圖中，值1(即“是”)以“ 紅色 ”顏色繪制，而值0(即“否”)以“ 綠色 ”顏色繪制。 KNN分類模型將兩個區域分開。它與Logistic回歸模型不是線性的。因此，具有給定兩個數據點(DMV_Test_1和DMV_Test_2)的任何數據都可以繪制在圖形上，并且根據所處的區域而定，結果(獲得駕駛執照)可以分類為是或否。

As calculated above, we can see that there are two values in the test set, one on each region that are wrongly classified.

如上計算，我們可以看到測試集中有兩個值，每個區域一個值被錯誤分類。

結論— (Conclusion —)

Thus in this story, we have successfully been able to build a KNN Classification Model that is able to predict if a person is able to get the driving license from their written examinations and visualize the results.

因此，在這個故事中，我們已經成功地建立了一個KNN分類模型，該模型能夠預測一個人是否能夠通過筆試獲得駕照并將結果可視化。

I am also attaching the link to my GitHub repository where you can download this Google Colab notebook and the data files for your reference.

我還將鏈接附加到我的GitHub存儲庫中，您可以在其中下載此Google Colab筆記本和數據文件以供參考。

You can also find the explanation of the program for other Classification models below:

您還可以在下面找到其他分類模型的程序說明：

Logistic Regression
邏輯回歸
K-Nearest Neighbours (KNN) Classification
K最近鄰居(KNN)分類
Support Vector Machine (SVM) Classification (Coming Soon)
支持向量機(SVM)分類(即將推出)
Naive Bayes Classification (Coming Soon)
樸素貝葉斯分類(即將推出)
Random Forest Classification (Coming Soon)
隨機森林分類(即將推出)

We will come across the more complex models of Regression, Classification and Clustering in the upcoming articles. Till then, Happy Machine Learning!

在接下來的文章中，我們將介紹更復雜的回歸，分類和聚類模型。到那時，快樂機器學習！

翻譯自: https://towardsdatascience.com/machine-learning-basics-k-nearest-neighbors-classification-6c1e0b209542

機器學習:分類

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/392471.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/392471.shtml
英文地址，請注明出處：http://en.pswp.cn/news/392471.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！