機器學習:分類_機器學習基礎:K最近鄰居分類

機器學習:分類

In the previous stories, I had given an explanation of the program for implementation of various Regression models. Also, I had described the implementation of the Logistic Regression model. In this article, we shall see the algorithm of the K-Nearest Neighbors or KNN Classification along with a simple example.

在先前的故事中 ,我已經解釋了用于實現各種回歸模型的程序。 另外,我已經描述了Logistic回歸模型的實現。 在本文中,我們將看到K最近鄰算法或KNN分類算法以及一個簡單示例。

KNN分類概述 (Overview of KNN Classification)

The K-Nearest Neighbors or KNN Classification is a simple and easy to implement, supervised machine learning algorithm that is used mostly for classification problems.

K最近鄰或KNN分類是一種簡單且易于實現的,受監督的機器學習算法,主要用于分類問題。

Let us understand this algorithm with a very simple example. Suppose there are two classes represented by Rectangles and Triangles. If we want to add a new shape (Diamond) to any one of the classes, then we can implement the KNN Classification model.

讓我們通過一個非常簡單的示例來了解該算法。 假設有兩個用矩形和三角形表示的類。 如果要向任何一個類添加新的形狀(鉆石),則可以實現KNN分類模型。

In this model, we have to choose the number of nearest neighbors (N). Here, as we have chosen N=4, the new data point calculates the distance between each of the points and draws a circular region around its nearest 4 neighbors ( as N=4). In this problem as all the four nearest neighbors lie in the Class 1 (Rectangles), the new data point (Diamond) is also assigned as a Class 1 data point.

在此模型中,我們必須選擇最近鄰居的數量(N)。 在這里,由于我們選擇了N = 4,因此新數據點將計算每個點之間的距離,并在其最近的4個鄰居周圍繪制一個圓形區域(當N = 4時)。 在此問題中,由于四個最近的鄰居都位于Class 1(矩形)中,因此新數據點(Diamond)也被分配為Class 1數據點。

In this way, we can alter the parameter, N with various values and choose the most accurate value for the model by a trial and error basis, also avoiding over-fitting and high loss.

這樣,我們可以通過反復試驗改變參數N的值,并為模型選擇最準確的值,從而避免過擬合和高損失。

In this way, we can implement the KNN Classification algorithm. Let us now move to its implementation with a real world example in the next section.

這樣,我們可以實現KNN分類算法。 現在,讓我們在下一部分中以一個真實的示例轉到其實現。

問題分析 (Problem Analysis)

To apply the KNN Classification model in practical use, I am using the same dataset used in building the Logistic Regression model. In this, we DMV Test dataset which has three columns. The first two columns consist of the two DMV written tests (DMV_Test_1 and DMV_Test_2) which are the independent variables and the last column consists of the dependent variable, Results which denote that the driver has got the license (1) or not (0).

為了在實際應用中應用KNN分類模型,我使用的是用于構建Logistic回歸模型的相同數據集。 在這里,我們有三列的DMV Test數據集。 前兩列包含兩個DMV書面測試( DMV_Test_1DMV_Test_2 ),它們是自變量,最后一列包含因變量, 結果表示驅動程序已獲得許可證(1)或沒有獲得許可證(0)。

In this, we have to build a KNN Classification model using this data to predict if a driver who has taken the two DMV written tests will get the license or not using those marks obtained in their written tests and classify the results.

在這種情況下,我們必須使用此數據構建KNN分類模型,以預測已參加兩次DMV筆試的駕駛員是否會使用在其筆試中獲得的那些標記來獲得駕照,然后對結果進行分類。

步驟1:導入庫 (Step 1: Importing the Libraries)

As always, the first step will always include importing the libraries which are the NumPy, Pandas and the Matplotlib.

與往常一樣,第一步將始終包括導入NumPy,Pandas和Matplotlib庫。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

步驟2:導入數據集 (Step 2: Importing the dataset)

In this step, we shall get the dataset from my GitHub repository as “DMVWrittenTests.csv”. The variable X will store the two “DMV Tests ”and the variable Y will store the final output as “Results. The dataset.head(5)is used to visualize the first 5 rows of the data.

在這一步中,我們將從GitHub存儲庫中獲取數據集,名稱為“ DMVWrittenTests.csv”。 變量X將存儲兩個“ DMV測試 ”,變量Y將最終輸出存儲為“ 結果 dataset.head(5)用于可視化數據的前5行。

dataset = pd.read_csv('https://raw.githubusercontent.com/mk-gurucharan/Classification/master/DMVWrittenTests.csv')X = dataset.iloc[:, [0, 1]].values
y = dataset.iloc[:, 2].valuesdataset.head(5)>>
DMV_Test_1 DMV_Test_2 Results
34.623660 78.024693 0
30.286711 43.894998 0
35.847409 72.902198 0
60.182599 86.308552 1
79.032736 75.344376 1

步驟3:將資料集分為訓練集和測試集 (Step 3: Splitting the dataset into the Training set and Test set)

In this step, we have to split the dataset into the Training set, on which the Logistic Regression model will be trained and the Test set, on which the trained model will be applied to classify the results. In this the test_size=0.25 denotes that 25% of the data will be kept as the Test set and the remaining 75% will be used for training as the Training set.

在這一步中,我們必須將數據集分為訓練集和測試集,訓練集將在該訓練集上訓練邏輯回歸模型,測試集將在訓練集上應用訓練后的模型對結果進行分類。 在這種情況下, test_size=0.25表示將保留25%的數據作為測試集,而將剩余的75 %的數據用作培訓集

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

步驟4:功能縮放 (Step 4: Feature Scaling)

This is an additional step that is used to normalize the data within a particular range. It also aids in speeding up the calculations. As the data is widely varying, we use this function to limit the range of the data within a small limit ( -2,2). For example, the score 62.0730638 is normalized to -0.21231162 and the score 96.51142588 is normalized to 1.55187648. In this way, the scores of X_train and X_test are normalized to a smaller range.

這是一個附加步驟,用于對特定范圍內的數據進行規范化。 它還有助于加快計算速度。 由于數據變化很大,我們使用此功能將數據范圍限制在很小的限制(-2,2)內。 例如,將分數62.0730638標準化為-0.21231162,將分數96.51142588標準化為1.55187648。 這樣,將X_train和X_test的分數歸一化為較小的范圍。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

步驟5:在訓練集上訓練KNN分類模型 (Step 5: Training the KNN Classification model on the Training Set)

In this step, the class KNeighborsClassifier is imported and is assigned to the variable “classifier”. The classifier.fit() function is fitted with X_train and Y_train on which the model will be trained.

在此步驟中,將導入類KNeighborsClassifier并將其分配給變量“ classifier”classifier.fit()函數配有X_trainY_train ,將在其上訓練模型。

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

步驟6:預測測試集結果 (Step 6: Predicting the Test set results)

In this step, the classifier.predict() function is used to predict the values for the Test set and the values are stored to the variable y_pred.

在此步驟中, classifier.predict()函數用于預測測試集的值,并將這些值存儲到變量y_pred.

y_pred = classifier.predict(X_test) 
y_pred

步驟7:混淆矩陣和準確性 (Step 7: Confusion Matrix and Accuracy)

This is a step that is mostly used in classification techniques. In this, we see the Accuracy of the trained model and plot the confusion matrix.

這是分類技術中最常用的步驟。 在此,我們看到了訓練模型的準確性,并繪制了混淆矩陣。

The confusion matrix is a table that is used to show the number of correct and incorrect predictions on a classification problem when the real values of the Test Set are known. It is of the format

混淆矩陣是一個表,用于在已知測試集的實際值時顯示有關分類問題的正確和不正確預測的數量。 它的格式

Image for post
Source — Self
來源—自我

The True values are the number of correct predictions made.

True值是做出正確預測的次數。

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)from sklearn.metrics import accuracy_score
print ("Accuracy : ", accuracy_score(y_test, y_pred))
cm>>Accuracy : 0.92>>array([[11, 1],
[ 1, 12]])

From the above confusion matrix, we infer that, out of 25 test set data, 23 were correctly classified and 2 were incorrectly classified which is little better than the Logistic Regression model.

從上面的混淆矩陣中,我們推斷出,在25個測試集數據中,有23個被正確分類,而2個被錯誤分類,這比Logistic回歸模型好一點。

步驟8:將實際值與預測值進行比較 (Step 8: Comparing the Real Values with Predicted Values)

In this step, a Pandas DataFrame is created to compare the classified values of both the original Test set (y_test) and the predicted results (y_pred).

在此步驟中,將創建一個Pandas DataFrame來比較原始測試集( y_test )和預測結果( y_pred )的分類值。

df = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})
df>>
Real Values Predicted Values
0 0
0 1
1 1
0 0
0 0
1 1
1 1
0 0
0 0
1 1
0 0
1 0
1 1
1 1
0 0
0 0
0 0
1 1
1 1
1 1
1 1
0 0
1 1
1 1
0 0

Though this visualization may not be of much use as it was with Regression, from this, we can see that the model is able to classify the test set values with a decent accuracy of 92% as calculated above.

盡管這種可視化可能不像使用回歸那樣有用,但是從中我們可以看到,該模型能夠以如上計算的92%的準確度對測試集值進行分類。

步驟9:可視化結果 (Step 9: Visualizing the Results)

In this last step, we visualize the results of the KNN Classification model on a graph that is plotted along with the two regions.

在這最后一步中,我們在與兩個區域一起繪制的圖形上可視化KNN分類模型的結果。

from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('KNN Classification')
plt.xlabel('DMV_Test_1')
plt.ylabel('DMV_Test_2')
plt.legend()
plt.show()
Image for post
KNN Classification
KNN分類

In this graph, the value 1 (i.e, Yes) is plotted in “Red” color and the value 0 (i.e, No) is plotted in “Green” color. The KNN Classification model separates the two regions. It is not linear as the Logistic Regression model. Thus, any data with the two data points (DMV_Test_1 and DMV_Test_2) given, can be plotted on the graph and depending upon which region if falls in, the result (Getting the Driver’s License) can be classified as Yes or No.

在該圖中,值1(即“是”)以“ 紅色 ”顏色繪制,而值0(即“否”)以“ 綠色 ”顏色繪制。 KNN分類模型將兩個區域分開。 它與Logistic回歸模型不是線性的。 因此,具有給定兩個數據點(DMV_Test_1和DMV_Test_2)的任何數據都可以繪制在圖形上,并且根據所處的區域而定,結果(獲得駕駛執照)可以分類為是或否。

As calculated above, we can see that there are two values in the test set, one on each region that are wrongly classified.

如上計算,我們可以看到測試集中有兩個值,每個區域一個值被錯誤分類。

結論— (Conclusion —)

Thus in this story, we have successfully been able to build a KNN Classification Model that is able to predict if a person is able to get the driving license from their written examinations and visualize the results.

因此,在這個故事中,我們已經成功地建立了一個KNN分類模型,該模型能夠預測一個人是否能夠通過筆試獲得駕照并將結果可視化。

I am also attaching the link to my GitHub repository where you can download this Google Colab notebook and the data files for your reference.

我還將鏈接附加到我的GitHub存儲庫中,您可以在其中下載此Google Colab筆記本和數據文件以供參考。

You can also find the explanation of the program for other Classification models below:

您還可以在下面找到其他分類模型的程序說明:

  • Logistic Regression

    邏輯回歸

  • K-Nearest Neighbours (KNN) Classification

    K最近鄰居(KNN)分類
  • Support Vector Machine (SVM) Classification (Coming Soon)

    支持向量機(SVM)分類(即將推出)
  • Naive Bayes Classification (Coming Soon)

    樸素貝葉斯分類(即將推出)
  • Random Forest Classification (Coming Soon)

    隨機森林分類(即將推出)

We will come across the more complex models of Regression, Classification and Clustering in the upcoming articles. Till then, Happy Machine Learning!

在接下來的文章中,我們將介紹更復雜的回歸,分類和聚類模型。 到那時,快樂機器學習!

翻譯自: https://towardsdatascience.com/machine-learning-basics-k-nearest-neighbors-classification-6c1e0b209542

機器學習:分類

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392471.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392471.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392471.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

leetcode 714. 買賣股票的最佳時機含手續費(dp)

給定一個整數數組 prices,其中第 i 個元素代表了第 i 天的股票價格 ;非負整數 fee 代表了交易股票的手續費用。 你可以無限次地完成交易,但是你每筆交易都需要付手續費。如果你已經購買了一個股票,在賣出它之前你就不能再繼續購買…

如何在Angular Material中制作自定義主題

by Charlee Li通過李李 如何在Angular Material中制作自定義主題 (How to make a custom theme in Angular Material) Angular Material is a great library that implements Material Design for Angular 2. The official document is sufficient regarding the component us…

最感嘆的莫過于一見如故,最悲傷的莫過于再見陌路。最深的孤獨,是你明知道自己的渴望,卻得對它裝聾作啞。最美的你不是生如夏花,而是在時間的長河里,波瀾不驚。...

最感嘆的莫過于一見如故,最悲傷的莫過于再見陌路。最深的孤獨,是你明知道自己的渴望,卻得對它裝聾作啞。最美的你不是生如夏花,而是在時間的長河里,波瀾不驚。轉載于:https://www.cnblogs.com/dj258/p/7003890.html

java vimrc_.vimrc技巧

-------------------------------------------------------------------" 設置字符編碼。參考:http://www.rainux.org/blog/index.php/2005/10/20/106" encoding: Vim 內部使用的字符編碼方式,包括 Vim 的buffer (緩沖區)、菜單文" 本、消…

將PDF和Gutenberg文檔格式轉換為文本:生產中的自然語言處理

Estimates state that 70%–85% of the world’s data is text (unstructured data). Most of the English and EU business data formats as byte text, MS Word, or Adobe PDF. [1]據估計,全球數據的70%–85%是文本(非結構化數據)。 大多數…

Go_筆試題記錄-指針與值類型實現接口的區別

1、如果Add函數的調用代碼為: func main() {var a Integer 1var b Integer 2var i interface{} &asum : i.(*Integer).Add(b)fmt.Println(sum) } 則Add函數定義正確的是() A.type Integer int func (a Integer) Add(b Integer) Intege…

leetcode 48. 旋轉圖像

解題思路 將數組從里到外分為若干層, 數組 [1,2,3], [4,5,6][7,8,9]的最外層即為 [1,2,3] [4 6][7,8,9] ,將一層分為4條邊,如741 123,將741放到123的位置,123放到369的位置,如此類推(但是放置的…

如何恢復誤刪的OneNote頁面

今天不小心把半個月的日記刪掉了!(為了減少頁面數量,每個月的日記會放在同一個頁面上)。 幸運的是OneNote有自動備份功能,喜極而泣。 操作方法來自微軟支持 打開丟失了最近筆記的筆記本。 單擊“文件”>“信息”&g…

javascript函數式_JavaScript中的函數式編程原理

javascript函數式After a long time learning and working with object-oriented programming, I took a step back to think about system complexity.經過長時間的學習和使用面向對象的編程,我退后了一步來思考系統的復雜性。 “Complexity is anything that mak…

java writeint_Java DataOutputStream.writeInt(int v)類型

DataOutputStream.writeInt(int v)方法示例DataOutputStream的DataOutputStream.writeInt(int v)方法具有以下語法。public final void writeInt(int v) throws IOException示例在下面的代碼中展示了如何使用DataOutputStream.writeInt(int v)方法。import java.io.DataInputSt…

協方差意味著什么_“零”到底意味著什么?

協方差意味著什么When I was an undergraduate student studying Data Science, one of my professors always asked the same question for every data set we worked with — “What does zero mean?”當我是一名研究數據科學的本科生時,我的一位教授總是對我們處…

Go_筆試題記錄-不熟悉的

1、golang中沒有隱藏的this指針,這句話的含義是() A. 方法施加的對象顯式傳遞,沒有被隱藏起來 B. golang沿襲了傳統面向對象編程中的諸多概念,比如繼承、虛函數和構造函數 C. golang的面向對象表達更直觀,對…

leetcode 316. 去除重復字母(單調棧)

給你一個字符串 s ,請你去除字符串中重復的字母,使得每個字母只出現一次。需保證 返回結果的字典序最小(要求不能打亂其他字符的相對位置)。 注意:該題與 1081 https://leetcode-cn.com/problems/smallest-subsequenc…

Go-json解碼到結構體

廢話不多說,直接干就得了,上代碼 package mainimport ("encoding/json""fmt" )type IT struct {Company string json:"company" Subjects []string json:"subjects"IsOk bool json:"isok"…

leetcode 746. 使用最小花費爬樓梯(dp)

數組的每個索引作為一個階梯,第 i個階梯對應著一個非負數的體力花費值 costi。 每當你爬上一個階梯你都要花費對應的體力花費值,然后你可以選擇繼續爬一個階梯或者爬兩個階梯。 您需要找到達到樓層頂部的最低花費。在開始時,你可以選擇從索…

安卓中經常使用控件遇到問題解決方法(持續更新和發現篇幅)(在textview上加一條線、待續)...

TextView設置最多顯示30個字符。超過部分顯示...(省略號)&#xff0c;有人說分別設置TextView的android:signature"true",而且設置android:ellipsize"end";可是我試了。居然成功了&#xff0c;供大家參考 [java] view plaincopy<TextView android:id…

網絡工程師晉升_晉升為工程師的最快方法

網絡工程師晉升by Sihui Huang黃思慧 晉升為工程師的最快方法 (The Fastest Way to Get Promoted as an Engineer) We all want to live up to our potential, grow in our career, and do the best work of our lives. Getting promoted at work not only proves that we hav…

java 銀行存取款_用Java編寫銀行存錢取錢

const readline require(‘readline-sync‘)//引用readline-synclet s 2;//錯誤的次數for (let i 0; i < 3; i) {console.log(‘請輸入名&#xff1a;(由英文組成)‘);let user readline.question();console.log(‘請輸入密碼&#xff1a;(由數字組成)‘);let password …

垃圾郵件分類 python_在python中創建SMS垃圾郵件分類器

垃圾郵件分類 python介紹 (Introduction) I have always been fascinated with Google’s gmail spam detection system, where it is able to seemingly effortlessly judge whether incoming emails are spam and therefore not worthy of our limited attention.我一直對Goo…

leetcode 103. 二叉樹的鋸齒形層序遍歷(層序遍歷)

給定一個二叉樹&#xff0c;返回其節點值的鋸齒形層序遍歷。&#xff08;即先從左往右&#xff0c;再從右往左進行下一層遍歷&#xff0c;以此類推&#xff0c;層與層之間交替進行&#xff09;。例如&#xff1a; 給定二叉樹 [3,9,20,null,null,15,7],3/ \9 20/ \15 7 返回…