密度聚類dbscan_DBSCAN —基于密度的聚類方法的演練

密度聚類dbscan

The idea of having newer algorithms come into the picture doesn’t make the older ones ‘completely redundant’. British statistician, George E. P. Box had once quoted that, “All models are wrong, but some are useful”, meaning that no model is exact enough to certify as cent percent accurate. Reverse claims can only lead to the loss of generalization. The most accurate thing to do is to find the most approximate model.

出現新算法的想法并不能使舊算法“完全冗余”。 英國統計學家George EP Box曾經引述過: “所有模型都是錯誤的,但有些模型是有用的” ,這意味著沒有任何一種模型能夠精確到百分之一的精度。 反向主張只能導致泛化。 最準確的事情是找到最近似的模型。

Clustering is an unsupervised learning technique where the aim is to group similar objects together. We are virtually living in a world where our past and present choices have become a dataset that can be clustered to identify patterns in our searches, shopping carts, the books we read, etc such that the machine algorithm is sophisticated enough to recommend the things to us. It is fascinating that the algorithms know much more about us then we ourselves can recognize!

聚類是一種無監督的學習技術,其目的是將相似的對象分組在一起。 實際上,我們生活在一個世界中,過去和現在的選擇已成為一個數據集,可以將其聚類以識別我們的搜索,購物車,閱讀的書籍等中的模式,從而機器算法足夠復雜,可以向您推薦事物我們。 令人著迷的是,這些算法對我們的了解更多,然后我們自己就能意識到!

As already discussed in the previous blog, K-means makes use of Euclidean distance as a metric to form the clusters. This leads to a variety of drawbacks as mentioned. Please refer to the blog to read about the K-means algorithm, implementation, and drawbacks: Clustering — Diving deep into K-means algorithm

如先前博客中已討論的,K-means利用歐幾里得距離作為度量來形成聚類。 如上所述,這導致了各種缺點。 請參閱博客,以了解有關K-means算法,實現和缺點的信息: 聚類-深入探討K-means算法

The real-life data has outliers and is irregular in shape. K-means fails to address these important points and becomes unsuitable for arbitrary shaped, noisy data. In this blog, we are going to learn about an interesting density-based clustering approach — DBSCAN.

現實生活中的數據存在異常值,并且形狀不規則。 K均值無法解決這些重要問題,因此不適用于任意形狀的嘈雜數據。 在此博客中,我們將學習一種有趣的基于密度的聚類方法-DBSCAN。

應用程序基于密度的空間聚類— DBSCAN (Density-based spatial clustering of applications with noise — DBSCAN)

DBSCAN is a density-based clustering approach that separates regions with a high density of data points from the regions with a lower density of data points. Its fundamental definition is that the cluster is a contiguous region of dense data points separated from another such region by a region of the low density of data points. Unlike K-means clustering, the number of clusters is determined by the algorithm. Two important concepts are density reachability and density connectivity, which can be understood as follows:

DBSCAN是基于密度的聚類方法,可將數據點密度較高的區域與數據點密度較低的區域分開 它的基本定義是,群集是密集數據點的連續區域,該區域與另一個此類區域之間被數據點的低密度區域分隔開 。 與K均值聚類不同,聚類的數量由算法確定。 密度可達性密度連通 是兩個重要的概念,可以理解如下:

“A point is considered to be density reachable to another point if it is situated within a particular distance range from it. It is the criteria for calling two points as neighbors. Similarly, if two points A and B are density reachable (neighbors), also B and C are density reachable (neighbors), then by chaining approach A and C belong to the same cluster. This concept is called density connectivity. By this approach, the algorithm performs cluster propagation.”

“如果一個點位于另一個點的特定距離范圍內,則認為該點可以密度達到另一個點。 這是將兩個點稱為鄰居的標準。 類似地,如果兩個點A和B是密度可達的(鄰居),則B和C也是密度可達的(鄰居),則通過鏈接方法A和C屬于同一群集。 這個概念稱為密度連接。 通過這種方法,該算法執行集群傳播。”

The key constructs of the DBSCAN algorithm that help it determine the ‘concept of density’ are as follows:

DBSCAN算法可幫助確定“密度概念”的關鍵結構如下:

Epsilon ε (measure): ε is the threshold radius distance which determines the neighborhood of a point. If a point is located at a distance less than or equal to ε from another point, it becomes its neighbor, that is, it becomes density reachable to it.

小量 ε(測量):ε 是確定點附近的閾值半徑距離。 如果一個點與另一個點的距離小于或等于ε,則該點成為其相鄰點,即它可以達到的密度。

Choice of ε: The choice of ε is made in a way that the clusters and the outlier data can be segregated perfectly. Too large ε value can cluster the entire data as one cluster and too small value can classify each point as noise. In layman terms, the average distance of each point from its k-nearest neighbors is determined, sorted, and plotted. The point of maximum change (the elbows bend) determines the optimal value of ε.

的選擇 ε :選擇ε時,可以將聚類和離群數據完美地分開。 太大的ε值會將整個數據聚類為一個聚類,而太小的ε值會將每個點歸類為噪聲。 用外行術語來說,每個點到其k個最近鄰居的平均距離被確定,排序和繪制。 最大變化點(肘部彎曲)確定ε的最佳值。

Min points m (measure): It is a threshold number of points present in the ε distance of a data point that dictates the category of that data point. It is driven by the number of dimensions present.

最小點數m (小節):它是數據點的ε距離中存在的閾值點數,它決定了該數據點的類別。 它由當前尺寸的數量驅動。

Choice of Min points: Minimum value of Min points has to be 3. Larger density and dimensionality means larger value should be chosen. The formula to be used while assigning value to Min points is: Min points>= Dimensionality + 1

最小點數的選擇:最小點數的最小值必須為3。較大的密度和維數表示應選擇較大的值。 將值分配給“最小點”時要使用的公式為: 最小點> =維+ 1

Core points (data points): A point is a core point if it has at least m number of points within radii of ε distance from it.

核心點 (數據點):如果一個點在距其ε距離的半徑內至少有m個點,則它是一個核心點。

Border points (data points): A point that doesn’t qualify as a core point but is a neighbor of a core point.

邊界點 (數據點):不符合核心點要求但與核心點相鄰的點。

Noise points (data points): An outlier point that doesn’t fulfill any of the above-given criteria.

噪聲點 (數據點):不滿足上述任何標準的異常點。

Image for post

Algorithm:

算法:

  1. Select a value for ε and m.

    εm選擇一個值。

  2. Mark all points as outliers.

    將所有點標記為離群值。
  3. For each point, if at least m points are present within its ε distance range:

    對于每個點,如果在其ε距離范圍內至少存在m個點:

  • Identify it as a core point and mark the point as visited.

    將其標識為核心點并將該點標記為已訪問。
  • Assign the core point and its density reachable points in one cluster and remove them from the outlier list.

    在一個群集中分配核心點及其密度可達到的點,并將其從異常值列表中刪除。

4. Check for the density connectivity of the clusters. If so, merge the clusters into one.

4.檢查集群的密度連接。 如果是這樣,請將群集合并為一個。

5. For points remaining in the outlier list, identify them as noise.

5.對于剩余在異常值列表中的點,將其標識為噪聲。

Image for post
Example of cluster formation by DBSCAN
DBSCAN形成集群的示例

The time complexity of the DBSCAN lies between O(n log n) (best case scenario) to O(n2) (worst case), depending upon the indexing structure, ε, and m values chosen.

-O之間的DBSCAN位于(N log n)的 (最好的情況下)至O(N2)(最壞情況),取決于所選擇索引結構 ,ε,和m值的時間復雜度。

Python code:

Python代碼:

As a part of the scikit-learn module, below is the code of DBSCAN with some of the hyperparameters set to the default value:

作為scikit-learn模塊的一部分,以下是DBSCAN的代碼,其中一些超參數設置為默認值:

class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean')

eps is the epsilon value as already explained.

如前所述,eps是epsilon值。

min_samples is the Min points value.

min_samples最低分值。

metric is the process by which distance is calculated in the algorithm. By default, it is Euclidean distance, other than that it can be any user-defined distance function or a ‘precomputed’ distance matrix.

metric是在算法中計算距離的過程。 默認情況下,它是歐幾里得距離,除了可以是任何用戶定義的距離函數或“預計算”距離矩陣。

There are some advanced hyperparameters which will be best discussed in future projects.

有一些高級超參數將在以后的項目中進行最佳討論。

Drawbacks:

缺點:

Image for post
  1. For the large differences in densities and unequal density spacing between clusters, DBSCAN shows unimpressive results at times. At times, the dataset may require different ε and ‘Min points’ value, which is not possible with DBSCAN.

    對于群集之間的密度差異和不相等的密度間距,DBSCAN有時會顯示令人印象深刻的結果。 有時,數據集可能需要不同的ε和“最小點”值,而DBSCAN則不可能。
  2. DBSCAN sometimes shows different results on each run for the same dataset. Although rarely so, but it has been termed as non-deterministic.

    對于同一數據集,DBSCAN有時每次運行都會顯示不同的結果。 雖然很少這樣,但是它被稱為不確定性的。
  3. DBSCAN faces the curse of dimensionality. It doesn’t work as expected in high dimensional datasets.

    DBSCAN面臨著維度的詛咒。 在高維數據集中無法正常工作。

To overcome these, other advanced algorithms have been designed which will be discussed in future blogs.

為了克服這些問題,已經設計了其他高級算法,這些算法將在以后的博客中討論。

Stay tuned. Happy learning :)

敬請關注。 快樂學習:)

翻譯自: https://medium.com/devtorq/dbscan-a-walkthrough-of-a-density-based-clustering-method-b5e74ca9fcfa

密度聚類dbscan

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391871.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391871.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391871.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

node aws 內存溢出_在AWS Elastic Beanstalk上運行生產Node應用程序的現實

node aws 內存溢出by Jared Nutt賈里德努特(Jared Nutt) 在AWS Elastic Beanstalk上運行生產Node應用程序的現實 (The reality of running a production Node app on AWS Elastic Beanstalk) 從在AWS的ELB平臺上運行生產Node應用程序兩年的經驗教訓 (Lessons learned from 2 y…

Day2-數據類型

數據類型與內置方法 數據類型 數字字符串列表字典元組集合字符串 1.用途 用來描述某個物體的特征:姓名,性別,愛好等 2.定義方式 變量名 字符串 如:name huazai 3.常用操作和內置方法 1.按索引取值:(只能取…

嵌套路由

父組件不能用精準匹配,否則只組件路由無法展示 轉載于:https://www.cnblogs.com/dianzan/p/11308146.html

leetcode 992. K 個不同整數的子數組(滑動窗口)

給定一個正整數數組 A,如果 A 的某個子數組中不同整數的個數恰好為 K,則稱 A 的這個連續、不一定獨立的子數組為好子數組。 (例如,[1,2,3,1,2] 中有 3 個不同的整數:1,2,以及 3。) …

從完整的新手到通過TensorFlow開發人員證書考試

I recently graduated with a bachelor’s degree in Civil Engineering and was all set to start with a Master’s degree in Transportation Engineering this fall. Unfortunately, my plans got pushed to the winter term because of COVID-19. So as of January this y…

微信開發者平臺如何編寫代碼_編寫超級清晰易讀的代碼的初級開發者指南

微信開發者平臺如何編寫代碼Writing code is one thing, but writing clean, readable code is another thing. But what is “clean code?” I’ve created this short clean code for beginners guide to help you on your way to mastering and understanding the art of c…

【轉】PHP面試題總結

PHP面試總結 PHP基礎1:變量的傳值與引用。 2:變量的類型轉換和判斷類型方法。 3:php運算符優先級,一般是寫出運算符的運算結果。 4:PHP中函數傳參,閉包,判斷輸出的echo,print是不是函…

Winform控件WebBrowser與JS腳本交互

1)在c#中調用js函數 如果要傳值,則可以定義object[]數組。 具體方法如下例子: 首先在js中定義被c#調用的方法: function Messageaa(message) { alert(message); } 在c#調用js方法Messageaa private void button1_Click(object …

從零開始擼一個Kotlin Demo

####前言 自從google將kotlin作為親兒子后就想用它擼一管app玩玩,由于工作原因一直沒時間下手,直到項目上線后才有了空余時間,期間又由于各種各樣煩人的事斷了一個月,現在終于開發完成項目分為服務器和客戶端;服務器用…

移動平均線ma分析_使用動態移動平均線構建交互式庫存量和價格分析圖

移動平均線ma分析I decided to code out my own stock tracking chart despite a wide array of freely available tools that serve the same purpose. Why? Knowledge gain, it’s fun, and because I recognize that a simple project can generate many new ideas. Even t…

敏捷開發創始人_開發人員和技術創始人如何將他們的想法轉化為UI設計

敏捷開發創始人by Simon McCade西蒙麥卡德(Simon McCade) 開發人員和技術創始人如何將他們的想法轉化為UI設計 (How developers and tech founders can turn their ideas into UI design) Discover how to turn a great idea for a product or service into a beautiful UI de…

在ubuntu怎樣修改默認的編碼格式

ubuntu修改系統默認編碼的方法是:1. 參考 /usr/share/i18n/SUPPORTED 編輯/var/lib/locales/supported.d/* gedit /var/lib/locales/supported.d/localgedit /var/lib/locales/supported.d/zh-hans如:more /var/lib/locales/supported.d/localzh_CN GB18…

JAVA中PO,BO,VO,DTO,POJO,Entity

https://my.oschina.net/liaodo/blog/2988512轉載于:https://www.cnblogs.com/dianzan/p/11311217.html

【Lolttery】項目開發日志 (三)維護好一個項目好難

項目的各種配置開始出現混亂的現象了 在只有一個人開發的情況下也開始感受到維護一個項目的難度。 之前明明還好用的東西,轉眼就各種莫名其妙的報錯,完全不知道為什么。 今天一天的工作基本上就是整理各種配置。 再加上之前數據庫設計出現了問題&#xf…

leetcode 567. 字符串的排列(滑動窗口)

給定兩個字符串 s1 和 s2,寫一個函數來判斷 s2 是否包含 s1 的排列。 換句話說,第一個字符串的排列之一是第二個字符串的子串。 示例1: 輸入: s1 “ab” s2 “eidbaooo” 輸出: True 解釋: s2 包含 s1 的排列之一 (“ba”). 解題思路 和s1每個字符…

靜態變數和非靜態變數_統計資料:了解變數

靜態變數和非靜態變數Statistics 101: Understanding the different type of variables.統計101:了解變量的不同類型。 As we enter the latter part of the year 2020, it is safe to say that companies utilize data to assist in making business decisions. F…

代碼走查和代碼審查_如何避免代碼審查陷阱降低生產率

代碼走查和代碼審查Code reviewing is an engineering practice used by many high performing teams. And even though this software practice has many advantages, teams doing code reviews also encounter quite a few code review pitfalls.代碼審查是許多高性能團隊使用…

Zabbix3.2安裝

一、環境 OS: CentOS7.0.1406 Zabbix版本: Zabbix-3.2 下載地址: http://repo.zabbix.com/zabbix/3.2/rhel/7/x86_64/zabbix-release-3.2-1.el7.noarch.rpm MySQL版本: 5.6.37 MySQL: http://repo.mysql.com/mysql-community-release-el7-5.noarch.r…

Warensoft Unity3D通信庫使用向導4-SQL SERVER訪問組件使用說明

Warensoft Unity3D通信庫使用向導4-SQL SERVER訪問組件使用說明 (作者:warensoft,有問題請聯系warensoft163.com) 在前一節《warensoft unity3d通信庫使用向導3-建立WarensoftDataService》中已經說明如何配置Warensoft Data Service,從本節開始,將說明…

01-gt;選中UITableViewCell后,Cell中的UILabel的背景顏色變成透明色

解決方案有兩種方法一 -> 新建一個UILabel類, 繼承UILabel, 然后重寫 setBackgroundColor: 方法, 在這個方法里不做任何操作, 讓UILabel的backgroundColor不發生改變.寫在最后, 感謝參考的出處:不是謝志偉StackOverflow: UITableViewCell makes labels background clear whe…