異常檢測機器學習_使用機器學習檢測異常

異常檢測機器學習

什么是異常檢測? (What is Anomaly Detection?)

The anomaly detection problem has been a problem that has been frequently explored in the field of machine learning and has become a classic problem. Anomalies are any unusual sequence or pattern inside a large corpus of data. These anomalies usually cause unexpected and complex errors or inefficiencies unless resolved. Searching for these anomalies through a corpus might be easy if the corpus was relatively small, but when it scales to an enormous size, that solution becomes unreasonable. For example, trying to find a grammatical mistake in a 200 word paragraph is pretty easy but imagine trying to find all the grammatical errors in a 5000 page encyclopedia. The problem becomes much more difficult for humans. Fortunately, with the help of machine learning, we are able to solve this problem much easier (kind of).

異常檢測問題已經成為機器學習領域中經常探討的問題,并且已經成為經典問題。 異常是大型數據集中的任何異常序列或模式。 除非解決,否則這些異常通常會導致意外的復雜錯誤或效率低下。 如果語料庫相對較小,則通過語料庫搜索這些異常可能很容易,但是當它擴展到巨大規模時,該解決方案將變得不合理。 例如,嘗試在200個單詞的段落中查找語法錯誤是很容易的,但是可以想象一下,嘗試在5000頁的百科全書中查找所有語法錯誤。 這個問題對人類來說變得更加困難。 幸運的是,借助機器學習,我們能夠(更輕松)解決此問題。

First of all, what is machine learning? Machine learning is essentially using statistics to model and train how a system (or corpus) normally behaves from a training set (the background data set). Afterwards, we can compare the abnormally behaving system (the target data set) to our model of how a normal system behaves and try to uncover anomalies in the target. Although the main idea sounds pretty easy and intuitive, there are many complexities associated with this process such as finding a background data set that is representative of the whole population, distributing the calculations to different machines for large data sets, etc. Although these problems are all difficult obstacles that software engineers have to tackle before creating a polished machine learning model, I will not be talking about these issues but rather the application of machine learning to find anomalies.

首先,什么是機器學習? 機器學習本質上是使用統計數據來建模和訓練系統(或語料庫)通常如何根據訓練集(背景數據集)表現。 然后,我們可以將行為異常的系統(目標數據集)與正常系統行為的模型進行比較,并嘗試發現目標中的異常。 盡管主要想法聽起來很容易且直觀,但是與此過程相關的復雜性很多,例如找到代表整個人群的背景數據集,將計算分布到大型數據集的不同機器等。盡管這些問題是在創建完善的機器學習模型之前軟件工程師必須解決的所有困難障礙,我不會在談論這些問題,而是在機器學習中應用以發現異常。

異常檢測問題的類型 (Types of Anomaly Detection Problems)

已知數據語料庫中的結構異常 (Structured Anomalies in a Known Corpus of Data)

There are four main types of anomaly detection problems. The first (and also easiest) type is detecting structured anomalies in a known corpus. These are problems where you know what the structure of the anomalies will be and you know the format of the corpus. As a simplified analogy, the problem of detecting numbers that decrease from the number prior to it where the corpus is a string of strictly increasing numbers would fall under this type. In this example, we know the pattern of the normal behavior (strictly increasing numbers) and we are detecting for a known anomaly (a decrease between adjacent numbers). This problem is relatively easy as we can clearly measure and know for sure when something is an anomaly as we have a clear structure we are comparing it to. In this case, it is relatively easy to have a high performance machine learning algorithm and have negligible false negatives.

有四種主要類型的異常檢測問題。 第一種(也是最簡單的一種)類型是檢測已知語料庫中的結構異常。 在這些問題中,您知道異常的結構將是什么,并且您知道語料庫的格式。 作為簡化的類比,在語料庫是一串嚴格增加的數字的情況下,檢測從其之前的數字開始減少的數字的問題將屬于這種類型。 在此示例中,我們知道正常行為的模式(數字嚴格增加),并且正在檢測已知的異常(相鄰數字之間的減少)。 這個問題相對容易,因為我們可以清楚地測量并確定什么時候異常,因為我們有一個清晰的結構要與之進行比較。 在這種情況下,擁有高性能的機器學習算法和具有可忽略的錯誤否定條件相對容易。

未知數據語料庫中的結構異常 (Structured Anomalies in an Unknown Corpus of Data)

The second type is detecting a structured anomaly in an unknown corpus. These problems are more difficult than the previous example as we now need to consider the problem of how to parse through and evaluate the corpus in order to uncover the anomalies. This problem is not that much more difficult than the previous example as we still know the structure of the anomalies so after we solve the parsing problem then this type of problem becomes identical to the previous type. However, as the target corpus has an unknown structure, there will most likely be more false negatives than in the first type.

第二種類型是檢測未知語料庫中的結構異常。 這些問題比前面的示例更加困難,因為我們現在需要考慮如何解析和評估語料庫以發現異常的問題。 因為我們仍然知道異常的結構,所以這個問題并不比前面的示例困難得多,因此在解決了解析問題之后,該類型的問題就變得與前面的類型相同。 但是,由于目標語料庫的結構未知,因此與第一種類型相比,假陰性率最高。

已知數據語料庫中的非結構化異常 (Unstructured Anomalies in a Known Corpus of Data)

The third type is detecting an unstructured anomaly in a known corpus. Again, this type of problem is more complex than the previous type. Although we have a defined structure where we can build our parsing algorithm upon, the anomalies are unstructured meaning that we have to truly understand the heuristics of the background corpus in order to evaluate the target corpus against. In this case, we start to have false positives in addition to false negatives as we do not have a proper way to evaluate if our detected anomalies are in fact true positives through the program without human interaction.

第三種是檢測已知語料庫中的非結構異常。 同樣,這種類型的問題比以前的類型更為復雜。 盡管我們有一個定義的結構可以在其中構建我們的解析算法,但是異常是非結構化的,這意味著我們必須真正了解背景語料庫的啟發式方法才能評估目標語料庫。 在這種情況下,除了假陰性外,我們還開始有假陽性,因為我們沒有適當的方法來評估通過程序在沒有人工干預的情況下檢測到的異常是否實際上是真正的陽性。

未知數據語料庫中的非結構化異常 (Unstructured Anomalies in an Unknown Corpus of Data)

The last type is the toughest anomaly detection problem and is still being researched and improved today. The remaining type is, of course, detecting unstructured anomalies in an unknown corpus. In this case, not only do we have to understand the heuristics of the corpus, we also have to create many measures based on the heuristics to evaluate how anomalous each segment of the target corpus is. For all of these measures, we need to set thresholds for which we classify a segment as an anomaly. These thresholds each have their own trade offs and finding the optimal thresholds for detecting anomalies requires operating and evaluating performance in a multi-dimensional space, each dimension representing one of the thresholds. Additionally, after exploring this multi-dimensional space, one might realize that the heuristics of the background corpus was not properly represented by the machine learning model and must restart and think of another way to quantify or identify the patterns of the corpus. The whole process can be really complex and frustrating due to the performance feedback loop. This type of anomaly detection, although very difficult, can potentially yield amazing results.

最后一種是最棘手的異常檢測問題,目前仍在研究和改進中。 當然,剩下的類型是檢測未知語料庫中的非結構化異常。 在這種情況下,我們不僅必須了解語料庫的啟發式方法,還必須基于啟發式方法創建許多度量,以評估目標語料庫的每個片段的異常程度。 對于所有這些措施,我們需要設置閾值,將其分類為異常。 這些閾值各有其自身的權衡,找到用于檢測異常的最佳閾值需要在多維空間中進行操作和評估性能,每個維表示一個閾值。 另外,在探索了多維空間之后,人們可能會意識到,背景語料庫的啟發式方法不能正確地由機器學習模型表示,因此必須重新開始思考另一種量化或識別語料庫模式的方法。 由于性能反饋回路,整個過程可能非常復雜且令人沮喪。 這種異常檢測雖然非常困難,但可能會產生驚人的結果。

結論 (Conclusion)

Understandably, the degree of which we can ignore the structure of the anomalies and corpus is proportional to the degree of difficulty in creating the algorithm. The more specific we are about the structure of the anomalies and the corpus, the easier the machine learning algorithm is to make. The less structured the anomalies and corpus are, the wider the range of problems that the algorithm can be applied to. However, accuracy and precision will also become issues as the structure of the anomalies and corpus becomes more vague. In an ideal world, if we made a super generic and accurate machine learning algorithm and tuned it perfectly to fix every situation, we would be able apply it to any problem in the world. In the field of health and medicine, we can detect problematic sub-sequences in genomes to detect illnesses like cancer way before it becomes an issue. In the field of technology, we can apply the algorithm to a real time logging system and uncover hackers or malicious activity the instant it occurs. There are so many other fields that anomaly detection can be applied to and if we can one day perfect it, we can solve many issues that are stumping scientists, engineers, and researchers today.

可以理解,我們可以忽略異常和語料庫的結構的程度與創建算法的難度成正比。 我們對異常和語料庫的結構越具體,機器學習算法就越容易實現。 異常和語料庫的結構越少,可以應用該算法的問題范圍就越廣。 但是,隨著異常和語料庫的結構越來越模糊,準確性和準確性也將成為問題。 在理想的世界中,如果我們制作了超級通用且準確的機器學習算法,并對其進行了完美的調整以解決每種情況,那么我們便可以將其應用于世界上的任何問題。 在健康和醫學領域,我們可以檢測到基因組中有問題的子序列,從而在疾病成為問題之前檢測出諸如癌癥之類的疾病。 在技??術領域,我們可以將該算法應用于實時日志記錄系統,并在發生黑客或惡意活動后立即對其進行發現。 還有很多其他領域可以應用異常檢測,如果我們有一天能夠完善它,我們可以解決當今困擾科學家,工程師和研究人員的許多問題。

翻譯自: https://towardsdatascience.com/detecting-anomalies-using-machine-learning-e3495f79718

異常檢測機器學習

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392106.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392106.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392106.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

數據挖掘—BP神經網絡(Java實現)

public class Test {public static void main(String args[]) throws Exception {ArrayList<ArrayList<Double>> alllist new ArrayList<ArrayList<Double>>(); // 存放所有數據ArrayList<String> outlist new ArrayList<String>(); // …

c語言掌握常用函數,c語言一些常用函數.pdf

c語言一些常用函數C 語言程序設計(常用函數說明)C 語言是 1972 年由美國的 Dennis Ritchie 設計發明的,并首次在 UNIX 操作系統的 DEC PDP-11 計算機上使用。它由早期的編程語言 BCPL(Basic Combind ProgrammingLanguage)發展演變而來。在 1970 年,AT&T 貝爾實驗室的 Ken T…

高階函數 - 函數節流

/*** 函數節流 - 限制函數被頻繁調用* param {Function} fn [需要執行的函數]* param {[type]} interval [限制多長的時間再重復執行fn]*/var throttle function(fn, interval) {var __self fn,timer,firstTime true;return function() {var args arguments,__me…

[CareerCup] 8.7 Chat Server 聊天服務器

8.7 Explain how you would design a chat server. In particular, provide details about the various backend components, classes, and methods. What would be the hardest problems to solve? 這個簡易的聊天服務器功能十分的有限&#xff0c;畢竟只是針對面試題的&…

react hooks使用_如何開始使用React Hooks:受控表格

react hooks使用by Kevin Okeh由Kevin Okeh 如何開始使用React Hooks&#xff1a;受控表格 (How to Get Started With React Hooks: Controlled Forms) React Hooks are a shiny new proposal that will allow you to write 90% cleaner React. According to Dan Abramov, Hoo…

特征工程tf-idf_特征工程-保留和刪除的內容

特征工程tf-idfThe next step after exploring the patterns in data is feature engineering. Any operation performed on the features/columns which could help us in making a prediction from the data could be termed as Feature Engineering. This would include the…

c語言定義數組a10 指定各元素,C語言填空題.doc

C語言填空題.doc二、填空題1、C 語言只有 32 個關鍵字和 9 種控制語句。2、每個源程序有且只有一個 main 函數&#xff0c;系統總是從該函數開始執行 C 語言程序。 3、C 語言程序的注釋可以出現在程序中的任何地方&#xff0c;它總是以 * 符號作為開始標記&#xff0c;以 */ 符…

貓狗隊列

功能要求&#xff1a; 用戶可以調用push方法將cat類或dog類的實例放入隊列中;用戶可以調用pollAll方法&#xff0c;將隊列中所有的實例按照進隊列的先后順序依次彈出;用戶可以調用pollDog方法&#xff0c;將隊列中dog類的實例按照進隊列的先后順序依次彈出;用戶可以調用pollCat…

如何使用HTML5,JavaScript和Bootstrap構建自定義文件上傳器

by Prashant Yadav通過Prashant Yadav 如何使用HTML5&#xff0c;JavaScript和Bootstrap構建自定義文件上傳器 (How to build a custom file uploader with HTML5, JavaScript, & Bootstrap) In this short article, we’ll learn how to create custom file uploader wit…

monkey測試===通過monkey測試檢查app內存泄漏和cpu占用

最近一直在研究monkey測試。網上資料很多&#xff0c;但都是一個抄一個的。原創的很少 我把檢查app內存泄漏的情況梳理一下&#xff1a; 參考資料&#xff1a; Monkey測試策略&#xff1a;https://testerhome.com/topics/597 Android Monkey測試詳細介紹&#xff1a;http://www…

數據挖掘—主成分分析法降維和最小最大規范化

算法步驟:1)將原始數據按列組成n行m列矩陣X2)特征中心化。即每一維的數據都減去該維的均值&#xff0c;使每一維的均值都為03)求出協方差矩陣4)求出協方差矩陣的特征值及對應的特征向量5)將特征向量按對應的特征值大小從上往下按行排列成矩陣&#xff0c;取前k行組成矩陣p6)YPX…

用戶使用說明c語言,(C語言使用指南.docx

(C語言使用指南Turbo C(V2.0)使用指南(本文的許多命令或方法同樣適用于TC3) 在開始看本文以前&#xff0c;我先說明一下C語言的安裝和使用中最應該注意的地方&#xff1a;許多網友在下載Turbo C 2.0和Turbo C 3.0后&#xff0c;向我問得最多的是在使用過程中碰到如下問題&…

三維空間兩直線/線段最短距離、線段計算算法 【轉】

https://segmentfault.com/a/1190000006111226d(ls,lt)|sj?tj||s0?t0(be?cd)u? ?(ae?bd)v? ac?bd(ls,lt)|sj?tj||s0?t0(be?cd)u? ?(ae?bd)v? ac?b2|具體實現代碼如下&#xff08;C#實現&#xff09;&#xff1a; public bool IsEqual(double d1, double d2) { …

【慎思堂】之JS牛腩總結

一 JS基礎 1-定義 Javascript是一種腳本語言/描述語言&#xff0c;是一種解釋性語言。用于開發交互式web網頁&#xff0c;使得網頁和用戶之間實現了一種實時性的、動態的、交互性的關系&#xff0c;使網頁包含更多活躍的元素和更加精彩的內容。 主要用于&#xff1a;表單驗證 …

vuejs 輪播_如何在VueJS中設計和構建輪播功能

vuejs 輪播by Fabian Hinsenkamp由Fabian Hinsenkamp設計 A carousel, slideshow, or slider — however you call it this class of UI — has become one of the core elements used in modern web development. Today, it’s almost impossible to find any Website or UI …

iOS繪圓形圖-CGContextAddArc各參數說明

2019獨角獸企業重金招聘Python工程師標準>>> 1.使用 UIGraphicsGetCurrentContext() 畫圓 CGContextAddArc(<#CGContextRef _Nullable c#>, <#CGFloat x#>, <#CGFloat y#>, <#CGFloat radius#>, <#CGFloat startAngle#>, <#CGFlo…

c語言中if和goto的用法,C語言中if和goto的用法.doc

C語言中if和goto的用法C語言中&#xff0c;if是一個條件語句&#xff0c;用法??if(條件表達式) 語句如果滿足括號里面表達式&#xff0c;表示邏輯為真于是執行后面的語句&#xff0c;否則不執行(表達式為真則此表達式的值不為0&#xff0c;為假則為0&#xff0c;也就是說&…

數據挖掘—K-Means算法(Java實現)

算法描述 &#xff08;1&#xff09;任意選擇k個數據對象作為初始聚類中心 &#xff08;2&#xff09;根據簇中對象的平均值&#xff0c;將每個對象賦給最類似的簇 &#xff08;3&#xff09;更新簇的平均值&#xff0c;即計算每個對象簇中對象的平均值 &#xff08;4&#xf…

自我價值感缺失的表現_不同類型的缺失價值觀和應對方法

自我價值感缺失的表現Before handling the missing values, we must know what all possible types of it exists in the data science world. Basically there are 3 types to be found everywhere on the web, but in some of the core research papers there is one more ty…

[收藏轉載]C# GDI+ 簡單繪圖(一)

最近對GDI這個東西接觸的比較多&#xff0c;也做了些簡單的實例&#xff0c;比如繪圖板&#xff0c;仿QQ截圖等&#xff0e; 廢話不多說了&#xff0c;我們先來認識一下這個GDI&#xff0c;看看它到底長什么樣. GDI&#xff1a;Graphics Device Interface Plus也就是圖形設備接…