異常檢測機器學習
什么是異常檢測? (What is Anomaly Detection?)
The anomaly detection problem has been a problem that has been frequently explored in the field of machine learning and has become a classic problem. Anomalies are any unusual sequence or pattern inside a large corpus of data. These anomalies usually cause unexpected and complex errors or inefficiencies unless resolved. Searching for these anomalies through a corpus might be easy if the corpus was relatively small, but when it scales to an enormous size, that solution becomes unreasonable. For example, trying to find a grammatical mistake in a 200 word paragraph is pretty easy but imagine trying to find all the grammatical errors in a 5000 page encyclopedia. The problem becomes much more difficult for humans. Fortunately, with the help of machine learning, we are able to solve this problem much easier (kind of).
異常檢測問題已經成為機器學習領域中經常探討的問題,并且已經成為經典問題。 異常是大型數據集中的任何異常序列或模式。 除非解決,否則這些異常通常會導致意外的復雜錯誤或效率低下。 如果語料庫相對較小,則通過語料庫搜索這些異常可能很容易,但是當它擴展到巨大規模時,該解決方案將變得不合理。 例如,嘗試在200個單詞的段落中查找語法錯誤是很容易的,但是可以想象一下,嘗試在5000頁的百科全書中查找所有語法錯誤。 這個問題對人類來說變得更加困難。 幸運的是,借助機器學習,我們能夠(更輕松)解決此問題。
First of all, what is machine learning? Machine learning is essentially using statistics to model and train how a system (or corpus) normally behaves from a training set (the background data set). Afterwards, we can compare the abnormally behaving system (the target data set) to our model of how a normal system behaves and try to uncover anomalies in the target. Although the main idea sounds pretty easy and intuitive, there are many complexities associated with this process such as finding a background data set that is representative of the whole population, distributing the calculations to different machines for large data sets, etc. Although these problems are all difficult obstacles that software engineers have to tackle before creating a polished machine learning model, I will not be talking about these issues but rather the application of machine learning to find anomalies.
首先,什么是機器學習? 機器學習本質上是使用統計數據來建模和訓練系統(或語料庫)通常如何根據訓練集(背景數據集)表現。 然后,我們可以將行為異常的系統(目標數據集)與正常系統行為的模型進行比較,并嘗試發現目標中的異常。 盡管主要想法聽起來很容易且直觀,但是與此過程相關的復雜性很多,例如找到代表整個人群的背景數據集,將計算分布到大型數據集的不同機器等。盡管這些問題是在創建完善的機器學習模型之前軟件工程師必須解決的所有困難障礙,我不會在談論這些問題,而是在機器學習中應用以發現異常。
異常檢測問題的類型 (Types of Anomaly Detection Problems)
已知數據語料庫中的結構異常 (Structured Anomalies in a Known Corpus of Data)
There are four main types of anomaly detection problems. The first (and also easiest) type is detecting structured anomalies in a known corpus. These are problems where you know what the structure of the anomalies will be and you know the format of the corpus. As a simplified analogy, the problem of detecting numbers that decrease from the number prior to it where the corpus is a string of strictly increasing numbers would fall under this type. In this example, we know the pattern of the normal behavior (strictly increasing numbers) and we are detecting for a known anomaly (a decrease between adjacent numbers). This problem is relatively easy as we can clearly measure and know for sure when something is an anomaly as we have a clear structure we are comparing it to. In this case, it is relatively easy to have a high performance machine learning algorithm and have negligible false negatives.
有四種主要類型的異常檢測問題。 第一種(也是最簡單的一種)類型是檢測已知語料庫中的結構異常。 在這些問題中,您知道異常的結構將是什么,并且您知道語料庫的格式。 作為簡化的類比,在語料庫是一串嚴格增加的數字的情況下,檢測從其之前的數字開始減少的數字的問題將屬于這種類型。 在此示例中,我們知道正常行為的模式(數字嚴格增加),并且正在檢測已知的異常(相鄰數字之間的減少)。 這個問題相對容易,因為我們可以清楚地測量并確定什么時候異常,因為我們有一個清晰的結構要與之進行比較。 在這種情況下,擁有高性能的機器學習算法和具有可忽略的錯誤否定條件相對容易。
未知數據語料庫中的結構異常 (Structured Anomalies in an Unknown Corpus of Data)
The second type is detecting a structured anomaly in an unknown corpus. These problems are more difficult than the previous example as we now need to consider the problem of how to parse through and evaluate the corpus in order to uncover the anomalies. This problem is not that much more difficult than the previous example as we still know the structure of the anomalies so after we solve the parsing problem then this type of problem becomes identical to the previous type. However, as the target corpus has an unknown structure, there will most likely be more false negatives than in the first type.
第二種類型是檢測未知語料庫中的結構異常。 這些問題比前面的示例更加困難,因為我們現在需要考慮如何解析和評估語料庫以發現異常的問題。 因為我們仍然知道異常的結構,所以這個問題并不比前面的示例困難得多,因此在解決了解析問題之后,該類型的問題就變得與前面的類型相同。 但是,由于目標語料庫的結構未知,因此與第一種類型相比,假陰性率最高。
已知數據語料庫中的非結構化異常 (Unstructured Anomalies in a Known Corpus of Data)
The third type is detecting an unstructured anomaly in a known corpus. Again, this type of problem is more complex than the previous type. Although we have a defined structure where we can build our parsing algorithm upon, the anomalies are unstructured meaning that we have to truly understand the heuristics of the background corpus in order to evaluate the target corpus against. In this case, we start to have false positives in addition to false negatives as we do not have a proper way to evaluate if our detected anomalies are in fact true positives through the program without human interaction.
第三種是檢測已知語料庫中的非結構異常。 同樣,這種類型的問題比以前的類型更為復雜。 盡管我們有一個定義的結構可以在其中構建我們的解析算法,但是異常是非結構化的,這意味著我們必須真正了解背景語料庫的啟發式方法才能評估目標語料庫。 在這種情況下,除了假陰性外,我們還開始有假陽性,因為我們沒有適當的方法來評估通過程序在沒有人工干預的情況下檢測到的異常是否實際上是真正的陽性。
未知數據語料庫中的非結構化異常 (Unstructured Anomalies in an Unknown Corpus of Data)
The last type is the toughest anomaly detection problem and is still being researched and improved today. The remaining type is, of course, detecting unstructured anomalies in an unknown corpus. In this case, not only do we have to understand the heuristics of the corpus, we also have to create many measures based on the heuristics to evaluate how anomalous each segment of the target corpus is. For all of these measures, we need to set thresholds for which we classify a segment as an anomaly. These thresholds each have their own trade offs and finding the optimal thresholds for detecting anomalies requires operating and evaluating performance in a multi-dimensional space, each dimension representing one of the thresholds. Additionally, after exploring this multi-dimensional space, one might realize that the heuristics of the background corpus was not properly represented by the machine learning model and must restart and think of another way to quantify or identify the patterns of the corpus. The whole process can be really complex and frustrating due to the performance feedback loop. This type of anomaly detection, although very difficult, can potentially yield amazing results.
最后一種是最棘手的異常檢測問題,目前仍在研究和改進中。 當然,剩下的類型是檢測未知語料庫中的非結構化異常。 在這種情況下,我們不僅必須了解語料庫的啟發式方法,還必須基于啟發式方法創建許多度量,以評估目標語料庫的每個片段的異常程度。 對于所有這些措施,我們需要設置閾值,將其分類為異常。 這些閾值各有其自身的權衡,找到用于檢測異常的最佳閾值需要在多維空間中進行操作和評估性能,每個維表示一個閾值。 另外,在探索了多維空間之后,人們可能會意識到,背景語料庫的啟發式方法不能正確地由機器學習模型表示,因此必須重新開始思考另一種量化或識別語料庫模式的方法。 由于性能反饋回路,整個過程可能非常復雜且令人沮喪。 這種異常檢測雖然非常困難,但可能會產生驚人的結果。
結論 (Conclusion)
Understandably, the degree of which we can ignore the structure of the anomalies and corpus is proportional to the degree of difficulty in creating the algorithm. The more specific we are about the structure of the anomalies and the corpus, the easier the machine learning algorithm is to make. The less structured the anomalies and corpus are, the wider the range of problems that the algorithm can be applied to. However, accuracy and precision will also become issues as the structure of the anomalies and corpus becomes more vague. In an ideal world, if we made a super generic and accurate machine learning algorithm and tuned it perfectly to fix every situation, we would be able apply it to any problem in the world. In the field of health and medicine, we can detect problematic sub-sequences in genomes to detect illnesses like cancer way before it becomes an issue. In the field of technology, we can apply the algorithm to a real time logging system and uncover hackers or malicious activity the instant it occurs. There are so many other fields that anomaly detection can be applied to and if we can one day perfect it, we can solve many issues that are stumping scientists, engineers, and researchers today.
可以理解,我們可以忽略異常和語料庫的結構的程度與創建算法的難度成正比。 我們對異常和語料庫的結構越具體,機器學習算法就越容易實現。 異常和語料庫的結構越少,可以應用該算法的問題范圍就越廣。 但是,隨著異常和語料庫的結構越來越模糊,準確性和準確性也將成為問題。 在理想的世界中,如果我們制作了超級通用且準確的機器學習算法,并對其進行了完美的調整以解決每種情況,那么我們便可以將其應用于世界上的任何問題。 在健康和醫學領域,我們可以檢測到基因組中有問題的子序列,從而在疾病成為問題之前檢測出諸如癌癥之類的疾病。 在技??術領域,我們可以將該算法應用于實時日志記錄系統,并在發生黑客或惡意活動后立即對其進行發現。 還有很多其他領域可以應用異常檢測,如果我們有一天能夠完善它,我們可以解決當今困擾科學家,工程師和研究人員的許多問題。
翻譯自: https://towardsdatascience.com/detecting-anomalies-using-machine-learning-e3495f79718
異常檢測機器學習
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392106.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392106.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392106.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!