異常檢測時間序列
To understand the normal behaviour of any flow on time axis and detect anomaly situations is one of the prominent fields in data driven studies. These studies are mostly conducted in unsupervised manner, since labelling the data in real life projects is a very tough process in terms of requiring a deep retrospective analyses if you already don’t have label information. Keep in mind that outlier detection and anomaly detection are used interchangeably most of the time.
理解時間軸上任何流量的正常行為并檢測異常情況是數據驅動研究的重要領域之一。 這些研究大多以無人監督的方式進行,因為在現實生活中的項目中標記數據是非常困難的過程,因為如果您還沒有標記信息,則需要進行深入的回顧性分析。 請記住, 異常檢測和異常檢測在大多數情況下可以互換使用。
There is not a magical silver bullet that performs well in all anomaly detection use cases. In this writing, I touch on fundamental methodologies which are mainly utilized while detecting anomalies on time series in an unsupervised way, and mention about simple working principles of them. In this sense, this writing can be thought as an overview about anomaly detection on time series including real life experiences.
在所有異常檢測用例中,沒有一種能很好地發揮作用的神奇的銀彈。 在本文中,我介紹了在以無監督的方式檢測時間序列異常時主要使用的基本方法,并提到了它們的簡單工作原理。 從這個意義上講,本文可以看作是關于時間序列異常檢測的概述,包括現實生活中的經驗。
基于概率的方法 (Probability Based Approaches)
Using Z-score is one of the most straightforward methodology. Z-score basically stands for the number of standart deviation that sample value is below or above the mean of the distribution. It assumes that each features fits a normal distribution, and calculating the z-score of each features of a sample give an insight in order to detect anomalies. Samples which have much features whose values are located far from the means are likely to be an anomaly.
使用Z分數是最直接的方法之一。 Z分數基本上代表樣本值低于或高于分布平均值的標準偏差數。 假定每個特征都符合正態分布,并且計算樣本中每個特征的z得分可提供洞察力以檢測異常。 特征值遠非均值的樣本可能是異常的。
While estimating the z-scores, you should take into account the several factors that affect the pattern to get more robust inferences. Let me give you an example, you aim detecting anomalies in traffic values on devices in telco domain. Hour information, weekday information, device information(if multiple device exist in dataset) are likely to shape the pattern of traffic values. For this reason, z-score should be estimated by considering each device, hour and weekday for this example. For instance, if you expect 2.5 mbps average traffic on device A at 8 p.m. at weekends, you should take into consideration that value while making a decision for corresponding device and time information.
在估計z分數時,應考慮影響模式的幾個因素,以獲得更可靠的推斷。 讓我舉一個例子,您的目標是檢測電信域中設備的流量值異常。 小時信息,工作日信息,設備信息(如果數據集中存在多個設備)很可能會影響流量值的模式。 因此,在此示例中,應通過考慮每個設備,小時和工作日來估算z得分。 例如,如果您預計周末晚上8點設備A的平均流量為2.5 mbps,則在確定相應設備和時間信息時應考慮該值。
One of the drawbacks of this approach is that it assumes that features fit a normal distribution which is not true all the time. Another one can be counted that it ignores the correlations between features in the above mentioned solution. One important point is that z-scores can be used as inputs for other anomaly detection models as well.
這種方法的缺點之一是,它假定要素符合正態分布,但并非總是如此。 可以算出另一個忽略了上述解決方案中要素之間的相關性。 重要的一點是,z分數也可以用作其他異常檢測模型的輸入。
Quartiles-based solution implements a very similar idea to Z-score, differently it takes into account median instead of mean in a simple manner. Sometimes, it achieves better results compared to the z-score depending on distribution of the data.
基于四分位數的解決方案實現了與Z分數非常相似的想法,不同的是,它以簡單的方式考慮了中位數而不是均值 。 有時,根據數據分布,與z得分相比,它可以獲得更好的結果。
Elliptic Envelope is another option for outlier detection, which fits multivariate Gaussian distribution on data. However, it might not perform well with high dimensional data.
橢圓包絡是離群值檢測的另一種選擇,它適合數據的多元高斯分布。 但是,對于高維數據,它可能無法很好地執行。
基于預測的方法 (Forecasting Based Approaches)
In this methodology, a prediction is performed with a forecasting model for the next time period and if forecasted value is out of confidence interval, the sample is flagged as anomaly. As a forecasting model, LSTM and ARIMA models can be used. The advantage of these methods are that they are well-performing models on time series and can be applied directly on time series without feature engineering step most of the time. On the other hand, estimating the confidence interval is not a trivial and easy task. Furthermore, the accuracy of the forecasting model directly affects the success of anomaly detection. Good news is that you are able to evaluate the accuracy of the forecasting model in a supervised manner, even if you are performing anomaly detection without any label info.
在這種方法中,使用下一個時間段的預測模型執行預測,如果預測值超出置信區間,則將樣本標記為異常。 作為預測模型,可以使用LSTM和ARIMA模型。 這些方法的優勢在于它們是時間序列上性能良好的模型,并且可以在大多數時間無需特征工程步驟直接應用于時間序列。 另一方面,估計置信區間并不是一件容易的事。 此外,預測模型的準確性直接影響異常檢測的成功。 好消息是,即使您在沒有任何標簽信息的情況下執行異常檢測,您也可以通過監督的方式評估預測模型的準確性。
Prophet is also worth to take a look, it is basically a forecasting algorithm designed for time series and developed by Facebook, but I encounter many implementation of this algortihm in anomaly detection use cases.
預言家 也值得一看,它基本上是為時間序列設計的并由Facebook開發的預測算法,但是我在異常檢測用例中遇到了許多這種算法的實現。
基于神經網絡的方法 (Neural Network Based Approaches)
Autoencoder is an unsupervised type neural networks, and mainly used for feature extraction and dimension reduction. At the same time, it is a good option for anomaly detection problems. Autoencoder consists of encoding and decoding parts. In encoding part, main features are extracted which represents the patterns in the data, and then each samples is reconstructed in the decoding part. The reconstruction error will be minumum for normal samples. On the other hand, the model is not able to reconstruct a sample that behaves abnormal, resulting a high reconstruction error. So, basically, the higher reconstruction error a sample has, the more likely it is to be an anomaly.
自動編碼器 是一種無監督型神經網絡,主要用于特征提取和降維。 同時,它是異常檢測問題的不錯選擇。 自動編碼器由編碼和解碼部分組成。 在編碼部分,提取代表數據中模式的主要特征,然后在解碼部分重構每個樣本。 對于正常樣本,重建誤差將最小。 另一方面,該模型無法重建表現異常的樣本,從而導致較高的重建誤差。 因此,基本上,樣本的重構誤差越高,就越有可能成為異常。
Autoencoder is very convenient for time series, so it can also be considered among preferential alternatives for anomaly detection on time series. Note that, layers of autoencoders can be composed of LSTMs at the same time. Thus, dependencies in sequential data just like in time series can be captured.
自動編碼器對于時間序列非常方便,因此也可以考慮將其作為時間序列異常檢測的優先選擇。 注意,自動編碼器的層可以同時由LSTM組成。 因此,可以捕獲時序數據中的時序序列中的依存關系。
Self Organizing Maps (SOM) is also another unsupervised neural network based implementation, and it has simpler working principle compared to other neural network models. Although, it does not have a widespread usage in anomaly detection use cases, it is good to keep in mind that it is also an alternative.
自組織映射(SOM)也是另一種基于非監督神經網絡的實現,與其他神經網絡模型相比,它的工作原理更簡單。 盡管它在異常檢測用例中沒有廣泛使用,但請記住,它也是一種替代方法。
基于聚類的方法 (Clustering Based Approaches)
The idea behind usage of clustering in anomaly detection is that outliers don’t belong to any cluster or has their own clusters. k-means is one of the most known clustering algorithms and easy to implement. However, it brings some limitations like picking an appropriate k value. Moreover, it forms spherical clusters which is not correct for all cases. Another drawback is that it is not able to supply a probability while assigning samples to clusters especially considering that clusters can overlap in some situations.
在異常檢測中使用聚類的背后思想是,異常值不屬于任何聚類或具有自己的聚類。 k-means是最著名的聚類算法之一,易于實現。 但是,它帶來了一些限制,例如選擇適當的k值。 而且,它形成球形簇,并非在所有情況下都是正確的。 另一個缺點是,在將樣本分配給群集時,尤其是考慮到群集在某些情況下可能重疊時,它無法提供概率。
Gaussian Mixture Model (GMM) focus on the abovementioned weaknesses of k-means and present a probabilistic approach. It attemps to find a mixture of a finite number of Gaussian distributions inside the dataset.
高斯混合模型(GMM)著重于k均值的上述缺點,并提出了一種概率方法。 它試圖在數據集中找到有限數量的高斯分布的混合。
DBSCAN is a density based clustering algorithm. It determines the core points in the dataset which contains at least min_samples around it within epsilon distance, and creates clusters from these samples. After that, it finds all points which are densely reachable(within epsilon distance) from any sample in the cluster and add them to the cluster. And then, iteratively, it performs the same procedure for the newly added samples and extend the cluster. DBSCAN determines the cluster number by itself, and outliers samples will be assigned as -1. In other words, it directly serves for anomaly detection. Note that, it might suffer from perfromance issues with large sized datasets.
DBSCAN是基于密度的聚類算法。 它確定數據集中至少在epsilon距離內包含min_samples的核心點,并根據這些樣本創建聚類。 之后,它會從聚類中的任何樣本中找到所有密集到達的點(在epsilon距離之內),并將它們添加到聚類中。 然后,它對新添加的樣本反復執行相同的過程,并擴展聚類。 DBSCAN自行確定群集編號,離群值樣本將分配為-1。 換句話說,它直接用于異常檢測。 請注意,它可能會遇到大型數據集的性能問題。
基于接近的方法 (Proximity Based Approaches)
The first algorithm that come to mind is k-nearest neighbor(k-NN) algorithm. The simple logic behind is that outliers are far away from the rest of samples in the data plane. The distances to nearest negihbors of all samples are estimated and the samples located far from the other samples can be flagged as outlier. k-NN can use different distance metrics like Eucledian, Manhattan, Minkowski, Hamming distance etc.
我想到的第一個算法是k最近鄰居(k-NN)算法。 背后的簡單邏輯是,離群值與數據平面中的其余樣本相距甚遠。 估計所有樣本到最近鄰居的距離,并且可以將距離其他樣本較遠的樣本標記為離群值。 k-NN可以使用不同的距離度量標準,例如Eucledian,Manhattan,Minkowski,Hamming距離等。
Another alternative algorithm is Local Outlier Factor (LOF) which identifies the local outliers with respect to local neighbors rather than global data distribution. It utilizes a metric named as local reachability density(lrd) in order to represents density level of each points. LOF of a sample is simply the ratio of average lrd values of the sample’s neighbours to lrd value of the sample itself. If the density of a point is much smaller than average density of its neighbors, then it is likely to be an anomaly.
另一種替代算法是 本地離群值因子(LOF) ,它相對于本地鄰居而不是全局數據分布來標識本地離群值。 它利用一個稱為局部可達性密度(lrd)的度量來表示每個點的密度級別。 樣本的LOF只是樣本鄰居的平均lrd值與樣本本身的lrd值之比。 如果一個點的密度遠小于其相鄰點的平均密度,則可能是異常。
基于樹的方法 (Tree Based Approaches)
Isolation Forest is a tree based, very effective algorithm for detecting anomalies. It builds multiple trees. To build a tree, it randomly picks a feature and a split value within the minimums and maximums values of the corresponding feature. This procedure is applied to all samples in the dataset. And finally, a tree ensemble is composed by averaging all trees in the forest.
隔離林是一種基于樹的非常有效的異常檢測算法。 它構建多棵樹。 要構建樹,它會隨機選擇一個特征和一個在相應特征的最小值和最大值內的分割值。 此過程將應用于數據集中的所有樣本。 最后,通過對森林中的所有樹木進行平均來構成樹木集合。
The idea behind the Isolation Forest is that outliers are easy to diverge from rest of the samples in dataset. For this reason, we expect shorter paths from root to a leaf node in a tree(the number of splittings required to isolate the sample) for abnormal samples compared to rest of the samples in dataset.
隔離林背后的想法是,異常值很容易與數據集中的其余樣本相區別。 因此,與數據集中的其他樣本相比,我們期望異常樣本從樹的根到葉節點的路徑更短(分離樣本所需的分裂數)。
Extended Isolation Forest comes with an imporvement to splitting process of Isolation Forest. In Isolation Forest, splitting is performed parallel to the axes, in other saying, in horizontal or vertical manner resulting too much redundant regions in the domain, and similarly over construction of many trees. Extended Isolation Forest remedies these shortcomings by allowing splitting process to happen in every direction, instead of selecting a random feature with a random splitting value, it selects a random normal vector along with a random intercept point.
擴展隔離林帶有隔離林拆分過程的改進。 在“隔離林”中,平行于軸進行拆分,也就是說,以水平或垂直方式進行拆分,從而導致域中有過多的冗余區域,并且類似地,在構建許多樹時也是如此。 擴展隔離林通過允許在每個方向上進行拆分過程來彌補這些缺點,而不是選擇具有隨機拆分值的隨機特征,而是選擇隨機法向矢量以及隨機截距。
基于降維的方法 (Dimension Reduction Based Approaches)
Principal Component Analyses (PCA) is mainly used as a dimension reduction method for high dimensional data. In a basic manner, it helps to cover most of the variance in data with a smaller dimension by extracting eigenvectors that have largest eigenvalues. Therefore, it is able to keep most of the information in the data with a very smaller dimension.
主成分分析(PCA)主要用作高維數據的降維方法。 從根本上講,它通過提取具有最大特征值的特征向量來幫助覆蓋較小維度的數據中的大多數方差。 因此,它能夠以很小的維數將大多數信息保留在數據中。
While using PCA in anomaly detection, it follows a very similar approach like Autoencoders. Firstly, it decomposes data into a smaller dimension and then it reconstructs data from the decomposed version of data again. Abnormal samples tend to have a high reconstruction error regarding that they have different behaviors from other observations in data, so it is diffucult to obtain same observation from the decomposed version. PCA can be a good option for multivariate anomaly detection scenarios.
在異常檢測中使用PCA時,它采用了非常類似的方法,例如自動編碼器。 首先,它將數據分解為較小的維度,然后再次從分解后的數據版本中重建數據。 由于異常樣本與數據中其他觀測值的行為不同,因此它們往往具有較高的重構誤差,因此很難從分解后的版本中獲得相同的觀測值。 對于多變量異常檢測方案,PCA可能是一個不錯的選擇。

真實生活經驗 (REAL LIFE EXPERIENCES)
- Before starting the study, answer the following questions: How much data do you have retroactively? Univariate or multivariate data? What is the frequency of making anomaly detection?(near real time, hourly, weekly?) On what unit you are supposed to make anomaly detection? (For instance, you are studying on traffic values and you might make an anomaly detection on only devices or for each slot/port of devices) 開始研究之前,請回答以下問題:您具有多少數據? 單變量或多變量數據? 進行異常檢測的頻率是多少(近實時,每小時,每周一次?)您應該在哪個單元上進行異常檢測? (例如,您正在研究流量值,并且可能僅在設備上或針對設備的每個插槽/端口進行異常檢測)
Do your data have multiple items? Let me clarify it, assume that you are supposed to perform anomaly detection on traffic values of devices from the previous example in telco domain. You probably have traffic values for many devices(may be thousands of different devices), each has different patterns, and you should avoid to design separate models for each device in terms of complexity and maintenance issues in production. In such situations, selecting correct features is more functional rather than focusing on trying different models. Determine the patterns of each device considering properties like hour, weekday/weekend info, and extract deviation from their patterns (like z-scores) and feed the models with these features. Note that contextual anomalies are tackled mostly in time series. So, you can handle this problem with only one model that is really precious. From the forecasting perspective, a multi head neural network based model can be adapted as an advanced solution.
您的數據有多個項目嗎? 讓我澄清一下,假設您應該對電信領域中上一個示例中的設備的流量值執行異常檢測。 您可能具有許多設備(可能是數千個不同的設備)的流量值,每個設備都有不同的模式,并且應避免就生產中的復雜性和維護問題為每個設備設計單獨的模型 。 在這種情況下,選擇正確的功能比起專注于嘗試不同的模型更具功能性。 考慮諸如小時,工作日/周末信息之類的屬性來確定每個設備的模式,并從其模式中提取偏差(例如z得分),然后將這些功能提供給模型。 請注意, 上下文異常通常按時間序列處理。 因此,您只能使用一種非常珍貴的模型來解決此問題。 從預測的角度來看,基于多頭神經網絡的模型可以用作高級解決方案。
- Before starting, if it is possible, you necessarily ask for a few anomaly example from the past from the client. It will give you an insight about what is expected from you. 在開始之前,如果有可能,您一定要從客戶端詢問過去的一些異常示例。 它將使您對期望的結果有深刻的了解。
The number of anomalies is another concern. Most anomaly detection algorithms have a scoring process internally, so you are able to tune the number of anomalies by selecting an optimum threshold. Most of the time, clients dont want to be disturbed with too many anomalies even if they are real anomalies. Therefore, you might need a separate false positive elimination module. For simplicity, if a device has traffic pattern of 10mbps and if it increases to 30mbps at a point, then it is absolutely an anomaly. However, it might not attract more attention than increasing from 1gbps to 1.3gbps.
異常的數量是另一個問題。 大多數異常檢測算法在內部都有評分過程,因此您可以通過選擇最佳閾值來調整異常數量。 在大多數情況下,客戶不希望被太多異常打擾,即使它們是真正的異常。 因此, 您可能需要一個單獨的誤報消除模塊 。 為簡單起見,如果設備的流量模式為10mbps,并且某個點的流量模式增加到30mbps,則絕對是異常情況。 但是,它可能不會比從1gbps增加到1.3gbps引起更多的注意。
- Before making any decision about methodologies, I recommend visualizing the data for at least a sub-sample that will give a deep vision about data. 在對方法進行任何決定之前,我建議至少對子樣本進行數據可視化,以提供對數據的深入了解。
While some of the methods accept the time series directly without any preprocessing step, you need to implement a preprocessing or feature extraction step in order to turn data into convenient format for some methods.
盡管某些方法無需任何預處理步驟即可直接接受時間序列,但您需要實施預處理或特征提取步驟,以便將數據轉換為某些方法的便捷格式 。
Note that novelty detection and anomaly detection are different concepts. In short, in novelty detection, you have a dataset completely consists of normal observations, and decide on whether new received observation fits to data in trainset. At variance with novelty detection, you have trainset consists of both normal and abnormal samples in anomaly detection. One-class SVM might be a good option for novelty detection problems.
注意, 新穎性檢測和異常檢測是不同的概念。 簡而言之,在新穎性檢測中,您有一個完全由正常觀測值組成的數據集,并確定新接收到的觀測值是否適合火車集中的數據。 與新穎性檢測不同,您的訓練集包含異常檢測中的正常樣本和異常樣本。 對于新穎性檢測問題, 一類SVM可能是一個不錯的選擇。
I encourage to take a look on pyod and pycaret libraries in python, which provide off-the-shelf solutions in anomaly detection.
我鼓勵您看一下python中的pyod和pycaret庫,它們提供了異常檢測的現成解決方案。
翻譯自: https://towardsdatascience.com/unsupervised-anomaly-detection-on-time-series-9bcee10ab473
異常檢測時間序列
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/388069.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/388069.shtml 英文地址,請注明出處:http://en.pswp.cn/news/388069.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!