缺失值和異常值的識別與處理_識別異常值-第一部分

缺失值和異常值的識別與處理

📈Python金融系列 (📈Python for finance series)

Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.

警告： 這里沒有神奇的配方或圣杯，盡管新世界可能為您打開了大門。

📈Python金融系列 (📈Python for finance series)

Identifying Outliers
識別異常值
Identifying Outliers — Part Two
識別異常值-第二部分
Identifying Outliers — Part Three
識別異常值-第三部分
Stylized Facts
程式化的事實
Feature Engineering & Feature Selection
特征工程與特征選擇
Data Transformation
數據轉換

Pandas has quite a few handy methods to clean up messy data, like dropna,drop_duplicates, etc.. However, finding and removing outliers is one of those functions that we would like to have and still not exist yet. Here I would like to share with you how to do it step by step in details:

Pandas有很多方便的方法可以清理混亂的數據，例如dropna ， drop_duplicates等。但是，查找和刪除異常值是我們希望擁有的但仍然不存在的功能之一。在這里，我想與您分享如何逐步進行詳細操作：

The key to defining an outlier lays at the boundary we employed. Here I will give 3 different ways to define the boundary, namely, the Average mean, the Moving Average mean and the Exponential Weighted Moving Average mean.

定義離群值的關鍵在于我們采用的邊界。在這里，我將給出3種不同的方法來定義邊界，即平均均值，移動平均數和指數加權移動平均數。

1.數據準備 (1. Data preparation)

Here I used Apple’s 10-year stock history price and returns from Yahoo Finance as an example, of course, you can use any data.

在這里，我以蘋果公司10年的股票歷史價格和Yahoo Finance的收益為例，當然，您可以使用任何數據。

import pandas as pd 
import yfinance as yfimport matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300df = yf.download('AAPL',
                 start = '2000-01-01',
                 end= '2010-12-31')

As we only care about the returns, a new DataFrame (d1) is created to hold the adjusted price and returns.

由于我們只關心收益， DataFrame (d1)會創建一個新的DataFrame (d1)來容納調整后的價格和收益。

d1 = pd.DataFrame(df['Adj Close'])
d1.rename(columns={'Adj Close':'adj_close'}, inplace=True)
d1['simple_rtn']=d1.adj_close.pct_change()
d1.head()

2.以均值和標準差為邊界。 (2. Using mean and standard deviation as the boundary.)

Calculate the mean and std of the simple_rtn:

計算simple_rtn的均值和std：

d1_mean = d1['simple_rtn'].agg(['mean', 'std'])

If we use mean and one std as the boundary, the results will look like these:

如果我們使用均值和一個std作為邊界，結果將如下所示：

fig, ax = plt.subplots(figsize=(10,6))
d1['simple_rtn'].plot(label='simple_rtn', legend=True, ax = ax)
plt.axhline(y=d1_mean.loc['mean'], c='r', label='mean')
plt.axhline(y=d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.axhline(y=-d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.legend(loc='lower right')

What happens if I use 3 times std instead?

如果我使用3次std會怎樣？

Looks good! Now is the time to look for those outliers:

看起來挺好的！現在是時候尋找那些離群值了：

mu = d1_mean.loc['mean']
sigma = d1_mean.loc['std']def get_outliers(df, mu=mu, sigma=sigma, n_sigmas=3):
    '''
    df: the DataFrame
    mu: mean
    sigmas: std
    n_sigmas: number of std as boundary
    '''
    x = df['simple_rtn']
    mu = mu
    sigma = sigma
    
    if (x > mu+n_sigmas*sigma) | (x<mu-n_sigmas*sigma):
        return 1
    else:
        return 0

After applied the rule get_outliers to the stock price return, a new column is created:

將規則get_outliers應用于股票價格收益后，將創建一個新列：

d1['outlier'] = d1.apply(get_outliers, axis=1)
d1.head()

?提示！ (?Tip!)

#The above code snippet can be refracted as follow:cond = (d1['simple_rtn'] > mu + sigma * 2) | (d1['simple_rtn'] < mu - sigma * 2)
d1['outliers'] = np.where(cond, 1, 0)

Let’s have a look at the outliers. We can check how many outliers we found by doing a value count.

讓我們看看異常值。我們可以通過計數來檢查發現了多少離群值。

d1.outlier.value_counts()

We found 30 outliers if we set 3 times std as the boundary. We can pick those outliers out and put it into another DataFrame and show it in the graph:

如果我們將std設置為3倍，則發現30個離群值。我們可以挑選出這些離群值，并將其放入另一個DataFrame ，并在圖中顯示出來：

outliers = d1.loc[d1['outlier'] == 1, ['simple_rtn']]fig, ax = plt.subplots()ax.plot(d1.index, d1.simple_rtn, 
        color='blue', label='Normal')
ax.scatter(outliers.index, outliers.simple_rtn, 
           color='red', label='Anomaly')
ax.set_title("Apple's stock returns")
ax.legend(loc='lower right')plt.tight_layout()
plt.show()

In the above plot, we can observe outliers marked with a red dot. In the next post, I will show you how to use Moving Average Mean and Standard deviation as the boundary.

在上圖中，我們可以觀察到標有紅點的離群值。在下一篇文章中，我將向您展示如何使用移動平均均值和標準差作為邊界。

Happy learning, happy coding!

學習愉快，編碼愉快！

翻譯自: https://medium.com/python-in-plain-english/identifying-outliers-part-one-c0a31d9faefa

缺失值和異常值的識別與處理

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/390827.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/390827.shtml
英文地址，請注明出處：http://en.pswp.cn/news/390827.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

SQL Server 常用分頁SQL

今天無聊和朋友討論分頁，發現網上好多都是錯的。網上經常查到的那個Top Not in 或者Max 大部分都不實用，很多都忽略了Order和性能問題。為此上網查了查，順帶把2000和2012版本的也補上了。先說說網上常見SQL的錯誤或者說局限問題 12345select…

Word中摘要和正文同時分欄后，正文跑到下一頁，怎么辦？或Word分欄后第一頁明明有空位后面的文字卻自動跳到第二頁了，怎么辦？...

問題1：Word中摘要和正文同時分欄后，正文跑到下一頁，怎么辦？或Word分欄后第一頁明明有空位后面的文字卻自動跳到第二頁了，怎么辦？ 答：在word2010中，菜單欄中最左側選“文件”->“選…

leetcode 664. 奇怪的打印機(dp)

題目有臺奇怪的打印機有以下兩個特殊要求： 打印機每次只能打印由同一個字符組成的序列。每次可以在任意起始和結束位置打印新字符，并且會覆蓋掉原來已有的字符。給你一個字符串 s ，你的任務是計算這個打印機打印它需要的最少打印次數。…

SQL數據類型說明和MySQL語法示例

SQL數據類型 (SQL Data Types) Each column in a database table is required to have a name and a data type. 數據庫表中的每一列都必須具有名稱和數據類型。 An SQL developer must decide what type of data that will be stored inside each column when creating a tab…

PHP7.2 redis

為什么80%的碼農都做不了架構師？>>> PHP7.2 的redis安裝方法： 順便說一下PHP7.2的安裝： wget http://cn2.php.net/distributions/php-7.2.4.tar.gz tar -zxvf php-7.2.4.tar.gz cd php-7.2.4./configure --prefix/usr/local/php…

leetcode 1787. 使所有區間的異或結果為零

題目給你一個整數數組 nums??? 和一個整數 k????? 。區間 [left, right]（left < right）的異或結果是對下標位于 left 和 right（包括 left 和 right ）之間所有元素進行 XOR 運算的結果：nums[left] XOR n…

【JavaScript】網站源碼防止被人另存為

1、禁示查看源代碼從"查看"菜單下的"源文件"中同樣可以看到源代碼，下面我們就來解決這個問題： 其實這只要使用一個含有<frame></frame>標記的網頁便可以達到目的。 <frameset> <frame src"你要保密的文件…

梯度 cv2.sobel_TensorFlow 2.0中連續策略梯度的最小工作示例

梯度 cv2.sobelAt the root of all the sophisticated actor-critic algorithms that are designed and applied these days is the vanilla policy gradient algorithm, which essentially is an actor-only algorithm. Nowadays, the actor that learns the decision-making …