缺失值和異常值的識別與處理
📈Python金融系列 (📈Python for finance series)
Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.
警告 : 這里沒有神奇的配方或圣杯,盡管新世界可能為您打開了大門。
📈Python金融系列 (📈Python for finance series)
Identifying Outliers
識別異常值
Identifying Outliers — Part Two
識別異常值-第二部分
Identifying Outliers — Part Three
識別異常值-第三部分
Stylized Facts
程式化的事實
Feature Engineering & Feature Selection
特征工程與特征選擇
Data Transformation
數據轉換
Pandas has quite a few handy methods to clean up messy data, like dropna,drop_duplicates, etc.. However, finding and removing outliers is one of those functions that we would like to have and still not exist yet. Here I would like to share with you how to do it step by step in details:
Pandas有很多方便的方法可以清理混亂的數據,例如dropna , drop_duplicates等。但是,查找和刪除異常值是我們希望擁有的但仍然不存在的功能之一。 在這里,我想與您分享如何逐步進行詳細操作:
The key to defining an outlier lays at the boundary we employed. Here I will give 3 different ways to define the boundary, namely, the Average mean, the Moving Average mean and the Exponential Weighted Moving Average mean.
定義離群值的關鍵在于我們采用的邊界。 在這里,我將給出3種不同的方法來定義邊界,即平均均值,移動平均數和指數加權移動平均數。
1.數據準備 (1. Data preparation)
Here I used Apple’s 10-year stock history price and returns from Yahoo Finance as an example, of course, you can use any data.
在這里,我以蘋果公司10年的股票歷史價格和Yahoo Finance的收益為例,當然,您可以使用任何數據。
import pandas as pd
import yfinance as yfimport matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300df = yf.download('AAPL',
start = '2000-01-01',
end= '2010-12-31')
As we only care about the returns, a new DataFrame (d1)
is created to hold the adjusted price and returns.
由于我們只關心收益, DataFrame (d1)
會創建一個新的DataFrame (d1)
來容納調整后的價格和收益。
d1 = pd.DataFrame(df['Adj Close'])
d1.rename(columns={'Adj Close':'adj_close'}, inplace=True)
d1['simple_rtn']=d1.adj_close.pct_change()
d1.head()

2.以均值和標準差為邊界。 (2. Using mean and standard deviation as the boundary.)
Calculate the mean and std of the simple_rtn:
計算simple_rtn的均值和std:
d1_mean = d1['simple_rtn'].agg(['mean', 'std'])
If we use mean and one std as the boundary, the results will look like these:
如果我們使用均值和一個std作為邊界,結果將如下所示:
fig, ax = plt.subplots(figsize=(10,6))
d1['simple_rtn'].plot(label='simple_rtn', legend=True, ax = ax)
plt.axhline(y=d1_mean.loc['mean'], c='r', label='mean')
plt.axhline(y=d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.axhline(y=-d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.legend(loc='lower right')

What happens if I use 3 times std instead?
如果我使用3次std會怎樣?

Looks good! Now is the time to look for those outliers:
看起來挺好的! 現在是時候尋找那些離群值了:
mu = d1_mean.loc['mean']
sigma = d1_mean.loc['std']def get_outliers(df, mu=mu, sigma=sigma, n_sigmas=3):
'''
df: the DataFrame
mu: mean
sigmas: std
n_sigmas: number of std as boundary
'''
x = df['simple_rtn']
mu = mu
sigma = sigma
if (x > mu+n_sigmas*sigma) | (x<mu-n_sigmas*sigma):
return 1
else:
return 0
After applied the rule get_outliers
to the stock price return, a new column is created:
將規則get_outliers
應用于股票價格收益后,將創建一個新列:
d1['outlier'] = d1.apply(get_outliers, axis=1)
d1.head()

?提示! (?Tip!)
#The above code snippet can be refracted as follow:cond = (d1['simple_rtn'] > mu + sigma * 2) | (d1['simple_rtn'] < mu - sigma * 2)
d1['outliers'] = np.where(cond, 1, 0)
Let’s have a look at the outliers. We can check how many outliers we found by doing a value count.
讓我們看看異常值。 我們可以通過計數來檢查發現了多少離群值。
d1.outlier.value_counts()

We found 30 outliers if we set 3 times std as the boundary. We can pick those outliers out and put it into another DataFrame
and show it in the graph:
如果我們將std設置為3倍,則發現30個離群值。 我們可以挑選出這些離群值,并將其放入另一個DataFrame
,并在圖中顯示出來:
outliers = d1.loc[d1['outlier'] == 1, ['simple_rtn']]fig, ax = plt.subplots()ax.plot(d1.index, d1.simple_rtn,
color='blue', label='Normal')
ax.scatter(outliers.index, outliers.simple_rtn,
color='red', label='Anomaly')
ax.set_title("Apple's stock returns")
ax.legend(loc='lower right')plt.tight_layout()
plt.show()

In the above plot, we can observe outliers marked with a red dot. In the next post, I will show you how to use Moving Average Mean and Standard deviation as the boundary.
在上圖中,我們可以觀察到標有紅點的離群值。 在下一篇文章中,我將向您展示如何使用移動平均均值和標準差作為邊界。
Happy learning, happy coding!
學習愉快,編碼愉快!
翻譯自: https://medium.com/python-in-plain-english/identifying-outliers-part-one-c0a31d9faefa
缺失值和異常值的識別與處理
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/390827.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/390827.shtml 英文地址,請注明出處:http://en.pswp.cn/news/390827.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!