數據轉換軟件

📈Python金融系列 (📈Python for finance series)

Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.

警告：這里沒有神奇的配方或圣杯，盡管新世界可能為您打開了大門。

📈Python金融系列 (📈Python for finance series)

Identifying Outliers
識別異常值
Identifying Outliers — Part Two
識別異常值-第二部分
Identifying Outliers — Part Three
識別異常值-第三部分
Stylized Facts
程式化的事實
Feature Engineering & Feature Selection
特征工程與特征選擇
Data Transformation
數據轉換

In the previews article, I briefly introduced the Volume Spread Analysis(VSA). After we did feature-engineering and feature-selection, there were two things I noticed immediately, the first one was that there were outliers in the dataset and the second issue was the distribution were no way close to normal. By using the method described here, here and here, I removed most of the outliers. Now is the time to face the bigger problem, the normality.

在預覽文章中，我簡要介紹了體積擴散分析(VSA)。在進行了特征工程和特征選擇之后，我立即注意到了兩件事，第一件事是數據集中存在異常值，第二個問題是分布與正常值不相稱。通過使用此處，此處和此處所述的方法，我刪除了大多數異常值。現在是時候面對更大的問題，正常性。

There are many ways to transfer the data. One of the well-known examples is the one-hot encoding, even better one is word embedding in natural language processing (NLP). Considering one of the advantages of using deep learning is that it completely automates what used to be the most crucial step in a machine-learning workflow: feature engineering. Before we get into the deep learning in the later articles, let’s have a look at some simple ways to transfer data to see if we can make it closer to normal distribution.

有很多方法可以傳輸數據。眾所周知的例子之一是“ 單熱”編碼，更好的例子是自然語言處理(NLP)中的單詞嵌入。考慮到使用深度學習的優勢之一是它可以完全自動化機器學習工作流程中最關鍵的步驟：特征工程。在后續文章中進行深度學習之前，讓我們看一下一些簡單的數據傳輸方法，以了解是否可以使其更接近正態分布。

In this article, I would like to try a few things. The first one is to transfer all the features to a simple percentage change. The second one is to do a Percentile Ranking. In the end, I will show you what happens if I only pick the sign of all the data. Methods like Z-score, which are standard pre-processing in deep learning, I would rather leave it for now.

在本文中，我想嘗試一些事情。第一個是將所有功能轉移到簡單的百分比更改中。第二個是做百分等級。最后，我將向您展示如果僅選擇所有數據的符號會發生什么。 Z-score之類的方法是深度學習中的標準預處理，我寧愿暫時將其保留。

1.數據準備 (1. Data preparation)

For consistency, in all the 📈Python for finance series, I will try to reuse the same data as much as I can. More details about data preparation can be found here, here and here or you can refer back to my previous article. Or if you like, you can ignore all the code below and use whatever clean data you have at hand, it won’t affect the things we are going to do together.

為了保持一致性，在所有Python金融系列叢書中，我將盡量重用相同的數據。有關數據準備的更多詳細信息可以在這里，這里和這里找到，或者您可以參考我以前的文章。或者，如果愿意，您可以忽略下面的所有代碼，而使用您手邊的任何干凈數據，這不會影響我們將共同完成的工作。

#import all the libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import yfinance as yf  #the stock data from Yahoo Finance
import matplotlib.pyplot as plt #set the parameters for plotting
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300#define a function to get data
def get_data(symbols, begin_date=None,end_date=None):
    df = yf.download('AAPL', start = '2000-01-01',
                     auto_adjust=True,#only download adjusted data
                     end= '2010-12-31') 
    #my convention: always lowercase
    df.columns = ['open','high','low',
                  'close','volume'] 
    
    return dfprices = get_data('AAPL', '2000-01-01', '2010-12-31')#create some features
def create_HLCV(i):
#as we don't care open that much, that leaves volume, 
#high,low and close 
    df = pd.DataFrame(index=prices.index)
    df[f'high_{i}D'] = prices.high.rolling(i).max()
    df[f'low_{i}D'] = prices.low.rolling(i).min()
    df[f'close_{i}D'] = prices.close.rolling(i).\
                        apply(lambda x:x[-1]) 
    # close_2D = close as rolling backwards means today is 
    # literly the last day of the rolling window.
    df[f'volume_{i}D'] = prices.volume.rolling(i).sum()
    
    return df# create features at different rolling windows
def create_features_and_outcomes(i):
    df = create_HLCV(i)
    high = df[f'high_{i}D']
    low = df[f'low_{i}D']
    close = df[f'close_{i}D']
    volume = df[f'volume_{i}D']
    
    features = pd.DataFrame(index=prices.index)
    outcomes = pd.DataFrame(index=prices.index)
    
    #as we already considered the different time span, 
    #here only day of simple percentage change used.
   
    features[f'volume_{i}D'] = volume.pct_change()
    features[f'price_spread_{i}D'] = (high - low).pct_change()
    #aligne the close location with the stock price change
    features[f'close_loc_{i}D'] = ((close - low) / \
                             (high -   low)).pct_change()    #the future outcome is what we are going to predict
    outcomes[f'close_change_{i}D'] = close.pct_change(-i)
       
    return features, outcomesdef create_bunch_of_features_and_outcomes():
    '''
    the timespan that i would like to explore 
    are 1, 2, 3 days and 1 week, 1 month, 2 month, 3 month
    which roughly are [1,2,3,5,20,40,60]
    '''
    days = [1,2,3,5,20,40,60]
    bunch_of_features = pd.DataFrame(index=prices.index)
    bunch_of_outcomes = pd.DataFrame(index=prices.index)
    
    for day in days:
        f,o = create_features_and_outcomes(day)
        bunch_of_features = bunch_of_features.join(f)
        bunch_of_outcomes = bunch_of_outcomes .join(o)
        
    return bunch_of_features, bunch_of_outcomesbunch_of_features, bunch_of_outcomes = create_bunch_of_features_and_outcomes()#define the method to identify outliers
def get_outliers(df, i=4): 
    #i is number of sigma, which define the boundary along mean
    outliers = pd.DataFrame()
    stats = df.describe()
    
    for col in df.columns:
        mu = stats.loc['mean', col]
        sigma = stats.loc['std', col]
        condition = (df[col] > mu + sigma * i) | \
                  (df[col] < mu -   sigma * i) 
        outliers[f'{col}_outliers'] = df[col][condition]
    
    return outliers#remove all the outliers
features_outcomes = bunch_of_features.join(bunch_of_outcomes)
outliers = get_outliers(features_outcomes, i=1)features_outcomes_rmv_outliers = features_outcomes.drop(index = outliers.index).dropna()features = features_outcomes_rmv_outliers[bunch_of_features.columns]
outcomes = features_outcomes_rmv_outliers[bunch_of_outcomes.columns]
features.info(), outcomes.info()

Image for post — Information of features dataset

In the end, we will have the basic four features based on Volume Spread Analysis (VSA) at different time scale listed below, namely, 1 day, 2 days, 3 days, a week, a month, 2 months and 3 months.

最后，我們將基于下面列出的不同時間范圍的體積擴展分析(VSA)提供基本的四個功能，分別是1天，2天，3天，一周，一個月，2個月和3個月。

Volume: pretty straight forward
數量：挺直的
Range/Spread: Difference between high and close
范圍/價差：最高價和收市價之間的差異
Closing Price Relative to Range: Is the closing price near the top or the bottom of the price bar?
收盤價相對于范圍：收盤價是否在價格柱的頂部或底部附近？
The change of stock price: pretty straight forward
股票價格的變化：很簡單

2.回報率 (2. Percentage Returns)

I know that’s a whole lot of codes above. We have all the features transformed into a simple percentage change through the function below.

我知道上面有很多代碼。我們通過以下功能將所有功能轉換為簡單的百分比更改。

def create_features_and_outcomes(i):
    df = create_HLCV(i)
    high = df[f'high_{i}D']
    low = df[f'low_{i}D']
    close = df[f'close_{i}D']
    volume = df[f'volume_{i}D']
    
    features = pd.DataFrame(index=prices.index)
    outcomes = pd.DataFrame(index=prices.index)
    
    #as we already considered the different time span, 
    #here only 1 day of simple percentage change used.
   
    features[f'volume_{i}D'] = volume.pct_change()
    features[f'price_spread_{i}D'] = (high - low).pct_change()
    #aligne the close location with the stock price change
    features[f'close_loc_{i}D'] = ((close - low) / \
    (high -    low)).pct_change()#the future outcome is what we are going to predict
    outcomes[f'close_change_{i}D'] = close.pct_change(-i)
       
    return features, outcomes

Now, let’s have a look at their correlations using cluster map. Seaborn’s clustermap() hierarchical clustering algorithm shows a nice way to group the most closely related features.

現在，讓我們看一下使用聚類圖的相關性。 Seaborn的clustermap()層次聚類算法顯示了一種對最密切相關的特征進行分組的好方法。

corr_features = features.corr().sort_index()
sns.clustermap(corr_features, cmap='coolwarm', linewidth=1);

Based on this cluster map, to minimize the amount of feature overlap in selected features, I will remove those features that are paired with other features closely and having less correlation with the outcome targets. From the cluster map above, it is easy to spot that features on [40D, 60D] and [2D, 3D] are paired together. To see how those features are related to the outcomes, let’s have a look at how the outcomes are correlated first.

基于此聚類圖，為了最大程度地減少所選要素中的要素重疊量，我將刪除那些與其他要素緊密配對且與結果目標的相關性較小的要素。從上方的群集圖中，很容易發現[40D，60D]和[2D，3D]上的要素已配對。要了解這些功能與結果之間的關系，讓我們先看一下結果之間的關系。

corr_outcomes = outcomes.corr()
sns.clustermap(corr_outcomes, cmap='coolwarm', linewidth=2);

From top to bottom, 20 days, 40 days and 60 days price percentage change are grouped together, so as the 2 days, 3 days and 5 days. Whereas, 1-day stock price percentage change is relatively independent of those two groups. If we pick the next day price percentage change as the outcome target, let’s see how those features are related to it.

從上至下，將20天，40天和60天價格百分比變化分組在一起，即2天，3天和5天。而1天的股價百分比變化相對獨立于這兩組。如果我們選擇第二天的價格百分比變化作為結果目標，讓我們看看這些功能如何與之相關。

corr_features_outcomes = features.corrwith(outcomes. \
                                close_change_1D).sort_values()
corr_features_outcomes.dropna(inplace=True)
corr_features_outcomes.plot(kind='barh',title = 'Strength of Correlation');

The correlation coefficients are way too small to make a solid conclusion. I will expect that the most recent data have a stronger correlation, but that is not the case here.

相關系數太小而無法得出可靠的結論。我希望最新的數據具有更強的相關性，但事實并非如此。

How about the pair plot? We only pick those features based on a 1-day time scale as a demonstration. At the meantime, I transferred the close_change_1D to sign base on it’s a negative or positive number to add extra dimensionality to the plots.

配對圖怎么樣？我們僅基于1天的時間范圍選擇這些功能作為演示。同時，我將close_change_1D給簽名，因為它是一個負數或正數，從而為繪圖增加了尺寸。

selected_features_1D_list = ['volume_1D', 'price_spread_1D', 'close_loc_1D', 'close_change_1D']
features_outcomes_rmv_outliers['sign_of_close'] = features_outcomes_rmv_outliers['close_change_1D']. \
                                                  apply(np.sign)sns.pairplot(features_outcomes_rmv_outliers, 
             vars=selected_features_1D_list,
             diag_kind='kde',
             palette='husl', hue='sign_of_close',
             markers = ['*', '<', '+'], 
             plot_kws={'alpha':0.3});

The pair plot builds on two basic figures, the histogram and the scatter plot. The histogram on the diagonal allows us to see the distribution of a single variable while the scatter plots on the upper and lower triangles show the relationship (or lack thereof) between two variables. From the plots above, we can see that price spreads are getting wider with high volume. Most of the price change locate at a narrow price spread, in another word, wider spread doesn’t always come with bigger price fluctuation. Either low volume or high volume can cause price change at almost all scale. And we can apply all those conclusions to both up days and down days.

對圖建立在兩個基本圖形上，即直方圖和散點圖。對角線上的直方圖使我們能夠看到單個變量的分布，而上三角形和下三角形上的散點圖則顯示了兩個變量之間的關系(或不存在)。從上面的圖可以看出，隨著交易量的增加，價差越來越大。大多數價格變化都位于狹窄的價差中，也就是說，價差并不總是伴隨著較大的價格波動。數量少或數量大都會導致幾乎所有規模的價格變化。我們可以將所有這些結論應用到工作日和工作日中。

you can also use the close location of bars to add more dimensionality, simply apply

您還可以使用條形圖的靠近位置來添加更多維度，只需應用

features[‘sign_of_close_loc’] = np.where( \
 features[‘close_loc_1D’] > 0.5, \
 1, -1)

to see how many bars’ close location above the 0.5 or below 0.5.

看看有多少個柱的收盤位置高于0.5或低于0.5。

One thing that I don’t really like in the pair plot is all the plots with the close_loc_1D condensed, looks like the outliers still there, even I know I used one standard deviation as the boundary which is a very low threshold and 338 outliers were removed. I realize that because the location of close is already a percentage change, adding another percentage change on top doesn’t make much sense. Let’s change it.

在配對圖中，我真正不喜歡的一件事是所有close_loc_1D壓縮的圖，看起來仍然存在離群值，即使我知道我使用一個標準偏差作為邊界，該閾值也很低，有338個離群值刪除。我意識到，由于關閉位置已經是百分比變化，因此在頂部添加另一個百分比變化沒有多大意義。讓我們改變它。

def create_features_and_outcomes(i):
    df = create_HLCV(i)
    high = df[f'high_{i}D']
    low = df[f'low_{i}D']
    close = df[f'close_{i}D']
    volume = df[f'volume_{i}D']
    
    features = pd.DataFrame(index=prices.index)
    outcomes = pd.DataFrame(index=prices.index)
    
    #as we already considered the different time span, 
    #simple percentage change of 1 day used here.
    
    features[f'volume_{i}D'] = volume.pct_change()
    features[f'price_spread_{i}D'] = (high - low).pct_change()
    #remove pct_change() here
    features[f'close_loc_{i}D'] = ((close - low) / (high - low))
    #predict the future with -i
    outcomes[f'close_change_{i}D'] = close.pct_change(-i)
       
    return features, outcomes

With pct_change() removed, let’s see how the cluster map looks like now.

刪除pct_change()后，讓我們看看集群圖現在的樣子。

corr_features = features.corr().sort_index()
sns.clustermap(corr_features, cmap='coolwarm', linewidth=1);

The cluster map makes more sense now. All four basic features have pretty much the same pattern. [40D, 60D], [2D, 3D] are paired together.

現在，群集圖更有意義。所有四個基本功能都具有幾乎相同的模式。 [40D，60D]，[2D，3D]配對在一起。

and in terms of the features correlations with the outcome.

以及與結果相關的特征。

corr_features_outcomes.plot(kind='barh',title = 'Strength of Correlation');

The longer-range time scale features have weak correlations with stock price return, while the more recent events have more effects on the price returns.

較長的時間尺度特征與股票價格收益之間的相關性較弱，而最近的事件對股價收益的影響更大。

By removing pct_change() of the close_loc_1D, the biggest difference is laid on the pairplot().

通過消除pct_change()中的close_loc_1D ，最大的區別就是鋪在pairplot()

Finally, the close_loc_1D variable plots at the right range. This illustrates that we should be careful with over-engineering. It may lead to a totally unexpected way.

最后， close_loc_1D變量在正確的范圍內繪制。這說明我們應謹慎處理過度工程。這可能會導致完全出乎意料的方式。

3.百分等級 (3. Percentile Ranking)

According to Wikipedia, the percentile rank is

根據維基百科，百分等級是

“The percentile rank of a score is the percentage of scores in its frequency distribution that are equal to or lower than it. For example, a test score that is greater than 75% of the scores of people taking the test is said to be at the 75th percentile, where 75 is the percentile rank.”
“分數的百分等級是指頻率分布中等于或低于分數的分數的百分比。例如，一個測驗分數大于參加該測驗的人分數的75％，被認為是第75個百分位，其中75是百分位。

The below example returns the percentile rank (from 0.00 to 1.00) of traded volume for each value as compared to a trailing 60-day period.

下面的示例返回與過去60天的時間段相比每個值的交易量的百分比等級(從0.00到1.00)。

roll_rank = lambda x: pd.Series(x).rank(pct=True)[-1]
# you only pick the first value [0]
# of the 60 windows rank if you rolling forward.
# if you rolling backward, we should pick last one,[-1].features_rank = features.rolling(60, min_periods=60). \
                apply(roll_rank).dropna()
outcomes_rank = outcomes.rolling(60, min_periods=60). \
                apply(roll_rank).dropna()

?提示！ (?Tip!)

Pandasrolling(), by default, the result is set to the right edge of the window. That means the window is backward-looking windows, from the past rolls towards current timestamp. That is why, to rank() in that window frame, we pick the last value [-1].

熊貓rolling() ，默認情況下，結果設置在窗口的右邊緣。這意味著該窗口是后視窗口，從過去滾動到當前時間戳。這就是為什么要在該窗口框架中對rank()進行選擇的原因，我們選擇了最后一個值[-1] 。

More information about rolling(), please check the official document.

有關rolling()更多信息，請檢查官方文檔。

First, we have a quick look at the outcomes’ cluster map. It is almost identical to the percentage change one with a different order.

首先，我們快速查看結果的聚類圖。它幾乎與具有不同順序的百分比變化相同。

corr_outcomes_rank = outcomes_rank.corr().sort_index()
sns.clustermap(corr_outcomes_rank, cmap='coolwarm', linewidth=2);

The same pattern goes to the features’ cluster map.

相同的模式也用于要素的群集地圖。

corr_features_rank = features_rank.corr().sort_index()
sns.clustermap(corr_features_rank, cmap='coolwarm', linewidth=2);

Even with a different method,

即使采用其他方法，

# using 'ward' method
corr_features_rank = features_rank.corr().sort_index()
sns.clustermap(corr_features_rank, cmap='coolwarm', linewidth=2, method='ward');

and of course, the correlation of features and outcome are the same as well.

當然，特征與結果的相關性也相同。

corr_features_outcomes_rank = features_rank.corrwith( \
                              outcomes_rank. \
                              close_change_1D).sort_values()corr_features_outcomes_rank

corr_features_outcomes_rank.plot(kind='barh',title = 'Strength of Correlation');

Last, you may guess the pair plot will be the same as well.

最后，您可能會猜對圖也會一樣。

selected_features_1D_list = ['volume_1D', 'price_spread_1D', 'close_loc_1D', 'close_change_1D']
features_outcomes_rank['sign_of_close'] = features_outcomes_rmv_outliers['close_change_1D']. \
                                                  apply(np.sign)sns.pairplot(features_outcomes_rank, 
             vars=selected_features_1D_list,
             diag_kind='kde',
             palette='husl', hue='sign_of_close',
             markers = ['*', '<', '+'], 
             plot_kws={'alpha':0.3});

Because of the percentile rank (from 0.00 to 1.00) we utilized in the set window, the spots are evenly distributed across all features. The distribution of all the features is more or less close to a normal distribution than the same data without transformation.

由于我們在設置窗口中使用了百分比等級(從0.00到1.00)，因此斑點在所有要素上均勻分布。與沒有變換的相同數據相比，所有特征的分布或多或少接近正態分布。

4.簽署 (4. Signing)

The last not least, I would like to remove all the data grain and see how those features related under this scenario.

最后一點，我想刪除所有數據粒度，并查看在這種情況下這些功能之間的關系。

features_sign = features.apply(np.sign)
outcomes_sign = outcomes.apply(np.sign)

Then calculate the correlation coefficiency again.

然后再次計算相關系數。

corr_features_outcomes_sign = features_sign.corrwith(
                              outcomes_sign. \
                              close_change_1D).sort_values(ascending=False)corr_features_outcomes_sign

corr_features_outcomes_sign.plot(kind='barh',title = 'Strength of Correlation');

It turns out a bit weird now, like volume_1D and price_spread_1D has a very weak correlation with the outcome now.

事實證明現在有點怪異，例如volume_1D和price_spread_1D與結果之間的關聯非常微弱。

Luckily, the cluster map remains pretty much the same.

幸運的是，集群圖幾乎保持不變。

corr_features_sign = features_sign.corr().sort_index()
sns.clustermap(corr_features_sign, cmap='coolwarm', linewidth=2);

And the same goes for the relationship between outcomes.

結果之間的關系也是如此。

corr_outcomes_sign = outcomes_sign.corr().sort_index()
sns.clustermap(corr_outcomes_sign, cmap='coolwarm', linewidth=2);

As for pair plot, as all the data are transferred to either -1 or 1, it doesn’t show anything meaningful.

至于成對圖，由于所有數據都被傳輸為-1或1，因此沒有任何意義。

It is sometimes vital to “standardize” or “normalize” data so that we get fair comparisons between features of differing scale. I am tempted to use Z-score to normalize the data set.

有時對數據進行“標準化”或“標準化”至關重要，這樣我們才能在不同規模的特征之間進行公平的比較。我很想使用Z分數來規范化數據集。

The formula of Z-score requires the mean and standard deviation, by calculating these two parameters across the entire dataset, we have the chance to peek to the future. Of course, we can take advantage of the rolling window again. But generally, people will normalize their data before injecting them into their model.

Z分數的公式需要平均值和標準偏差，通過在整個數據集中計算這兩個參數，我們就有機會窺見未來。當然，我們可以再次利用滾動窗口。但是通常，人們將數據標準化后再將其注入模型。

In summary, by utilizing 3 different data transformations methods, now we are pretty confident we can select the most related features and discard those abundant ones as all 3 methods pretty much share the same patterns.

綜上所述，通過使用3種不同的數據轉換方法，現在我們非常有信心可以選擇最相關的功能并丟棄那些豐富的功能，因為這3種方法幾乎都共享相同的模式。

5.平穩性和正常性測試 (5. Stationary and Normality Test)

The last question can the transformed data pass the stationary/normality test? Here, I will use the Augmented Dickey-Fuller test1, which is a type of statistical test called a unit root test. At the meantime, I want to see the skewness and kurtosis as well.

最后一個問題是轉換后的數據可以通過平穩性/正常性測試嗎？在這里，我將使用增強Dickey-Fuller檢驗 1，這是一種統計檢驗，稱為單位根檢驗。同時，我也想看看偏度和峰度。

import statsmodels.api as sm
import scipy.stats as scs
p_val = lambda s: sm.tsa.stattools.adfuller(s)[1]def build_stats(df):
    stats = pd.DataFrame({'skew':scs.skew(df),
                 'skew_test':scs.skewtest(df)[1],
                 'kurtosis': scs.kurtosis(df),
                 'kurtosis_test' : scs.kurtosistest(df)[1],
                 'normal_test' : scs.normaltest(df)[1]},
                  index = df.columns)
    return stats

The null hypothesis of the test is that the time series can be represented by a unit root, that it is not stationary (has some time-dependent structure). The alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary.

該檢驗的零假設是，時間序列可以用單位根表示，它不是固定的(具有某些時間相關的結構)。備用假設(拒絕原假設)是時間序列是平穩的。

Null Hypothesis (H0): If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary. It has some time dependent structure.
零假設(H0) ：如果未能被拒絕，則表明時間序列具有單位根，這意味著它是非平穩的。它具有一些時間相關的結構。
Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root, meaning it is stationary. It does not have time-dependent structure.
備用假設(H1) ：原假設被拒絕；這表明時間序列沒有單位根，這意味著它是固定的。它沒有時間相關的結構。

Here is the result from Augmented Dickey-Fuller test:

這是增強Dickey-Fuller測試的結果：

For features and outcomes:

對于功能和結果：

features_p_val = features.apply(p_val)
outcomes_p_val = outcomes.apply(p_val)
outcomes_p_val,features_p_val

The test can be interpreted by the p-value. A p-value below a threshold (such as 5% or 1%) suggests we reject the null hypothesis (stationary), otherwise, a p-value above the threshold suggests we cannot reject the null hypothesis (non-stationary).

可以用p值解釋該檢驗。低于閾值的p值(例如5％或1％)表明，我們拒絕零假設(靜止的)，否則，一個p值高于該閾值表明，我們不能拒絕零假設(非靜止的)。

p-value > 0.05: cannot reject the null hypothesis (H0), the data has a unit root and is non-stationary.
p值> 0.05 ：無法拒絕原假設(H0)，數據具有單位根且不穩定。
p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.
p值<= 0.05 ：拒絕原假設(H0)，數據沒有單位根并且是固定的。

From this test, we can see that all the results are well below 5%, that shows we can reject the null hypothesis and all the transformed data are stationary.

從該測試中，我們可以看到所有結果都遠低于5％，這表明我們可以拒絕原假設，并且所有變換后的數據都是平穩的。

Next, let’s test the normality.

接下來，讓我們測試正常性。

build_stats(features_outcomes_rmv_outliers)

For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero meansthat there is more weight in the right tail of the distribution and vice versa.

對于正態分布的數據，偏度應約為零。對于單峰連續分布，偏度值大于零意味著在分布的右尾有更多的權重，反之亦然。

scs.skewtest() tests the null hypothesis that the skewness of the population that the sample was drawn from is the same as that of a corresponding normal distribution. As all the numbers are below 5% threshold, we have to reject the null hypothesis and say the skewness doesn’t correspond to normal distribution. The same thing goes to scs.kurtosistest().

scs.skewtest()測試零假設，即從中抽取樣本的總體的偏斜度與相應正態分布的偏度相同。由于所有數字均低于5％閾值，因此我們必須拒絕原假設，并說偏度與正態分布不符。 scs.kurtosistest() 。

scs.normaltest() tests the null hypothesis that a sample comes from a normal distribution. It is based on D’Agostino and Pearson’s test2 3 that combines skew and kurtosis to produce an omnibus test of normality. Again, all the numbers are below 5% threshold. We have to reject the null hypothesis and say the data transformed by percentage change is not normal distribution.

scs.normaltest()測試樣本來自正態分布的原假設。它基于D'Agostino和Pearson的test23，結合了偏斜和峰度以進行綜合的正態性檢驗。同樣，所有數字均低于5％閾值。我們必須拒絕原假設，并說通過百分比變化轉換的數據不是正態分布。

We can do the same tests on data transformed by Percentile Ranking and Signing. I don’t want to scare people off by complexing thing further. I am better off ending here before this article goes way too long.

我們可以對通過百分比排名和簽名轉換的數據進行相同的測試。我不想通過進一步復雜化事情來嚇跑人們。在本文過長之前，我最好在這里結束。

翻譯自: https://towardsdatascience.com/data-transformation-e7b3b4268151

數據轉換軟件

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/391643.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/391643.shtml
英文地址，請注明出處：http://en.pswp.cn/news/391643.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

leetcode 1047. 刪除字符串中的所有相鄰重復項（棧）

給出由小寫字母組成的字符串 S，重復項刪除操作會選擇兩個相鄰且相同的字母，并刪除它們。在 S 上反復執行重復項刪除操作，直到無法繼續刪除。在完成所有重復項刪除操作后返回最終的字符串。答案保證唯一。示例： 輸入&#x…

spring boot: spring Aware的目的是為了讓Bean獲得Spring容器的服務

Spring Aware的目的是為了讓Bean獲得Spring容器的服務 //獲取容器中的bean名稱import org.springframework.beans.factory.BeanNameAware;//獲得資源加載器，可以獲得額外的資源import org.springframework.context.ResourceLoaderAware; package ch2.aware; import …

10張圖帶你深入理解Docker容器和鏡像

【編者的話】本文用圖文并茂的方式介紹了容器、鏡像的區別和Docker每個命令后面的技術細節，能夠很好的幫助讀者深入理解Docker。這篇文章希望能夠幫助讀者深入理解Docker的命令，還有容器（container）和鏡像（image&#…

matlab界area_Matlab的數據科學界

matlab界area意見 (Opinion) My personal interest in Data Science spans back to 2011. I was learning more about Economies and wanted to experiment with some of the ‘classic’ theories and whilst many of them held ground, at a micro level, many were also pur…

javascript異步_JavaScript異步并在循環中等待

javascript異步Basic async and await is simple. Things get a bit more complicated when you try to use await in loops.基本的async和await很簡單。當您嘗試在循環中使用await時，事情會變得更加復雜。 In this article, I want to share some gotchas to wat…

白盒測試目錄導航

白盒測試目錄導航（更新中） 2017-12-29 [1] 白盒測試：為什么要做白盒測試 [2] 白盒測試：理論基礎 [3] 白盒測試實戰（上） [4] 白盒測試實戰（中） [5] 白盒測試實戰（下&#…

hdf5文件和csv的區別_使用HDF5文件并創建CSV文件

hdf5文件和csv的區別In my last article, I discussed the steps to download NASA data from GES DISC. The data files downloaded are in the HDF5 format. HDF5 is a file format, a technology, that enables the management of very large data collections. Thus, it is…