r怎么對兩組數據統計檢驗
Business analytics and data science is a convergence of many fields of expertise. Professionals form multiple domains and educational backgrounds are joining the analytics industry in the pursuit of becoming data scientists.
業務分析和數據科學是許多專業領域的融合。 專業人士來自多個領域,教育背景正在加入分析行業,以成為數據科學家。
Two kinds of data scientist I met in my career. One who provides attention to the details of the algorithms and models. They always try to understand the mathematics and statistics behind the scene. Want to take full control over solution and the theory behind it. The other kind are more interested in the end result without looking at the theoretical details. They are fascinated by the implementation of new and advanced models. Inclined towards solving the problem in hand rather than the theory behind the solution.
我在職業生涯中遇到的兩種數據科學家。 一位關注算法和模型細節的人。 他們總是試圖了解幕后的數學和統計學。 想要完全控制解決方案及其背后的理論。 另一類對最終結果更感興趣,而不關注理論細節。 他們對新的和先進的模型的實施著迷。 傾向于解決現有問題,而不是解決方案背后的理論。
Believers of both of these approaches have their own logic to support their stand. I respect their choices.
這兩種方法的信徒都有自己的邏輯來支持自己的立場。 我尊重他們的選擇。
In this post, I shall share some statistical tests that are commonly used in data science. It will be good to know some of these irrespective of the approach you believe in.
在這篇文章中,我將分享一些數據科學中常用的統計測試。 無論您采用哪種方法,都應該了解其中的一些內容。
In statistics, there are two ways of drawing an inference from any exploration. Estimation of parameters is one of the ways. Here unknown values of population parameters are computed through various methods. The other way is testing of hypothesis. It helps us to test the parameter values that are guessed by some prior knowledge.
在統計中,有兩種方法可以從任何探索中得出推論。 參數估計是方法之一。 這里,人口參數的未知值是通過各種方法計算的。 另一種方法是檢驗假設。 它可以幫助我們測試一些先驗知識猜測的參數值。
I shall list out some statistical test procedures which you will frequently encounter in data science.
我將列出您在數據科學中經常遇到的一些統計測試程序。
“The only relevant test of the validity of a hypothesis is comparison of its predictions with experience.” — Milton Friedman
“關于假設有效性的唯一相關檢驗是將其預測與經驗進行比較。” —米爾頓·弗里德曼
作為數據科學家,我真的需要了解假設檢驗嗎? (As a data scientist, do I really need to know hypothesis testing?)
In most decision-making procedures in data science, we are knowing or unknowingly using hypothesis testing. Here are some evidences in support of my statement.
在數據科學的大多數決策程序中,我們都在使用或不使用假設檢驗。 這里有一些證據支持我的發言。
Being data scientist, the kind of data analysis we do can be segregated into four broad areas —
作為數據科學家,我們進行的數據分析可以分為四個主要領域:
- Exploratory Data Analysis (EDA) 探索性數據分析(EDA)
2. Regression and Classification
2.回歸與分類
3. Forecasting
3.預測
4. Data Grouping
4.數據分組
Each of these areas include some amount of statistical testing.
這些領域中的每個領域都包含一些統計測試。
探索性數據分析(EDA) (Exploratory Data Analysis (EDA))
It is an unavoidable part of data science in which every data scientist spends a significant amount of time. It establishes the foundation for creating machine learning and statistical models. Some common tasks that involve statistical testing in EDA are —
這是數據科學中不可避免的一部分,每個數據科學家都花費大量時間。 它為創建機器學習和統計模型奠定了基礎。 在EDA中涉及統計測試的一些常見任務是-
- Test for normality 測試正常性
2. Test for Outliers
2.測試異常值
3. Test for correlation
3.測試相關性
4. Test of homogeneity
4.均勻性測試
5. Test for equality of distribution
5.測試分配是否平等
Each of these tasks involves testing of hypothesis at some point.
這些任務中的每一個都需要在某個時候檢驗假設。
1.How to Test for normality?
1.如何測試正常性?
Normality is everywhere in Statistics. Most theories we use in statistics are based on normality assumption. Normality means the data should follow a particular kind of probability distribution, which is the normal distribution. It has a particular shape and represented by a particular function.
統計數據中到處都有常態。 我們在統計學中使用的大多數理論都基于正態性假設。 正態性表示數據應遵循一種特定的概率分布,即正態分布。 它具有特定的形狀并由特定的功能表示。
In Analysis of Variance(ANOVA), we assume normality of the data. While doing regression we expect the residual to follow normal distribution.
在方差分析(ANOVA)中,我們假設數據是正態的。 在進行回歸時,我們期望殘差遵循正態分布。
To check normality of data we can use Shapiro–Wilk Test. The null hypothesis for this test is — the distribution of the data sample is normal.
要檢查數據的正態性,我們可以使用Shapiro-Wilk Test。 該檢驗的零假設是-數據樣本的分布是正態的。
Python implementation:
Python實現:
import numpy as np
from scipy import stats
data = stats.norm.rvs(loc=2.5, scale=2, size=100)
shapiro_test = stats.shapiro(data)
print(shapiro_test)
2. How to test whether a data point is an outlier?
2.如何測試數據點是否為離群值?
When I start any new data science use case, where I have to fit some model, one of the routine tasks I do is detection of outliers in the response variable. Outliers affect the regression models greatly. A careful elimination or substitution strategy is required for the outliers.
當我開始任何新的數據科學用例時,我必須適應某種模型,我要做的日常任務之一是檢測響應變量中的異常值。 離群值極大地影響回歸模型。 離群值需要謹慎的消除或替換策略。
An outlier can be global outlier if its value significantly deviate from rest of the data. It is called contextual outlier if it deviates only from the data point originated from a particular context. Also, a set of data point can be collectively outlier when they deviate considerably from the rest.
如果異常值的值與其他數據有明顯偏差,則該異常值可以是全局異常值。 如果它僅偏離源自特定上下文的數據點,則稱為上下文離群值。 同樣,當一組數據點與其他數據點有很大差異時,它們可能在總體上離群。
The Tietjen-Moore test is useful for determining multiple outliers in a data set. The null hypothesis for this test is — there are no outliers in the data.
Tietjen-Moore檢驗對于確定數據集中的多個異常值很有用。 該檢驗的零假設是-數據中沒有異常值。
Python implementation:
Python實現:
import scikit_posthocs
x = np.array([-1.40, -0.44, -0.30, -0.24, -0.22, -0.13, -0.05, 0.06, 0.10, 0.18, 0.20, 0.39, 0.48, 0.63, 1.01])
scikit_posthocs.outliers_tietjen(x, 2)
3. How to test the significance of correlation coefficient between two variables?
3.如何檢驗兩個變量之間相關系數的顯著性?
In data science, we deal with a number of independent variables that explain the behavior of the dependent variable. Significant correlation between the independent variables may affect the estimated coefficient of the variables. It makes the standard error of the regression coefficients unreliable. Which hurts the interpretability of the regression.
在數據科學中,我們處理許多自變量,這些自變量解釋了因變量的行為。 自變量之間的顯著相關性可能會影響變量的估計系數。 這使得回歸系數的標準誤差不可靠。 這損害了回歸的可解釋性。
When we calculate the correlation between two variables, we should check the significance of the correlation. It can be checked by t-test. The null hypothesis of this test assumes that the correlation among the variables is not significant.
當我們計算兩個變量之間的相關性時,我們應該檢查相關性的重要性。 可以通過t檢驗進行檢查。 該檢驗的零假設假設變量之間的相關性不顯著。
Python implementation:
Python實現:
from scipy.stats import pearsonr
data1 = stats.norm.rvs(loc=3, scale=1.5, size=20)
data2 = stats.norm.rvs(loc=-5, scale=0.5, size=20)
stat, p = pearsonr(data1, data2)
print(stat, p)
4. How to test the homogeneity of a categorical variable in two data sets?
4.如何在兩個數據集中測試分類變量的同質性?
It would be convenient to explain the test of homogeneity if I use an example. Suppose you we want to check if the viewing preference of Netflix subscribers are same for males and females. You can use Chi-square test for homogeneity for the same. You have to check whether the frequency distribution of the males and females are significantly different from each other.
如果我舉一個例子,解釋同質性測試將很方便。 假設您要檢查男性和女性的Netflix訂戶的觀看偏好是否相同。 您可以使用卡方檢驗進行同質性檢驗。 您必須檢查男性和女性的頻率分布是否顯著不同。
The null hypotheses for the test is the two data sets are homogeneous.
檢驗的零假設是兩個數據集是同質的。
Python implementation:
Python實現:
import scipy
import scipy.stats
from scipy.stats import chisquare
data1 = stats.norm.rvs(loc=3, scale=1.5, size=20)
data2 = stats.norm.rvs(loc=-5, scale=0.5, size=20)
chisquare(data1, data2)
5. How to check if a given data sets follow a particular distribution?
5.如何檢查給定的數據集是否遵循特定的分布?
Sometimes in data analysis we require checking if the data follows a particular distribution. Even we may want to check if two samples follow the same distribution. In such cases we use Kolmogorov-Smirnov (KS) test. We often use KS test to check for goodness of fit of a regression model.
有時,在數據分析中,我們需要檢查數據是否遵循特定的分布。 甚至我們可能要檢查兩個樣本是否遵循相同的分布。 在這種情況下,我們使用Kolmogorov-Smirnov(KS)檢驗。 我們經常使用KS檢驗來檢查回歸模型的擬合優度。
This test compares the empirical cumulative distribution functions (ECDF) with the theoretical distribution function. The null hypothesis for this test assumes that the given data follows the specified distribution.
該測試將經驗累積分布函數(ECDF)與理論分布函數進行了比較。 此檢驗的零假設假設給定數據遵循指定的分布。
Python implementation:
Python實現:
from scipy import stats
x = np.linspace(-25, 17, 6)
stats.kstest(x, ‘norm’)
回歸與分類 (Regression and Classification)
Most of the modeling we do in data science fall under either regression or classification. Whenever we predict some value or some class, we take help of these two methods.
我們在數據科學中所做的大多數建模屬于回歸或分類。 每當我們預測某個值或某個類時,我們都會使用這兩種方法。
Both regression and classification involves statistical tests at different stages of decision making. Also, the data need to satisfy some prerequisite conditions to be eligible for these tasks. Some tests are required to be performed to check these conditions.
回歸和分類都涉及決策不同階段的統計檢驗。 同樣,數據需要滿足一些前提條件才能有資格執行這些任務。 需要執行一些測試以檢查這些條件。
Some common statistical tests associated with regression and classification are —
與回歸和分類相關的一些常見統計檢驗是-
- Test for heteroscedasticity 測試異方差
2. Test or multicollinearity
2.測試或多重共線性
3. Test of the significance of regression coefficients
3.檢驗回歸系數的顯著性
4. ANOVA for regression or classification model
4.回歸或分類模型的方差分析
1.How to test for heteroscedasticity?
1.如何測試異方差?
Heteroscedasticity is a quite heavy term. It simply means unequal variance. Let me explain it with an example. Suppose you are collecting income data from different cities. You will see that the variation of income differs significantly over cities.
異方差性是一個很沉重的名詞。 它只是意味著方差不均。 讓我用一個例子來解釋它。 假設您正在收集來自不同城市的收入數據。 您將看到,收入的差異在城市之間存在很大差異。
If the data is heteroscedastic, it affects the estimation of the regression coefficients largely. It makes the regression coefficients less precise. The estimates will be far from actual values.
如果數據是異方差的,那么它將極大地影響回歸系數的估計。 這使得回歸系數不太精確。 該估計將與實際值相差甚遠。
To test heteroscedasticity in the data White’s Test can be used. White’s test considers the null hypothesis — the variance is constant over the data.
要測試數據中的異方差性,可以使用White's Test。 White的檢驗考慮了原假設-方差在數據上是恒定的。
Python implementation:
Python實現:
from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip
expr = ‘y_var ~ x_var’
y, X = dmatrices(expr, df, return_type=’dataframe’)
keys = [‘LM stat’, ‘LM test p-value’, ‘F-stat’, ‘F-test p-value’]
results = het_white(olsr_results.resid, X)
lzip(keys, results)
2. How to test for multicollinearity in the variables?
2.如何測試變量的多重共線性?
Data science problems often include multiple explanatory variables. Some time these variables become correlated due to their origin and nature. Also, sometimes we create more than one variable from the same underlying fact. In these cases the variables become highly correlated. It is called multicollinearity.
數據科學問題通常包含多個解釋變量。 一段時間以來,這些變量由于其來源和性質而變得相關。 此外,有時我們會根據相同的基礎事實創建多個變量。 在這些情況下,變量變得高度相關。 這稱為多重共線性。
Presence of multicollinearity increases standard error of the coefficients of the regression or classification model. It makes some important variables insignificant in the model.
多重共線性的存在增加了回歸或分類模型的系數的標準誤差。 它使一些重要變量在模型中無關緊要。
Farrar–Glauber Test can be used to check the presence of multicollinearity in the data.
Farrar–Glauber檢驗可用于檢查數據中是否存在多重共線性。
3. How to test if the model coefficients are significant?
3.如何測試模型系數是否顯著?
In classification or regression models we require identifying the important variables which have strong influence on the target variable. The models perform some tests and provide us with the extent of significance of the variables.
在分類或回歸模型中,我們需要確定對目標變量有很大影響的重要變量。 這些模型執行了一些測試,并為我們提供了變量的重要程度。
t-test is used in models to check the significance of the variables. The null hypothesis of the test is- the coefficients are zero. You need to check p-values of the tests to understand the significance of the coefficients.
模型中使用t檢驗來檢查變量的重要性。 檢驗的原假設是-系數為零。 您需要檢查測試的p值以了解系數的重要性。
Python implementation:
Python實現:
from scipy import stats
rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
stats.ttest_1samp(rvs1, 7)
4. How to test statistical significance of a model?
4.如何檢驗模型的統計顯著性?
While developing regression or classification model, we perform Analysis of Variance (ANOVA). It checks the validity of regression coefficients. ANOVA compares the variation due to model with the variation due to error. If the variation due to model is significantly different from variation due to error, the effect of the variable is significant.
在開發回歸或分類模型時,我們執行方差分析(ANOVA)。 它檢查回歸系數的有效性。 方差分析將模型引起的變化與誤差引起的變化進行比較。 如果因模型引起的變化與因誤差引起的變化顯著不同,則變量的影響就很大。
F-test is used to take the decision. The null hypothesis in this test is — the regression coefficient is equal to zero.
F檢驗用于做出決定。 該檢驗中的零假設是-回歸系數等于零。
Python implementation:
Python實現:
import scipy.stats as stats
data1 = stats.norm.rvs(loc=3, scale=1.5, size=20)
data2 = stats.norm.rvs(loc=-5, scale=0.5, size=20)
stats.f_oneway(data1,data2)
預測 (Forecasting)
In data science we deal with two kinds of data- cross-section and time series. The profiles of a set of customers on an e-commerce website are a cross-section data. But, the daily sales of an item in the e-commerce website for a year will be time series data.
在數據科學中,我們處理兩種數據:橫截面和時間序列。 電子商務網站上一組客戶的資料是橫截面數據。 但是,電子商務網站中某項商品一年的每日銷售額將是時間序列數據。
We often use forecasting models on time series data to estimate the future sales or profits. But, before forecasting, we go through some diagnostic checking of the data to understand the data pattern and its fitness for forecasting.
我們經常對時間序列數據使用預測模型來估計未來的銷售或利潤。 但是,在進行預測之前,我們會對數據進行一些診斷檢查,以了解數據模式及其對預測的適用性。
As a data scientist I frequently use these tests on time series data:
作為數據科學家,我經常對時間序列數據使用以下測試:
- Test for trend 測試趨勢
2. Test for stationarity
2.測試平穩性
3. Test for autocorrelation
3.測試自相關
4. Test for causality
4.檢驗因果關系
5. Test for temporal relationship
5.測試時間關系
1. How to test for trend in time series data?
1.如何測試時序數據趨勢?
Data generated over time from business often shows an upward or downward trend. Be it sales or profit or any other performance metrics that depicts business performance, we always prefer to estimate the future movements.
隨著時間推移從業務生成的數據通常顯示出上升或下降的趨勢。 無論是銷售或利潤,還是描述業務績效的任何其他績效指標,我們始終希望估算未來的走勢。
To forecast the such movements, you need to estimate or eliminate the trend component. To understand if the trend is significant, you can use some statistical test.
要預測這種運動,您需要估計或消除趨勢分量。 要了解趨勢是否顯著,可以使用一些統計檢驗。
Mann-Kendall Test can be used to test the existence of trend. The null hypothesis assumes that there is no significant trend.
Mann-Kendall檢驗可以用來檢驗趨勢的存在。 零假設假設沒有明顯趨勢。
Python implementation:
Python實現:
pip install pymannkendall
import numpy as np
import pymannkendall as mk
data = np.random.rand(250,1)
test_result = mk.original_test(data)
print(test_result)
2. How to test whether a time series data is stationary?
2.如何測試時間序列數據是否固定?
Non-stationarity is an inherent characteristic of most time series data. We always need to test for stationarity before any time series modeling. If the data is non-stationary it may produce unreliable and spurious results after modeling. It will lead to a poor understanding of the data.
非平穩性是大多數時間序列數據的固有特征。 在任何時間序列建模之前,我們始終需要測試平穩性。 如果數據不穩定,則建模后可能會產生不可靠且虛假的結果。 這將導致對數據的理解不充分。
Augmented Dickey-Fuller (ADF) can be used to check for non-stationarity. The null hypothesis for ADF is the series is non-stationary. At 5% level of significance, if the p-value is less than 0.05, we reject the null hypothesis.
增強的Dickey-Fuller(ADF)可用于檢查非平穩性。 ADF的原假設是級數是非平穩的。 在5%的顯著性水平下,如果p值小于0.05,我們將拒絕原假設。
Python implementation:
Python實現:
from statsmodels.tsa.stattools import adfuller
X = [15, 20, 21, 20, 21, 30, 33, 45, 56]
result = adfuller(X)
print(result)
3. How to check autocorrelation among the values of a time series?
3.如何檢查時間序列值之間的自相關?
For time series data, the causal relationship between past and present values is a common phenomenon. For financial time series often we see that current price is influenced by the prices of the last few days. This feature of time series data is measured by autocorrelation.
對于時間序列數據,過去值和現在值之間的因果關系是一種常見現象。 對于財務時間序列,我們經常看到當前價格受最近幾天的價格影響。 時間序列數據的此功能通過自相關度量。
To know whether the autocorrelation is strong enough, you can test for it. Durbin-Watson test reveals the extent of it. The null hypothesis for this test assumes that there is no autocorrelation between the values.
要知道自相關是否足夠強,可以對其進行測試。 Durbin-Watson檢驗揭示了其程度。 此檢驗的零假設假設值之間不存在自相關。
Python implementation:
Python實現:
from statsmodels.stats.stattools import durbin_watson
X = [15, 20, 21, 20, 21, 30, 33, 45, 56]
result = durbin_watson(X)
print(result)
4. How can you test one variable has causes effect on other?
4.如何測試一個變量對另一個變量有影響?
Two time series variable can share causal relationship. If you are familiar with financial derivatives, a financial instrument defined on underlying stocks, you would know that spot and future values have causal relationships. They influence each other according to the situation.
兩個時間序列變量可以共享因果關系。 如果您熟悉金融衍生工具(一種定義在基礎股票上的金融工具),則您會知道現貨和未來價值具有因果關系。 它們根據情況相互影響。
The causality between two variables can be tested by Granger Causality test. This test uses a regression setup. The current value of one variable regresses on lagged values of the other variable along with lagged values of itself. The null hypothesis of no causality is determined by F-test.
兩個變量之間的因果關系可以通過格蘭杰因果關系檢驗進行檢驗。 該測試使用回歸設置。 一個變量的當前值與其他變量的滯后值一起回歸。 沒有因果關系的零假設由F檢驗確定。
Python implementation:
Python實現:
import statsmodels.api as sm
from statsmodels.tsa.stattools import grangercausalitytests
import numpy as np
data = sm.datasets.macrodata.load_pandas()
data = data.data[[“realgdp”, “realcons”]].pct_change().dropna()
gc_res = grangercausalitytests(data, 4)
5. How can you check the temporal relationship between two variables?
5.如何檢查兩個變量之間的時間關系?
Two time series sometimes moves together over time. In the financial time series you will often observe that spot and future price of derivatives move together.
有時兩個時間序列會隨著時間一起移動。 在金融時間序列中,您經常會觀察到衍生產品的現貨價格和未來價格會同時波動。
This co-movements can be checked through a characteristic called cointegration. This cointegration can be tested by Johansen’s test. The null hypothesis of this test assumes no cointegartion between the variables.
可以通過稱為協整的特征來檢查這種共同運動。 可以通過約翰森的檢驗來檢驗這種協整。 該檢驗的零假設假設變量之間沒有共同含義。
Python implementation:
Python實現:
from statsmodels.tsa.vector_ar.vecm import coint_johansen
data = sm.datasets.macrodata.load_pandas()
data = data.data[[“realgdp”, “realcons”]].pct_change().dropna()
#x = getx() # dataframe of n series for cointegration analysis
jres = coint_johansen(data, det_order=0, k_ar_diff=1
print(jres.max_eig_stat)
print(jres.max_eig_stat_crit_vals)
資料分組 (Data Grouping)
Many times in real-life scenario we try to find similarity among the data points. The intention becomes grouping them together in some buckets and study them closely to understand how different buckets behave.
在現實生活中,很多時候我們試圖找到數據點之間的相似性。 目的是將它們分組到一些存儲桶中,并仔細研究它們以了解不同存儲桶的行為。
The same is applicable for variables as well. We identify some latent variable those are formed by the combination of a number of observable variables.
同樣適用于變量。 我們確定一些潛在變量,它們是由多個可觀察變量的組合形成的。
A retail store might be interested to form segments among its customers like — cost-conscious, brand-conscious, bulk-purchaser, etc. It requires grouping of the customers based on their characteristics like — transactions, demographics, psychographics, etc.
零售商店可能有興趣在其顧客中形成細分,例如-注重成本,注重品牌,大量購買者等。它要求根據顧客的特征(例如交易,人口統計,心理特征等)對顧客進行分組。
In this area we often encounter the following tests:
在這一方面,我們經常遇到以下測試:
1. Test of sphericity
1.球形度測試
2. Test for sampling adequacy
2.檢驗抽樣是否足夠
3. Test for clustering tendency
3.測試聚類趨勢
1. How to test for Sphericity of the variables?
1.如何測試變量的球形性?
If the number of variables in the data is very high, the regression models in this situation tend to perform badly. Besides, identifying important variables becomes challenging. In this scenario, we try to reduce the number of variables.
如果數據中的變量數量非常多,則這種情況下的回歸模型往往表現不佳。 此外,識別重要變量也變得充滿挑戰。 在這種情況下,我們嘗試減少變量的數量。
Principal Component Analysis (PCA) is one method of reducing the number of variables and identifying major factors. These factors will help you built a regression model with reduced dimension. Also, help to identify key features of any object or incident of interest.
主成分分析(PCA)是減少變量數量和識別主要因素的一種方法。 這些因素將幫助您構建尺寸減小的回歸模型。 此外,有助于識別感興趣的任何物體或事件的關鍵特征。
Now, variables can form factors only when they share some amount of correlation. It is tested by Bartlet’s test. The null hypothesis of this test is — variables are uncorrelated.
現在,變量只有在它們共享一定程度的相關性時才能形成因素。 它通過Bartlet的測試進行了測試。 該檢驗的零假設是-變量不相關。
Python implementation:
Python實現:
from scipy.stats import bartlett
a = [8.88, 9.12, 9.04, 8.98, 9.00, 9.08, 9.01, 8.85, 9.06, 8.99]
b = [8.88, 8.95, 9.29, 9.44, 9.15, 9.58, 8.36, 9.18, 8.67, 9.05]
c = [8.95, 9.12, 8.95, 8.85, 9.03, 8.84, 9.07, 8.98, 8.86, 8.98]
stat, p = bartlett(a, b, c)
print(p, stat)
2. How to test for sampling adequacy of variables?
2.如何檢驗變量的抽樣充分性?
The PCA method will produce a reliable result when the sample size is large enough. This is called sampling adequacy. It is to be checked for each variable.
當樣本量足夠大時,PCA方法將產生可靠的結果。 這稱為抽樣充分性。 將檢查每個變量。
Kaiser-Meyer-Olkin (KMO) test is used to check sampling adequacy for the overall data set. The statistic measures the proportion of variance among variables that could be common variance.
Kaiser-Meyer-Olkin(KMO)測試用于檢查整個數據集的采樣是否足夠。 該統計數據衡量的是可能是普通方差的變量之間方差的比例。
Python implementation:
Python實現:
import pandas as pd
from factor_analyzer.factor_analyzer import calculate_kmo
a = [8.88, 9.12, 9.04, 8.98, 9.00, 9.08, 9.01, 8.85, 9.06, 8.99]
b = [8.88, 8.95, 9.29, 9.44, 9.15, 9.58, 8.36, 9.18, 8.67, 9.05]
c = [8.95, 9.12, 8.95, 8.85, 9.03, 8.84, 9.07, 8.98, 8.86, 8.98]
df= pd.DataFrame({‘x’:a,’y’:b,’z’:c})
kmo_all,kmo_model=calculate_kmo(df)
print(kmo_all,kmo_model)
3. How to test for clustering tendency of a data set?
3.如何測試數據集的聚類趨勢?
To group the data in different buckets, we use clustering techniques. But before going for clustering you need to check if there is clustering tendency in the data. If the data has uniform distribution then it not suitable for clustering.
為了將數據分組到不同的存儲桶中,我們使用聚類技術。 但是在進行聚類之前,您需要檢查數據中是否存在聚類趨勢。 如果數據具有均勻分布,則不適合聚類。
Hopkins test can check for spatial randomness of variables. Null hypothesis in this test is — the data is generated from non-random, uniform distribution.
Hopkins檢驗可以檢查變量的空間隨機性。 該測試中的零假設是-數據是由非隨機,均勻分布生成的。
Python implementation:
Python實現:
from sklearn import datasets
from pyclustertend import hopkins
from sklearn.preprocessing import scale
X = scale(datasets.load_iris().data)
hopkins(X,150)
In this article, I mentioned some frequently used tests in data science. There are a lot of others which I could not mention. Let me know if you find some which I haven’t mentioned here.
在本文中,我提到了數據科學中一些常用的測試。 還有很多我不能提及的。 如果您找到我在這里未提及的內容,請告訴我。
Reference:
參考:
https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.grangercausalitytests.html
https://www.statsmodels.org/dev/generation/statsmodels.tsa.stattools.grangercausalitytests.html
https://pypi.org/project/pyclustertend/
https://pypi.org/project/pyclustertend/
翻譯自: https://towardsdatascience.com/what-are-the-commonly-used-statistical-tests-in-data-science-a95cfc2e6b5e
r怎么對兩組數據統計檢驗
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392354.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392354.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392354.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!