消解原理推理
I Feel:
我覺得:
The more you analyze the data the more enlightened, data engineer you will become.
您對數據的分析越多,您將變得越發開明。
In data engineering, you will always find an instance where you need to establish whether the data sample which you have got from population data, is reliable enough to build a model around it. There can be an instance where you may have got the data from the old archive, which may not represent the true behavior of process modeled around it in a production environment, with time behavior changes and so the process on which model was built.
在數據工程中,您將始終找到一個實例,需要在該實例中確定從總體數據中獲取的數據樣本是否足夠可靠以圍繞該數據模型建立模型。 在某些情況下,您可能已經從舊的存檔中獲取了數據,這些數據可能無法表示生產環境中圍繞它建模的流程的真實行為,并且行為會隨時間變化,因此建立模型的流程也會隨之變化。
So if we go ahead and build our new model around such old sample data, we may end up with a faulty process and the model will not be effective or useful. So what we do is to perform a certain inferential statistical test to ensure data is reliable.
因此,如果我們繼續圍繞這樣的舊樣本數據構建新模型,則可能會導致過程出錯,并且該模型將無效或無用。 因此,我們要做的是執行某種推斷統計檢驗,以確保數據可靠。
One such test is the Normal Deviate Z Test, where we test our sample data to infer if it has come from the population data which is a true representation of process behavior in a production environment before we go-ahead to build a model around it.
一種這樣的測試是Normal Deviate Z Test ,我們在這里測試示例數據以推斷它是否來自于總體數據,這是生產環境中過程行為的真實表示,然后我們繼續圍繞它建立模型。
Earlier in part 1 of Inferential statistics, we learned about the Chi-Square test
在推論統計的第1部分之前,我們了解了卡方檢驗
I would invite you all to read the same. As promised, today we will cover more statistical testing techniques being used in inferential statistic hypothesis testing to establish sample data reliability. So let’s get started with understanding one such test called normal deviate Z Test which we will be covering in detail moving forward in our journey.
我請大家閱讀相同的內容。 如所承諾的那樣,今天我們將介紹用于推斷統計假設檢驗的更多統計檢驗技術,以建立樣本數據的可靠性。 因此,讓我們開始理解一種稱為“普通偏差Z測試”的測試,我們將詳細介紹其前進的過程。
什么是標準偏差Z測試及其工作原理? (What Is Normal Deviate Z Test & How It Works?)
When we try to establish data reliability of a large sample data set (sample size > 30 is the norm)using Normal deviate Z test we try to compare two distribution means of data like the given sample data in our data science project and the production data.
當我們嘗試使用正態偏差Z檢驗來建立大型樣本數據集(樣本量大于30的范數)的數據可靠性時,我們嘗試比較兩種分布方式的數據,例如數據科學項目中的給定樣本數據和生產數據。
The Z-test compares sample and population means to determine if there is a significant difference.The Z test statistic is assumed to have a normal distribution, and nuisance parameters such as standard deviation should be known in order to perform an accurate z-test
Z檢驗比較樣本和總體均值以確定是否存在顯著差異。假定Z檢驗統計量具有正態分布,并且應該知道有害參數(例如標準差)以執行準確的Z檢驗
正態偏差Z測試如何工作? (How Normal Deviate Z test Work?)
We will understand how Z test functions in the following steps
我們將在以下步驟中了解Z測試的功能
第一步:建立假設: (Step1: Establishing Hypothesis:)
It is the first thing data engineers need to state before we go to perform any statistical test in inferential statistics.
在進行推理統計中的任何統計測試之前,這是數據工程師需要陳述的第一件事。
H0 — The difference in means between sample variable and population mean is a statistical fluctuation.
H0-樣本變量和總體平均值之間的均值差異是統計波動。
# H1 — The difference in means between sample BP column and population mean is significant. The difference is too high to be the result of statistical fluctuation
#H1-樣本BP列與總體平均值之間的均值差異顯著。 差異太高,無法由統計波動得出
步驟2:計算Z檢驗統計量 (Step 2: Calculating Z test statistic)
Before we calculate, here are the required
在我們計算之前,這是必需的
Pre-Requisites: In-order to perform Z test on a normal distribution of data, there are some prerequisites:
先決條件:所有以上的數據的正態分布進行Z檢驗,也有一些先決條件:
- Number of samples >= 30, 樣本數量> = 30,
- The mean and standard deviation of the population should be known 應該知道總體的平均值和標準偏差
計算的Z檢驗統計公式: (Z-test statistic Formula For Calculation:)
The Z measure is calculated as:
Z度量的計算公式為:
Z = (M — μ)/ SE
Z =(M —μ)/ SE
Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.
其中,M是要標準化的平均樣本,μ(μ)是總體平均值,SE是平均值的標準誤差。
SE is calculated using the below-given formula:
SE使用以下給出的公式計算:
SE = s/ SQRT(n)
SE = s / SQRT(n)
Where, s is the population standard deviation and n is the sample size.
其中, s是總體標準差,n是樣本量。
Standard_error is the standard deviation of the sample distribution of means (Central Limit Distribution)
Standard_error是均值樣本分布(中央限制分布)的標準偏差
The above-given formula may look very similar to Z score calculation as both Z score calculation and Z Norm_dev is an instance of a test of statistical significance
上面給出的公式可能看起來與Z分數計算非常相似,因為Z分數計算和Z Norm_dev都是統計顯著性檢驗的一個實例
步驟3:分析Z值以解釋P值 (Step 3: Analyze the Z value to interpret P-Value)
Once we have the Z value we go ahead to calculate the p-value, based on which we will be able to accept or reject the null hypothesis.
一旦獲得Z值,我們便可以計算p值,以此為基礎我們可以接受或拒絕原假設。
使用Python和Jupyter Notebook的示例: (Example Using Python & Jupyter Notebook:)
So let’s try to understand the above-given steps using a practical example.
因此,讓我們嘗試通過一個實際示例來了解上述步驟。
安裝Anaconda發行版: (Install Anaconda distribution:)
By following the given link anaconda download the latest version for a python based on your OS. This will come up with a pre-installed Jupyter notebook and required python packages likes pandas, SciPy, etc.
通過點擊給定的鏈接, anaconda會根據您的操作系統為python下載最新版本。 這將附帶一個預安裝的Jupyter筆記本和必需的python包(如pandas , SciPy等)。
Once you are done with the installation, launch your Jupyter notebook and write the following code(copy the below code ) to get started.
安裝完成后,啟動Jupyter筆記本并編寫以下代碼(復制以下代碼)以開始使用。
Import The Required Package:
導入所需的軟件包:
Let’s important some relevant python packages as shown below and create a data frame by reading “pima-indians-diabetes.csv” sourced from a Kaggle
讓我們重要一些相關的python軟件包,如下所示,并通過閱讀來自Kaggle的 “ pima-indians-diabetes.csv ”來創建數據框
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns#Reading CSV file into df as pandas dataframedf= pd.read_csv(“pima-indians-diabetes.csv”)
Let’s view the data frame by calling the method df.head(20) to view the data series in the given sample data set.
讓我們通過調用方法df.head(20)來查看數據幀,以查看給定樣本數據集中的數據系列。
df.head(20)
步驟1:讓我們制定零假設和替代假設: (Step 1: Let’s Formulate Our Null and Alternate Hypothesis:)
零假設: (Null Hypothesis:)
# H0: The difference in the mean between sample BP(Press column visible above in the data frame table ) column and population mean for BP is a statistical fluctuation.
#H0:樣本BP(在數據框表中上方可見的Press列)列與BP總體平均值之間的平均值差是統計波動。
替代假設: (Alternate Hypothesis:)
# H1 — The difference in Mean between sample BP column and population mean is significant, and is not a case of mere statistical fluctuation
#H1-樣本BP列與總體均值之間的均值差異非常大,而不僅僅是統計上的波動
步驟2:讓我們計算Z Stat(Z測試): (Step 2: Let’s Calculate Z Stat (Z Test):)
As we have already discussed Z stats formula,
正如我們已經討論過的Z統計公式一樣,
Z =(M —μ)/ SE (Z = (M — μ)/ SE)
Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.
其中,M是要標準化的平均樣本,μ(μ)是總體平均值 ,SE是平均值的標準誤差。
So let’s do this calculation in Jupyter Notebook :
因此,讓我們在Jupyter Notebook中進行以下計算:
這是計算Z測試的代碼片段: (Here is the code snippet to calculate Z test:)
# Pre - Requisites - Number of samples >= 30, the mean and standard deviation of population should be known# Here we have Avg and Standard Deviation for diastolic blood pressure = 71.3 with standard deviation of 7.2
## Let's Apply of Normal Deviate Z test on blood pressure(Press) column of given dataframe#mu = μ
mu = 71.3 # source - http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_BiostatisticsBasics/BS704_BiostatisticsBasics3.html
std = 7.2#Let's find the M, mean of BP column(Press) in a given data frameMeanOfBpSample = np.average( df['Pres'])
print("Mean Of BP Column", MeanOfBpSample)SE= std/np.sqrt(df.size) #sf.size id the total size of# Z_norm_deviate = sample_mean - population_mean /std_error_bpZ_norm_deviate = (MeanOfBpSample - mu) / SEprint("Normal Deviate Z Value: ", Z_norm_deviate)
If you type the above code in your notebook you will be able to see the below-given output
如果您在筆記本中鍵入上面的代碼,您將能夠看到以下給出的輸出
Mean Of BP Column 69.10546875
Standard Error: 0.08660254037844387
Normal Deviate Z value : -25.340264158650886
Now that we know the Z Test Value, let’s find our p-value
現在我們知道了Z檢驗值,讓我們找到p值
計算P值,代碼段: (Calculating P-Value, Code Snippet:)
# We will be using scipy stats normal survival function sf
#Here we mulitply the sf fucntion with 2 for two sided p value #calcultion , a two tail testp_value = scipy.stats.norm.sf(abs(Z_norm_deviate))*2
print('p values' , p_value)if p_value > 0.05:
print('Samples are likely drawn from the same distributions (fail to reject H0)')
else:
print('Samples are likely drawn from different distributions (reject H0)')
If you run the above code snippet in Jupyter you will get the following outcome:
如果在Jupyter中運行上述代碼片段,您將得到以下結果:
步驟3:分析Z值以解釋P值 (Step 3: Analyze the Z value to interpret P-Value)
As you can see above, the p-value comes out to be: 1.150581011903455e-141. As the p-value is less than the accepted industry standard of 0.05, we can conclude that the given sample has not come from the same population distribution, on which the process was built. There is a significant difference in Means between sample BP column and population mean, so we have to reject the Null hypothesis H0, and accept the alternate hypothesis: H1.
如上所示, p值顯示為: 1.150581011903455e-141。 由于p值小于公認的行業標準0.05,因此可以得出結論,給定的樣本并非來自建立該過程的相同總體分布。 樣本BP列與總體均值之間的均值存在顯著差異,因此我們必須拒絕零假設H0,并接受替代假設H1。
As we reject the null hypothesis here using the normal Z deviate test, it will be recommended to avoid building an ML model on this sample data.
由于我們在這里使用常規Z偏差檢驗拒絕零假設,因此建議避免在此樣本數據上建立ML模型。
Aspiring/working data engineers need to have a clear understanding of p-value. This will be the basis of performing most of the statistical data reliability tests. So let me quickly cover a few basic stuff about the same here and we will look into it more deeply in the special article which I will frame only around P-value for you all.
有抱負/工作數據的工程師需要對p值有清晰的了解。 這將是執行大多數統計數據可靠性測試的基礎。 因此,讓我在這里快速介紹一些基本知識,我們將在特別文章中對其進行更深入的研究,我將只為大家介紹P值 。
什么是P值? (What Is P-Value?)
The p-value, or probability value, tells us the probability of getting a value as small or as large as the one observed in the sample, given that our null hypothesis is true.
假設我們的零假設是真實的,則p值或概率值告訴我們獲得與樣本中觀察到的值一樣小的值的概率。
一般如何計算p值? (How to calculate p-value in general?)
Frame your hypothesis
闡明你的假設
Assume the null hypothesis to be true
假設原假設為真
Calculate the z or t value for getting the value in the alternative hypothesis
計算z或t值以獲取替代假設中的值
From the z/t-table, find the probability associated with the z or t value obtained above. You can also find p-value with Scipy inbuilt methods you just need to pass z, t statistics calculated in step 3.
從z / t表中,找到與上面獲得的z或t值關聯的概率。 您還可以使用Scipy內置方法找到p值,只需傳遞步驟3中計算的z,t統計信息即可。
This is the p-value you need to find
這是您需要找到的p值
We will cover P-value calculations, how to interpret it and its use cases separately later on. Also, you will also experience it while we cover all the hypothesis test types in our journey of understanding inferential statistics.
稍后我們將分別介紹P值計算,如何解釋它以及其用例。 同樣,當我們在理解推理統計的過程中涵蓋所有假設檢驗類型時,您還將體驗到它。
下一步是什么? (What’s Next?)
In our next article: “Inferential Statistics: Hypothesis Testing using T-Test”. We will cover the T-test in detail.
在我們的下一篇文章: “推理統計:使用T檢驗的假設檢驗”。 我們將詳細介紹T檢驗。
Would like to leave you all by covering some basics of the T-test.
希望通過介紹T檢驗的一些基礎知識來使您滿意。
什么是T檢驗? (What Is T-Test?)
A t-test is a kind of inferential statistic used to find if there is a significant difference between the means of two given groups, which may be related to certain features.
t檢驗是一種推論統計量,用于發現兩個給定組的均值之間是否存在顯著差異,這可能與某些特征有關。
A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the probability of difference between two sets of data
t檢驗檢查t統計量,t分布值和自由度,以確定兩組數據之間的差異概率
T檢驗的類型: (Types Of T-Test:)
There are three types of t-test:
t檢驗分為三種類型:
一樣本t檢驗: (One-sample t-test:)
Used to compare a sample mean with a known population mean or some other meaningful, fixed value
用于將樣本平均值與已知總體平均值或其他有意義的固定值進行比較
獨立樣本t檢驗: (Independent samples t-test:)
Used to compare two means from independent groups
用于比較獨立組的兩種均值
配對樣本t檢驗: (Paired samples t-test:)
- Used to compare two means that are repeated measures for the same participants — scores might be repeated across different measures or across time. 用于比較兩個均值是針對同一參與者的重復測量方法-得分可能會在不同測量值或時間之間重復。
- Used also to compare paired samples, as in a two treatment randomized block design. 也用于比較成對的樣本,如兩次治療的隨機區組設計。
Will cover how we perform the above-given T-test using examples and hands-on lab exercises.
將通過示例和動手實驗練習介紹我們如何執行上述T檢驗。
Do refer below given graphics that cover the decision making tree to help you chose the right kind of hypothesis testing based on the given problem statement.
請參考下面給出的覆蓋決策樹的圖形,以幫助您根據給定的問題陳述選擇正確的假設檢驗。
如何決定何時使用什么測試? (How To Decide What Test To Use & When?)
摘要: (Summary:)
Never ever rely on plain observation or assumption while you try to build a model on the given sample. Make sure you are measuring it’s distribution type, testing the data sample using statistical hypothesis testing to ensure your sample data is reliable. Descriptive statistics & inferential statistical techniques are designed to help you make better decisions, in data sampling before modeling it in machine learning.
嘗試在給定樣本上建立模型時,切勿依賴單純的觀察或假設。 確保您正在測量其分布類型,使用統計假設檢驗來測試數據樣本,以確保樣本數據可靠。 描述性統計和推論統計技術旨在幫助您在數據采樣之前在機器學習中建模之前做出更好的決策。
As data cleansing, EDA will fill larger part of your work life as a data scientist, it’s imperative that you take responsibility of handling data with utmost clarity & care to test it out for its reliability. You are going to influence the market dynamics in a larger way, as your model is going to take some really critical business decisions.
在清理數據時,EDA將占據數據科學家一生的大部分時間,當務之急是您要以最清晰,最謹慎的態度處理數據,以測試其可靠性。 您將以更大的方式影響市場動態,因為您的模型將做出一些非常關鍵的業務決策。
我覺得 : (I Feel :)
Going wrong with data interpretations while building ML models may cost heavily. So don’t just build models for the sake of building, make sure it has been fed with the right kind of food in terms of data. Your right data feeding habit will do wonders when your machine will make intelligent & precise, ML based predictions and recommendations for your business . Everybody in the ecosystem will be the beneficiary of the right model building process if it’s done right.
在構建ML模型時,數據解釋出錯可能會耗費大量資金。 因此,不要僅僅為了構建模型而建立模型,還要確保它已經在數據方面獲得了正確的選擇。 您正確的數據饋送習慣將使您的機器何時能夠為您的業務做出智能,精確的,基于ML的預測和建議,這會產生奇跡。 如果做得正確,生態系統中的每個人都將是正確的模型構建過程的受益者。
翻譯自: https://medium.com/swlh/what-is-z-test-in-inferential-statistics-how-it-works-3dde6eae64e5
消解原理推理
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391196.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391196.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391196.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!