消解原理推理_什么是推理統計中的Z檢驗及其工作原理?

消解原理推理

I Feel:

我覺得:

The more you analyze the data the more enlightened, data engineer you will become.

您對數據的分析越多,您將變得越發開明。

In data engineering, you will always find an instance where you need to establish whether the data sample which you have got from population data, is reliable enough to build a model around it. There can be an instance where you may have got the data from the old archive, which may not represent the true behavior of process modeled around it in a production environment, with time behavior changes and so the process on which model was built.

在數據工程中,您將始終找到一個實例,需要在該實例中確定從總體數據中獲取的數據樣本是否足夠可靠以圍繞該數據模型建立模型。 在某些情況下,您可能已經從舊的存檔中獲取了數據,這些數據可能無法表示生產環境中圍繞它建模的流程的真實行為,并且行為會隨時間變化,因此建立模型的流程也會隨之變化。

So if we go ahead and build our new model around such old sample data, we may end up with a faulty process and the model will not be effective or useful. So what we do is to perform a certain inferential statistical test to ensure data is reliable.

因此,如果我們繼續圍繞這樣的舊樣本數據構建新模型,則可能會導致過程出錯,并且該模型將無效或無用。 因此,我們要做的是執行某種推斷統計檢驗,以確保數據可靠。

One such test is the Normal Deviate Z Test, where we test our sample data to infer if it has come from the population data which is a true representation of process behavior in a production environment before we go-ahead to build a model around it.

一種這樣的測試是Normal Deviate Z Test 我們在這里測試示例數據以推斷它是否來自于總體數據,這是生產環境中過程行為的真實表示,然后我們繼續圍繞它建立模型。

Earlier in part 1 of Inferential statistics, we learned about the Chi-Square test

在推論統計的第1部分之前,我們了解了卡方檢驗

I would invite you all to read the same. As promised, today we will cover more statistical testing techniques being used in inferential statistic hypothesis testing to establish sample data reliability. So let’s get started with understanding one such test called normal deviate Z Test which we will be covering in detail moving forward in our journey.

我請大家閱讀相同的內容。 如所承諾的那樣,今天我們將介紹用于推斷統計假設檢驗的更多統計檢驗技術,以建立樣本數據的可靠性。 因此,讓我們開始理解一種稱為“普通偏差Z測試”的測試,我們將詳細介紹其前進的過程。

什么是標準偏差Z測試及其工作原理? (What Is Normal Deviate Z Test & How It Works?)

When we try to establish data reliability of a large sample data set (sample size > 30 is the norm)using Normal deviate Z test we try to compare two distribution means of data like the given sample data in our data science project and the production data.

當我們嘗試使用正態偏差Z檢驗來建立大型樣本數據集(樣本量大于30的范數)的數據可靠性時,我們嘗試比較兩種分布方式的數據,例如數據科學項目中的給定樣本數據和生產數據。

The Z-test compares sample and population means to determine if there is a significant difference.The Z test statistic is assumed to have a normal distribution, and nuisance parameters such as standard deviation should be known in order to perform an accurate z-test

Z檢驗比較樣本和總體均值以確定是否存在顯著差異。假定Z檢驗統計量具有正態分布,并且應該知道有害參數(例如標準差)以執行準確的Z檢驗

正態偏差Z測試如何工作? (How Normal Deviate Z test Work?)

We will understand how Z test functions in the following steps

我們將在以下步驟中了解Z測試的功能

第一步:建立假設: (Step1: Establishing Hypothesis:)

It is the first thing data engineers need to state before we go to perform any statistical test in inferential statistics.

在進行推理統計中的任何統計測試之前,這是數據工程師需要陳述的第一件事。

H0 — The difference in means between sample variable and population mean is a statistical fluctuation.

H0-樣本變量和總體平均值之間的均值差異是統計波動。

# H1 — The difference in means between sample BP column and population mean is significant. The difference is too high to be the result of statistical fluctuation

#H1-樣本BP列與總體平均值之間的均值差異顯著。 差異太高,無法由統計波動得出

步驟2:計算Z檢驗統計量 (Step 2: Calculating Z test statistic)

Before we calculate, here are the required

在我們計算之前,這是必需的

Pre-Requisites: In-order to perform Z test on a normal distribution of data, there are some prerequisites:

先決條件:所有以上的數據的正態分布進行Z檢驗,也有一些先決條件:

  • Number of samples >= 30,

    樣本數量> = 30,
  • The mean and standard deviation of the population should be known

    應該知道總體的平均值和標準偏差

計算的Z檢驗統計公式: (Z-test statistic Formula For Calculation:)

The Z measure is calculated as:

Z度量的計算公式為:

Z = (M — μ)/ SE

Z =(M —μ)/ SE

Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.

其中,M是要標準化的平均樣本,μ(μ)是總體平均值,SE是平均值的標準誤差。

SE is calculated using the below-given formula:

SE使用以下給出的公式計算:

SE = s/ SQRT(n)

SE = s / SQRT(n)

Where, s is the population standard deviation and n is the sample size.

其中, s是總體標準差,n是樣本量。

Standard_error is the standard deviation of the sample distribution of means (Central Limit Distribution)

Standard_error是均值樣本分布(中央限制分布)的標準偏差

The above-given formula may look very similar to Z score calculation as both Z score calculation and Z Norm_dev is an instance of a test of statistical significance

上面給出的公式可能看起來與Z分數計算非常相似,因為Z分數計算和Z Norm_dev都是統計顯著性檢驗的一個實例

步驟3:分析Z值以解釋P值 (Step 3: Analyze the Z value to interpret P-Value)

Once we have the Z value we go ahead to calculate the p-value, based on which we will be able to accept or reject the null hypothesis.

一旦獲得Z值,我們便可以計算p值,以此為基礎我們可以接受或拒絕原假設。

使用Python和Jupyter Notebook的示例: (Example Using Python & Jupyter Notebook:)

So let’s try to understand the above-given steps using a practical example.

因此,讓我們嘗試通過一個實際示例來了解上述步驟。

安裝Anaconda發行版: (Install Anaconda distribution:)

By following the given link anaconda download the latest version for a python based on your OS. This will come up with a pre-installed Jupyter notebook and required python packages likes pandas, SciPy, etc.

通過點擊給定的鏈接, anaconda會根據您的操作系統為python下載最新版本。 這將附帶一個預安裝的Jupyter筆記本和必需的python包(如pandasSciPy等)。

Once you are done with the installation, launch your Jupyter notebook and write the following code(copy the below code ) to get started.

安裝完成后,啟動Jupyter筆記本并編寫以下代碼(復制以下代碼)以開始使用。

Import The Required Package:

導入所需的軟件包:

Let’s important some relevant python packages as shown below and create a data frame by reading “pima-indians-diabetes.csv” sourced from a Kaggle

讓我們重要一些相關的python軟件包,如下所示,并通過閱讀來自Kaggle的pima-indians-diabetes.csv ”來創建數據框

import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns#Reading CSV file into df as pandas dataframedf= pd.read_csv(“pima-indians-diabetes.csv”)

Let’s view the data frame by calling the method df.head(20) to view the data series in the given sample data set.

讓我們通過調用方法df.head(20)來查看數據幀,以查看給定樣本數據集中的數據系列。

df.head(20)
Image for post

步驟1:讓我們制定零假設和替代假設: (Step 1: Let’s Formulate Our Null and Alternate Hypothesis:)

零假設: (Null Hypothesis:)

# H0: The difference in the mean between sample BP(Press column visible above in the data frame table ) column and population mean for BP is a statistical fluctuation.

#H0:樣本BP(在數據框表中上方可見的Press列)列與BP總體平均值之間的平均值差是統計波動。

替代假設: (Alternate Hypothesis:)

# H1 — The difference in Mean between sample BP column and population mean is significant, and is not a case of mere statistical fluctuation

#H1-樣本BP列與總體均值之間的均值差異非常大,而不僅僅是統計上的波動

步驟2:讓我們計算Z Stat(Z測試): (Step 2: Let’s Calculate Z Stat (Z Test):)

As we have already discussed Z stats formula,

正如我們已經討論過的Z統計公式一樣,

Z =(M —μ)/ SE (Z = (M — μ)/ SE)

Where, M is the mean sample to be standardized, μ(mu) is the population mean and SE is the standard error of the mean.

其中,M是要標準化的平均樣本,μ(μ)是總體平均值 ,SE是平均值的標準誤差。

So let’s do this calculation in Jupyter Notebook :

因此,讓我們在Jupyter Notebook中進行以下計算:

這是計算Z測試的代碼片段: (Here is the code snippet to calculate Z test:)

# Pre - Requisites -  Number of samples >= 30, the mean and standard deviation of population should be known# Here we have  Avg and Standard Deviation for  diastolic blood pressure = 71.3 with standard deviation of 7.2 

## Let's Apply of Normal Deviate Z test on blood pressure(Press) column of given dataframe#mu = μ
mu = 71.3 # source - http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_BiostatisticsBasics/BS704_BiostatisticsBasics3.html
std = 7.2#Let's find the M, mean of BP column(Press) in a given data frameMeanOfBpSample = np.average( df['Pres'])
print("Mean Of BP Column", MeanOfBpSample)SE= std/np.sqrt(df.size) #sf.size id the total size of# Z_norm_deviate = sample_mean - population_mean /std_error_bpZ_norm_deviate = (MeanOfBpSample - mu) / SEprint("Normal Deviate Z Value: ", Z_norm_deviate)

If you type the above code in your notebook you will be able to see the below-given output

如果您在筆記本中鍵入上面的代碼,您將能夠看到以下給出的輸出

Mean Of BP Column 69.10546875
Standard Error: 0.08660254037844387
Normal Deviate Z value : -25.340264158650886
Image for post

Now that we know the Z Test Value, let’s find our p-value

現在我們知道了Z檢驗值,讓我們找到p值

計算P值,代碼段: (Calculating P-Value, Code Snippet:)

# We will be using scipy stats normal survival function sf
#Here we mulitply the sf fucntion with 2 for two sided p value #calcultion , a two tail testp_value = scipy.stats.norm.sf(abs(Z_norm_deviate))*2
print('p values' , p_value)if p_value > 0.05:
print('Samples are likely drawn from the same distributions (fail to reject H0)')
else:
print('Samples are likely drawn from different distributions (reject H0)')

If you run the above code snippet in Jupyter you will get the following outcome:

如果在Jupyter中運行上述代碼片段,您將得到以下結果:

Image for post

步驟3:分析Z值以解釋P值 (Step 3: Analyze the Z value to interpret P-Value)

As you can see above, the p-value comes out to be: 1.150581011903455e-141. As the p-value is less than the accepted industry standard of 0.05, we can conclude that the given sample has not come from the same population distribution, on which the process was built. There is a significant difference in Means between sample BP column and population mean, so we have to reject the Null hypothesis H0, and accept the alternate hypothesis: H1.

如上所示, p值顯示為: 1.150581011903455e-141。 由于p值小于公認的行業標準0.05,因此可以得出結論,給定的樣本并非來自建立該過程的相同總體分布。 樣本BP列與總體均值之間的均值存在顯著差異,因此我們必須拒絕零假設H0,并接受替代假設H1。

As we reject the null hypothesis here using the normal Z deviate test, it will be recommended to avoid building an ML model on this sample data.

由于我們在這里使用常規Z偏差檢驗拒絕零假設,因此建議避免在此樣本數據上建立ML模型。

Aspiring/working data engineers need to have a clear understanding of p-value. This will be the basis of performing most of the statistical data reliability tests. So let me quickly cover a few basic stuff about the same here and we will look into it more deeply in the special article which I will frame only around P-value for you all.

有抱負/工作數據的工程師需要對p值有清晰的了解。 這將是執行大多數統計數據可靠性測試的基礎。 因此,讓我在這里快速介紹一些基本知識,我們將在特別文章中對其進行更深入的研究,我將只為大家介紹P值

什么是P值? (What Is P-Value?)

The p-value, or probability value, tells us the probability of getting a value as small or as large as the one observed in the sample, given that our null hypothesis is true.

假設我們的零假設是真實的,則p值或概率值告訴我們獲得與樣本中觀察到的值一樣小的值的概率。

一般如何計算p值? (How to calculate p-value in general?)

  1. Frame your hypothesis

    闡明你的假設

  2. Assume the null hypothesis to be true

    假設原假設為真

  3. Calculate the z or t value for getting the value in the alternative hypothesis

    計算z或t值以獲取替代假設中的值

  4. From the z/t-table, find the probability associated with the z or t value obtained above. You can also find p-value with Scipy inbuilt methods you just need to pass z, t statistics calculated in step 3.

    從z / t表中,找到與上面獲得的z或t值關聯的概率。 您還可以使用Scipy內置方法找到p值,只需傳遞步驟3中計算的z,t統計信息即可。

  5. This is the p-value you need to find

    這是您需要找到的p值

We will cover P-value calculations, how to interpret it and its use cases separately later on. Also, you will also experience it while we cover all the hypothesis test types in our journey of understanding inferential statistics.

稍后我們將分別介紹P值計算,如何解釋它以及其用例。 同樣,當我們在理解推理統計的過程中涵蓋所有假設檢驗類型時,您還將體驗到它。

下一步是什么? (What’s Next?)

In our next article: “Inferential Statistics: Hypothesis Testing using T-Test”. We will cover the T-test in detail.

在我們的下一篇文章: “推理統計:使用T檢驗的假設檢驗”。 我們將詳細介紹T檢驗

Would like to leave you all by covering some basics of the T-test.

希望通過介紹T檢驗的一些基礎知識來使您滿意。

什么是T檢驗? (What Is T-Test?)

A t-test is a kind of inferential statistic used to find if there is a significant difference between the means of two given groups, which may be related to certain features.

t檢驗是一種推論統計量,用于發現兩個給定組的均值之間是否存在顯著差異,這可能與某些特征有關。

A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the probability of difference between two sets of data

t檢驗檢查t統計量,t分布值和自由度,以確定兩組數據之間的差異概率

T檢驗的類型: (Types Of T-Test:)

There are three types of t-test:

t檢驗分為三種類型:

一樣本t檢驗: (One-sample t-test:)

Used to compare a sample mean with a known population mean or some other meaningful, fixed value

用于將樣本平均值與已知總體平均值或其他有意義的固定值進行比較

獨立樣本t檢驗: (Independent samples t-test:)

Used to compare two means from independent groups

用于比較獨立組的兩種均值

配對樣本t檢驗: (Paired samples t-test:)

  1. Used to compare two means that are repeated measures for the same participants — scores might be repeated across different measures or across time.

    用于比較兩個均值是針對同一參與者的重復測量方法-得分可能會在不同測量值或時間之間重復。
  2. Used also to compare paired samples, as in a two treatment randomized block design.

    也用于比較成對的樣本,如兩次治療的隨機區組設計。

Will cover how we perform the above-given T-test using examples and hands-on lab exercises.

將通過示例和動手實驗練習介紹我們如何執行上述T檢驗。

Do refer below given graphics that cover the decision making tree to help you chose the right kind of hypothesis testing based on the given problem statement.

請參考下面給出的覆蓋決策樹的圖形,以幫助您根據給定的問題陳述選擇正確的假設檢驗。

如何決定何時使用什么測試? (How To Decide What Test To Use & When?)

Image for post

摘要: (Summary:)

Never ever rely on plain observation or assumption while you try to build a model on the given sample. Make sure you are measuring it’s distribution type, testing the data sample using statistical hypothesis testing to ensure your sample data is reliable. Descriptive statistics & inferential statistical techniques are designed to help you make better decisions, in data sampling before modeling it in machine learning.

嘗試在給定樣本上建立模型時,切勿依賴單純的觀察或假設。 確保您正在測量其分布類型,使用統計假設檢驗來測試數據樣本,以確保樣本數據可靠。 描述性統計和推論統計技術旨在幫助您在數據采樣之前在機器學習中建模之前做出更好的決策。

As data cleansing, EDA will fill larger part of your work life as a data scientist, it’s imperative that you take responsibility of handling data with utmost clarity & care to test it out for its reliability. You are going to influence the market dynamics in a larger way, as your model is going to take some really critical business decisions.

在清理數據時,EDA將占據數據科學家一生的大部分時間,當務之急是您要以最清晰,最謹慎的態度處理數據,以測試其可靠性。 您將以更大的方式影響市場動態,因為您的模型將做出一些非常關鍵的業務決策。

我覺得 : (I Feel :)

Going wrong with data interpretations while building ML models may cost heavily. So don’t just build models for the sake of building, make sure it has been fed with the right kind of food in terms of data. Your right data feeding habit will do wonders when your machine will make intelligent & precise, ML based predictions and recommendations for your business . Everybody in the ecosystem will be the beneficiary of the right model building process if it’s done right.

在構建ML模型時,數據解釋出錯可能會耗費大量資金。 因此,不要僅僅為了構建模型而建立模型,還要確保它已經在數據方面獲得了正確的選擇。 您正確的數據饋送習慣將使您的機器何時能夠為您的業務做出智能,精確的,基于ML的預測和建議,這會產生奇跡。 如果做得正確,生態系統中的每個人都將是正確的模型構建過程的受益者。

翻譯自: https://medium.com/swlh/what-is-z-test-in-inferential-statistics-how-it-works-3dde6eae64e5

消解原理推理

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391196.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391196.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391196.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

pytest+allure測試框架搭建

https://blog.csdn.net/wust_lh/article/details/86685912 https://www.jianshu.com/p/9673b2aeb0d3 定制化展示數據 https://blog.csdn.net/qw943571775/article/details/99634577 環境說明: jdk 1.8 python 3.5.3 allure-commandline 2.13.0 文檔及下載地址&…

lintcode433 島嶼的個數

島嶼的個數 給一個01矩陣,求不同的島嶼的個數。 0代表海,1代表島,如果兩個1相鄰,那么這兩個1屬于同一個島。我們只考慮上下左右為相鄰。 您在真實的面試中是否遇到過這個題? Yes樣例 在矩陣: [[1, 1, 0, …

大數據分析要學習什么_為什么要學習數據分析

大數據分析要學習什么The opportunity to leverage insights from data has never been greater.利用來自數據的洞察力的機會從未如此大。 Humans tend to generate a lot of data each day - from heart rates to favorite songs, fitness goals and movie preferences. You …

POJ - 3257 Cow Roller Coaster (背包)

題目大意:要用N種材料建一條長為L的路,如今給出每種材料的長度w。起始地點x。發費c和耐久度f 問:在預算為B的情況下,建好這條路的最大耐久度是多少 解題思路:背包問題 dp[i][j]表示起始地點為i。發費為j的最大耐久度…

leetcode 1473. 粉刷房子 III(dp)

在一個小城市里,有 m 個房子排成一排,你需要給每個房子涂上 n 種顏色之一(顏色編號為 1 到 n )。有的房子去年夏天已經涂過顏色了,所以這些房子不需要被重新涂色。 我們將連續相同顏色盡可能多的房子稱為一個街區。&a…

大學生信息安全_給大學生的信息

大學生信息安全You’re an undergraduate. Either you’re graduating soon (like me) or you’re in the process of getting your first college degree. The process is not easy and I can only assume how difficult the pressures on Masters and Ph.D. students are. Ho…

打破冷漠僵局文章_保持冷靜并打破僵局-最佳

打破冷漠僵局文章Hack The Box (HTB) is an online platform allowing you to test your penetration testing skills. It contains several challenges that are constantly updated. Some of them simulating real world scenarios and some of them leaning more towards a …

使用DOM Breakpoints找到修改屬性的Javascript代碼

使用Chrome開發者工具的DOM斷點功能可以讓您快速找到修改了某一個DOM元素的Javascript代碼。 在Chrome開發者工具里,選中想要監控的DOM元素,點擊右鍵,選擇Break on->Attributes modifications: 之后在DOM Breakpoints的tab里能看到對應的斷…

特斯拉最安全的車_特斯拉現在是最受歡迎的租車選擇

特斯拉最安全的車Have you been curious to know which cars are most popular in US and what are their typical rental fares in various cities? As the head of Product and Data Science at an emerging technology start-up, Ving Rides, these were some of the quest…

leetcode 740. 刪除并獲得點數(dp)

給你一個整數數組 nums ,你可以對它進行一些操作。 每次操作中,選擇任意一個 nums[i] ,刪除它并獲得 nums[i] 的點數。之后,你必須刪除每個等于 nums[i] - 1 或 nums[i] 1 的元素。 開始你擁有 0 個點數。返回你能通過這些操作…

WebSocket入門

WebSocket前言  WebSocket是HTML5的重要特性,它實現了基于瀏覽器的遠程socket,它使瀏覽器和服務器可以進行全雙工通信,許多瀏覽器(Firefox、Google Chrome和Safari)都已對此做了支持。 在WebSocket出現之前&#xff…

安卓游戲開發推箱子_保持冷靜并砍箱子-開發

安卓游戲開發推箱子Hack The Box (HTB) is an online platform allowing you to test your penetration testing skills. It contains several challenges that are constantly updated. Some of them simulating real world scenarios and some of them leaning more towards …

自定義TabLayout

本文為kotlin仿開眼視頻Android客戶端的后續補充內容,本篇為大家介紹如何對TabLayout進行定制使用,基于項目需求,本篇主要對部分功能進行了定制,如:指示器距離文字的距離、文字選中加粗、文字選中變大等 本文部分代碼參…

ml dl el學習_DeepChem —在生命科學和化學信息學中使用ML和DL的框架

ml dl el學習Application of Machine Learning and Deep Learning for Drug Discovery, Genomics, Microsocopy and Quantum Chemistry can create radical impact and holds the potential to significantly accelerate the process of medical research and vaccine developm…

響應式網站設計_通過這個免費的四小時課程,掌握響應式網站設計

響應式網站設計This video tutorial from Kevin Powell teaches you to build responsive websites from scratch. 凱文鮑威爾(Kevin Powell)的這段視頻教程教您從頭開始構建響應式網站。 The course starts with explaining the core concepts needed to start thinking resp…

2017-2018-1 20179215《Linux內核原理與分析》第二周作業

20179215《Linux內核原理與分析》第二周作業 這一周主要了解了計算機是如何工作的,包括現在存儲程序計算機的工作模型、X86匯編指令包括幾種內存地址的尋址方式和push、pop、call、re等幾個重要的匯編指令。主要分為兩部分進行這周的學習總結。第一部分對學習內容進…

python:單例模式--使用__new__(cls)實現

單例模式:即一個類有且僅有一個實例。 那么通過python怎么實現一個類只能有一個實例呢。 class Earth:"""假如你是神,你可以創造地球"""print 歡迎來到地球# 生成一個地球 a Earth() print id(a)# 再生成一個地球 b Ear…

重學TCP協議(5) 自連接

1.自連接是什么 在發起連接時,TCP/IP的協議棧會先選擇source IP和source port,在沒有顯示調用bind()的情況下,source IP由路由表確定,source port由TCP/IP協議棧從local port range中選取尚未使用的port。 如果destination IP正…

Gradle復制文件/目錄方法

2019獨角獸企業重金招聘Python工程師標準>>> gradle復制文件/文件夾方法 復制文件 //復制IDE生成的classes.jar文件到build/libs中,并改名為FileUtils.jar. task copyFile(type:Copy) {delete build/libs/FileUtils.jarfrom(build/intermediates/bundles…

用戶參與度與活躍度的區別_用戶參與度突然下降

用戶參與度與活躍度的區別disclaimer: I don’t work for Yammer, this is a public data case study, I’ve written it in a narrative format to make this case study more engaging to read.免責聲明:我不為Yammer工作,這是一個公共數據案例研究&am…