數據統計 測試方法_統計測試:了解如何為數據選擇最佳測試!

數據統計 測試方法

This post is not meant for seasoned statisticians. This is geared towards data scientists and machine learning (ML) learners & practitioners, who like me, do not come from a statistical background.

?他的職位是不是意味著經驗豐富的統計人員。 這是針對數據科學家和機器學習(ML)學習者和從業者的 ,他們和我一樣,并非來自統計背景。

For a person being from a non-statistical background the most confusing aspect of statistics, are the fundamental statistical tests, and when to use which test?. This post is an attempt to mark out the difference between the most common tests and the relevant key assumptions.

對于一個非統計學背景的人來說,統計方面最令人困惑的方面是基本統計檢驗 ,以及何時使用哪種檢驗? 這篇文章是試圖指出最常見的測試和相關的關鍵假設之間的差異。

目錄 (Table of contents)

  1. Terminologies: (KEY TERMINOLOGIES FOR THIS POST)

    術語:( 此職位的主要術語)

  2. Statistical Test(Hypothesis Testing)

    統計檢驗(假設檢驗)
  3. Statistical Assumptions

    統計假設
  4. Parametric tests

    參數測試
  5. Parametric test Flowchart

    參數測試流程圖
  6. Dealing with non-normal distributions (Non-Parametric tests)

    處理非正態分布(非參數檢驗)

1)術語: (1) TERMINOLIGIES:)

獨立變量和獨立變量 (DEPENDENT AND INDEPENDENT VARIABLES)

An independent variable often called “predictor variable”, is a variable that is being manipulated in order to observe the effect on a dependent variable, sometimes called an outcome/output variable.

通常被稱為“預測變量”的自變量是為了觀察對因變量的影響而被操縱的變量,有時稱為結果/輸出變量。

  • Independent variable(s)-> Predictor variable(s)

    自變量->預測變量
  • Dependent variable(s) -> Outcome/Output variable(s)

    因變量->結果/輸出變量

變量類型 (TYPES OF VARIABLES)

It is important to distinguish the difference between the type of variables because this plays a key role in determining the correct type of statistical test to adopt. There are two main categories:

區分變量類型之間的差異非常重要,因為這在確定要采用的正確統計檢驗類型中起著關鍵作用。 主要有兩個類別:

  • QUANTITATIVE: express the amounts of things (e.g. the number of cigarettes in a pack). The two different types of quantitative variables are:

    數量 表達物品的數量(例如,一包香煙的數量)。 兩種不同類型的定量變量是:

  1. CONTINOUS (a.k.a Ratio): is used to describe measures and can usually be divided into units smaller than one (e.g. 1.50 kg).

    連續 (又稱比率 ):用于描述度量,通常可以劃分為小于一的單位(例如1.50千克)。

  2. DISCRETE (a.k.a Interval): is used to describe counts and usually can’t be divided into units smaller than one (e.g. 1 cigarette).

    DISCRETE (又名Interval ):用于描述計數,通常不能分為小于1的單位(例如1支香煙)。

  • CATEGORICAL: express groupings of things (e.g. the different type of fruits). The three different types of categorical variables are:

    類別 表達事物的分組(例如,不同類型的水果)。 三種不同類型的類別變量是:

  1. ORDINAL: represent data with an order (e.g. rankings).

    序數表示具有順序的數據(例如排名)。

  2. NOMINAL: represent group names (e.g. brands or species names).

    名詞代表組名(例如品牌或品種名稱)。

  3. BINARY: represent data with a yes/no or 1/0 outcome (e.g. LEFT or RIGHT).

    BINARY :表示結果為是/否或1/0的數據(例如,左或右)。

Image for post
TYPES OF VARIABLES SUMMARY (Image by author)
變量類型摘要(作者提供)

2)統計測試 (2) STATISTICAL TESTS)

Statistics is all about data. Data alone is not interesting. It is the interpretation of the data that we are interested in.

統計信息都是關于數據的。 單獨的數據并不有趣。 它是對我們感興趣的數據的解釋。

In Statistics, one very important thing is statistical testing, if statistics “is the interpretation of the data”, statistical testing can be considered as the “formal procedure for investigating our ideas about the world”.

在統計中,非常重要的一件事是統計測試,如果統計“是對數據的解釋”,則統計測試可以被視為“調查我們對世界的看法的正式程序”。

In other words, whenever we want to make claims about the distribution of data or whether one set of results are different from another set of results, data scientists must rely on hypothesis testing.

換句話說,每當我們要對數據的分布或一組結果是否與另一組結果有所不同時,數據科學家必須依靠假設檢驗。

假設檢驗 (HYPOTHESIS TESTING)

Using Hypothesis Testing, we try to interpret or draw conclusions about the population using sample data, evaluating two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

使用“ 假設檢驗” ,我們嘗試使用樣本數據來解釋或得出有關總體的結論,評估關于總體的兩個互斥陳述,以確定樣本數據最能支持哪種陳述。

假設檢驗有五個主要步驟: (THERE ARE FIVE MAIN STEPS IN HYPOTHESIS TESTING:)

Step 1) State your hypothesis as a Null (Ho) and Alternate (Ha) hypothesis.

步驟1)將您的假設陳述為零(Ho)和替代(Ha)假設。

Step 2) Choose a significance level (also called alpha or α).

步驟2)選擇顯著性水平(也稱為alpha或α)。

Step 3) Collect data in a way designed to test the hypothesis.

步驟3)以旨在檢驗假設的方式收集數據。

Step 4) Perform an appropriate statistical test: compute the p-value and compare from the test to the significance level.

步驟4)執行適當的統計檢驗:計算p值,然后將檢驗與顯著性水平進行比較。

Step 5) Decide whether to “ REJECT ” the null hypothesis(Ho) or “ FAIL TO REJECT ” the null hypothesis(Ho).

步驟5)決定是“拒絕”無效假設(Ho)還是“失敗”無效假設(Ho)。

Note: Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

注意 :盡管具體細節可能有所不同,但是在檢驗假設時將使用的過程將始終遵循這些步驟的某些版本。

If you want to further understand hypothesis testing, I would highly recommend these two great posts on Hypothesis testing.

如果您想進一步了解假設檢驗,我強烈推薦有關假設檢驗的這兩篇好文章。

3)統計假設 (3) STATISTICAL ASSUMPTIONS)

Statistical tests make some common assumptions about the data being tested (If these assumptions are violated then the test may not be valid: e.g. the resulting p-value may not be correct)

統計測試對要測試的數據做出一些通用假設(如果違反了這些假設,則該測試可能無效:例如,得出的p值可能不正確)

  1. Independence of observations: the observations/variables you include in your test should not be related(e.g. several tests from a same test subject are not independent, while several tests from multiple different test subjects are independent)

    觀察結果的獨立性 :您包含在測試中的觀察值/變量不應該相關(例如,來自同一測試對象的多個測試不是獨立的,而來自多個不同測試對象的多個測試是獨立的)

  2. Homogeneity of variance: the “variance” within each group is being compared should be similar to the rest of the group variance. If a group has a bigger variance than the other(s) this will limit the test’s effectiveness.

    方差的同質性 :比較每個組中的“方差”應與其余組方差相似。 如果組的方差大于其他方,這將限制測試的有效性。

  3. Normality of data: the data follows a normal distribution, normality means that the distribution of the test is normally distributed (or bell-shaped) with mean 0, with 1 standard deviation and a symmetric bell-shaped curve.

    數據的正態性 :數據遵循正態分布,正態性表示測試的分布呈正態分布(或鐘形),平均值為0,標準差為1,鐘形曲線對稱。

Image for post
source: https://studylib.net/doc/10831020/the-bell-curve-the-standard-normal-bell-curve
來源: https : //studylib.net/doc/10831020/the-bell-curve-the-standard-normal-bell-curve

4)參數測試 (4) PARAMETRIC TESTS)

Parametric tests are the ones that can only be run with data that stick with the “three statistical assumptions” mentioned above. The most common types of parametric tests are divided into three categories.

參數測試是只能使用符合上述“三個統計假設”的數據運行的測試。 最常見的參數測試類型分為三類。

回歸測試: (Regression tests:)

These tests are used test cause-and-effect relationships, if the change in one or more continuous variable predicts change in another variable.

如果一個或多個連續變量的變化預示著另一個變量的變化則將這些檢驗用于檢驗因果關系

  • Simple linear regression: tests how a change in the predictor variable predicts the level of change in the outcome variable.

    簡單線性回歸:測試預測變量的變化如何預測結果變量的變化水平。

  • Multiple linear regression: tests how changes in the combination of two or more predictor variables predict the level of change in the outcome variable

    多元線性回歸:測試兩個或多個預測變量組合的變化如何預測結果變量的變化水平

  • Logistic regression: is used to describe data and to explain the relationship between one dependent (binary) variable and one or more nominal, ordinal, interval or ratio-level independent variable(s).

    Logistic回歸:用于描述數據并解釋一個(二元)變量與一個或多個名義,有序,區間或比率級別的自變量之間的關系。

比較測試: (Comparison tests:)

These tests look for the difference between the means of variables:Comparison of Means.

這些測試尋找變量均值之間的差異:均值比較。

  • T-tests are used when comparing the means of precisely two groups (e.g. the average heights of men and women).

    在精確比較兩組的平均值(例如,男性和女性的平均身高)時,使用T檢驗

  • Independent t-test: Tests the difference between the same variable from different populations (e.g., comparing dogs to cats)

    獨立t檢驗 :測試來自不同人群相同變量之間的差異 (例如,比較狗和貓)

  • ANOVA and MANOVA tests are used to compare the means of more than two groups or more(e.g. the average weights of children, teenagers, and adults).

    ANOVAMANOVA檢驗用于比較兩組或以上兩組的均值(例如,兒童,青少年和成人的平均體重)。

關聯測試: (Correlation tests:)

These tests look for an association between variable checking whether two variables are related.

這些測試在變量之間尋找關聯,檢查兩個變量是否相關。

  • Pearson Correlation: Tests for the strength of the association between two continuous variables.

    皮爾遜相關:測試兩個連續變量之間關聯的強度。

  • Spearman Correlation: Tests for the strength of the association between two ordinal variables (it does not rely on the assumption of normally distributed data)

    Spearman相關性:測試兩個序數變量之間的關聯強度(它不依賴于正態分布數據的假設)

  • Chi-Square Test: Tests for the strength of the association between two categorical variables.

    卡方檢驗:測試兩個類別變量之間的關聯強度。

Image for post
PARAMETRIC TESTS SUMMARY( Image by Author)
參數測試摘要(作者提供)

5)流程圖:選擇參數測試 (5) FLOWCHART: CHOOSING A PARAMETRIC TEST)

This flowchart will help you choose among the above described parametric tests. For nonparametric alternatives, check the following section.

該流程圖將幫助您在上述參數測試中進行選擇。 對于非參數替代,請檢查以下部分。

Image for post
PARAMETRIC TEST Flowchart (Image by author)
參數測試流程圖(作者提供)

6)處理非正態分布 (6) DEALING WITH NON- NORMAL DISTRIBUTIONS)

Although the normal distribution takes centre part in statistics, many processes follow non-normal distributions. Many datasets naturally fit a non-normal model:

盡管正態分布在統計中占據中心位置,但是許多過程遵循非正態分布。 許多數據集自然適合于非正常模型:

-The number of accidents tends to fit a “Poisson distribution”

-事故數量趨于符合“泊松分布”

-The Lifetimes of products usually fit a “Weibull distribution”.

-產品的使用壽命通常符合“威布爾分布”。

非正態分布的示例 (Example of Non-Normal Distributions)

  1. Beta Distribution.

    Beta發行版。
  2. Exponential Distribution.

    指數分布。
  3. Gamma Distribution.

    伽瑪分布。
  4. Inverse Gamma Distribution.

    反伽瑪分布。
  5. Log-Normal Distribution.

    對數正態分布。
  6. Logistic Distribution.

    物流配送。
  7. Maxwell-Boltzmann Distribution.

    Maxwell-Boltzmann分布。
  8. Poisson Distribution.

    泊松分布。
  9. Skewed Distribution.

    分布偏斜。
  10. Symmetric Distribution.

    對稱分布。
  11. Uniform Distribution.

    均勻分布。
  12. Unimodal Distribution.

    單峰分布。
  13. Weibull Distribution.

    威布爾分布。

那么,我們如何處理非正態分布? (Well then, How do we deal with non-Normal-Distributions?)

When your data is supposed to fit a normal distribution but doesn’t, we could do a few things to handle them:

當您的數據應該符合正態分布但不符合正態分布時,我們可以做一些事情來處理它們:

  • We may still be able to run parametric tests if your sample size is large enough (usually over 20 items) and try to interpret the results accordingly.

    如果您的樣本量足夠大(通常超過20個項目),我們仍然可以運行參數測試,并嘗試相應地解釋結果。
  • We may choose to transform the data with different statistical techniques, forcing it to fit a normal distribution.

    我們可能選擇使用不同的統計技術來轉換數據,迫使其適應正態分布。
  • If the sample size is small, skewed or if it represents another distribution type, you might run a non-parametric test.

    如果樣本量小,偏斜或代表其他分布類型,則可以運行非參數檢驗

非參數測試 (Non-Parametric Tests)

Non-parametric tests (figure below) don’t make as many assumptions about the data and are useful when one or more of the three statistical assumptions are violated.

非參數檢驗(下圖)對數據的假設不多,當違反三個統計假設中的一個或多個時很有用。

Note that: The inferences that non-parametric tests make aren’t as strong as the parametric tests.

請注意:非參數測試的推論不如參數測試強。

Image for post
NON- PARAMETRIC TESTS(Image by author)
非參數測試(作者提供)

Hope you find this post informative and useful. Please let me know if you have any feedback. Thanks a lot for reading!

希望您發現這篇文章有益和有用。 如果您有任何反饋意見,請告訴我。 非常感謝您的閱讀!

翻譯自: https://towardsdatascience.com/statistical-testing-understanding-how-to-select-the-best-test-for-your-data-52141c305168

數據統計 測試方法

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388746.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388746.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388746.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

前端介紹-35

前端介紹-35 # 前端## 一、什么是前端 前端即網站前臺部分,運行在PC端,移動端等瀏覽器上展現給用戶瀏覽的網頁。隨著互聯網技術的發展,HTML5,CSS3,前端框架的應用,跨平臺響應式網頁設計能夠適應各種屏幕…

spring的幾個通知(前置、后置、環繞、異常、最終)

1、沒有異常的 2、有異常的 1、被代理類接口Person.java 1 package com.xiaostudy;2 3 /**4 * desc 被代理類接口5 * 6 * author xiaostudy7 *8 */9 public interface Person { 10 11 public void add(); 12 public void update(); 13 public void delete();…

每個Power BI開發人員的Power Query提示

If someone asks you to define the Power Query, what should you say? If you’ve ever worked with Power BI, there is no chance that you haven’t used Power Query, even if you weren’t aware of it. Therefore, one could easily say that Power Query is the “he…

c# PDF 轉換成圖片

1.新建項目 2.新增一個新文件夾“lib”(主要是為了存放引用的dll) 3.將“gsdll32.dll 、PDFLibNet.dll 、PDFView.dll”3個dll添加到文件夾中 4.項目添加“PDFLibNet.dll 、PDFView.dll”2個類庫的引用,并將gsdll32.dll 拷貝到項目生產根…

java finally在return_Java finally語句到底是在return之前還是之后執行?

點擊上方“方志朋”,選擇“置頂或者星標”你的關注意義重大!網上有很多人探討Java中異常捕獲機制try...catch...finally塊中的finally語句是不是一定會被執行?很多人都說不是,當然他們的回答是正確的,經過我試驗&#…

oracle 死鎖

為什么80%的碼農都做不了架構師?>>> ORA-01013: user requested cancel of current operation 轉載于:https://my.oschina.net/8808/blog/2994537

面試題:二叉樹的深度

題目描述:輸入一棵二叉樹,求該樹的深度。從根結點到葉結點依次經過的結點(含根、葉結點)形成樹的一條路徑,最長路徑的長度為樹的深度。 思路:遞歸 //遞歸 public class Solution {public int TreeDepth(Tre…

a/b測試_如何進行A / B測試?

a/b測試The idea of A/B testing is to present different content to different variants (user groups), gather their reactions and user behaviour and use the results to build product or marketing strategies in the future.A / B測試的想法是將不同的內容呈現給不同…

hibernate h2變mysql_struts2-hibernate-mysql開發案例 -解道Jdon

Hibernate專題struts2-hibernate-mysql開發案例與源碼源碼下載本案例展示使用Struts2,Hibernate和MySQL數據庫開發一個個人音樂管理器Web應用程序。,可將您的音樂收藏添加到數據庫中。功能有:顯示一個添加記錄的表單和所有的音樂收藏的列表。…

P5024 保衛王國

傳送門 我現在還是不明白為什么NOIPd2t3會是一道動態dp…… 首先關于動態dp可以看這里 然后這里就是把把矩陣給改一改,改成這個形式\[\left[dp_{i-1,0},dp_{i-1,1}\right]\times \left[\begin{matrix}\infty&ldp_{i,1}\\ldp_{i,0}&ldp_{i,1}\end{matrix}\ri…

提取圖像感興趣區域_從圖像中提取感興趣區域

提取圖像感興趣區域Welcome to the second post in this series where we talk about extracting regions of interest (ROI) from images using OpenCV and Python.歡迎來到本系列的第二篇文章,我們討論使用OpenCV和Python從圖像中提取感興趣區域(ROI)。 As a rec…

解決java compiler level does not match the version of the installed java project facet

ava compiler level does not match the version of the installed java project facet錯誤的解決 因工作的關系,Eclipse開發的Java項目拷來拷去,有時候會報一個很奇怪的錯誤。明明源碼一模一樣,為什么項目復制到另一臺機器上,就會…

php模板如何使用,ThinkPHP如何使用模板

到目前為止,我們只是使用了控制器和模型,還沒有接觸視圖,下面來給上面的應用添加視圖模板。首先我們修改下 Action 的 index 操作方法,添加模板賦值和渲染模板操作。PHP代碼classIndexActionextendsAction{publicfunctionindex(){…

理解Windows窗體和WPF中的跨線程調用

你曾開發過Windows窗體程序,可能會注意到有時事件處理程序將拋出InvalidOperationException異常,信息為“ 跨線程調用非法:在非創建控件的線程上訪問該控件”。這種Windows窗體應用程序中 跨線程調用時的一個最為奇怪的行為就是,有…

什么是嵌入式系統

在我們的日常生活中,我們經常使用許多使用嵌入式系統技術設計的電氣和電子電路和套件。計算機,手機,平板,筆記本電腦,數字電子系統以及其他電子和電子設備都是使用嵌入式系統設計的。 什么是嵌入式系統?將硬…

面向數據科學家的實用統計學_數據科學家必知的統計數據

面向數據科學家的實用統計學Beginners usually ignore most foundational statistical knowledge. To understand different models, and various techniques better, these concepts are essential. These work as baseline knowledge for various concepts involved in data …

字符串、指針、引用、數組基礎

1.字符串:字符是由單引號所括住的單個字母、數字或符號。若將單引號改為雙引號,該字符就會變成字符串。它們之間主要的差別是:雙引號的字符串“A”會比單引號的字符串’A’在字符串的最后補上一個結束符’\0’(Null字符&#xff0…

suse安裝php,SUSE下安裝LAMP

安裝Apache可以看到編譯安裝Apache出錯,rpm包安裝gcc (首先要安裝GCC)makemake install修改apache端口cd /home/sxit/apache2vi conf/httpd.confListen 8000啟動 apache/home/root/apache2/bin/apachectl start(stop restart)http://localhost:8000安裝一下PHP開發…

自己動手寫事件總線(EventBus)

2019獨角獸企業重金招聘Python工程師標準>>> 本文由云社區發表 事件總線核心邏輯的實現。 <!--more--> EventBus的作用 Android中存在各種通信場景&#xff0c;如Activity之間的跳轉&#xff0c;Activity與Fragment以及其他組件之間的交互&#xff0c;以及在某…

viz::viz3d報錯_我可以在Excel中獲得該Viz嗎?

viz::viz3d報錯Have you ever found yourself in the following situation?您是否遇到以下情況&#xff1f; Your team has been preparing and working tireless hours to create and showcase the end product — an interactive visual dashboard. It’s a culmination of…