貝塞爾修正_貝塞爾修正背后的推理:n-1

貝塞爾修正

A standard deviation seems like a simple enough concept. It’s a measure of dispersion of data, and is the root of the summed differences between the mean and its data points, divided by the number of data points…minus one to correct for bias.

標準偏差似乎很簡單。 它是對數據分散性的一種度量,是均值與其數據點之和的總和之差除以數據點的數量…… 減去1即可校正偏差

This is, I believe, the most oversimplified and maddening concept for any learner, and the intent of this post is to provide a clear and intuitive explanation for Bessel’s Correction, or n-1.

我認為,對于任何學習者來說,這都是最簡單和令人發瘋的概念,這篇文章的目的是為貝塞爾修正(n-1)提供清晰直觀的解釋。

To start, recall the formula for a population mean:

首先,回顧一下總體均值的公式:

Image for post
Population Mean Formula
人口均值公式

What about a sample mean?

樣本的意思是什么?

Image for post
Sample Mean Formula
樣本均值公式

Well, they look identical, except for the lowercase N. In each case, you just add each x?, and divide by how many x’s there are. If we are dealing with an entire population, we would use N, instead of n, to indicate the total number of points in the population.

好吧,除了小寫字母N之外,它們看起來相同。在每種情況下,您只需將每個x?相加,然后除以x的個數即可。 如果要處理整個總體,則將使用N而不是n來表示總體中的總點數。

Now, what is standard deviation σ (called sigma)?

現在,標準偏差σ是多少?

If a population contains N points, then the standard deviation is the square root of the variance, which is the summed-and-averaged squared differences of each data point and the population mean, or μ:

如果總體包含N個點,則標準偏差為方差的平方根,即每個數據點與總體平均值或μ的求和平均值平方差。

Image for post
Formula for Population Standard Deviation
人口標準偏差公式

But what about a sample standard deviation, s, with n data points and sample mean x-bar:

但是,具有n個數據點和樣本均值x-bar的樣本標準偏差s呢?

Image for post
Formula for Sample Standard Deviation
樣品標準偏差公式

Alas, the dreaded n-1 appears. Why? Shouldn’t it be the same formula? It was virtually the same formula for population mean and sample mean!

,,出現了可怕的n-1。 為什么? 不應該是相同的公式嗎? 總體均值和樣本均值實際上是相同的公式!

The short answer is: this is very complex, to such an extent that most instructors explain n-1 by saying the sample standard deviation will ‘a biased estimator’ if you don’t do it.

簡短的答案是: 這非常復雜,以至于大多數教師講n-1時,如果不這樣做,則樣本標準差將成為“有偏估計”。

什么是偏見,為什么存在? (What is Bias, and Why is it There?)

The Wikipedia explanation can be found here.

Wikipedia的解釋可以在這里找到 。

It’s not helpful.

沒有幫助

To really understand n-1, just like any other brief attempt to explain Bessel’s Correction, requires holding a lot in your head at once. I’m not talking about a proof, either. I’m talking about truly understanding the differences between a sample and a population.

要真正了解N-1,就像任何其他的簡短試圖解釋貝塞爾修正,需要立刻拿著很多在你的頭上。 我也不是在說證明。 我說的是真正了解樣本與總體之間的差異

What is a sample?

什么是樣本?

A sample is always a subset of a population it’s intended to represent (a subset can be the same size as the original set, in the unusual case of sampling an entire population without replacement). This is a massive leap alone. Once a sample is taken, there are presumed, hypothetical parameters and distributions built into that sample-representation.

樣品始終是一個人口它意在表示 (一個子集可以是相同大小的原始集合,在沒有更換采樣的整個人口的不尋常的情況下)的子集 。 單單這是一次巨大的飛躍。 抽取樣本后, 該樣本表示中會內置假定的假設參數和分布

The very word statistic refers to some piece of information about a sample (such as a mean, or median) which corresponds to some piece of analogous information about the population (again, such as mean, or median) called a parameter. The field of ‘Statistics’ is named as such, instead of ‘Parametrics’, to convey this attitude of inference from smaller to larger, and this leap, again, has many assumptions built into it. For example, if prior assumptions about a sample’s population are actually quantified, this leads to Bayesian statistics. If not, this leads to frequentism, both outside the scope of this post, but nevertheless important angles to consider in the context of Bessel’s correction. (in fact, in Bayesian inference Bessel’s Correction is not used, since prior probabilities about population parameters are intended to handle bias in a different way, upfront. Variance and standard deviation are calculated with plain old n).

統計數據一詞是指有關樣本的某些信息(例如平均值或中位數),它對應于有關總體的某些類似信息(同樣是平均值或中位數),稱為參數。 “統計”字段的名稱被這樣命名,而不是“參數”字段,以傳達從小到大的這種推理態度而這一飛躍又有許多假設。 例如,如果實際量化了關于樣本總體的先前假設,那么這將導致貝葉斯統計 。 如果不是這樣,則會導致頻頻主義 ,這既不在本文討論的范圍之內,也不過是在貝塞爾修正案中要考慮的重要角度。 (實際上,在貝葉斯推斷中未使用Bessel校正,因為有關總體參數的先驗概率旨在以不同的方式預先處理偏差。方差和標準差使用普通old n來計算)。

But let’s not lose focus. Now that we’ve stated the important fundamental difference between a sample and a population, let’s consider the implications of sampling. I will be using the Normal distribution for the following examples for the sake of simplicity, as well as this Jupyter notebook which contains one-million simulated, Normally distributed data points for visualizing intuitions about samples. I highly recommend playing with it yourself, or simply using from sklearn.datasets import make_gaussian_quantiles to get a hands-on feel for what’s really going on with sampling.

但是,我們不要失去重點。 現在,我們已經說明了樣本和總體之間的重要根本區別,讓我們考慮一下樣本的含義。 為簡單起見,我將在以下示例中使用正態分布,以及此Jupyter筆記本 ,其中包含一百萬個模擬的正態分布數據點,用于可視化有關樣本的直覺。 我強烈建議您自己玩,或者只是from sklearn.datasets import make_gaussian_quantiles來獲得對采樣實際操作的親身體驗。

Here is an image of one million randomly-generated, Normally distributed points. We will call it our population:

這是一百萬個隨機生成的正態分布點的圖像。 我們稱其為人口:

Image for post
Just one million points
一百萬點

To further simplify things, we will only be considering mean, variance, standard deviation, etc., based on the x-values. (That is, I could have used a mere number line for these visualizations, but having the y-axis more effectively displays the distribution across the x axis).

為了進一步簡化,我們將僅基于x值考慮均值,方差,標準差等。 (也就是說,我本可以僅使用數字線進行這些可視化,但是使y軸更有效地顯示x軸上的分布)。

This is a population, so N = 1,000,000. It’s Normally distributed, so the mean is 0.0, and the standard deviation is 1.0.

這是人口,因此N = 1,000,000。 它是正態分布的,因此平均值為0.0,標準偏差為1.0。

I took two random samples, the first only 10 points and the second 100 points:

我隨機抽取了兩個樣本,前一個僅10分,第二個100分:

Image for post
100-point sample in black, 10-point sample in orange, red lines are one std from the mean
黑色的100點樣本,橙色的紅色點10點樣本與平均值相差1 std

Now, let’s take a look at these two samples, without and with Bessel’s Correction, along with their standard deviations (biased and unbiased, respectively). The first sample is only 10 points, and the second sample is 100 points.

現在,讓我們看一下這兩個樣本(不帶貝塞爾校正和不帶貝塞爾校正)以及它們的標準偏差(分別為有偏和無偏)。 第一個樣本僅為10分,第二個樣本為100分。

Image for post
The Correction Seems to Help!
更正似乎有幫助!

Take a good long look at the above image. Bessel’s Correction does seem to be helping. It makes sense: very often the sample standard deviation will be lower than the population standard deviation, especially if the sample is small, because unrepresentative points (‘biased’ points, i.e. farther from the mean) will have more of an impact on the calculation of variance. Because the difference between each data point and the sample mean is being squared, the range of possible differences will be smaller than the real range if the population mean was used. Furthermore, taking a square root is a concave function, and therefore introduces ‘downward bias’ in estimations.

請仔細看一下上面的圖片。 貝塞爾的矯正似乎確實有所幫助。 這是有道理的:樣本標準偏差通常低于總體標準偏差,尤其是在樣本較小的情況下,因為不具有代表性的點(“有偏點”,即距離均值較遠)會對計算產生更大的影響差異。 由于每個數據點和樣本均值之間的差異均被平方,因此,如果使用總體均值,則可能的差異范圍將小于實際范圍。 此外, 取平方根是一個凹函數,因此在估計中引入了“向下偏差” 。

Another way of thinking about it is this: the larger your sample, the more of an opportunity you have to run into more population-representative points, i.e. points that are close to the mean. Therefore, you have less of a chance of getting a sample mean which results in differences which are too small, leading to a too-small variance, and you’re left with an undershot standard deviation.

另一種思考方式是:樣本越大, 就越有機會碰到更多具有人口代表性的點,即接近均值的點。 因此,您獲得樣本均值的機會較小,樣本均值導致的差異過小,導致方差過小,并且留下的標準偏差不足。

On average, samples of a Normally-distributed population will produce a variance which is biased downward by a factor of n-1 on average. (Incidentally, I believe the distribution of sample biases themselves are described by Student’s t-distribution, determined by n). Therefore, by dividing the square-rooted variance by n-1, we make the denominator smaller, thereby making the result larger and leading to a so-called ‘unbiased’ estimate.

平均而言,正態分布總體的樣本將產生方差,該方差平均向下降低n-1倍 。 (順便說一句,我相信樣本偏差本身的分布由Student的t分布描述,由n確定)。 因此,通過將平方根方差除以n-1,我們使分母變小,從而使結果變大,從而導致所謂的“無偏”估計。

The key point to emphasize here is that Bessel’s Correction, or dividing by n-1, doesn’t always actually help! Because the potential sample-variances are themselves t-distributed, you will unwittingly run into cases where n-1 will overshoot the real population standard deviation. It just so happens that n-1 is the best tool we have to correct for bias most of the time.

這里要強調的關鍵是,貝塞爾校正或除以n-1并不一定總是有幫助! 因為潛在的樣本方差本身是t分布的,所以您會無意中遇到n-1會超出實際總體標準差的情況。 碰巧的是,n-1是大多數時候我們必須校正偏差的最佳工具

To prove this, check out the same Jupyter notebook where I’ve merely changed the random seed until I found some samples whose standard deviation was already close to the population standard deviation, and where n-1 added more bias:

為了證明這一點,請查看我只更改了隨機種子的同一本Jupyter筆記本 直到我發現一些標準偏差已經接近總體標準偏差的樣本,并且其中n-1 增加了更多偏差

Image for post

In this case, Bessel’s Correction actually hurt us!

在這種情況下,貝塞爾的更正實際上傷害了我們!

Thus, Bessel’s Correction is not always a correction. It’s called such because most of the time, when sampling, we don’t know the population parameters. We don’t know the real mean or variance or standard deviation. Thus, we are relying on the fact that because we know the rate of bad luck (undershooting, or downward bias), we can counteract bad luck by the inverse of that rate: n-1.

因此,貝塞爾的校正并不總是校正。 之所以這樣稱呼,是因為在大多數情況下,抽樣時我們不知道總體參數 。 我們不知道真實的均值或方差或標準差。 因此,我們依靠這樣一個事實, 因為我們知道厄運率(下沖或向下偏差),因此可以通過該比率的倒數來抵消厄運:n-1。

But what if you get lucky? Just like in the cells above, this can happen sometimes. Your sample can occasionally produce the correct standard deviation, or even overshoot it, in which case n-1 ironically adds bias.

但是,如果您幸運的話,該怎么辦? 就像上面的單元格一樣,有時可能會發生這種情況。 您的樣品有時可能會產生正確的標準偏差,甚至會產生超標,在這種情況下,n-1具有諷刺意味的是會增加偏差。

Nevertheless, it’s the best tool we have for bias correction in a state of ignorance. The need for bias correction doesn’t exist from a God’s-eye point of view, where the parameters are known.

但是,它是我們在無知狀態下進行偏差校正的最佳工具。 從參數已知的角度來看,不存在偏差校正的需要。

At the end of the day, this fundamentally comes down to understanding the crucial difference between a sample and a population, as well as why Bayesian Inference is such a different approach to classical problems, where guesses about the parameters are made upfront via prior probabilities, thus removing the need for Bessel’s Correction.

歸根結底,這從根本上歸結為理解樣本與總體之間的關鍵差異,以及為什么貝葉斯推理是對古典問題的如此不同的方法,其中對參數的猜測是通過先驗概率預先做出的 ,從而消除了貝塞爾校正的需要。

I’ll focus on Bayesian statistics in future posts. Thanks for reading!

在以后的文章中,我將重點介紹貝葉斯統計。 謝謝閱讀!

翻譯自: https://towardsdatascience.com/the-reasoning-behind-bessels-correction-n-1-eeea25ec9bc9

貝塞爾修正

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391238.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391238.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391238.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

RESET MASTER和RESET SLAVE使用場景和說明【轉】

【前言】在配置主從的時候經常會用到這兩個語句,剛開始的時候還不清楚這兩個語句的使用特性和使用場景。 經過測試整理了以下文檔,希望能對大家有所幫助; 【一】RESET MASTER參數 功能說明:刪除所有的binglog日志文件,…

Kubernetes 入門(1)基本概念

1. Kubernetes簡介 作為一個目前在生產環境已經廣泛使用的開源項目 Kubernetes 被定義成一個用于自動化部署、擴容和管理容器應用的開源系統;它將一個分布式軟件的一組容器打包成一個個更容易管理和發現的邏輯單元。 Kubernetes 是希臘語『舵手』的意思&#xff0…

Python程序互斥體

Python程序互斥體 有時候我們需要程序只運行一個實例,在windows平臺下我們可以很簡單的用mutex實現這個目的。 ??在開始時,程序創建了一個命名的mutex,這個mutex可以被其他進程檢測到。 這樣如果程序已經啟動,再次運行時的進程就…

890

890 轉載于:https://www.cnblogs.com/Forever77/p/11528605.html

android 西班牙_分析西班牙足球聯賽(西甲)

android 西班牙The Spanish football league commonly known as La Liga is the first national football league in Spain, being one of the most popular professional sports leagues in the world. It was founded in 1929 and has been held every year since then with …

Goalng軟件包推薦

2019獨角獸企業重金招聘Python工程師標準>>> 前言 哈嘍大家好呀! 馬上要迎來狗年了大家是不是已經懷著過年的心情了呢? 今天筆者給大家帶來了一份禮物, Goalng的軟件包推薦, 主要總結了一下在go語言中大家開源的優秀的軟件, 大家了解之后在后續使用過程有遇到如下軟…

Kubernetes 入門(2)基本組件

1. C/S架構 Kubernetes 遵循非常傳統的客戶端服務端架構,客戶端通過 RESTful 接口或者直接使用 kubectl 與 Kubernetes 集群進行通信,這兩者在實際上并沒有太多的區別,后者也只是對 Kubernetes 提供的 RESTful API 進行封裝并提供出來。 左側…

caioj1522: [NOIP提高組2005]過河

狀態壓縮的經典題。 按照一般做法,DP一維時間O(n),顯然跑不過。考慮到石子較少,實際上有很長一段是一定可以跳到的,設兩個石頭分別在i點和j點,跳躍的路程為S到T。那么從i點可以跳到iS到iT。從j-T到j-S可以跳到J。顯然當…

Dev控件使用CheckedListBoxControl獲取items.count為0 的解決方法

CheckedListBoxControl,我使用DataSource屬性,給其綁定了一個List對象。界面顯示都挺正常的,當若干個項的復選框被選中的后,它的checkedListBoxControl1.CheckedItems也是正常的。 唯獨的問題是在代碼中得到的checkedListBoxContr…

如何啟動軟件YouTube頻道

Hi, I’m Beau and I run the freeCodeCamp.org YouTube channel. 嗨,我是Beau,我運行了freeCodeCamp.org YouTube頻道 。 For the first few years of our channel’s life, we had less than 100,000 subscribers. When we published new videos, we …

【powerdesign】從mysql數據庫導出到powerdesign,生成數據字典

使用版本powerdesign16.5,mysql 5.5,windows 64 步驟: 1.下載mysql驅動【注意 32和64的驅動都下載下來,具體原因查看第三步 依舊會報錯處】 下載地址:https://dev.mysql.com/downloads/connector/odbc/5.3.html 請下…

php amazon-s3_推薦亞馬遜電影-一種協作方法

php amazon-s3Item-based collaborative and User-based collaborative approach for recommendation system with simple coding.推薦系統的基于項目的協作和基于用戶的協作方法,編碼簡單。 推薦系統概述 (Overview of Recommendation System) There are many met…

[高精度乘法]BZOJ 1754 [Usaco2005 qua]Bull Math

模板題目&#xff0c;練練手~ #include <iostream> #include <algorithm> #include <cstring> #include <cstdio> using namespace std;int s1[2333]; int s2[2333]; int Out[2333]; string one,two;void Debug(){for(int i0;i<one.length();i){pri…

python:使用Djangorestframework編寫post和get接口

1、安裝django pip install django 2、新建一個django工程 python manage.py startproject cainiao_monitor_api 3、新建一個app python manage.py startapp monitor 4、安裝DRF pip install djangorestframework 5、編寫視圖函數 views.py from rest_framework.views import A…

Kubernetes 入門(3)集群安裝

1. kubeadm簡介 kubeadm 是 Kubernetes 官方提供的一個 CLI 工具&#xff0c;可以很方便的搭建一套符合官方最佳實踐的最小化可用集群。當我們使用 kubeadm 搭建集群時&#xff0c;集群可以通過 K8S 的一致性測試&#xff0c;并且 kubeadm 還支持其他的集群生命周期功能&#…

Angular Material 攻略 04 Icon

Icon 網頁系統中的Icon雖然說很簡單&#xff0c;但是其中的學問還是有很多的&#xff0c;我們常用的Icon庫有FontAwesome、Iconfont等&#xff0c;我們選擇了Angular Material這個組件庫&#xff0c;就介紹Material Icons吧。 對Icon感興趣的同學可以看一下這里 Material Desig…

【9303】平面分割

Time Limit: 10 second Memory Limit: 2 MB 問題描述 同一平面內有n&#xff08;n≤500&#xff09;條直線&#xff0c;已知其中p&#xff08;p≥2&#xff09;條直線相交與同一點&#xff0c;則這n條直線最多能將平面分割成多少個不同的區域&#xff1f; Input 兩個整數n&am…

簡述yolo1-yolo3_使用YOLO框架進行對象檢測的綜合指南-第一部分

簡述yolo1-yolo3重點 (Top highlight)目錄&#xff1a; (Table Of Contents:) Introduction 介紹 Why YOLO? 為什么選擇YOLO&#xff1f; How does it work? 它是如何工作的&#xff1f; Intersection over Union (IoU) 聯合路口(IoU) Non-max suppression 非最大抑制 Networ…

django:資源網站匯總

Django REST framework官網 http://www.sinodocs.cn/ django中文網 https://www.django.cn/ 轉載于:https://www.cnblogs.com/gcgc/p/11542068.html

Kubernetes 入門(4)集群配置

1. 集群配置 報錯&#xff1a; message: ‘runtime network not ready: NetworkReadyfalse reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized’ 原因&#xff1a;cni未被初始化&#xff08;CNI 是 Container Network In…