大數定理 中心極限定理_中心極限定理:直觀的遍歷

大數定理 中心極限定理

One of the most beautiful concepts in statistics and probability is Central Limit Theorem,people often face difficulties in getting a clear understanding of this and the related concepts, I myself struggled understanding this during my college days (eventually mugged up the definitions and formulas to pass the exam). In its core it is a very simple yet elegant theorem that enables us to estimate the population mean. Here I will try to explain these concepts using this toy dataset on customer demographics available on Kaggle (this is a fictional dataset created for educational purposes). Without wasting much time lets dive in and try to understand what CLT is

統計量和概率中最美麗的概念之一是中央極限定理 ,人們常常在難以清楚地理解這一概念和相關概念時遇到困難,我本人在上大學期間一直難以理解這一概念(最終弄糟了要通過的定義和公式考試)。 其核心是一個非常簡單而優雅的定理,使我們能夠估計總體均值。 在這里,我將嘗試在Kaggle上提供的有關客戶受眾特征的玩具數據集中解釋這些概念(這是為教育目的而創建的虛構數據集)。 不要浪費太多時間,讓我們深入研究一下CLT是什么

中心極限定理 (CENTRAL LIMIT THEOREM)

Here is what Central Limit Theorem states

這是中心極限定理指出的

If you take sufficiently large samples from a distribution,then the mean of these samples would follow approximately normal distribution with mean of distribution approximately equal to population mean and standard deviation equal on 1/√n times the population standard deviation (here n is number of elements in a sample)

如果從分布中獲取足夠大的樣本,則這些樣本的均值將遵循近似正態分布,分布均值大約等于總體均值,標準偏差等于總體標準偏差的1 /√n倍(此處n是樣本中的元素)

Image for post
Image for post
Creative Commons知識共享

Now comes the fun part, in order to have a better understanding and appreciation of the above statement, let us take our toy dateset (will take the annual income column for our analysis) and try to check if these approximations actually holds true.

現在是有趣的部分,為了更好地理解和理解上述說法,讓我們以玩具的日期集(將采用“年收入”列進行分析)并嘗試檢查這些近似值是否正確。

We will try to estimate the mean income of a population, first let us have a look at the distribution and size the population

我們將嘗試估算人口的平均收入,首先讓我們看一下人口的分布和規模

df = pd.read_csv(r'toy_dataset.csv')
print("Number of samples in our data: ",df.shape[0])
sns.kdeplot(df['Income'],shade=True)
Image for post
Annual Income Distribution of Population (Image By Author)
人口年收入分布(作者提供)

Well, we can fairly say this is isn’t exactly a normal distribution and the original population mean and standard deviation is 91252.798 and 24989.501 respectively. Now have a good look on to these numbers and let’s see if we could use Central Limit Theorem to approximate these values

好吧,我們可以很公平地說這不是正態分布,原始總體均值標準差分別為91252.79824989.501 。 現在看一下這些數字,讓我們看看是否可以使用中心極限定理來近似這些值

生成隨機樣本 (GENERATING RANDOM SAMPLES)

Now let us try to generate random samples from the population and try to plot sample mean distributions

現在讓我們嘗試從總體中生成隨機樣本,并嘗試繪制樣本均值分布

def return_mean_of_samples(total_samples,element_in_each_sample):
sample_with_n_elements_m_size = []
for i in range(total_samples):
sample = df.sample(element_in_each_sample).mean()['Income']
sample_with_n_elements_m_size.append(sample)
return (sample_with_n_elements_m_size)

We will use this function to generate random sample means and later use it to calculate sampling distributions

我們將使用此函數生成隨機樣本均值,然后使用它來計算采樣分布

Here we are taking 200 samples with 100 elements in each samples and see how the sample mean distribution looks like

在這里,我們以200個樣本為例,每個樣本中有100個元素,并查看樣本均值分布如何

sample_means = return_mean_of_samples(200,100)
sns.kdeplot(sample_means,shade=True)
print("Total Samples: ",200)
print("Total elements in each sample: ",100)
Image for post
Sample Mean Distribution (Image By Author)
樣本均值分布(作者提供的圖像)

Well that looks pretty normal, so now we can assume that with sufficient sample size, sample means do follow normal distribution irrespective ofthe original distributions

好吧,這看起來很正常,因此現在我們可以假設,在有足夠的樣本量的情況下,樣本均值確實遵循正態分布,而與原始分布無關

Now comes the second part, let us try to see if we could estimate the population mean from this sampling distribution, below is a piece of code that generates different sampling distributions by varying total sample size and elements in each samples

現在是第二部分,讓我們嘗試看看是否可以從該采樣分布中估計總體平均值,下面是一段代碼,該代碼通過改變總采樣大小和每個采樣中的元素來生成不同的采樣分布

total_samples_list = [100,500]
elements_in_each_sample_list = [50,100,500]
mean_list = []
std_list = []
key_list = []
estimate_std_list = []
key=''
pop_mean = [population_mean]*6
pop_std = [population_std]*6
for tot in total_samples_list:
for ele in elements_in_each_sample_list:
key = '{}_samples_with_{}_elements_each'.format(tot,ele)
key_list.append(key)
mean_list.append(np.round(np.mean(return_mean_of_samples(tot,ele)),3))
std_list.append(np.round(np.array(return_mean_of_samples(tot,ele)).std(),3))
estimate_std_list.append(np.round(population_std/(np.sqrt(ele)),3))pd.DataFrame(zip(key_list,pop_mean,mean_list,pop_std,estimate_std_list,std_list),columns=['Sample_Description','Population_Mean','Sample_Mean','Population_Standard_Deviation',"Pop_Std_Dev/"+u"\u221A"+"sample_size",'Sample_Standard_Deviation'])
Image for post
Summary of sampling distribution generated (Image By Author)
生成的采樣分布摘要(作者提供的圖像)
  • Look at second and third columns, we can clearly see the mean of sampling distribution is very close to the population mean in all the distributions

    查看第二和第三列,我們可以清楚地看到抽樣分布的均值與所有分布中的總體均值非常接近
  • Have a look at the last two columns, initially there is some difference in the deviations but as the sample size increases this difference becomes negligible

    看看最后兩列,最初的偏差有所不同,但是隨著樣本數量的增加,這種差異可以忽略不計

Let us further plot these sampling distributions and population mean and see how the plots look

讓我們進一步繪制這些采樣分布和總體均值,并查看這些圖的外觀

def plot_distribution(sample,population_mean,i,j,color,sampling_dist_type):
sns.kdeplot(np.array(sample),color = color,ax = axs[i,j],shade=True)
axs[i, j].axvline(population_mean, linestyle="-", color='r', label="p_mean")
axs[i, j].axvline(np.array(sample).mean(), linestyle="-.", color='b', label="s_mean")
axs[i, j].set_title(key)
axs[i, j].legend()colors = ['r','g','b','y', 'c', 'm', 'k']
plt_grid = [(0,0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]
sample_sizes = [(100,50), (100, 100), (100, 500), (500, 50), (500, 100), (500, 500)]total_samples_list = [100,500]
elements_in_each_sample_list = [50,100,500]fig, axs = plt.subplots(3, 2, figsize=(10, 9))
i = 0
for tot in total_samples_list:
for ele in elements_in_each_sample_list:
key = '{}_samples_with_{}_elements_each'.format(tot,ele)
plot_distribution(return_mean_of_samples(tot,ele), population_mean , plt_grid[i][0], plt_grid[i][1] , colors[i], key)
i = i + 1
plt.show()
Image for post
Plot for sample mean distribution and population mean (Image By Author)
樣本均值分布和總體均值的圖(作者提供)

As you can see the mean of sampling distribution is pretty close to the population mean. (Here is a food for thought, have a look at first and last plots do you notice there is a difference in spread of data, first one is more spread around population mean as compared to last, look at the scale on x axis for better clarity, well once your reach the end of this blog try to answer this question yourself)

如您所見,抽樣分布的平均值非常接近總體平均值。 ( 這是一個值得深思的地方,請查看第一和最后一個圖,您是否注意到數據分布上的差異,第一個是在人口均值上的分布比最后一個更大,請查看x軸上的比例更好清楚,一旦您到達本博客的結尾,嘗試自己回答這個問題 )

保密間隔 (CONFIDENCE INTERVAL)

A confidence interval can be defined as an entire interval of plausible values of a population parameter, such as mean based on observations obtained from a random sample of size n.

置信區間可以定義為總體參數的合理值的整個區間,例如基于從大小為n的隨機樣本獲得的觀察值的平均值。

Image for post
Creative Commons知識共享

Let’s summarize our leanings so far and and try to understand Confidence Intervals from it.

讓我們總結到目前為止的觀點,并嘗試從中了解置信區間。

  • Sampling distribution of mean of samples follow a normal distribution

    樣本均值的抽樣分布遵循正態分布
  • Hence using property of normal distribution 95% of sample means lie within two standard deviations of population mean

    因此,使用正態分布特性, 樣本均值的95%處于總體均值的兩個標準差

  • We can rephrase the sentence and say that 95% of these Intervals or rather Confidence Intervals (two standard deviations away from the mean on either side) contains the population mean

    我們可以改寫句子,說95%的這些區間或置信區間(兩側的均值有兩個標準差)包含總體均值

Here I would like to give more focus on the last point we talked about. People often get confused and say once we take a random sample and calculate its mean and corresponding 95% CI, there is a 95% chance that the population mean lies within 2 standard deviations of this sample mean, this statement is wrong

在這里,我想更加關注我們剛才談到的最后一點。 人們常常會感到困惑,說一旦我們抽取了一個隨機樣本并計算出其均值和相應的95%CI,總體均值就有95%的機會位于該樣本均值的2個標準差之內,這是錯誤的

When we talk about a probability estimate w.r.t to a sample then that sample gets fixed here (including the corresponding sample mean and confidence interval), also population mean eventually is a fixed value, hence there is no point in saying there is a 95% probability of a fixed point (population mean) lying in a fixed interval (CI of sample used), it would either exist there or not, instead the more proper definition of 95% CI is

當我們談論樣本的概率估計值時,樣本在此處固定( 包括相應的樣本均值和置信區間 ),總體均值最終也是固定值,因此,毫無疑問地說概率為95%固定間隔(使用的樣本的CI)中的固定點(人口平均值)的平均值,則該值是否存在或不存在,取而代之的是更準確地定義95%CI

If random samples are taken and corresponding sample means and CI’s (two standard deviations from the mean on either side) are calculated, then 95% of these CI’s would contain the population mean. For example let’s say we take 100 random samples and calculate their CI’s then 95% of these CI’s would contain the population mean

如果抽取隨機樣本并計算出相應的樣本均值和CI(任一側均值有兩個標準差),則這些CI的95%將包含總體均值。 例如,假設我們抽取100個隨機樣本并計算其CI,則這些CI中的95%將包含總體均值

Image for post
Creative Commons知識共享

Enough of the talking business, now as always lets take our toy dataset and see if this holds true.

足夠多說話的生意了,現在像往常一樣讓我們獲取玩具數據集,看看這是否成立。

def get_CI_percent(size):
counter = 0
for i in range(size):
is_contains = False
sample_mean = df.sample(50)['Income'].mean()
lower_lim = sample_mean - 2*standard_error
upper_lim = sample_mean + 2*standard_error
if (population_mean>=lower_lim)&(population_mean<=upper_lim):
is_contains = True
counter = counter + 1
return np.round(counter/size*100,2)
Image for post
95% Confidence Interval Validation(Image By Author)
95%置信區間驗證(作者提供)

I took 20 random samples with 50 elements in each sample and calculated their sample means and respective two standard deviation intervals (95% CI’s), 18 out of 20 intervals contained these intervals. (actually this interval count varied between 18–20)

我抽取了20個隨機樣本,每個樣本中包含50個元素,并計算了它們的樣本均值和兩個標準差區間(95%CI),其中20個區間中有18個包含這些區間。 (實際上,此間隔計數在18–20之間變化)

Instead of giving a point estimate (taking a sample and calculating its mean), it is more plausible to give an interval estimate (this helps us include any error that might occur due to sampling) hence we calculate these confidence intervals along with the sample mean.

與其給出點估計(獲取樣本并計算其均值),不如給出區間估計(這有助于我們包括由于采樣而可能出現的任何誤差),因此我們將這些置信區間與樣本均值一起計算。

結論 (CONCLUSION)

One question that some of you might be having, why go through so much hard work of taking random samples, calculating its sample mean hence the Confidence Interval later, why not simply do np.mean() like it was done in the very first line of code here.

你們中的一些人可能會遇到一個問題,為什么要經過如此艱巨的工作來抽取隨機樣本,然后計算其樣本均值,從而得出置信區間,為什么不像第一行那樣簡單地做np.mean()代碼在這里。

Well if you have your data completely available in digital form and you could calculate your population metric just from a single line of code within milliseconds then definitely go for it there is no point in going through the whole process

好吧,如果您的數據完全以數字形式提供,并且您可以僅在幾毫秒內通過單行代碼來計算人口指標,那么絕對可以,整個過程沒有意義

However there are many scenarios where data collection itself is a big challenge, suppose we want to estimate mean height of people in Bangalore then it is practically impossible to go to every person and record the data. This is where these concepts of sampling, Confidence Intervals are really useful.

但是,在很多情況下,數據收集本身就是一個很大的挑戰,假設我們要估算班加羅爾的平均人口高度,那么幾乎不可能到每個人那里記錄數據。 這是抽樣,置信區間這些概念真正有用的地方。

Here is the link to the code file and the dataset. For those of you who are new to these topics, I would strongly recommend to try running the code yourself, play around with the dataset (there are other fields like age) to get a better understanding.

這是代碼文件和數據集的鏈接 。 對于那些不熟悉這些主題的人,我強烈建議您自己嘗試運行代碼,并嘗試使用數據集(還有age等其他字段)以更好地理解。

翻譯自: https://medium.com/swlh/central-limit-theorem-an-intuitive-walk-through-36f55bd7668d

大數定理 中心極限定理

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388193.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388193.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388193.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

萬惡之源 - Python數據類型二

列表 列表的介紹 列表是python的基礎數據類型之一 ,其他編程語言也有類似的數據類型. 比如JS中的數 組, java中的數組等等. 它是以[ ]括起來, 每個元素用 , 隔開而且可以存放各種數據類型: lst [1,a,True,[2,3,4]]列表相比于字符串,不僅可以存放不同的數據類型.而且可…

230. Kth Smallest Element in a BST

Given a binary search tree, write a function kthSmallest to find the kth smallest element in it.Note: You may assume k is always valid, 1 ≤ k ≤ BSTs total elements.Example 1: Input: root [3,1,4,null,2], k 13/ \1 4\2 Output: 1 Example 2: Input: root …

探索性數據分析(EDA)-不要問如何,不要問什么

數據科學 &#xff0c; 機器學習 (Data Science, Machine Learning) This is part 1 in a series of articles guiding the reader through an entire data science project.這是一系列文章的第1部分 &#xff0c;指導讀者完成整個數據科學項目。 I am a new writer on Medium…

unity3d 攝像機跟隨鼠標和鍵盤的控制

鼠標控制&#xff1a; using UnityEngine; using System.Collections; public class shubiao : MonoBehaviour { //public Transform firepos; public int Ball30; public int CurBall1; public Rigidbody projectile; public Vector3 point; public float time100f; public…

《必然》九、享受重混盛宴,是每個人的機會

今天說的是《必然》的第七個關鍵詞&#xff0c;過濾Filtering。1我們需要過濾如今有一個問題&#xff0c;彌漫在我們的生活當中&#xff0c;困擾著所有人。那就是“今天我要吃什么呢&#xff1f;”同樣的&#xff0c;書店里這么多的書&#xff0c;我要看哪一本呢&#xff1f;網…

IDEA 插件開發入門教程

2019獨角獸企業重金招聘Python工程師標準>>> IntelliJ IDEA 是目前最好用的 JAVA 開發 IDE&#xff0c;它本身的功能已經非常強大了&#xff0c;但是每個人的需求不一樣&#xff0c;有些需求 IDEA 本身無法滿足&#xff0c;于是我們就需要自己開發插件來解決。工欲善…

安卓代碼還是xml繪制頁面_我們應該繪制實際還是預測,預測還是實際還是無關緊要?

安卓代碼還是xml繪制頁面Plotting the actual and predicted data is frequently used for visualizing and analyzing how the actual data correlate with those predicted by the model. Ideally, this should correspond to a slope of 1 and an intercept of 0. However, …

Mecanim動畫系統

本期教程和大家分享Mecanim動畫系統的重定向特性&#xff0c;Mecanim動畫系統是Unity3D推出的全新的動畫系統&#xff0c;具有重定向、可融合等諸多新特性&#xff0c;通過和美工人員的緊密合作&#xff0c;可以幫助程序設計人員快速地設計出角色動畫。一起跟著人氣博主秦元培學…

【嵌入式硬件Esp32】Ubuntu 1804下ESP32交叉編譯環境搭建

一、ESP32概述EPS32是樂鑫最新推出的集成2.4GWi-Fi和藍牙雙模的單芯片方案&#xff0c;采用臺積電(TSMC)超低功耗的40nm工藝&#xff0c;擁有最佳的功耗性能、射頻性能、穩定性、通用性和可靠性&#xff0c;適用于多種應用和不同的功耗要求。 ESP32搭載低功耗的Xtensa LX6 32bi…

你認為已經過時的C語言,是如何影響500萬程序員的?...

看招聘職位要c語言的占比真不多了&#xff0c;是否c語言真得落伍了&#xff1f; 看一下許多招聘平臺有關于找純粹的c語言開發的占比確實沒有很多&#xff0c;都被Java&#xff0c;php&#xff0c;python等等語言刷屏。這對于入門正在學習c語言的小白真他媽就是驚天霹靂&#xf…

換熱站起停條件

循環泵 自動條件&#xff1a; 一、循環泵啟動條件 兩臺泵/三臺泵&#xff1a; 1&#xff09;本循環泵在遠程狀態 2&#xff09;本循環泵自動狀態 3&#xff09;本循環泵沒有故障 4&#xff09;二次網的回水壓力&#xff08;測量值&#xff09;>設定值 5&#xff09;…

云尚制片管理系統_電影制片廠的未來

云尚制片管理系統Data visualization is a key step of any data science project. During the process of exploratory data analysis, visualizing data allows us to locate outliers and identify distribution, helping us to control for possible biases in our data ea…

JAVA單向鏈表實現

JAVA單向鏈表實現 單向鏈表 鏈表和數組一樣是一種最常用的線性數據結構&#xff0c;兩者各有優缺點。數組我們知道是在內存上的一塊連續的空間構成&#xff0c;所以其元素訪問可以通過下標進行&#xff0c;隨機訪問速度很快&#xff0c;但數組也有其缺點&#xff0c;由于數組的…

軟件公司管理基本原則

商業人格&#xff1a;獨立履行責任 獨立堅持原則兩大要素&#xff1a;1)靠原則做事&#xff0c;原則高于一切。2)靠結果做交換&#xff0c;我要什么我清楚兩個標準&#xff1a; 1)我不是孩子&#xff0c;我不需要照顧2)承認邏輯&#xff0c;我履行我的責任社會人心態: 1)用社會…

201771010102 常惠琢《面向對象程序設計(java)》第八周學習總結

1、實驗目的與要求 (1) 掌握接口定義方法&#xff1b; (2) 掌握實現接口類的定義要求&#xff1b; (3) 掌握實現了接口類的使用要求&#xff1b; (4) 掌握程序回調設計模式&#xff1b; (5) 掌握Comparator接口用法&#xff1b; (6) 掌握對象淺層拷貝與深層拷貝方法&#xff1b…

新版 Android 已支持 FIDO2 標準,免密登錄應用或網站

谷歌剛剛宣布了與 FIDO 聯盟達成的最新合作&#xff0c;為 Android 用戶帶來了無需密碼、即可登錄網站或應用的便捷選項。 這項服務基于 FIDO2 標準實現&#xff0c;任何運行 Android 7.0 及后續版本的設備&#xff0c;都可以在升級最新版 Google Play 服務后&#xff0c;通過指…

t-sne原理解釋_T-SNE解釋-數學與直覺

t-sne原理解釋The method of t-distributed Stochastic Neighbor Embedding (t-SNE) is a method for dimensionality reduction, used mainly for visualization of data in 2D and 3D maps. This method can find non-linear connections in the data and therefore it is hi…

oracle操作

imp kfqrlcs/kfqrlcshx fileC:\kfqrlcs.dmp fully //創建臨時表空間 create temporary tablespace kfqrlcs_temp tempfile C:\oracledata\kfqrlcs_temp.dbf size 32m autoextend on next 32m maxsize 8048m extent management local; //tempfile參數必須有 //創建數據表…

strust2自定義攔截器

1.創建一個攔截器類&#xff0c;繼承MethodFilterInterceptor類&#xff0c;實現doIntercept方法 package com.yqg.bos.web.interceptor;import com.opensymphony.xwork2.ActionInvocation; import com.opensymphony.xwork2.interceptor.MethodFilterInterceptor; import com.y…

Android Studio如何減小APK體積

最近在用AndroidStudio開發一個小計算器&#xff0c;代碼加起來還不到200行。但是遇到一個問題&#xff0c;導出的APK文件大小竟然達到了1034K。這不科學&#xff0c;于是就自己動手精簡APK。下面我們大家一起學習怎么縮小一個APK的大小&#xff0c;以hello world為例。 新建工…