js值的拷貝和值的引用
介紹 (Introduction)
Welcome to this lesson on calculating p-values.
歡迎參加有關計算p值的課程。
Before we jump into how to calculate a p-value, it’s important to think about what the p-value is really for.
在我們開始計算p值之前,考慮一下p值的真正意義很重要。
假設檢驗復習 (Hypothesis Testing Refresher)
Without going into too much detail for this post, when establishing a hypothesis test, you will determine a null hypothesis. Your null hypothesis represents the world in which the two variables your assessing don’t have any given relationship. Conversely the alternative hypothesis represents the world where there is a statistically significant relationship such that you’re able to reject the null hypothesis in favor of the alternative hypothesis.
在不進行過多介紹的情況下,建立假設檢驗時,您將確定原假設。 您的零假設代表了您評估的兩個變量沒有任何給定關系的世界。 相反,替代假設表示存在統計學上顯著關系的世界,這樣您就可以拒絕原假設,而支持替代假設。
深潛 (Diving Deeper)
Before we move on from the idea of hypothesis testing… think about what we just said. You effectively need to prove that with little room for error, what we’re seeing in the real world could not be taking place in a world where these variables are not related or in a world where the relationship is independent.
在繼續進行假設檢驗的想法之前,請思考一下我們剛才所說的內容。 您實際上需要證明,在幾乎沒有錯誤余地的情況下,在這些變量不相關的世界或在關系獨立的世界中,我們在現實世界中看到的東西不可能發生。
Sometimes when learning concepts in statistics, you hear the definition, but take little time to conceptualize. There is often a lot of memorization of rule sets… I find that understanding the intuitive foundation of these principles will serve you far better when finding their practical applications.
有時,當學習統計學中的概念時,您會聽到定義,但是花很少的時間來概念化。 規則集通常記憶很多。我發現了解這些原理的直觀基礎將在您發現其實際應用時為您提供更好的服務。
Continuing on this vein of thought. If you want to compare your real world stat with the fake world, that’s exactly what you should do.
繼續這種思想脈絡。 如果您想將真實世界的統計數據與假世界進行比較,那正是您應該做的。
As you’d guess we can calculate our observed statistic by creating a linear regression model where we explain our response variable as a function of our explanatory variable. Once we’ve done this we can quantify the relationship between these two variables using the slope or coefficient identified through our ols regression.
如您所料,我們可以通過創建線性回歸模型來計算觀察到的統計數據,在該模型中,我們將響應變量解釋為解釋變量的函數。 完成此操作后,我們可以使用通過ols回歸確定的斜率或系數來量化這兩個變量之間的關系。
But now we need to come up with a this idea of the null world… or the world where these variables are independent. This is something we don’t have, so we’ll need to simulate it. For our convenience, we’re going to leverage the infer package.
但是,現在我們需要提出一個關于零世界 ……或這些變量是獨立的世界的想法。 這是我們所沒有的,因此我們需要對其進行仿真。 為了方便起見,我們將利用推斷包。
讓我們計算觀察到的統計數據 (Let’s Calculate our Observed Statistic)
First things first, let’s get our observed statistic!
首先,讓我們獲取觀察到的統計信息!
The dataset we’re working with is a Seattle home prices dataset. I’ve used this dataset many times before and find it particularly flexible for demonstration. The record level of the dataset is by home and details price, square footage, # of beds, # of baths, and so forth.
我們正在使用的數據集是西雅圖房屋價格數據集。 我以前曾多次使用過該數據集,并發現它對于演示特別靈活。 數據集的記錄級別是按房屋和詳細信息,價格,平方英尺,床位數,浴室數量等等。
Through the course of this post, we’ll be trying to explain price through a function of square footage.
在本文的整個過程中,我們將嘗試通過平方英尺的功能來解釋價格。
Let’s create our regression model
讓我們創建回歸模型
fit <- lm(price_log ~ sqft_living_log,
data = housing)
summary(fit)
As you can see in the output above, the statistic we’re after is the Estimate
for our explanatory variable, sqft_living_log
.
如您在上面的輸出中看到的,我們需要的統計信息是我們的解釋變量sqft_living_log
的Estimate
。
A very clean way to do this is to tidy our results such that rather than a linear model, we get a tibble. Tibbles, tables, or data frames are going to make it a lot easier for us to systematically interact with.
一種非常干凈的方法是整理我們的結果,使我們得到的不是線性模型,而是小標題。 標語,表格或數據框將使我們更輕松地進行系統地交互。
We’ll then want to filter down to the sqft_living_log
term and we'll wrap it up by using the pull
function to return the estimate itself. This will return the slope as a number, which will make things easier to compare with our null distribution later on.
然后,我們希望過濾到sqft_living_log
項,并使用pull
函數返回估計值本身來對其進行包裝。 這將以數字形式返回斜率,這將使以后更容易與空分布進行比較。
Take a look!
看一看!
lm(price_log ~ sqft_living_log,
data = housing)%>%
tidy()%>%
filter(term == 'sqft_living_log')%>%
pull(estimate)
是時候模擬了! (Time to Simulate!)
To kick things off, you should know there are various types of simulation. The one we’ll be using here is what’s called permutation.
首先,您應該知道有各種類型的模擬。 我們將在這里使用的就是所謂的permutation 。
Permutation is particularly helpful when it comes to showing a world where variables are independent of one another.
當顯示一個變量相互獨立的世界時,排列特別有用。
While we won’t be going into the specifics of how a permutation sample is created under the hood; it’s worth noting that the sample will be normal and center around 0 for the observed statistic.
雖然我們不會詳細介紹如何在后臺創建排列樣本; 值得注意的是,樣本將是正常的,并且在觀察到的統計數據的中心大約為0。
In this case, the slope would center around 0 as we’re operating under the premise that there is no relationship between our explanatory and response variables.
在這種情況下,當我們在解釋變量和響應變量之間沒有關系的前提下進行操作時,斜率將以0為中心。
推斷基本原理 (Infer Fundamentals)
A few things for you to know:
您需要了解的幾件事:
specify is how we determine the relationship we’re modeling:
price_log~sqft_living_log
指定如何確定我們正在建模的關系:
price_log~sqft_living_log
hypothesize is where we designate
independence
假設是我們指定
independence
generate is how we determine the number of replications of our dataset we want to make. Note that if you did, one replicate and did not
calculate
it would return a sample dataset of the same size as the original dataset.generate是我們確定要復制的數據集的數量的方式。 請注意,如果您這樣做了,則一次重復但不進行
calculate
將返回與原始數據集大小相同的樣本數據集。- calculate allows you to determine the calculation in question (slope, mean, median, diff in means, etc.) 計算可讓您確定相關的計算(斜率,均值,中位數,均值差異等)
library(infer)
set.seed(1) perm <- housing %>%
specify(price_log ~ sqft_living_log) %>%
hypothesize(null = 'independence') %>%
generate(reps = 100, type = 'permute') %>%
calculate('slope')perm
hist(perm$stat)
Same distribution with 1000 reps
分配相同,重復1000次
空采樣分布 (Null Sampling Distribution)
Ok we’ve done it! We’ve created what is known as the null sampling distribution. What we’re seeing above is a distribution of 1000 slopes each modeled after 1000 simulations of independent data.
好的,我們完成了! 我們創建了所謂的空采樣分布。 上面我們看到的是1000個坡度的分布,每個坡度都是在獨立數據進行1000次模擬之后建模的。
This gives us just what we needed. A simulated world against which we can compare reality.
這給了我們我們所需要的。 一個可以與現實進行比較的模擬世界。
Taking the visual we just made, let’s use a density plot and add a vertical line for our observed slope, marked in red.
以我們剛剛制作的視覺效果,讓我們使用密度圖,并為觀察到的斜率添加一條垂直線,用紅色標記。
ggplot(perm, aes(stat)) +
geom_density()+
geom_vline(xintercept = obs_slope, color = 'red')
Visually, you can see that this is happening far beyond the occurrences of random chance.
從視覺上,您可以看到這種情況遠遠超出了隨機機會的發生。
As you can guess from visually looking at this the p-value here is going to be 0. As to say, in 0% of the null sampling distribution is greater than or equal to our observed statistic.
從視覺上可以看出,這里的p值將為0。也就是說,在0%的原始抽樣分布中,大于或等于我們觀察到的統計量。
If in fact we were seeing cases where our permuted data was greater than or equal to our observed statistic, we would know that it was just random.
如果實際上我們看到的是排列的數據大于或等于觀察到的統計數據的情況,那么我們將知道它只是隨機的。
The reiterate the message here, the purpose of p-value is to give you an idea of how feasible it is that we saw such a slope randomly versus a statistically significant relationship.
在此重申此信息,p值的目的是讓您了解我們隨機看到這樣的斜率與統計上顯著的關系是多么可行。
計算P值 (Calculating P-value)
While we know what our p-value will be here, let’s get you set up with the calculation for p-value.
雖然我們知道這里的p值將是多少,但讓我們開始設置p值的計算。
To re-prime this idea; p-value is the portion of replicates that were (randomly) greater than or equal to our observed slope.
重新提出這個想法; p值是重復(隨機)大于或等于我們觀察到的斜率的部分。
You’ll see in our summarise
function that we're checking to see whether our stat or slope is greater than or equal to the observed slope. Each record will be assigned TRUE or FALSE accordingly.. When you wrap that in a mean function, TRUE will represent 1 and FALSE 0, resulting in a proportion of the cases stat was greater than or equal to our observed slope.
您將在summarise
功能中看到,我們正在檢查統計數據或斜率是否大于或等于觀察到的斜率。 每條記錄將被相應地分配為TRUE或FALSE。當您將其包裝在平均值函數中時,TRUE將代表1,而FALSE為0,從而導致部分情況stat大于或等于我們觀察到的斜率。
perm %>%
summarise(p_val = 2 * round(mean(stat >= obs_slope),2))
For the sake of identifying the case of a weaker relationship in which we would not have sufficient evidence to reject the null hypothesis, let’s look at price explained as a function of the year it was built.
為了確定關系較弱的情況,在這種情況下我們將沒有足夠的證據來拒絕原假設,讓我們看一下價格作為其建立年份的函數。
Using the same calculation as above, this results in a p-value of 12%; which according to a standard confidence level of 95%, is not sufficient evidence to reject the null hypothesis.
使用與上述相同的計算,得出的p值為12%; 根據95%的標準置信度,這不足以拒絕原假設。
關于P值解釋的最終說明 (Final Notes on P-value Interpretation)
One final thing I want to highlight just one more time….
最后一件事,我想再強調一次。
The meaning of 12%. We saw that when we randomly generated an independent sample… a whole 12% of the time, our randomly generated slope was as or more extreme…
意思是12%。 我們看到,當我們隨機生成一個獨立樣本時……整整12%的時間里,我們隨機生成的斜率等于或大于極限。
You might see such a result as much as 12% just due to random chance
由于隨機機會,您可能會看到多達12%的結果
結論 (Conclusion)
That’s it! You’re a master of the calculating & understanding p-value.
而已! 您是計算和理解p值的大師。
In a few short minutes we have learned a lot:
在短短的幾分鐘內,我們學到了很多:
- hypothesis testing 假設檢驗
- linear regression refresher 線性回歸更新
- sampling explanation 抽樣說明
- learning about infer package 了解推斷包
- building a sampling distribution 建立抽樣分布
- visualizing p-value 可視化p值
- calculating p-value 計算p值
It’s easy to get lost when dissecting statistics concepts like p-value. My hope is that having a strong foundational understanding of the need and corresponding execution allows you to understand and correctly apply this to any variety of problems.
剖析p值之類的統計概念時,很容易迷失方向。 我希望對需求和相應的執行有深刻的基礎理解,使您能夠理解并正確地將其應用于各種問題。
If this was helpful, feel free to check out my other posts at https://medium.com/@datasciencelessons. Happy Data Science-ing!
如果這有幫助,請隨時通過https://medium.com/@datasciencelessons查看我的其他帖子。 快樂數據科學!
翻譯自: https://towardsdatascience.com/getting-to-the-bottom-of-p-value-the-intuitive-explanation-calculation-fec46bb15a92
js值的拷貝和值的引用
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391600.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391600.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391600.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!