貝葉斯統計推斷

by Kirill Dubovikov

通過基里爾·杜博維科夫(Kirill Dubovikov)

統計推斷對決：頻繁主義者與貝葉斯主義者 (Statistical Inference Showdown: The Frequentists VS The Bayesians)

推理 (Inference)

Statistical Inference is a very important topic that powers modern Machine Learning and Deep Learning algorithms. This article will help you to familiarize yourself with the concepts and mathematics that make up inference.

統計推斷是推動現代機器學習和深度學習算法發展的一個非常重要的主題。本文將幫助您熟悉構成推理的概念和數學。

Imagine we want to trick some friends with an unfair coin. We have 10 coins and want to judge whether any one of them is unfair — meaning it will come up as heads more often than tails, or vice versa.

想象我們想用不公平的硬幣欺騙一些朋友。我們有10個硬幣，要判斷它們中的任何一個是否不公平-意味著它正面朝上的頻率要比正面朝上的頻率高，反之亦然。

So we take each coin, toss it a bunch of times — say 100 — and record the results. The thing is we now have a subset of measurements from a true distribution (a sample) for each coin. We’ve considered the condition of our thumbs and concluded that collecting more data would be very tedious.

因此，我們取每個硬幣，扔一堆(例如100次)并記錄結果。問題是，我們現在從每個硬幣的真實分布(樣本)中獲得了測量的子集。我們已經考慮過拇指的狀況，并得出結論，收集更多的數據將非常繁瑣。

It is uncommon to know parameters of the true distribution. Frequently, we want to infer true population parameters them from the sample.

知道真實分布的參數并不常見。通常，我們想從樣本中推斷出真實的種群參數。

So now we want to estimate the probability of a coin landing on Heads. We are interested in the sample mean.

因此，現在我們要估計硬幣落在正面上的概率。我們對樣本均值感興趣。

By now you’ve likely thought, “Just count number of heads and divide by the total number of attempts already!” Yep, this is the way to find an unfair coin, but how could we come up with this formula if we didn’t know it in the first place?

到現在為止，您可能已經想到：“只需計算正面的數目，再除以已嘗試的總數即可！” 是的，這是找到不公平硬幣的方法，但是如果我們首先不知道這個公式，我們怎么想出這個公式？

慣常推論 (Frequentist Inference)

Recall that coin tosses are best modeled with Bernoulli distribution, so we are sure that it represents our data well. Probability Mass Function (PMF) for Bernoulli distribution looks like this:

回想一下，拋硬幣最好用伯努利分布建模，因此我們確信它可以很好地代表我們的數據。伯努利分布的概率質量函數(PMF)如下所示：

x is a random variable that represents an observation of a coin toss (assume 1 for Heads and 0 for Tails) and p is a parameter — probability of Heads. We will refer to all possible parameters as θ onward. This function represents how probable each value of x is according to the distribution law we have chosen.

x是一個隨機變量，代表拋硬幣的觀察結果(假設正面為1，尾部為0)， p是參數-正面的概率。我們將所有可能的參數稱為θ向前。該函數表示根據我們選擇的分布定律， x的每個值有多大可能。

When x is equal to 1 we get f(1; p) = p, and when it is zero f(0; p) = 1-p. Thus, Bernoulli distribution answers the question ‘How probable is it that we get a heads with a coin that lands on heads with probability p?’. Actually, it is one of the simplest examples of a discrete probability distribution.

當x等于1時，我們得到f(1; p)= p，而當它為零時， f(0; p)= 1-p。 因此，伯努利分布回答了一個問題：“讓一個硬幣正面朝著概率為p落在正面的可能性有多大？ '。實際上，它是離散概率分布的最簡單示例之一。

So, we are interested in determining parameter p from the data. A frequentist statistician will probably suggest using a Maximum Likelihood Estimation (MLE) procedure. This method takes approach of maximizing likelihood of parameters given the dataset D:

因此，我們有興趣根據數據確定參數p 。一位常客統計學家可能會建議使用最大似然估計(MLE)程序。在給定數據集D的情況下，此方法采用最大化參數可能性的方法：

This means that likelihood is defined as a probability of the data given parameters of the model. To maximize this probability, we will need to find parameters that help our model to match the data as close as possible. Doesn’t it look like learning? Maximum Likelihood is one of the methods that make supervised learning work.

這意味著將可能性定義為給定模型參數的數據的概率。為了最大程度地提高這種可能性，我們將需要找到有助于我們的模型盡可能匹配數據的參數。 看起來不是學習嗎？最大可能性是使有監督的學習工作的方法之一。

Now let’s assume all observations we make are independent. This means that joint probability in the expression above may be simplified to a product by basic rules of probability:

現在讓我們假設我們所做的所有觀察都是獨立的。這意味著可以通過基本的概率規則將以上表達式中的聯合概率簡化為乘積：

Now goes the main part: how do we maximize a likelihood function? We call calculus for help, differentiate likelihood function in respect to model parameters θ, set it to 0 and solve the equation. There is a neat trick that makes differentiation much easier most of the times — logarithms do not change function’s extrema (minimum and maximum).

現在開始主要部分：我們如何最大化似然函數？我們稱微積分為幫助，針對模型參數θ區分似然函數，將其設置為0并求解方程。在大多數情況下，有一個巧妙的技巧可以使微分變得更容易-對數不會改變函數的極值(最小值和最大值)。

Maximum Likelihood Estimation has immense importance and almost every machine learning algorithm. It is one of the most popular ways to formulate a process of learning mathematically.

最大似然估計非常重要，幾乎所有機器學習算法都非常重要。這是制定數學學習過程的最流行的方法之一。

And now let’s apply what we’ve learned and play with our coins. We’ve done n independent Bernoulli trials to evaluate the fairness of our coin. Thus, all probabilities can be multiplied and likelihood function will look like this:

現在，讓我們應用我們所學到的知識，并使用我們的硬幣。我們已經進行了n次獨立的伯努利試驗，以評估我們硬幣的公平性。因此，所有概率都可以相乘，似然函數將如下所示：

Taking the derivative of the expression above won’t be nice. So, we need to find the log-likelihood:

采取上面的表達式的派生將不是很好。因此，我們需要找到對數似然：

That looks easier. Moving on to differentiation

看起來比較容易。走向差異化

Here we split derivatives using standard d(f + g) = df + dg. Next, we move the constants out and differentiate logarithms:

在這里，我們使用標準d(f + g)= df + dg拆分導數。接下來，我們將常量移出并區分對數：

The last step might seem funny because of the sign flip. The cause is that log(1-p) is actually a composition of two functions and we must use the chain rule here:

由于符號翻轉，最后一步可能看起來很有趣。原因是log(1-p)實際上是兩個函數的組合，我們必須在這里使用鏈式規則：

Voilà, we are done with the log-likelihood! Now we are close to find the maximum likelihood statistic for the mean of Bernoulli distribution. The last step is to solve the equation:

瞧，我們已經完成對數似然法！現在，我們即將找到伯努利分布平均值的最大似然統計量。最后一步是求解方程：

Multiplying everything by p(1-p) and expanding parenthesis we get

將所有內容乘以p(1-p)并擴展括號，我們得到

Canceling out the terms and rearranging:

取消條款并重新安排：

So, here is the derivation of our intuitive formula ?. You may now play with Bernoulli distribution and its MLE estimate of the mean in the visualization below

所以，這是我們的推導直觀的公式？ ?歐可以立即使用伯努利分布和下面的可視化的意味其MLE估計玩

Congratulations on your new awesome skill of Maximum Likelihood Estimation! Or just for refreshing your existing knowledge.

恭喜您獲得了新的驚人的最大似然估計技能！或者只是為了刷新您現有的知識。

貝葉斯推理 (Bayesian Inference)

Recall that there exists another approach to probability. Bayesian statistics has its own way to do probabilistic inference. We want to find the probability distribution of parameters THETA given sample — P(THETA | D). But how can we infer this probability? Bayes theorem comes to rescue:

回想一下，存在另一種概率方法。貝葉斯統計有其自己的方式來進行概率推斷。我們想要找到給定樣本-P(THETA | D)的參數THETA的概率分布。但是，我們如何推斷這種可能性呢？貝葉斯定理可以解救：

P(θ) is called a prior distribution and incorporates our beliefs in what parameters could be before we have seen any data. The ability to state prior beliefs is one of the main differences between maximum likelihood and Bayesian inference. However, this is also the main point of criticism for the Bayesian approach. How do we state the prior distribution if we do not know anything about the problem in interest? What if we choose bad prior?
P(θ)稱為先驗分布，它結合了我們對看到任何數據之前可能具有哪些參數的信念。陳述先驗信念的能力是最大似然和貝葉斯推理之間的主要區別之一。但是，這也是貝葉斯方法批評的重點。如果我們對所關注的問題一無所知，該如何陳述事先的分配？如果我們選擇不好的先驗怎么辦？
P(D | θ) is a likelihood, we have encountered it in Maximum Likelihood Estimation
P(D |θ)是一個可能性，我們在最大似然估計中遇到了它
P(D) is called evidence or marginal likelihood
P(D)稱為證據或邊際可能性

P(D) is also called normalization constant since it makes sure that results we get are a valid probability distribution. If we rewrite P(D) as

P(D)也稱為歸一化常數，因為它可以確保我們得到的結果是有效的概率分布。如果我們將P(D)重寫為

We will see that it is similar to the numerator in the Bayes Theorem, but the summation goes over all possible parameters θ. This way we get two things:

我們將看到它類似于貝葉斯定理中的分子，但是求和遍及所有可能的參數θ 。這樣，我們得到兩件事：

The output is always valid probability distribution in the domain of [0, 1].
輸出始終是[0，1]域中的有效概率分布。
Major difficulties when we try to compute P(D) since this requires integrating or summing over all possible parameters. This is impossible in most of the real word problems.
我們嘗試計算P(D)時遇到了主要困難，因為這需要對所有可能的參數進行積分或求和。在大多數實際單詞問題中，這是不可能的。

But does marginal likelihood P(D) make all things Bayesian impractical? The answer is not quite. In most of the times, we will use one of the two options to get rid of this problem.

但是，邊際可能性P(D)是否使所有事情都變得不可行？答案并不完全。在大多數情況下，我們將使用兩個選項之一來解決此問題。

The first one is to somehow approximate P(D). This can be achieved by using various sampling methods like Importance Sampling or Gibbs Sampling, or a technique called Variational Inference (which is a cool name by the way ?).

第一個是某種程度上近似P(D) 。這可以通過使用諸如重要性采樣或吉布斯采樣之類的各種采樣方法，或稱為變分推理的技術(順便說一下，這是一個很酷的名字)來實現。

The second is to get it out of the equation completely. Let’s explore this approach in more detail. What if we concentrate on finding one most probable parameter combination (that is the best possible one)? This procedure is called Maximum A Posteriori estimation (MAP).

第二個是完全擺脫方程式。讓我們更詳細地探討這種方法。如果我們專注于找到一種最可能的參數組合(即最佳組合)怎么辦？此過程稱為最大后驗估計(MAP)。

The equation above means that we want to find θ for which expression inside arg max takes its maximum value — the argument of a maximum. The main thing to notice here is that P(D) is independent of parameters and may be excluded from arg max:

一個最大 imum的ARG ument -上述手段，我們希望找到θ為在arg最大內部表達取最大值的計算公式。這里要注意的主要事情是P(D)與參數無關，并且可以從arg max中排除：

In other words, P(D) will always be constant with respect to model parameters and its derivative will be equal to 1.

換句話說， P(D)相對于模型參數將始終是恒定的，并且其導數將等于1 。

This fact is so widely used that it is common to see Bayes Theorem written in this form:

這個事實被廣泛使用，以至于經常看到貝葉斯定理是這樣寫的：

The wired incomplete infinity sign in the expression above means “proportional to” or “equal up to a constant”.

上面的表達式中的有線不完整無窮大符號表示“與...成比例”或“等于一個常數”。

Thus, we have removed the most computationally heavy part of the MAP. This makes sense since we basically discarded all possible parameter values from probability distribution and just skimmed off the best most probable one.

因此，我們刪除了MAP中計算量最大的部分。這是有道理的，因為我們基本上從概率分布中丟棄了所有可能的參數值，而只是略去了最可能的參數值。

MLE和MAP之間的鏈接 (A link between MLE and MAP)

And now consider what happens when we assume the prior to be uniform (a constant probability).

現在考慮當我們假設先驗是統一的(恒定概率)時會發生什么。

We have moved out constant C out of the arg max since it does not affect the result as it was with the evidence. It certainly looks alike to a Maximum Likelihood estimate! In the end, the mathematical gap between frequentist and Bayesian inference is not that large.

我們已將常量C從arg max中移出，因為它不會像證據那樣影響結果。最大似然估計當然看起來很像！最后，頻繁主義者和貝葉斯推理之間的數學差距并不大。

We can also build the bridge from the other side and view maximum likelihood estimation through Bayesian glasses. In specific, it can be shown that Bayesian priors have close connections with regularization terms. But that topic deserves another post (see this SO question and ESLR book for more details).

我們還可以從另一側建造橋梁，并通過貝葉斯眼鏡查看最大似然估計。具體而言，可以證明貝葉斯先驗與正則化項有著密切的聯系。但是該主題值得再發表一遍(有關更多詳細信息，請參閱此SO問題和ESLR書 )。

結論 (Conclusion)

Those differences may seem subtle at first, but they give a start to two schools of statistics. Frequentist and Bayesian approaches differ not only in mathematical treatment but in philosophical views on fundamental concepts in stats.

乍一看，這些差異似乎微妙，但它們為兩個統計學流派開了一個開端。頻繁主義和貝葉斯方法不僅在數學處理上有所不同，而且在統計數據基本概念的哲學觀點上也有所不同。

If you take on a Bayesian hat you view unknowns as probability distributions and the data as non-random fixed observations. You incorporate prior beliefs to make inferences about events you observe.

如果您戴上貝葉斯帽子，則將未知數視為概率分布，將數據視為非隨機固定觀測值。您結合了先前的信念來推斷觀察到的事件。

As a Frequentist, you believe that there is a single true value for the unknowns that we seek and it’s the data that is random and incomplete. Frequentist randomly samples data from unknown population and makes inferences about true values of unknown parameters using this sample.

作為常客，您認為我們所尋求的未知因素只有一個真實的價值，而數據是隨機的和不完整的。頻密者從未知總體中隨機采樣數據，并使用該樣本推斷未知參數的真實值。

In the end, Bayesian and Frequentist approaches have their own strengths and weaknesses. Each has the tools to solve almost any problem the other can. Like different programming languages, they should be considered as tools of equal strength that may be a better fit for a certain problem and fall short at the other. Use them both, use them wisely, and do not fall into the fury of a holy war between two camps of statisticians!

最后，貝葉斯方法和頻率論方法各有優缺點。每個工具都有解決其他問題的工具。像不同的編程語言一樣，它們應被視為具有同等強度的工具，可能更適合于某個問題，但在另一個方面卻不如以前。都使用它們，明智地使用它們，不要陷入兩個統計學家陣營之間的一場圣戰的狂怒中！

Learned something? Click the ? to say “thanks!” and help others find this article.

學到了什么？點擊？說“謝謝！” 并幫助其他人找到本文。