協方差意味著什么
When I was an undergraduate student studying Data Science, one of my professors always asked the same question for every data set we worked with — “What does zero mean?”
當我是一名研究數據科學的本科生時,我的一位教授總是對我們處理的每個數據集提出相同的問題-“零意味著什么?”
On the surface, this seems trivial. If the scenario is how many apples does each student have, zero means that a student has no apples.
從表面上看,這似乎是微不足道的。 如果假設每個學生有多少個蘋果,則零表示該學生沒有蘋果。
Then why ask the question?
那為什么要問這個問題呢?
Well, zero can mean zero… but it can also mean a slew of other things, and if you’re not careful, ignoring it could come back to haunt you.
好吧,零可能意味著零……但也可能意味著其他許多事情,如果您不小心,忽略它可能會再次困擾您。
It can be easy to disregard or ignore missing data. Whether the values are 0, NULL, NA, or blank, we are often quick to ignore these records because they “lack information”. However, these data points can often be critical pieces of information for our problem and the “lack” of information actually is information.
忽略或忽略丟失的數據很容易。 無論值是0,NULL,NA還是空白,我們經常會很快忽略這些記錄,因為它們“缺少信息”。 但是,這些數據點通常可能是解決我們問題的關鍵信息,而信息“不足”實際上就是信息。
Let’s consider the following scenario. We are consulting for a bank and they want us to determine if a customer is likely to default on their credit card payments. Below is a sample of the data we are given to evaluate this problem.
讓我們考慮以下情形。 我們正在為一家銀行提供咨詢,他們希望我們確定客戶是否有可能拖欠其信用卡付款。 以下是我們提供的評估此問題的數據樣本。
We see that there are five variables we could use to make predictions and of those five, four contain values of 0, blank, or in some cases both.
我們看到有五個變量可用于進行預測,在這五個變量中,四個包含0,空白或在某些情況下都包含的值。
It might be easy to ignore these values or chalk them up as some kind of error in the bank’s system but let’s take a closer look and see if there might be more to the story.
忽略這些值或將它們歸類為銀行系統中的某種錯誤可能很容易,但讓我們仔細看看,看看故事可能還有更多內容。
Our first variable with missing data is Credit Score. Two of the eight customers have no credit score value. While it may seem like these values were skipped over, notice that both of these customers have ages of 23 and 20. Since they are relatively young, there is a good chance they may have only recently opened up a credit card and consequently would not have a credit score yet. It might be easy to backfill these records with a value of 0, but that would not make sense either given that we don’t know how they will actually perform. How would we handle this in a real-life scenario? One approach would be to find the average score of individuals with similar ages and use that value for our missing records.
我們缺少數據的第一個變量是Credit Score 。 八個客戶中有兩個沒有信用評分值。 盡管看起來這些值似乎已被跳過,但請注意,這兩個客戶的年齡分別為23歲和20歲。由于他們相對年輕,所以很有可能他們只是最近才打開了信用卡,因此不會信用評分呢。 將這些記錄的值回填為0可能很容易,但是鑒于我們不知道它們的實際表現,這也沒有意義。 在現實生活中,我們將如何處理? 一種方法是找到年齡相似的個人的平均分數,并將該值用于我們的缺失記錄。
The next column with missing data is Missed Payments. This time we have records with both the missing data and values of 0. Intuitively, we decypher that a value of 0 indicates a customer has never made a late payment. In this case, 0 does really mean 0. What about the missing values? Well, as we discussed for Credit Score, there might be other factors impacting this variable. Notice again that our missing record is for a customer who is only 20 years old. Given that they also do not have a credit score, we might infer that they have never missed a payment because they have never had the chance to make one yet. If our customer only opened their account this month, then they would not have had to make a payment and consequently could not miss one either.
缺少數據的下一列是“ 未付款項” 。 這次,我們有同時缺失數據和值為0的記錄。從直覺上講,我們解密為0表示客戶從未付款。 在這種情況下,0確實意味著0。缺失值又如何呢? 好吧,正如我們在“ 信用評分”中討論的那樣,可能還有其他因素會影響此變量。 再次注意,我們丟失的記錄僅針對20歲的客戶。 考慮到他們也沒有信用評分,我們可以推斷他們從未錯過過付款,因為他們還沒有機會進行付款。 如果我們的客戶僅在本月開戶,那么他們就不必付款,因此也不會錯過任何一個。
Moving onto our final two variables, Credit Limit and Payment Due, we see there are again both missing values and 0 values. For our values of 0, they appear to be fairly intuitive that $0 is a plausible amount in these cases. Our missing data, however, poses a bigger question. How can an individual have no value for a credit limit? Are they allowed to spend as much as they want? This same individual also has no payment due… how does that work?
進入最后兩個變量, 信用額度和到期付款 ,我們再次看到缺失值和0值。 對于我們的0值,他們似乎很直觀,在這些情況下,$ 0是合理的金額。 但是,我們缺少的數據提出了一個更大的問題。 個人如何沒有信用額度的價值? 是否允許他們花費想要多少? 該個人也沒有應付款...這是如何工作的?
Let’s approach this similar to our other scenarios — by looking at the rest of the customer’s information. First, we can see this individual has a much lower score than all our other customers, including the 23-year-old (higher scores are better for credit). We also see that they have missed 8 payments, double the next highest individual. Hmmm… so why would they have no credit limit?
讓我們以與其他方案類似的方式進行處理-通過查看客戶的其余信息。 首先,我們可以看到此人的得分比所有其他客戶(包括23歲的客戶)低得多(得分越高,信用越好)。 我們還看到他們錯過了8筆付款,是第二高的個人的兩倍。 嗯...為什么他們沒有信用額度?
One possible answer — this individual had performed so poorly that the bank decided to terminate their account. As a result, the customer is still in our database but is not active with the bank any longer. Consequently, they cannot spend any money and would not be subject to making payments either.
一個可能的答案-這個人的表現很差,銀行決定終止他們的帳戶。 結果,客戶仍然在我們的數據庫中,但是不再與銀行保持聯系。 因此,他們不能花任何錢,也不必付款。
All of the above are obviously hypothetical scenarios and explanations as to why data might be missing, why it might be zero, and how we might handle it. Data can be missing for a number of reasons but understanding why it is missing or zero can be critical in learning more and making better decisions. In a real-world scenario, you might be able to go back to the bank and ask clarifying questions around the data to verify if your assumptions are correct. Of course, there are plenty of times where that will not be an option either.
上面所有這些顯然都是關于數據為何丟失,為何可能為零以及我們如何處理的假設性場景和解釋。 數據丟失可能有多種原因,但是了解數據丟失或為零的原因對于學習更多信息和制定更好的決策至關重要。 在現實世界中,您也許可以回到銀行詢問有關數據的澄清問題,以驗證您的假設是否正確。 當然,在很多時候,這也不是一種選擇。
In the modern world of data science and machine learning, we often see models that cannot handle missing values and are forced to handle this data in another way. While it can be easy to simply drop these records or impute averages or medians, we should also take time to consider what these missing values represent. Imagine if we had simply imputed values of 0 to any blank credit score in our bank example? A potential model may have made horrible predictions because we falsely assumed that 0 and missing were the same when in this case they were not. Similarly, if we had backfilled our missing credit limit values with 0, we would have been using at least one customer who had already defaulted in a model trying to predict if this customer would default.
在當今的數據科學和機器學習世界中,我們經常看到無法處理缺失值并被迫以其他方式處理此數據的模型。 盡管簡單地刪除這些記錄或估算平均值或中位數很容易,但我們也應該花些時間考慮這些缺失值代表什么。 想象一下,如果在我們的銀行示例中,我們是否僅將0的值估算為任何空白信用評分? 一個潛在的模型可能做出了可怕的預測,因為我們錯誤地假定0和缺失在這種情況下不是相同的。 同樣,如果我們用0回填缺少的信用額度值,那么我們將使用至少一個已經在模型中發生違約的客戶,試圖預測該客戶是否會違約。
Sometimes zero really is zero. Sometimes missing is simply a human error of a failed data entry job. Sometimes, there’s a much deeper story going on. To quote my professor, “What the heck does zero mean?”
有時零真的是零。 有時丟失僅僅是由于數據輸入作業失敗而導致的人為錯誤。 有時,還有一個更深層次的故事正在發生。 用我的教授的話說:“零意味著什么?”
翻譯自: https://medium.com/@ivanecky/what-the-heck-does-zero-mean-8c5f42266dc6
協方差意味著什么
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392460.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392460.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392460.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!