協方差意味著什么_“零”到底意味著什么?

協方差意味著什么

When I was an undergraduate student studying Data Science, one of my professors always asked the same question for every data set we worked with — “What does zero mean?”

當我是一名研究數據科學的本科生時,我的一位教授總是對我們處理的每個數據集提出相同的問題-“零意味著什么?”

On the surface, this seems trivial. If the scenario is how many apples does each student have, zero means that a student has no apples.

從表面上看,這似乎是微不足道的。 如果假設每個學生有多少個蘋果,則零表示該學生沒有蘋果。

Then why ask the question?

那為什么要問這個問題呢?

Well, zero can mean zero… but it can also mean a slew of other things, and if you’re not careful, ignoring it could come back to haunt you.

好吧,零可能意味著零……但也可能意味著其他許多事情,如果您不小心,忽略它可能會再次困擾您。

It can be easy to disregard or ignore missing data. Whether the values are 0, NULL, NA, or blank, we are often quick to ignore these records because they “lack information”. However, these data points can often be critical pieces of information for our problem and the “lack” of information actually is information.

忽略或忽略丟失的數據很容易。 無論值是0,NULL,NA還是空白,我們經常會很快忽略這些記錄,因為它們“缺少信息”。 但是,這些數據點通常可能是解決我們問題的關鍵信息,而信息“不足”實際上就是信息。

Let’s consider the following scenario. We are consulting for a bank and they want us to determine if a customer is likely to default on their credit card payments. Below is a sample of the data we are given to evaluate this problem.

讓我們考慮以下情形。 我們正在為一家銀行提供咨詢,他們希望我們確定客戶是否有可能拖欠其信用卡付款。 以下是我們提供的評估此問題的數據樣本。

We see that there are five variables we could use to make predictions and of those five, four contain values of 0, blank, or in some cases both.

我們看到有五個變量可用于進行預測,在這五個變量中,四個包含0,空白或在某些情況下都包含的值。

It might be easy to ignore these values or chalk them up as some kind of error in the bank’s system but let’s take a closer look and see if there might be more to the story.

忽略這些值或將它們歸類為銀行系統中的某種錯誤可能很容易,但讓我們仔細看看,看看故事可能還有更多內容。

Our first variable with missing data is Credit Score. Two of the eight customers have no credit score value. While it may seem like these values were skipped over, notice that both of these customers have ages of 23 and 20. Since they are relatively young, there is a good chance they may have only recently opened up a credit card and consequently would not have a credit score yet. It might be easy to backfill these records with a value of 0, but that would not make sense either given that we don’t know how they will actually perform. How would we handle this in a real-life scenario? One approach would be to find the average score of individuals with similar ages and use that value for our missing records.

我們缺少數據的第一個變量是Credit Score 。 八個客戶中有兩個沒有信用評分值。 盡管看起來這些值似乎已被跳過,但請注意,這兩個客戶的年齡分別為23歲和20歲。由于他們相對年輕,所以很有可能他們只是最近才打開了信用卡,因此不會信用評分呢。 將這些記錄的值回填為0可能很容易,但是鑒于我們不知道它們的實際表現,這也沒有意義。 在現實生活中,我們將如何處理? 一種方法是找到年齡相似的個人的平均分數,并將該值用于我們的缺失記錄。

The next column with missing data is Missed Payments. This time we have records with both the missing data and values of 0. Intuitively, we decypher that a value of 0 indicates a customer has never made a late payment. In this case, 0 does really mean 0. What about the missing values? Well, as we discussed for Credit Score, there might be other factors impacting this variable. Notice again that our missing record is for a customer who is only 20 years old. Given that they also do not have a credit score, we might infer that they have never missed a payment because they have never had the chance to make one yet. If our customer only opened their account this month, then they would not have had to make a payment and consequently could not miss one either.

缺少數據的下一列是“ 未付款項” 。 這次,我們有同時缺失數據和值為0的記錄。從直覺上講,我們解密為0表示客戶從未付款。 在這種情況下,0確實意味著0。缺失值又如何呢? 好吧,正如我們在“ 信用評分”中討論的那樣,可能還有其他因素會影響此變量。 再次注意,我們丟失的記錄僅針對20歲的客戶。 考慮到他們也沒有信用評分,我們可以推斷他們從未錯過過付款,因為他們還沒有機會進行付款。 如果我們的客戶僅在本月開戶,那么他們就不必付款,因此也不會錯過任何一個。

Moving onto our final two variables, Credit Limit and Payment Due, we see there are again both missing values and 0 values. For our values of 0, they appear to be fairly intuitive that $0 is a plausible amount in these cases. Our missing data, however, poses a bigger question. How can an individual have no value for a credit limit? Are they allowed to spend as much as they want? This same individual also has no payment due… how does that work?

進入最后兩個變量, 信用額度到期付款 ,我們再次看到缺失值和0值。 對于我們的0值,他們似乎很直觀,在這些情況下,$ 0是合理的金額。 但是,我們缺少的數據提出了一個更大的問題。 個人如何沒有信用額度的價值? 是否允許他們花費想要多少? 該個人也沒有應付款...這是如何工作的?

Let’s approach this similar to our other scenarios — by looking at the rest of the customer’s information. First, we can see this individual has a much lower score than all our other customers, including the 23-year-old (higher scores are better for credit). We also see that they have missed 8 payments, double the next highest individual. Hmmm… so why would they have no credit limit?

讓我們以與其他方案類似的方式進行處理-通過查看客戶的其余信息。 首先,我們可以看到此人的得分比所有其他客戶(包括23歲的客戶)低得多(得分越高,信用越好)。 我們還看到他們錯過了8筆付款,是第二高的個人的兩倍。 嗯...為什么他們沒有信用額度?

One possible answer — this individual had performed so poorly that the bank decided to terminate their account. As a result, the customer is still in our database but is not active with the bank any longer. Consequently, they cannot spend any money and would not be subject to making payments either.

一個可能的答案-這個人的表現很差,銀行決定終止他們的帳戶。 結果,客戶仍然在我們的數據庫中,但是不再與銀行保持聯系。 因此,他們不能花任何錢,也不必付款。

All of the above are obviously hypothetical scenarios and explanations as to why data might be missing, why it might be zero, and how we might handle it. Data can be missing for a number of reasons but understanding why it is missing or zero can be critical in learning more and making better decisions. In a real-world scenario, you might be able to go back to the bank and ask clarifying questions around the data to verify if your assumptions are correct. Of course, there are plenty of times where that will not be an option either.

上面所有這些顯然都是關于數據為何丟失,為何可能為零以及我們如何處理的假設性場景和解釋。 數據丟失可能有多種原因,但是了解數據丟失或為零的原因對于學習更多信息和制定更好的決策至關重要。 在現實世界中,您也許可以回到銀行詢問有關數據的澄清問題,以驗證您的假設是否正確。 當然,在很多時候,這也不是一種選擇。

In the modern world of data science and machine learning, we often see models that cannot handle missing values and are forced to handle this data in another way. While it can be easy to simply drop these records or impute averages or medians, we should also take time to consider what these missing values represent. Imagine if we had simply imputed values of 0 to any blank credit score in our bank example? A potential model may have made horrible predictions because we falsely assumed that 0 and missing were the same when in this case they were not. Similarly, if we had backfilled our missing credit limit values with 0, we would have been using at least one customer who had already defaulted in a model trying to predict if this customer would default.

在當今的數據科學和機器學習世界中,我們經常看到無法處理缺失值并被迫以其他方式處理此數據的模型。 盡管簡單地刪除這些記錄或估算平均值或中位數很容易,但我們也應該花些時間考慮這些缺失值代表什么。 想象一下,如果在我們的銀行示例中,我們是否僅將0的值估算為任何空白信用評分? 一個潛在的模型可能做出了可怕的預測,因為我們錯誤地假定0和缺失在這種情況下不是相同的。 同樣,如果我們用0回填缺少的信用額度值,那么我們將使用至少一個已經在模型中發生違約的客戶,試圖預測該客戶是否會違約。

Sometimes zero really is zero. Sometimes missing is simply a human error of a failed data entry job. Sometimes, there’s a much deeper story going on. To quote my professor, “What the heck does zero mean?”

有時零真的是零。 有時丟失僅僅是由于數據輸入作業失敗而導致的人為錯誤。 有時,還有一個更深層次的故事正在發生。 用我的教授的話說:“零意味著什么?”

翻譯自: https://medium.com/@ivanecky/what-the-heck-does-zero-mean-8c5f42266dc6

協方差意味著什么

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392460.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392460.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392460.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Go_筆試題記錄-不熟悉的

1、golang中沒有隱藏的this指針,這句話的含義是() A. 方法施加的對象顯式傳遞,沒有被隱藏起來 B. golang沿襲了傳統面向對象編程中的諸多概念,比如繼承、虛函數和構造函數 C. golang的面向對象表達更直觀,對…

leetcode 316. 去除重復字母(單調棧)

給你一個字符串 s ,請你去除字符串中重復的字母,使得每個字母只出現一次。需保證 返回結果的字典序最小(要求不能打亂其他字符的相對位置)。 注意:該題與 1081 https://leetcode-cn.com/problems/smallest-subsequenc…

Go-json解碼到結構體

廢話不多說,直接干就得了,上代碼 package mainimport ("encoding/json""fmt" )type IT struct {Company string json:"company" Subjects []string json:"subjects"IsOk bool json:"isok"…

leetcode 746. 使用最小花費爬樓梯(dp)

數組的每個索引作為一個階梯,第 i個階梯對應著一個非負數的體力花費值 costi。 每當你爬上一個階梯你都要花費對應的體力花費值,然后你可以選擇繼續爬一個階梯或者爬兩個階梯。 您需要找到達到樓層頂部的最低花費。在開始時,你可以選擇從索…

安卓中經常使用控件遇到問題解決方法(持續更新和發現篇幅)(在textview上加一條線、待續)...

TextView設置最多顯示30個字符。超過部分顯示...(省略號)&#xff0c;有人說分別設置TextView的android:signature"true",而且設置android:ellipsize"end";可是我試了。居然成功了&#xff0c;供大家參考 [java] view plaincopy<TextView android:id…

網絡工程師晉升_晉升為工程師的最快方法

網絡工程師晉升by Sihui Huang黃思慧 晉升為工程師的最快方法 (The Fastest Way to Get Promoted as an Engineer) We all want to live up to our potential, grow in our career, and do the best work of our lives. Getting promoted at work not only proves that we hav…

java 銀行存取款_用Java編寫銀行存錢取錢

const readline require(‘readline-sync‘)//引用readline-synclet s 2;//錯誤的次數for (let i 0; i < 3; i) {console.log(‘請輸入名&#xff1a;(由英文組成)‘);let user readline.question();console.log(‘請輸入密碼&#xff1a;(由數字組成)‘);let password …

垃圾郵件分類 python_在python中創建SMS垃圾郵件分類器

垃圾郵件分類 python介紹 (Introduction) I have always been fascinated with Google’s gmail spam detection system, where it is able to seemingly effortlessly judge whether incoming emails are spam and therefore not worthy of our limited attention.我一直對Goo…

leetcode 103. 二叉樹的鋸齒形層序遍歷(層序遍歷)

給定一個二叉樹&#xff0c;返回其節點值的鋸齒形層序遍歷。&#xff08;即先從左往右&#xff0c;再從右往左進行下一層遍歷&#xff0c;以此類推&#xff0c;層與層之間交替進行&#xff09;。例如&#xff1a; 給定二叉樹 [3,9,20,null,null,15,7],3/ \9 20/ \15 7 返回…

簡單易用的MongoDB

從我第一次聽到Nosql這個概念到如今已經走過4個年頭了&#xff0c;但仍然沒有具體的去做過相應的實踐。最近獲得一段學習休息時間&#xff0c;購買了Nosql技術實踐一書&#xff0c;正在慢慢的學習。在主流觀點中&#xff0c;Nosql大體分為4類&#xff0c;鍵值存儲數據庫&#x…

html畫布圖片不顯示_如何在HTML5畫布上顯示圖像

html畫布圖片不顯示by Nash Vail由Nash Vail Ok, so here’s a question: “Why do we need an article for this, Nash?”好的&#xff0c;這是一個問題&#xff1a;“為什么我們需要為此寫一篇文章&#xff0c;納什&#xff1f;” Well, grab a seat.好吧&#xff0c;坐下…

java斷點續傳插件_視頻斷點續傳+java視頻

之前仿造uploadify寫了一個HTML5版的文件上傳插件&#xff0c;沒看過的朋友可以點此先看一下~得到了不少朋友的好評&#xff0c;我自己也用在了項目中&#xff0c;不論是用戶頭像上傳&#xff0c;還是各種媒體文件的上傳&#xff0c;以及各種個性的業務需求&#xff0c;都能得到…

全棧入門_啟動數據棧入門包(2020)

全棧入門I advise a lot of people on how to build out their data stack, from tiny startups to enterprise companies that are moving to the cloud or from legacy solutions. There are many choices out there, and navigating them all can be tricky. Here’s a brea…

Go-json解碼到接口及根據鍵獲取值

Go-json解碼到接口及根據鍵獲取值 package mainimport ("encoding/json""fmt""github.com/bitly/go-simplejson" )type JsonServer struct {ServerName stringServerIP string }type JsonServers struct {Servers []JsonServer }func main() {…

C#接口的顯隱實現

顯示接口實現與隱式接口實現 何為顯式接口實現、隱式接口實現&#xff1f;簡單概括&#xff0c;使用接口名作為方法名的前綴&#xff0c;這稱為“顯式接口實現”&#xff1b;傳統的實現方式&#xff0c;稱為“隱式接口實現”。下面給個例子。 IChineseGreeting接口&#xff0c;…

亞馬遜 各國站點 鏈接_使用Amazon S3和HTTPS的簡單站點托管

亞馬遜 各國站點 鏈接by Georgia Nola喬治亞諾拉(Georgia Nola) 使用Amazon S3和HTTPS的簡單站點托管 (Simple site hosting with Amazon S3 and HTTPS) Hiya folks!大家好&#xff01; In this tutorial I’ll show you how to host a static website with HTTPS on AWS wit…

leetcode 387. 字符串中的第一個唯一字符(hash)

給定一個字符串&#xff0c;找到它的第一個不重復的字符&#xff0c;并返回它的索引。如果不存在&#xff0c;則返回 -1。 示例&#xff1a; s “leetcode” 返回 0 s “loveleetcode” 返回 2 class Solution { public int firstUniqChar(String s) { int[][] tempnew i…

marlin 三角洲_三角洲湖泊和數據湖泊-入門

marlin 三角洲Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logica…

tomcat中設置Java 客戶端程序的http(https)訪問代理

1、假定http/https代理服務器為 127.0.0.1 端口為8118 2、在tomcat/bin/catalina.sh腳本文件中設置JAVA_OPTS&#xff0c;如下圖&#xff1a; 保存后重啟tomcat就能生效。轉載于:https://www.cnblogs.com/zhangmingcheng/p/11211776.html

java界面中顯示圖片_java中怎樣在界面中顯示圖片?

方法一&#xff1a;JLabel helloLabel new JLabel("New label");helloLabel.setIcon(new ImageIcon("E:\\javaSE\u4EE3\u7801\\TimeManager\\asset\\hello.gif"));helloLabel.setBackground(Color.BLACK);helloLabel.setBounds(0, 0, 105, 50);contentPan…