自我價值感缺失的表現
Before handling the missing values, we must know what all possible types of it exists in the data science world. Basically there are 3 types to be found everywhere on the web, but in some of the core research papers there is one more type of it. Let me introduce you with all of them very briefly-
在處理缺失值之前,我們必須知道數據科學世界中存在所有可能的類型。 基本上,在網絡上到處都可以找到3種類型,但是在一些核心研究論文中,還有另外一種類型。 讓我簡單地向大家介紹一下-
Structurally Missing Data- Let me tell you an example where we have the results of the students of a university of a particular semester and out of the entire data, some of the result values were missing. This may happen when either of the students have dropped out before exams or maybe were absent. So, this is a structurally missing value. In this case, the best possible solution is to deduce by inserting 0 at those missing places.
結構上缺失的數據-讓我告訴你一個例子,其中我們有特定學期大學學生的成績,而在全部數據中,有些結果值丟失了。 當任何一個學生在考試前輟學或缺席時,可能會發生這種情況。 因此,這是結構上缺失的值。 在這種情況下,最好的解決方案是在那些丟失的位置插入0來推斷。
MCAR (Missing Completely at Random)- When missing values are randomly distributed over entire dataset, MCAR occurs in instances where missing data is not related to the scores on the variables in the question and is not related to the scores on any other variables under analysis. For example, when data are missing for respondents for which their questionnaire was lost. Say you have complete data of 15 questions and incomplete data of 10. In this case, we compare these two datasets by some testing say t-test and if we don’t find any difference in means between the two samples of data, we can assume the data to be MCAR.
MCAR(完全隨機缺失)-當缺失值隨機分布在整個數據集中時,MCAR發生在以下情況下:缺失數據與問題中變量的分數無關,并且與分析中任何其他變量的分數均無關。 例如,當丟失了問卷的受訪者的數據丟失時。 假設您有15個問題的完整數據,有10個問題的不完整數據。在這種情況下,我們通過一些測試(例如t檢驗)比較了這兩個數據集,如果我們發現兩個數據樣本之間的均值沒有任何差異,我們可以假設數據為MCAR。
MAR (Missing at Random)- Data is not missing randomly across entire dataset but is missing randomly only within sub samples of data. When the probability of missing data on a variable is related to some other measured variable in the model, but not to the value of the variable with missing value itself is MAR. For example, in an IQ dataset, only older people have missing value. Thus, the probability of missing data on IQ is related to age. Also, to assume this as MAR is difficult because there is no way of testing it.
MAR(隨機丟失)-數據在整個數據集中并不是隨機丟失的,而是僅在子數據樣本內隨機丟失的。 當變量上缺失數據的概率與模型中其他一些測量變量相關,而與缺失值本身無關的變量值則為MAR。 例如,在IQ數據集中,只有老年人的價值缺失。 因此,丟失智商數據的可能性與年齡有關。 而且,很難將其假定為MAR,因為沒有辦法對其進行測試。
NMAR (Not Missing at Random)- When the missing data has no structure to it, we can’t treat it as missing at random. It may be the case where we can’t make conclusions to the missing value.
NMAR(隨機丟失)-當丟失的數據沒有結構時,我們不能將其視為隨機丟失。 在某些情況下,我們無法得出缺失值的結論。
Some Common Approaches to deal with such type of missing data:
處理此類丟失數據的一些常用方法 :
Simple one: Drop the corresponding Column/ Row-
簡單一:刪除相應的Column / Row-
pd.Dataframe.isnull().dropna() If your data size is large and corresponding count of missing values in column/rows are comparatively quite low, then we use this approach.
如果您的數據量很大,并且列/行中缺失值的相應計數相對較低,那么我們可以使用這種方法。
2. Imputation- It fills the missing value with some number. The imputed value won’t be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column/row entirely. We can name some of the imputation techniques as below:
2.插補-用一些數字填充缺失值。 在大多數情況下,推算的值并不完全正確,但是與完全刪除列/行相比,推導的值通常會導致更準確的模型。 我們可以將一些插補技術命名為:
a) Mean/Median Imputation: As the name suggests, in this we replace missing values by mean or median of the total. We use this approach when the number of missing observations is low.
a)均值/中位數插補:顧名思義,在此我們將缺失值替換為總數的均值或中位數。 當缺少的觀察次數很少時,我們使用這種方法。
b) Multivariate Imputation by Chained Equations (MICE): It assumes that the missing data are Missing at Random (MAR). It imputes data on a variable-by-variable basis by specifying an imputation model per variable. It uses all the variables in the data for predictions.
b)鏈式方程多元估計(MICE):它假定丟失的數據是隨機丟失(MAR)。 通過為每個變量指定插補模型,它可以逐變量插補數據。 它使用數據中的所有變量進行預測。
3. Random Forest- Yes, it is also a non-parametric imputation method that works well with both data missing at random and not missing at random. It uses multiple decision trees to estimate missing values and outputs OOB (out of bag) imputation error estimates.
3.隨機森林-是的,它也是一種非參數插補方法,可以很好地處理隨機丟失的數據和隨機丟失的數據。 它使用多個決策樹來估計缺失值,并輸出OOB(袋外)估算誤差估計。
However, there are various other efficient methods to handle the missing values as per the given scenario and the type of data. I have discussed here the most common ones with you. Hope it was helpful, thanks for reading! Good luck!! Be safe!!
但是,根據給定方案和數據類型,還有各種其他有效的方法來處理缺失值。 我在這里與您討論了最常見的問題。 希望對您有所幫助,感謝您的閱讀! 祝好運!! 注意安全!!
翻譯自: https://medium.com/analytics-vidhya/different-types-of-missing-values-approaches-to-deal-with-them-1f67c617374c
自我價值感缺失的表現
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392087.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392087.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392087.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!