多重插補均值插補

Understanding the Mean /Median Imputation and Implementation using feature-engine….!

了解使用特征引擎的均值/中位數插補和實現…。！

均值或中位數插補： (Mean or Median Imputation:)

The mean or median value should be calculated only in the train set and used to replace NA in both train and test sets. To avoid over-fitting

平均值或中位數應僅在訓練集中進行計算，并用于代替訓練和測試集中的NA。避免過度擬合

均值/中位數插補：定義： (Mean / Median imputation: definition:)

Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean or median.

均值/中位數推算包括用均值或中位數替換變量中所有缺失值(NA)的出現。

我可以使用均值/中位數插補估算哪些變量？ (Which variables can I impute with Mean / Median Imputation?)

· The mean and median can only be calculated on numerical variables, therefore, these methods are suitable for continuous and discrete numerical variables only.

·平均值和中位數只能通過數值變量來計算，因此，這些方法僅適用于連續和離散數值變量。

假設： (Assumptions:)

1. Data is missing completely at random (MCAR)

1.數據完全隨機丟失(MCAR)

2. The missing observations, most likely look like the majority of the observations in the variable (aka, the mean/median)

2.缺失的觀測值，很可能看起來像變量中的大多數觀測值(aka，均值/中位數)

3. If data is missing completely at random, then it is fair to assume that the missing values are most likely very close to the value of the mean or the median of the distribution, as these represent the most frequent/average observation.

3.如果數據完全隨機丟失，則可以假設丟失值很可能非常接近均值或分布中值，因為它們代表了最頻繁/平均的觀察值。

優點： (Advantages:)

Easy to implement.
易于實現。
Fast way of obtaining complete datasets.
快速獲取完整數據集的方法。
Can be integrated into production (during model deployment).
可以集成到生產中(在模型部署期間)。

局限性： (Limitations:)

Distortion of the original variable distribution.
原始變量分布失真。
Distortion of the original variance.
原始方差的失真。

Distortion of the covariance with the remaining variables of the dataset
數據集其余變量的協方差失真

When replacing NA with the mean or median, the variance of the variable will be distorted if the number of NA is big respect to the total number of observations, leading to underestimation of the variance.

當用均值或中位數替換NA時，如果NA的數量相對于觀察總數而言很大，則變量的方差將失真，從而導致方差的低估。

Besides, estimates of covariance and correlations with other variables in the dataset may also be affected. Mean / median imputation may alter intrinsic correlations since the mean / median value that now replaces the missing data will not necessarily preserve the relationship with the remaining variables.

此外，數據集中其他變量的協方差和相關性估計也會受到影響。均值/中位數估算值可能會更改內在相關性，因為現在替換缺失數據的均值/中位數值不一定會保留與其余變量的關系。

Finally, concentrating all missing values at the mean / median value may lead to observations that are common occurrences in the distribution, to be picked up as outliers.

最后，將所有缺失值集中在平均值/中值可能會導致分布中常見的觀測值，被當作異常值。

何時使用均值/中位數推算？ (When to use mean/median imputation?)

· Data is missing completely at random.

·數據完全隨機丟失。

· No more than 5% of the variable contains missing data.

·包含丟失數據的變量不超過5％。

· Although in theory, the above conditions should be met to minimize the impact of this imputation technique, in practice, mean/median imputation is very commonly used, even in those cases when data is not MCAR and there are a lot of missing values. The reason behind this is the simplicity of the technique.

·盡管從理論上講，應滿足上述條件以最大程度地減少這種插補技術的影響，但實際上，即使在數據不是MCAR且存在許多缺失值的情況下，均值插補/中位數插補也是非常常用的。其背后的原因是該技術的簡單性。

Typically, mean/median imputation is done together with adding a binary “missing indicator” variable to capture those observations where the data was missing.

通常，均值/中位數估算與添加二進制“缺失指標”變量一起進行，以捕獲數據丟失的那些觀測值。

If the data were missing completely at random, this would be captured by the mean /median imputation, and if it wasn’t this would be captured by the additional “missing indicator” variable. Both methods are extremely straight forward to implement, and therefore are a top choice in data science competitions.

如果數據完全隨機丟失，則將通過均值/中位數插值來捕獲，如果不是，則將通過附加的“缺失指標”變量來捕獲。兩種方法都非常容易實現，因此是數據科學競賽中的首選。

請注意以下幾點： (Note the following:)

1. If a variable is normally distributed, the mean, median, and mode, are approximately the same. Therefore, replacing missing values by the mean and the median are equivalent. Replacing missing data by the mode is not common practice for numerical variables.

1.如果變量為正態分布，則均值，中位數和眾數大致相同。因此，用均值和中位數代替缺失值是等效的。對于數字變量，用這種模式替換丟失的數據并不常見。

2. If the variable is skewed, the mean is biased by the values at the far end of the distribution. Therefore, the median is a better representation of the majority of the values in the variable.

2.如果變量偏斜，則均值會受到分布遠端的值的偏倚。因此，中位數可以更好地表示變量中的大多數值。

實作 (Implementation)

Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add:PThanks.

如果您在帖子中發現任何錯誤或有任何要添加的內容，請在評論中進行討論：謝謝。

翻譯自: https://medium.com/analytics-vidhya/feature-engineering-part-1-mean-median-imputation-761043b95379

多重插補均值插補

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/390922.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/390922.shtml
英文地址，請注明出處：http://en.pswp.cn/news/390922.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

域嵌入圖像顯示不出來_如何（以及為什么）將域概念嵌入代碼中

域嵌入圖像顯示不出來Code should clearly reflect the problem it’s solving, and thus openly expose that problem’s domain. Embedding domain concepts in code requires thought and skill, and doesnt drop out automatically from TDD. However, it is a necessary …

linux 查看用戶上次修改密碼的日期

查看root用戶密碼上次修改的時間方法一：查看日志文件： # cat /var/log/secure |grep password changed 方法二： # chage -l root-----Last password change : Feb 27, 2018 Password expires : never…

spring里面 @Controller和@RestController注解的區別

問題：spring里面 Controller和RestController注解的區別 spring里面 Controller和RestController注解的區別 Web MVC和REST applications都可以用Controller嗎？ 如果是的話，怎么樣區別這個一個 Web MVC還是REST application呢回答一下面…

2流程控制

分支、循環 str1$1 str2$2 echo $# if [ $str1 $str2 ] thenecho "ab" elif [ "$str1" -lt "$str2" ] thenecho "a < b" elif [ "$str1" -gt "$str2" ] thenecho "a > b" elseecho "沒有符…

客戶行為模型 r語言建模_客戶行為建模：匯總統計的問題

客戶行為模型 r語言建模As a Data Scientist, I spend quite a bit of time thinking about Customer Lifetime Value (CLV) and how to model it. A strong CLV model is really a strong customer behavior model — the better you can predict next actions, the better yo…

linux bash命令_Ultimate Linux命令行指南-Full Bash教程

linux bash命令Welcome to our ultimate guide to the Linux Command Line. This tutorial will show you some of the key Linux command line technologies and introduce you to the Bash scripting language.歡迎使用我們的Linux命令行最終指南。本教程將向您展示一些關鍵…

【知識科普】解讀閃電/雷電網絡，零基礎秒懂！

知識科普，解讀閃電/雷電網絡，零基礎秒懂！ 閃電網絡的技術是革命性的，將實現即時0手續費的小金額支付。第一步是解決擴容問題，第二部就是解決共通性問題，利用原子交換協議和不同鏈條的狀態通道結合&#xff…

spring框架里面applicationContext.xml 和spring-servlet.xml 的區別

問題：spring框架里面applicationContext.xml 和spring-servlet.xml 的區別在Spring框架中applicationContext.xml和Spring -servlet.xml有任何關系嗎? DispatcherServlet可以使用到在applicationContext.xml中聲明的屬性文件嗎? 另外，為什么我需要*…

Alpha 沖刺（5/10）

【Alpha go】Day 5！ Part 0 簡要目錄 Part 1 項目燃盡圖Part 2 項目進展Part 3 站立式會議照片Part 4 Scrum 摘要Part 5 今日貢獻Part 1 項目燃盡圖 Part 2 項目進展已分配任務進度博客檢索功能：根據標簽檢索流程圖 -> 實現 -> 測試近期比…

多維空間可視化_使用GeoPandas進行空間可視化

多維空間可視化Recently, I was working on a project where I was trying to build a model that could predict housing prices in King County, Washington — the area that surrounds Seattle. After looking at the features, I wanted a way to determine the houses’ …

蠻力寫算法_蠻力算法解釋

蠻力寫算法Brute Force Algorithms are exactly what they sound like – straightforward methods of solving a problem that rely on sheer computing power and trying every possibility rather than advanced techniques to improve efficiency.蠻力算法聽起來確實像是–…

NoClassDefFoundError和ClassNotFoundException之間有什么區別?是由什么導致的？

問題： NoClassDefFoundError和ClassNotFoundException之間有什么區別?是由什么導致的？ NoClassDefFoundError和ClassNotFoundException之前的區別是什么? 是什么導致它們被拋出?這些問題我們要怎么樣解決? 當我在為了引入新的jar包而修改現有代碼…

關于Tensorflow安裝opencv和pygame

1.安裝opencv https://www.lfd.uci.edu/~gohlke/pythonlibs/#opencv C:\ProgramData\Anaconda3\Lib\site-packages>pip install opencv_python-3.3.1-cp36-cp36m-win_amd64.whlProcessing c:\programdata\anaconda3\lib\site-packages\opencv_python-3.3.1-cp36-cp36m-win_a…