多重插補 均值插補
Understanding the Mean /Median Imputation and Implementation using feature-engine….!
了解使用特征引擎的均值/中位數插補和實現…。!
均值或中位數插補: (Mean or Median Imputation:)
The mean or median value should be calculated only in the train set and used to replace NA in both train and test sets. To avoid over-fitting
平均值或中位數應僅在訓練集中進行計算,并用于代替訓練和測試集中的NA。 避免過度擬合
均值/中位數插補:定義: (Mean / Median imputation: definition:)
Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean or median.
均值/中位數推算包括用均值或中位數替換變量中所有缺失值(NA)的出現。
我可以使用均值/中位數插補估算哪些變量? (Which variables can I impute with Mean / Median Imputation?)
· The mean and median can only be calculated on numerical variables, therefore, these methods are suitable for continuous and discrete numerical variables only.
·平均值和中位數只能通過數值變量來計算,因此,這些方法僅適用于連續和離散數值變量。

假設: (Assumptions:)
1. Data is missing completely at random (MCAR)
1.數據完全隨機丟失(MCAR)
2. The missing observations, most likely look like the majority of the observations in the variable (aka, the mean/median)
2.缺失的觀測值,很可能看起來像變量中的大多數觀測值(aka,均值/中位數)
3. If data is missing completely at random, then it is fair to assume that the missing values are most likely very close to the value of the mean or the median of the distribution, as these represent the most frequent/average observation.
3.如果數據完全隨機丟失,則可以假設丟失值很可能非常接近均值或分布中值,因為它們代表了最頻繁/平均的觀察值。
優點: (Advantages:)
- Easy to implement. 易于實現。
- Fast way of obtaining complete datasets. 快速獲取完整數據集的方法。
- Can be integrated into production (during model deployment). 可以集成到生產中(在模型部署期間)。
局限性: (Limitations:)
- Distortion of the original variable distribution. 原始變量分布失真。
- Distortion of the original variance. 原始方差的失真。

- Distortion of the covariance with the remaining variables of the dataset 數據集其余變量的協方差失真

When replacing NA with the mean or median, the variance of the variable will be distorted if the number of NA is big respect to the total number of observations, leading to underestimation of the variance.
當用均值或中位數替換NA時,如果NA的數量相對于觀察總數而言很大,則變量的方差將失真,從而導致方差的低估。
Besides, estimates of covariance and correlations with other variables in the dataset may also be affected. Mean / median imputation may alter intrinsic correlations since the mean / median value that now replaces the missing data will not necessarily preserve the relationship with the remaining variables.
此外,數據集中其他變量的協方差和相關性估計也會受到影響。 均值/中位數估算值可能會更改內在相關性,因為現在替換缺失數據的均值/中位數值不一定會保留與其余變量的關系。
Finally, concentrating all missing values at the mean / median value may lead to observations that are common occurrences in the distribution, to be picked up as outliers.
最后,將所有缺失值集中在平均值/中值可能會導致分布中常見的觀測值,被當作異常值。
何時使用均值/中位數推算? (When to use mean/median imputation?)
· Data is missing completely at random.
·數據完全隨機丟失。
· No more than 5% of the variable contains missing data.
·包含丟失數據的變量不超過5%。
· Although in theory, the above conditions should be met to minimize the impact of this imputation technique, in practice, mean/median imputation is very commonly used, even in those cases when data is not MCAR and there are a lot of missing values. The reason behind this is the simplicity of the technique.
·盡管從理論上講,應滿足上述條件以最大程度地減少這種插補技術的影響,但實際上,即使在數據不是MCAR且存在許多缺失值的情況下,均值插補/中位數插補也是非常常用的。 其背后的原因是該技術的簡單性。
Typically, mean/median imputation is done together with adding a binary “missing indicator” variable to capture those observations where the data was missing.
通常,均值/中位數估算與添加二進制“缺失指標”變量一起進行,以捕獲數據丟失的那些觀測值。
If the data were missing completely at random, this would be captured by the mean /median imputation, and if it wasn’t this would be captured by the additional “missing indicator” variable. Both methods are extremely straight forward to implement, and therefore are a top choice in data science competitions.
如果數據完全隨機丟失,則將通過均值/中位數插值來捕獲,如果不是,則將通過附加的“缺失指標”變量來捕獲。 兩種方法都非常容易實現,因此是數據科學競賽中的首選。
請注意以下幾點: (Note the following:)
1. If a variable is normally distributed, the mean, median, and mode, are approximately the same. Therefore, replacing missing values by the mean and the median are equivalent. Replacing missing data by the mode is not common practice for numerical variables.
1.如果變量為正態分布,則均值,中位數和眾數大致相同。 因此,用均值和中位數代替缺失值是等效的。 對于數字變量,用這種模式替換丟失的數據并不常見。

2. If the variable is skewed, the mean is biased by the values at the far end of the distribution. Therefore, the median is a better representation of the majority of the values in the variable.
2.如果變量偏斜,則均值會受到分布遠端的值的偏倚。 因此,中位數可以更好地表示變量中的大多數值。

實作 (Implementation)
Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add:PThanks.
如果您在帖子中發現任何錯誤或有任何要添加的內容,請在評論中進行討論:謝謝。

翻譯自: https://medium.com/analytics-vidhya/feature-engineering-part-1-mean-median-imputation-761043b95379
多重插補 均值插補
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/390922.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/390922.shtml 英文地址,請注明出處:http://en.pswp.cn/news/390922.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!