python 插補數據_python 2020中缺少數據插補技術的快速指南

python 插補數據

Most machine learning algorithms expect complete and clean noise-free datasets, unfortunately, real-world datasets are messy and have multiples missing cells, in such cases handling missing data becomes quite complex.

大多數機器學習算法期望完整且干凈的無噪聲數據集,但不幸的是,現實世界的數據集比較雜亂,缺少多個單元格,在這種情況下,處理丟失的數據變得相當復雜。

Therefore in today’s article, we are going to discuss some of the most effective and indeed easy-to-use data imputation techniques which can be used to deal with missing data.

因此,在今天的文章中,我們將討論一些最有效且確實易于使用的數據插補技術,這些技術可用于處理丟失的數據。

So without any further delay, let’s get started.

因此,沒有任何進一步的延遲,讓我們開始吧。

Image for post
Copyright: ? Ampcool22 | Dreamstime.com
版權:?Ampcool22 | Dreamstime.com

什么是數據歸因? (What is Data Imputation?)

Data Imputation is a method in which the missing values in any variable or data frame(in Machine learning) is filled with some numeric values for performing the task. Using this method the sample size remains the same, only the blanks which were missing are now filled with some values. This method is easy to use but the variance of the dataset is reduced.

數據插補是一種方法,其中(在機器學習中)任何變量或數據框中的缺失值都填充有一些數字值,以執行任務。 使用此方法,樣本大小保持不變,現在僅將缺少空白 填充一些值這種方法易于使用,但數據集的方差減小了。

為什么要進行數據插補? (Why Data Imputation?)

There can be various reasons for imputing data, many real-world datasets(not talking about CIFAR or MNIST) containing missing values which can be in any form such as blanks, NaN, 0s, any integers or any categorical symbol. Instead of just dropping the Rows or Columns containing the missing values which come at the price of losing data which may be valuable, a better strategy is to impute the missing values.

插補數據可能有多種原因,許多現實世界的數據集(不涉及CIFAR或MNIST)包含缺失值,這些缺失值可以采用任何形式,例如空格,NaN,0,任何整數或任何分類符號更好的策略是估算缺失值 ,而不是僅僅刪除包含缺失值的行或列,而這些缺失值會以丟失可能有價值的數據代價

Having a good theoretical knowledge is amazing but implementing them in code in a real-time machine learning project is a completely different thing. You might get different and unexpected results based on different problems and datasets. So as a Bonus,I am also adding the links to the various courses which has helped me a lot in my journey to learn Data science and ML, experiment and compare different data imputations strategies which led me to write this article on comparisons between different data imputations methods.

擁有良好的理論知識是驚人的,但是在實時機器學習項目中以代碼實現它們是完全不同的。 根據不同的問題和數據集,您可能會得到不同且出乎意料的結果。 因此,作為獎勵,我還添加了到各種課程的鏈接,這些鏈接對我學習數據科學和ML,實驗和比較不同的數據歸因策略有很大幫助,這使我撰寫了有關不同數據之間的比較的本文。歸因方法。

I am personally a fan of DataCamp, I started from it and I am still learning through DataCamp and keep doing new courses. They seriously have some exciting courses. Do check them out.

我個人是 DataCamp 的粉絲 ,我從此開始,但仍在學習 DataCamp 并繼續 學習 新課程。 他們認真地開設了一些令人興奮的課程。 請檢查一下。

1.處理缺少輸入的數據 (1. handling-missing-data-with-imputations-in-r)

2.在Python中處理丟失的數據 (2. dealing-with-missing-data-in-python)

3.處理R中的缺失數據 (3. dealing-with-missing-data-in-r)

4.在Python中構建數據工程管道 (4. building-data-engineering-pipelines-in-python)

5.數據工程概論 (5. introduction-to-data-engineering)

6.用Python進行數據處理 (6. data-manipulation-with-python)

7.熊貓數據處理 (7. data-manipulation-with-pandas)

8.使用R進行數據處理 (8. data-manipulation-with-r)

P.S: I am still using DataCamp and keep doing courses in my free time. I actually insist the readers to try out any of the above courses as per their interest, to get started and build a good foundation in Machine learning and Data Science. The best thing about these courses by DataCamp is that they explain it in a very elegant and different manner with a balanced focus on practical and well as conceptual knowledge and at the end, there is always a Case study. This is what I love the most about them. These courses are truly worth your time and money. These courses would surely help you also understand and implement Deep learning, machine learning in a better way and also implement it in Python or R. I am damn sure you will love it and I am claiming this from my personal opinion and experience.

PS:我仍在使用 DataCamp, 并在 業余時間 繼續 上課 我實際上是堅持要求讀者根據自己的興趣嘗試上述任何課程,以開始并為機器學習和數據科學打下良好的基礎。 DataCamp 開設的這些課程的最好之 在于,他們以非常優雅且與眾不同的方式 對課程進行了 解釋,同時重點關注實踐和概念知識,最后始終進行案例研究。 這就是我最喜歡他們的地方。 這些課程確實值得您花費時間和金錢。 這些課程肯定會幫助您更好地理解和實施深度學習,機器學習,并且還可以在Python或R中實現它。我該死的,您一定會喜歡它的,我是從我個人的觀點和經驗中主張這一點的。

Also, I have noticed that DataCamp is giving unlimited access to all the courses for free for one week starting from 1st of September through 8th September 2020, 12 PM EST. So this would literally be the best time to grab some yearly subscriptions(which I have) which basically has unlimited access to all the courses and other things on DataCamp and make fruitful use of your time sitting at home during this Pandemic. So go for it folks and Happy learning

此外,我注意到, DataCamp 自2020年9月1日至美國東部時間12 PM,為期一周免費無限制地訪問所有課程。 因此,從字面上看,這將是獲取一些年度訂閱(我擁有)的最佳時間,該訂閱基本上可以無限制地訪問 DataCamp 上的所有課程和其他內容, 并可以在這次大流行期間充分利用您在家里的時間。 因此,請親朋好友學習愉快

Coming back to the topic -

回到主題-

Sklearn.impute package provides 2 types of imputations algorithms to fill in missing values:

Sklearn.impute包提供了兩種插補算法來填充缺失值:

1. SimpleImputer (1. SimpleImputer)

SimpleImputer is used for imputations on univariate datasets, univariate datasets are datasets that have only a single variable. SimpleImputer allows us to impute values in any feature column using only missing values in that feature space.

SimpleImputer用于單變量數據集的插補, 變量數據集是僅具有單個變量的數據集。 SimpleImputer允許我們僅使用該要素空間中的缺失值插補任何要素列中的值

There are different strategies provided to impute data such as imputing with a constant value or using the statistical methods such as mean, median or mode to impute data for each column of missing values.

提供了不同的策略來估算數據,例如以恒定值進行估算,或使用統計方法(例如均值,中位數或眾數)為缺失值的每一列估算數據。

For categorical data representations, it has support for ‘most-frequent’ strategy which is like the mode of numerical values.

對于分類數據表示,它支持“最頻繁”策略,就像數值模式一樣。

Image for post
dataframe with 5 columns具有5列的數據框
Image for post
number of missing values in each column每列中缺失值的數量
Image for post
Image for post
Image for post
Imputing Data using Mean, Median and Mode Strategy from SimpleImputer使用SimpleImputer的均值,中位數和眾數策略插補數據

2.迭代計算機 (2. IterativeImputer)

IterativeImputer is used for imputations on multivariate datasets, multivariate datasets are datasets that have more than two variables or feature columns per observation. IterativeImputer allows us to make use of the entire dataset of available features columns to impute the missing values.

IterativeImputer用于對多元數據集進行插補, 多元數據集是每個觀察值具有兩個以上變量或特征列的數據集。 IterativeImputer允許我們利用可用要素列的整個數據集來估算缺失值。

In IterativeImpute each feature with a missing value is used as a function of other features with known output and models the function for imputations. The same process is then iterated in a loop for some iterations and at each step, a feature column is selected as output y and other feature columns are treated as inputs X, then a regressor is fit on (X, y) for known y and is used to predict the missing values of y.

IterativeImpute與缺失值的每個特征被用作與已知的輸出和模式插補函數其它特征的函數 。 然后,在循環中重復相同的過程進行一些迭代,并在每個步驟中,選擇一個特征列作為 輸出y ,將其他特征列視為 輸入X ,然后將 回歸器擬合到(X,y)上 以獲取已知 y 和用于 預測y的缺失值

The same process is repeated for each feature column in a loop and the average of all the multiple regression values are taken to impute the missing values for the data points.

對循環中的每個要素列重復相同的過程,并采用所有多個回歸值的平均值來估算數據點的缺失值。

Image for post
Imputing Data using IterativeImputer使用IterativeImputer插補數據

失蹤圖書館 (Missingpy library)

Missingpy is a library in python used for imputations of missing values. Currently, it supports K-Nearest Neighbours based imputation technique and MissForest i.e Random Forest-based imputation technique.

Missingpy是python中的一個庫,用于估算缺失值。 當前,它支持基于K最近鄰的插補技術和MissForest即基于隨機森林的插補技術。

To install missingpy library, you can type the following in command line:

要安裝missingpy庫,可以在命令行中鍵入以下內容:

pip install missingpy

pip install missingpy

3. KNNImputer (3. KNNImputer)

KNNImputer is a multivariate data imputation technique used for filling in the missing values using the K-Nearest Neighbours approach. Each missing value is filled by the mean value form the n nearest neighbours found in the training set, either weighted or unweighted.

KNNImputer是一種多變量數據插補技術,用于使用K最近鄰方法填充缺失值。 每個缺失值都由在訓練集中找到的n個最近鄰居 (加權或未加權)的平均值填充。

If a sample has more than one feature missing then the neighbour for that sample can be different and if the number of neighbours is lesser than n_neighbour specified then there is no defined distance in the training set, the average of that training set is used during imputation.

如果樣本缺少一個以上的特征,則該樣本的鄰居可能會有所不同 ;如果鄰居的數量小于指定的n_neighbour,則訓練集中沒有定義的距離,則在插補過程中將使用該訓練集的平均值。

Nearest neighbours are selected on the basis of distance metrics, by default it is set to euclidean distance and n_neighbour are specified to consider for each step.

根據距離量度 選擇最近的鄰居 默認情況下將其設置為歐式距離,并為每個步驟指定要考慮的n_neighbour

Image for post
Imputing Data using KNN from missingpy使用失蹤的KNN插值數據

4.小姐森林 (4. MissForest)

It is another technique used to fill in the missing values using Random Forest in an iterated fashion. The candidate column is selected from the set of all the columns having the least number of missing values.

這是另一種使用“ 隨機森林 ”以迭代方式填充缺失值的技術。 從缺少值最少的所有列的集合中選擇候選列

In the first step, all the other columns i.e non-candidate columns having missing values are filled with the mean for the numerical columns and mode for the categorical columns and after that imputer fits a random forest model with the candidate columns as the outcome variable(target variable) and remaining columns as independent variables and then filling the missing values in candidate column using the predictions from the fitted Random Forest model.

第一步,將所有其他列(即缺少值的非候選列)填充為數值列的平均值和分類列的眾數,然后,imputer將候選列作為結果變量擬合隨機森林模型 (目標變量)和其余列作為自變量 ,然后使用擬合隨機森林模型的預測填充候選列中的缺失值。

Then the imputer moves on and the next candidate column is selected with the second least number of missing values and the process repeats itself for each column with the missing values.

然后,推動者繼續前進,并選擇缺失值次之下一個候選列,并且該過程針對具有缺失值的每一列重復其自身。

Image for post
Imputing data using MissForest from Missingpy使用Missingpy的MissForest插補數據

進一步閱讀 (Further Readings)

FancyImpute:

FancyImpute:

IterativeImputer was originally a part of the fancy impute but later on was merged into sklearn. Apart from IterativeImputer, fancy impute has many different algorithms that can be helpful in imputing missing values. Few of them are not much common in the industry but have proved their existence in some particular projects, that is why I am not including them in today's article. You can study them here.

IterativeImputer最初是幻想式估算的一部分,但后來合并為sklearn。 除了IterativeImputer之外,花式插補還具有許多不同的算法,可用于插補缺失值。 它們很少在行業中并不常見,但是已經證明它們在某些特定項目中的存在,這就是為什么我不在今天的文章中包括它們。 你可以在這里學習。

AutoImpute:

自動提示:

It is yet another python package for analysis and imputation of missing values in datasets. It supports various utility functions to examine patterns in missing values and provides some imputations methods for continuous, categorical or time-series data. It also supports multiple and single imputation frameworks for imputations. You can study them here.

它是另一個python軟件包,用于分析和估算數據集中的缺失值。 它支持各種實用程序功能以檢查缺失值中的模式,并為連續,分類或時間序列數據提供一些插補方法。 它還支持用于插補的多個和單個插補框架。 你可以在這里學習。

If you enjoyed reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via LinkedIn and Github. Please do not hesitate to send a contact request!

如果您喜歡閱讀本文,那么我相信我們有相同的興趣并且會/將會從事相似的行業。 因此,讓我們通過LinkedIn和Github進行連接。 請不要猶豫,發送聯系請求!

翻譯自: https://medium.com/analytics-vidhya/a-quick-guide-on-missing-data-imputation-techniques-in-python-2020-5410f3df1c1e

python 插補數據

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389696.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389696.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389696.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

5186. 區間內查詢數字的頻率

5186. 區間內查詢數字的頻率 請你設計一個數據結構,它能求出給定子數組內一個給定值的 頻率 。 子數組中一個值的 頻率 指的是這個子數組中這個值的出現次數。 請你實現 RangeFreqQuery 類: RangeFreqQuery(int[] arr) 用下標從 0 開始的整數數組 ar…

NIO 學習筆記

0. 介紹 參考 關于Java IO與NIO知識都在這里 ,在其基礎上進行修改與補充。 1. NIO介紹 1.1 NIO 是什么 Java NIO 是 java 1.4, 之后新出的一套IO接口. NIO中的N可以理解為Non-blocking,不單純是New。 1.2 NIO的特性/NIO與IO區別 IO是面向流的&#x…

[原創]java獲取word里面的文本

需求場景 開發的web辦公系統如果需要處理大量的Word文檔(比如有成千上萬個文檔),用戶一定提出查找包含某些關鍵字的文檔的需求,這就要求能夠讀取 word 中的文字內容,而忽略其中的文字樣式、表格、圖片等信息。 方案分析…

ab 模擬_Ab測試第二部分的直觀模擬

ab 模擬In this post, I would like to invite you to continue our intuitive exploration of A/B testing, as seen in the previous post:在本文中,我想邀請您繼續我們對A / B測試的直觀探索,如前一篇文章所示: Resuming what we saw, we…

1886. 判斷矩陣經輪轉后是否一致

1886. 判斷矩陣經輪轉后是否一致 給你兩個大小為 n x n 的二進制矩陣 mat 和 target 。現 以 90 度順時針輪轉 矩陣 mat 中的元素 若干次 ,如果能夠使 mat 與 target 一致,返回 true ;否則,返回 false 。 示例 1: 輸…

samba登陸密碼不正確

win7訪問Linux Samba的共享目錄提示“登錄失敗:用戶名或密碼錯誤”解決方法 解決辦法:修改本地安全策略 通過Samba服務可以實現UNIX/Linux主機與Windows主機之間的資源互訪,由于實驗需要,輕車熟路的在linux下配置了samba服務&…

Java構造函數的深入理解

我們人出生的時候,有些人一出生之后再起名字的,但是有些人一旦出生就已經起好名字的。那么我們在 java 里面怎么在對象一旦創建就賦值呢? public class Person {String name; // 姓名int age; // 年齡public static void main(String[]…

1967. 作為子字符串出現在單詞中的字符串數目

1967. 作為子字符串出現在單詞中的字符串數目 給你一個字符串數組 patterns 和一個字符串 word ,統計 patterns 中有多少個字符串是 word 的子字符串。返回字符串數目。 子字符串 是字符串中的一個連續字符序列。 示例 1:輸入:patterns [&…

判斷IE版本與各瀏覽器的語句

---恢復內容開始--- 一.IE下判斷IE版本的語句 <!--[if lte IE 6]><![endif]-->IE6及其以下版本可見<!--[if lte IE 7]><![endif]-->IE7及其以下版本可見<!--[if IE 6]><![endif]-->只有IE6版本可見<![if !IE]><![endif]>除了I…

各類軟件馬斯洛需求層次分析_需求的分析層次

各類軟件馬斯洛需求層次分析When I joined Square, I was embedded on a product that had been in-market for a year but didn’t have dedicated analytics support.當我加入Square時&#xff0c;我被嵌入了已經上市一年但沒有專門的分析支持的產品。 As you might expect,…

384. 打亂數組

384. 打亂數組 給你一個整數數組 nums &#xff0c;設計算法來打亂一個沒有重復元素的數組。 實現 Solution class: Solution(int[] nums) 使用整數數組 nums 初始化對象int[] reset() 重設數組到它的初始狀態并返回int[] shuffle() 返回數組隨機打亂后的結果 示例&#xf…

HTTP/2 學習筆記

創建連接TCP三次握手:包括客戶端想服務端發起一個SYN包,接著服務端返回對應SYN的ACK響應以及新的SYN包,然后客戶端返回對應的ACK.如果客戶端發起HTTPS連接,它還需要進行傳輸層安全協議(TLS)協商;TLS用來取代安全套接層.HTTP1的問題1.隊頭阻塞:允許一次發送一組請求,但是只能按照…

MySQL的變量分類總結

在MySQL中&#xff0c;my.cnf是參數文件&#xff08;Option Files&#xff09;&#xff0c;類似于ORACLE數據庫中的spfile、pfile參數文件&#xff0c;照理說&#xff0c;參數文件my.cnf中的都是系統參數&#xff08;這種稱呼比較符合思維習慣&#xff09;&#xff0c;但是官方…

859. 親密字符串

859. 親密字符串 給你兩個字符串 s 和 goal &#xff0c;只要我們可以通過交換 s 中的兩個字母得到與 goal 相等的結果&#xff0c;就返回 true &#xff1b;否則返回 false 。 交換字母的定義是&#xff1a;取兩個下標 i 和 j &#xff08;下標從 0 開始&#xff09;且滿足 …

python函數不同類型參數順序

python函數的參數定義順序必須為&#xff1a; 必須參數&#xff08;位置參數&#xff09;&#xff0c;默認參數&#xff0c;可變參數&#xff0c;命名關鍵字參數&#xff0c;關鍵字參數 如以下定義&#xff1a; def f1(a, b, c0, *args, d, **kw): print(a , a, b , b, c , c, …

亞洲國家互聯網滲透率_發展中亞洲國家如何回應covid 19

亞洲國家互聯網滲透率The COVID-19 pandemic has severely hit various economies across the world, with global impact estimated between USD 6.1 trillion and USD 9.1 trillion, equivalent to a loss of 7.1% to 10.5% of global gross domestic product (GDP).[1] More…

create-react-app項目使用假數據

做新項目的時候&#xff0c;前端每次要等后端接口準備好再開始&#xff0c;就會延期&#xff0c;等后端接口準備好了&#xff0c;前端這邊的項目又會相互緊張&#xff0c;如果前端跟后端同時進行&#xff0c;前期將框架&#xff0c;基礎做好&#xff0c;定好接口文檔&#xff0…

1854. 人口最多的年份

1854. 人口最多的年份 給你一個二維整數數組 logs &#xff0c;其中每個 logs[i] [birthi, deathi] 表示第 i 個人的出生和死亡年份。 年份 x 的 人口 定義為這一年期間活著的人的數目。第 i 個人被計入年份 x 的人口需要滿足&#xff1a;x 在閉區間 [birthi, deathi - 1] 內…

snake4444勒索病毒成功處理教程方法工具達康解密金蝶/用友數據庫sql后綴snake4444...

*snake4444勒索病毒成功處理教程方法 案例&#xff1a;筆者負責一個政務系統的第三方公司的運維&#xff0c;上班后發現服務器的所有文件都打不開了&#xff0c;而且每個文件后面都有一個snake4444的后綴&#xff0c;通過網絡我了解到這是一種勒索病毒。因為各個文件不能正常打…

有史以來最漂亮的游戲機

The recent reveal of the PlayStation 5’s design has divided the gaming world. There are those who appreciate its bold, daring industrial design and those who would have preferred something a little less outlandish; perhaps a little more traditional.噸 他最…