詳盡kmp_詳盡的分步指南,用于數據準備

詳盡kmp

表中的內容 (Table of Content)

  1. Introduction

    介紹
  2. What is Data Preparation

    什么是數據準備
  3. Exploratory Data Analysis (EDA)

    探索性數據分析(EDA)
  4. Data Preprocessing

    數據預處理
  5. Data Splitting

    數據分割

介紹 (Introduction)

Before we get into this, I want to make it clear that there is no rigid process when it comes to data preparation. How you prepare one set of data will most likely be different from how you prepare another set of data. Therefore this guide aims to provide an overarching guide that you can refer to when preparing any particular set of data.

在開始討論之前,我想澄清一下,在數據準備方面沒有嚴格的過程。 您準備一組數據的方式與準備另一組數據的方式很可能會有所不同。 因此,本指南旨在提供在準備任何特定數據集時可以參考的總體指南。

Before we get into the guide, I should probably go over what Data Preparation is…

在進入指南之前,我可能應該回顧一下數據準備是什么……

什么是數據準備? (What is Data Preparation?)

Data preparation is the step after data collection in the machine learning life cycle and it’s the process of cleaning and transforming the raw data you collected. By doing so, you’ll have a much easier time when it comes to analyzing and modeling your data.

數據準備是機器學習生命周期中數據收集之后的步驟,并且是清理和轉換您收集的原始數據的過程。 這樣,您就可以輕松地分析和建模數據。

There are three main parts to data preparation that I’ll go over in this article:

我將在本文中介紹數據準備的三個主要部分:

  1. Exploratory Data Analysis (EDA)

    探索性數據分析(EDA)
  2. Data preprocessing

    數據預處理
  3. Data splitting

    數據分割

1.探索性數據分析(EDA) (1. Exploratory Data Analysis (EDA))

Exploratory data analysis, or EDA for short, is exactly what it sounds like, exploring your data. In this step, you’re simply getting an understanding of the data that you’re working with. In the real world, datasets are not as clean or intuitive as Kaggle datasets.

探索性數據分析(簡稱EDA)聽起來像是探索數據。 在此步驟中,您僅需要了解正在使用的數據。 在現實世界中,數據集不如Kaggle數據集那么干凈或直觀。

The more you explore and understand the data you’re working with, the easier it’ll be when it comes to data preprocessing.

您越探索和了解正在使用的數據,就越容易進行數據預處理。

Below is a list of things that you should consider in this step:

下面是在此步驟中應考慮的事項列表:

特征和目標變量 (Feature and Target Variables)

Determine what the feature (input) variables are and what the target variable is. Don’t worry about determining what the final input variables are, but make sure you can identify both types of variables.

確定什么是要素(輸入)變量,什么是目標變量。 不必擔心確定最終輸入變量是什么,但是請確保可以識別兩種類型的變量。

資料類型 (Data Types)

Figure out what type of data you’re working with. Are they categorical, numerical, or neither? This is especially important for the target variable, as the data type will narrow what machine learning model you may want to use. Pandas functions like df.describe() and df.dtypes are useful here.

找出要使用的數據類型。 它們是分類的,數字的還是兩者都不是? 這對于目標變量尤其重要,因為數據類型將縮小您可能要使用的機器學習模型。 諸如df.describe()df.dtypes之類的熊貓函數在這里很有用。

檢查異常值 (Check for Outliers)

An outlier is a data point that differs significantly from other observations. In this step, you’ll want to identify outliers and try to understand why they’re in the data. Depending on why they’re in the data, you may decide to remove them from the dataset or keep them. There are a couple of ways to identify outliers:

離群值是與其他觀察有顯著差異的數據點。 在此步驟中,您將要確定異常值,并嘗試了解它們為何在數據中。 根據它們在數據中的原因,您可以決定從數據集中刪除它們或保留它們。 有兩種方法可以識別異常值:

  1. Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier. Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.

    Z分數/標準差 :如果我們知道數據集中99.7%的數據位于三個標準差之內,那么我們可以計算一個標準差的大小,將其乘以3,然后識別出不在該范圍內的數據點這個范圍。 同樣,我們可以計算給定點的z得分,如果它等于+/- 3,則這是一個離群值。 注意:使用此方法時需要考慮一些意外情況; 數據必須是正態分布的,不適用于小型數據集,并且存在太多異常值可能會使z得分下降。

  2. Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.

    四分位數間距(IQR) :IQR是用于構建箱形圖的概念,也可以用于識別異常值。 IQR等于第三四分位數和第一四分位數之間的差。 然后,如果該點小于Q1-1.5 * IRQ或大于Q3 + 1.5 * IQR,則可以確定該點是否為離群值。 這大約是2.698標準偏差。

問問題 (Ask Questions)

There’s no doubt that you’ll most likely have questions regarding the data that you’re working with, especially for a dataset that is outside of your domain knowledge. For example, Kaggle had a competition on NFL analytics and injuries, I had to do some research and understand what the different positions were and what their function served for the team.

毫無疑問,您極有可能對正在使用的數據有疑問, 尤其是對于您領域知識以外的數據集。 例如,Kaggle參加了NFL分析和受傷比賽,我必須做一些研究,了解不同職位的位置以及他們對團隊的作用。

2.數據預處理 (2. Data Preprocessing)

Once you understand your data, a majority of your time spent as a data scientist is on this step, data preprocessing. This is when you spend your time manipulating the data so that it can be modeled properly. Like I said before, there is no universal way to go about this. HOWEVER, there are a number of essential things that you should consider which we’ll go through below.

一旦了解了數據,作為數據科學家所花費的大部分時間都將放在此步驟上,即數據預處理。 這是您花時間處理數據以便可以對其進行正確建模的時候。 就像我之前說過的那樣,沒有通用的方法可以做到這一點。 但是,您應該考慮一些基本事項,我們將在下面進行介紹。

特征插補 (Feature Imputation)

Feature Imputation is the process of filling missing values. This is important because most machine learning models don’t work when there are missing data in the dataset.

特征插補是填充缺失值的過程。 這很重要,因為當數據集中缺少數據時,大多數機器學習模型將無法正常工作。

One of the main reasons that I wanted to write this guide is specifically for this step. Many articles say that you should default to filling missing values with the mean or simply remove the row, and this is not necessarily true.

我要編寫本指南的主要原因之一就是專門針對此步驟的。 許多文章說,您應該默認使用均值填充缺失值或簡單地刪除行,而這不一定是正確的

Ideally, you want to choose the method that makes the most sense — for example, if you were modeling people’s age and income, it wouldn’t make sense for a 14-year-old to be making a national average salary.

理想情況下,您想選擇最有意義的方法-例如,如果您在模擬人們的年齡和收入,則對于14歲的人來說,獲得全國平均工資是沒有意義的。

All things considered, there are a number of ways you can deal with missing values:

考慮所有因素,您可以通過多種方式處理缺失值:

  • Single value imputation: replacing missing values with the mean, median, or mode of a column

    單值插補 :用列的均值,中位數或眾數替換缺失值

  • Multiple value imputation: modeling features that have missing data and imputing missing data with what your model finds.

    多值插補:對具有缺失數據的特征進行建模,并使用模型找到的內容插補缺失數據。

  • K-Nearest neighbor: filling data with a value from another similar sample

    K最近鄰居:使用另一個相似樣本中的值填充數據

  • Deleting the row: this isn’t an imputation technique, but it tends to be okay when there’s an extremely large sample size where you can afford to.

    刪除行:這不是一種插補技術,但是當您可以負擔得起非常大的樣本量時,它往往可以。

  • Others include: random imputation, moving window, most frequent, etc…

    其他包括:隨機歸因,移動窗口,最頻繁出現等等。

功能編碼 (Feature Encoding)

Feature encoding is the process of turning values (i.e. strings) into numbers. This is because a machine learning model requires all values to be numbers.

特征編碼是將值(即字符串)轉換為數字的過程。 這是因為機器學習模型要求所有值都是數字。

There are a few ways that you can go about this:

有幾種方法可以解決此問題:

  1. Label Encoding: Label encoding simply converts a feature’s non-numerical values into numerical values, whether the feature is ordinal or not. For example, if a feature called car_colour had distinct values of red, green, and blue, then label encoding would convert these values to 1, 2, and 3 respectively. Be wary when using this method because while some ML models will be able to make sense of the encoding, others won’t.

    標簽編碼:標簽編碼只是將要素的非數字值轉換為數值,無論該要素是否為序數。 例如,如果一個名為car_colour的特征具有紅色,綠色和藍色的不同值,則標簽編碼會將這些值分別轉換為1、2和3。 使用此方法時要當心,因為雖然某些ML模型將能夠理解編碼,但其他ML模型卻無法。

  2. One Hot Encoding (aka. get_dummies): One hot encoding works by creating a binary feature (1, 0) for each non-numerical value of a given feature. Reusing the example above, if we had a feature called car_colour, then one hot encoding would create three features called car_colour_red, car_colour_green, car_colour_blue, and would have a 1 or 0 indicating whether it is or isn’t.

    一種熱編碼(aka。get_dummies):一種熱編碼通過為給定特征的每個非數字值創建一個二進制特征(1、0)來工作。 重用上面的示例,如果我們有一個名為car_colour的特征,那么一個熱編碼將創建三個名為car_colour_red,car_colour_green,car_colour_blue的特征,并具有1或0指示是否存在。

特征歸一化 (Feature Normalization)

When numerical values are on different scales, eg. height in centimeters and weight in lbs, most machine learning algorithms don’t perform well. The k-nearest neighbors algorithm is a prime example where features with different scales do not work well. Thus normalizing or standardizing the data can help with this problem.

當數值處于不同比例時,例如。 身高(厘米)和體重(磅),大多數機器學習算法的效果都不好。 k近鄰算法是一個主要示例,其中具有不同比例的要素無法很好地工作。 因此,對數據進行標準化或標準化可以幫助解決此問題。

  • Feature normalization rescales the values so that they’re within a range of [0,1]/

    特征歸一化會重新調整值的范圍,使其在[0,1] /

  • Feature standardization rescales the data to have a mean of 0 and a standard deviation of one.

    特征標準化會將數據重新縮放為平均值為0,標準差為1。

特征工程 (Feature Engineering)

Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There’s no specific way to go about this step but here are some things that you can consider:

特征工程是將原始數據轉換為更好地表示人們正在試圖解決的潛在問題的特征的過程。 沒有具體的方法可以執行此步驟,但是您可以考慮以下幾點:

  • Converting a DateTime variable to extract just the day of the week, the month of the year, etc…

    轉換DateTime變量以僅提取一周中的一天,一年中的月份等。
  • Creating bins or buckets for a variable. (eg. for a height variable, can have 100–149cm, 150–199cm, 200–249cm, etc.)

    為變量創建箱或桶。 (例如,對于高度變量,可以為100–149cm,150–199cm,200–249cm等)
  • Combining multiple features and/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called “Is_women_or_child” which was True if the person was a woman or a child and false otherwise.

    組合多個功能和/或值以創建一個新功能。 例如,針對泰坦尼克號挑戰的最準確模型之一設計了一個新變量“ Is_women_or_child”,如果該人是女人還是孩子,則為True,否則為false。

功能選擇 (Feature Selection)

Next is feature selection, which is choosing the most relevant/valuable features of your dataset. There are a few methods that I like to use that you can leverage to help you with selecting your features:

接下來是要素選擇,即選擇數據集中最相關/最有價值的要素。 我喜歡使用幾種方法來幫助您選擇功能:

  • Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.

    功能重要性:一些算法(例如隨機森林或XGBoost)可讓您確定哪些功能在預測目標變量的值時最“重要”。 通過快速創建這些模型之一并進行功能重要性,您將了解哪些變量比其他變量更有用。

  • Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.

    :主成分分析(PCA)是最常見的降維技術之一,它具有大量特征,并使用線性代數將其簡化為更少的特征。

處理數據不平衡 (Dealing with Data Imbalances)

One other thing that you’ll want to consider is data imbalances. For example, if there are 5,000 examples of one class (eg. not fraudulent) but only 50 examples for another class (eg. fraudulent), then you’ll want to consider one of a few things:

您還要考慮的另一件事是數據不平衡。 例如,如果一個類別有5,000個示例(例如,欺詐性的),而另一類別中只有50個示例(例如,欺詐性的),那么您將要考慮以下幾項之一:

  • Collecting more data — this always works in your favor but is usually not possible or too expensive.

    收集更多數據-這總是對您有利,但通常是不可能或太昂貴的。
  • You can over or undersample the data using the scikit-learn-contrib Python package.

    您可以使用scikit-learn-contrib Python軟件包對數據進行過度或欠采樣。

3.數據分割 (3. Data Splitting)

Last comes splitting your data. I’m just going to give a very generic framework that you can use here, that is generally agreed upon.

最后是拆分數據。 我將要給出一個非常通用的框架,您可以在這里使用它,這是普遍同意的。

Typically you’ll want to split your data into three sets:

通常,您需要將數據分為三組:

  1. Training Set (70–80%): this is what the models learns on

    訓練集 (70–80%):這是模型學習的內容

  2. Validation Set (10–15%): the model’s hyperparameters are tuned on this set

    驗證集 (10%到15%):在該集合上調整模型的超參數

  3. Test set (10–15%): finally, the model’s final performance is evaluated on this. If you’ve prepared the data correctly, the results from the test set should give a good indication of how the model will perform in the real world.

    測試集 (10%到15%):最后,以此評估模型的最終性能。 如果您正確地準備了數據,那么測試集的結果應該可以很好地表明模型在現實世界中的表現。

謝謝閱讀! (Thanks for Reading!)

I hope you’ve learned a thing or two from this. By reading this, you should now have a general framework in mind when it comes to data preparation. There are many things to consider, but having resources like this to remind you is always helpful.

希望您從中學到了一兩個東西。 通過閱讀本文,您現在應該在準備數據時牢記一個通用框架。 有很多事情要考慮,但是擁有類似的資源來提醒您總是很有幫助的。

If you follow these steps and keep these things in mind, you’ll definitely have your data better prepared, and you’ll ultimately be able to develop a more accurate model!

如果您遵循這些步驟并牢記這些內容,那么您肯定會為數據做好更好的準備,最終您將能夠開發出更準確的模型!

特倫斯·辛 (Terence Shin)

  • Check out my free data science resource with new material every week!

    每周 查看 我的免費數據科學資源 以及新材料!

  • If you enjoyed this, follow me on Medium for more

    如果您喜歡這個,請 在Medium上關注我以 了解更多

  • Let’s connect on LinkedIn

    讓我們在 LinkedIn上建立聯系

翻譯自: https://towardsdatascience.com/an-extensive-step-by-step-guide-for-data-preparation-aee4a109051d

詳盡kmp

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392314.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392314.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392314.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

leetcode 947. 移除最多的同行或同列石頭(dfs)

n 塊石頭放置在二維平面中的一些整數坐標點上。每個坐標點上最多只能有一塊石頭。 如果一塊石頭的 同行或者同列 上有其他石頭存在,那么就可以移除這塊石頭。 給你一個長度為 n 的數組 stones ,其中 stones[i] [xi, yi] 表示第 i 塊石頭的位置&#x…

matlab距離保護程序,基于MATLAB的距離保護仿真.doc

基于MATLAB的距離保護仿真摘要:本文闡述了如何利用Matlab中的Simulink及SPS工具箱建立線路的距離保護仿真模型,并用S函數編制相間距離保護和接地距離保護算法程序,構建相應的保護模塊,實現了三段式距離保護。仿真結果表明&#xf…

ZOJ3385 - Hanami Party (貪心)

題目鏈接: http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode3385 題目大意: 妖夢要準備一個party,所以需要許多食物,初始化妖夢的烹飪技能為L,每天妖夢有兩種選擇,一是選擇當天做L個食物&am…

sklearn.fit_兩個小時后仍在運行嗎? 如何控制您的sklearn.fit。

sklearn.fitby Nathan Toubiana內森圖比亞納(Nathan Toubiana) 兩個小時后仍在運行嗎? 如何控制您的sklearn.fit (Two hours later and still running? How to keep your sklearn.fit under control) Written by Gabriel Lerner and Nathan Toubiana加布里埃爾勒納…

RabbitMQ學習系列(二): RabbitMQ安裝與配置

1.安裝 Rabbit MQ 是建立在強大的Erlang OTP平臺上,因此安裝RabbitMQ之前要先安裝Erlang。 erlang:http://www.erlang.org/download.html rabbitmq:http://www.rabbitmq.com/download.html 注意: 1.現在先別裝最新的 3…

帝國CMS淺淺滴談一下——博客園老牛大講堂

封筆多月之后,工作中遇到了很多很多的問題,也解決了一些問題,下面我把一些得出的經驗,分享給大家! 會帝國cms的請離開,這篇文章對你沒什么用 1、什么是帝國CMS?---博客園老牛大講堂 多月之前&am…

matlab cdf,Matlab 簡單計算PDF和CDF | 學步園

通信的魅力就是在于隨機性中蘊含的確定性,這也就是為什么你隨便拿出一本通信方面的教材,前面幾章都會大篇幅的講解隨機過程,隨機過程也是研究生必須深入了解的一門課,特別是對于信號處理以及通信專業的學生。在實際工作中&#xf…

leetcode 1232. 綴點成線

在一個 XY 坐標系中有一些點,我們用數組 coordinates 來分別記錄它們的坐標,其中 coordinates[i] [x, y] 表示橫坐標為 x、縱坐標為 y 的點。 請你來判斷,這些點是否在該坐標系中屬于同一條直線上,是則返回 true,否則…

mysql常用操作(一)

【數據庫設計的三大范式】1、第一范式(1NF):數據表中的每一列,必須是不可拆分的最小單元。也就是確保每一列的原子性。 例如:userInfo:山東省煙臺市 18865518189 應拆分成 userAds山東省煙臺市 userTel188655181892、第…

pmp 成本估算準確高_如何更準確地估算JavaScript中文章的閱讀時間

pmp 成本估算準確高by Pritish Vaidya通過Pritish Vaidya 準確估算JavaScript中篇文章的閱讀時間 (Accurate estimation of read time for Medium articles in JavaScript) 介紹 (Introduction) Read Time Estimate is the estimation of the time taken by the reader to rea…

Android數據適配-ExpandableListView

Android中ListView的用法基本上學的時候都會使用,其中可以使用ArrayAdapter,SimpleAdapter,BaseAdapter去實現,這次主要使用的ExpandableListView展示一種兩層的效果,ExpandableListView是android中可以實現下拉list的…

JavaWeb 命名規則

命名規范命名規范命名規范命名規范 本規范主要針對java開發制定的規范項目命名項目命名項目命名項目命名 項目創建,名稱所有字母均小寫,組合方式為:com.company.projectName.component.hiberarchy。1. projectName:項目名稱2. com…

多元概率密度_利用多元論把握事件概率

多元概率密度Humans have plenty of cognitive strengths, but one area that most of us struggle with is estimating, explaining and preparing for improbable events. This theme underpins two of Nassim Taleb’s major works: Fooled by Randomness and The Black Swa…

nginx php訪問日志配置,nginx php-fpm 輸出php錯誤日志的配置方法

由于nginx僅是一個web服務器,因此nginx的access日志只有對訪問頁面的記錄,不會有php 的 error log信息。nginx把對php的請求發給php-fpm fastcgi進程來處理,默認的php-fpm只會輸出php-fpm的錯誤信息,在php-fpm的errors log里也看不…

阿里的技術愿景_技術技能的另一面:領域知識和長期愿景

阿里的技術愿景by Sihui Huang黃思慧 技術技能的另一面:領域知識和長期愿景 (The other side of technical skill: domain knowledge and long-term vision) When we first start our careers as software engineers, we tend to focus on improving our coding sk…

leetcode 721. 賬戶合并(并查集)

給定一個列表 accounts,每個元素 accounts[i] 是一個字符串列表,其中第一個元素 accounts[i][0] 是 名稱 (name),其余元素是 emails 表示該賬戶的郵箱地址。 現在,我們想合并這些賬戶。如果兩個賬戶都有一些共同的郵箱地址&#…

es6重點筆記:數值,函數和數組

本篇全是重點,撿常用的懟,數值的擴展比較少,所以和函數放一起: 一,數值 1,Number.EPSILON:用來檢測浮點數的計算,如果誤差小于這個,就無誤 2,Math.trunc()&am…

SMSSMS垃圾郵件檢測器的專業攻擊

Note: The methodology behind the approach discussed in this post stems from a collaborative publication between myself and Irene Anthi.注意: 本文討論的方法背后的方法來自 我本人和 Irene Anthi 之間 的 合作出版物 。 介紹 (INTRODUCTION) Spam SMS te…

php pdo 緩沖,PDO支持數據緩存_PHP教程

/*** 作者:初十* QQ:345610000*/class myPDO extends PDO{public $cache_Dir null; //緩存目錄public $cache_expireTime 7200; //緩存時間,默認兩小時//帶緩存的查詢public function cquery($sql){//緩存存放總目錄if ($this->cache_Di…

mooc課程下載_如何使用十大商學院的免費課程制作MOOC“ MBA”

mooc課程下載by Laurie Pickard通過勞里皮卡德(Laurie Pickard) 如何使用十大商學院的免費課程制作MOOC“ MBA” (How to make a MOOC “MBA” using free courses from Top 10 business schools) Back when massive open online courses (MOOCs) were new, I started a proje…