詳盡kmp

表中的內容 (Table of Content)

Introduction
介紹
What is Data Preparation
什么是數據準備
Exploratory Data Analysis (EDA)
探索性數據分析(EDA)
Data Preprocessing
數據預處理
Data Splitting
數據分割

介紹 (Introduction)

Before we get into this, I want to make it clear that there is no rigid process when it comes to data preparation. How you prepare one set of data will most likely be different from how you prepare another set of data. Therefore this guide aims to provide an overarching guide that you can refer to when preparing any particular set of data.

在開始討論之前，我想澄清一下，在數據準備方面沒有嚴格的過程。您準備一組數據的方式與準備另一組數據的方式很可能會有所不同。因此，本指南旨在提供在準備任何特定數據集時可以參考的總體指南。

Before we get into the guide, I should probably go over what Data Preparation is…

在進入指南之前，我可能應該回顧一下數據準備是什么……

什么是數據準備？ (What is Data Preparation?)

Data preparation is the step after data collection in the machine learning life cycle and it’s the process of cleaning and transforming the raw data you collected. By doing so, you’ll have a much easier time when it comes to analyzing and modeling your data.

數據準備是機器學習生命周期中數據收集之后的步驟，并且是清理和轉換您收集的原始數據的過程。這樣，您就可以輕松地分析和建模數據。

There are three main parts to data preparation that I’ll go over in this article:

我將在本文中介紹數據準備的三個主要部分：

Exploratory Data Analysis (EDA)
探索性數據分析(EDA)
Data preprocessing
數據預處理
Data splitting
數據分割

1.探索性數據分析(EDA) (1. Exploratory Data Analysis (EDA))

Exploratory data analysis, or EDA for short, is exactly what it sounds like, exploring your data. In this step, you’re simply getting an understanding of the data that you’re working with. In the real world, datasets are not as clean or intuitive as Kaggle datasets.

探索性數據分析(簡稱EDA)聽起來像是探索數據。在此步驟中，您僅需要了解正在使用的數據。在現實世界中，數據集不如Kaggle數據集那么干凈或直觀。

The more you explore and understand the data you’re working with, the easier it’ll be when it comes to data preprocessing.

您越探索和了解正在使用的數據，就越容易進行數據預處理。

Below is a list of things that you should consider in this step:

下面是在此步驟中應考慮的事項列表：

特征和目標變量 (Feature and Target Variables)

Determine what the feature (input) variables are and what the target variable is. Don’t worry about determining what the final input variables are, but make sure you can identify both types of variables.

確定什么是要素(輸入)變量，什么是目標變量。不必擔心確定最終輸入變量是什么，但是請確保可以識別兩種類型的變量。

資料類型 (Data Types)

Figure out what type of data you’re working with. Are they categorical, numerical, or neither? This is especially important for the target variable, as the data type will narrow what machine learning model you may want to use. Pandas functions like df.describe() and df.dtypes are useful here.

找出要使用的數據類型。它們是分類的，數字的還是兩者都不是？這對于目標變量尤其重要，因為數據類型將縮小您可能要使用的機器學習模型。諸如df.describe()和df.dtypes之類的熊貓函數在這里很有用。

檢查異常值 (Check for Outliers)

An outlier is a data point that differs significantly from other observations. In this step, you’ll want to identify outliers and try to understand why they’re in the data. Depending on why they’re in the data, you may decide to remove them from the dataset or keep them. There are a couple of ways to identify outliers:

離群值是與其他觀察值有顯著差異的數據點。在此步驟中，您將要確定異常值，并嘗試了解它們為何在數據中。根據它們在數據中的原因，您可以決定從數據集中刪除它們或保留它們。有兩種方法可以識別異常值：

Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier. Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.
Z分數/標準差 ：如果我們知道數據集中99.7％的數據位于三個標準差之內，那么我們可以計算一個標準差的大小，將其乘以3，然后識別出不在該范圍內的數據點這個范圍。同樣，我們可以計算給定點的z得分，如果它等于+/- 3，則這是一個離群值。 注意：使用此方法時需要考慮一些意外情況； 數據必須是正態分布的，不適用于小型數據集，并且存在太多異常值可能會使z得分下降。
Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.
四分位數間距(IQR) ：IQR是用于構建箱形圖的概念，也可以用于識別異常值。 IQR等于第三四分位數和第一四分位數之間的差。然后，如果該點小于Q1-1.5 * IRQ或大于Q3 + 1.5 * IQR，則可以確定該點是否為離群值。這大約是2.698標準偏差。

問問題 (Ask Questions)

There’s no doubt that you’ll most likely have questions regarding the data that you’re working with, especially for a dataset that is outside of your domain knowledge. For example, Kaggle had a competition on NFL analytics and injuries, I had to do some research and understand what the different positions were and what their function served for the team.

毫無疑問，您極有可能對正在使用的數據有疑問， 尤其是對于您領域知識以外的數據集。例如，Kaggle參加了NFL分析和受傷比賽，我必須做一些研究，了解不同職位的位置以及他們對團隊的作用。

2.數據預處理 (2. Data Preprocessing)

Once you understand your data, a majority of your time spent as a data scientist is on this step, data preprocessing. This is when you spend your time manipulating the data so that it can be modeled properly. Like I said before, there is no universal way to go about this. HOWEVER, there are a number of essential things that you should consider which we’ll go through below.

一旦了解了數據，作為數據科學家所花費的大部分時間都將放在此步驟上，即數據預處理。這是您花時間處理數據以便可以對其進行正確建模的時候。就像我之前說過的那樣，沒有通用的方法可以做到這一點。但是，您應該考慮一些基本事項，我們將在下面進行介紹。

特征插補 (Feature Imputation)

Feature Imputation is the process of filling missing values. This is important because most machine learning models don’t work when there are missing data in the dataset.

特征插補是填充缺失值的過程。這很重要，因為當數據集中缺少數據時，大多數機器學習模型將無法正常工作。

One of the main reasons that I wanted to write this guide is specifically for this step. Many articles say that you should default to filling missing values with the mean or simply remove the row, and this is not necessarily true.

我要編寫本指南的主要原因之一就是專門針對此步驟的。許多文章說，您應該默認使用均值填充缺失值或簡單地刪除行，而這不一定是正確的 。

Ideally, you want to choose the method that makes the most sense — for example, if you were modeling people’s age and income, it wouldn’t make sense for a 14-year-old to be making a national average salary.

理想情況下，您想選擇最有意義的方法-例如，如果您在模擬人們的年齡和收入，則對于14歲的人來說，獲得全國平均工資是沒有意義的。

All things considered, there are a number of ways you can deal with missing values:

考慮所有因素，您可以通過多種方式處理缺失值：

Single value imputation: replacing missing values with the mean, median, or mode of a column
單值插補 ：用列的均值，中位數或眾數替換缺失值
Multiple value imputation: modeling features that have missing data and imputing missing data with what your model finds.
多值插補：對具有缺失數據的特征進行建模，并使用模型找到的內容插補缺失數據。
K-Nearest neighbor: filling data with a value from another similar sample
K最近鄰居：使用另一個相似樣本中的值填充數據
Deleting the row: this isn’t an imputation technique, but it tends to be okay when there’s an extremely large sample size where you can afford to.
刪除行：這不是一種插補技術，但是當您可以負擔得起非常大的樣本量時，它往往可以。
Others include: random imputation, moving window, most frequent, etc…
其他包括：隨機歸因，移動窗口，最頻繁出現等等。

功能編碼 (Feature Encoding)

Feature encoding is the process of turning values (i.e. strings) into numbers. This is because a machine learning model requires all values to be numbers.

特征編碼是將值(即字符串)轉換為數字的過程。這是因為機器學習模型要求所有值都是數字。

There are a few ways that you can go about this:

有幾種方法可以解決此問題：

Label Encoding: Label encoding simply converts a feature’s non-numerical values into numerical values, whether the feature is ordinal or not. For example, if a feature called car_colour had distinct values of red, green, and blue, then label encoding would convert these values to 1, 2, and 3 respectively. Be wary when using this method because while some ML models will be able to make sense of the encoding, others won’t.
標簽編碼：標簽編碼只是將要素的非數字值轉換為數值，無論該要素是否為序數。例如，如果一個名為car_colour的特征具有紅色，綠色和藍色的不同值，則標簽編碼會將這些值分別轉換為1、2和3。使用此方法時要當心，因為雖然某些ML模型將能夠理解編碼，但其他ML模型卻無法。
One Hot Encoding (aka. get_dummies): One hot encoding works by creating a binary feature (1, 0) for each non-numerical value of a given feature. Reusing the example above, if we had a feature called car_colour, then one hot encoding would create three features called car_colour_red, car_colour_green, car_colour_blue, and would have a 1 or 0 indicating whether it is or isn’t.
一種熱編碼(aka。get_dummies)：一種熱編碼通過為給定特征的每個非數字值創建一個二進制特征(1、0)來工作。重用上面的示例，如果我們有一個名為car_colour的特征，那么一個熱編碼將創建三個名為car_colour_red，car_colour_green，car_colour_blue的特征，并具有1或0指示是否存在。

特征歸一化 (Feature Normalization)

When numerical values are on different scales, eg. height in centimeters and weight in lbs, most machine learning algorithms don’t perform well. The k-nearest neighbors algorithm is a prime example where features with different scales do not work well. Thus normalizing or standardizing the data can help with this problem.

當數值處于不同比例時，例如。身高(厘米)和體重(磅)，大多數機器學習算法的效果都不好。 k近鄰算法是一個主要示例，其中具有不同比例的要素無法很好地工作。因此，對數據進行標準化或標準化可以幫助解決此問題。

Feature normalization rescales the values so that they’re within a range of [0,1]/
特征歸一化會重新調整值的范圍，使其在[0,1] /
Feature standardization rescales the data to have a mean of 0 and a standard deviation of one.
特征標準化會將數據重新縮放為平均值為0，標準差為1。

特征工程 (Feature Engineering)

Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There’s no specific way to go about this step but here are some things that you can consider:

特征工程是將原始數據轉換為更好地表示人們正在試圖解決的潛在問題的特征的過程。沒有具體的方法可以執行此步驟，但是您可以考慮以下幾點：

Converting a DateTime variable to extract just the day of the week, the month of the year, etc…
轉換DateTime變量以僅提取一周中的一天，一年中的月份等。
Creating bins or buckets for a variable. (eg. for a height variable, can have 100–149cm, 150–199cm, 200–249cm, etc.)
為變量創建箱或桶。 (例如，對于高度變量，可以為100–149cm，150–199cm，200–249cm等)
Combining multiple features and/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called “Is_women_or_child” which was True if the person was a woman or a child and false otherwise.
組合多個功能和/或值以創建一個新功能。例如，針對泰坦尼克號挑戰的最準確模型之一設計了一個新變量“ Is_women_or_child”，如果該人是女人還是孩子，則為True，否則為false。

功能選擇 (Feature Selection)

Next is feature selection, which is choosing the most relevant/valuable features of your dataset. There are a few methods that I like to use that you can leverage to help you with selecting your features:

接下來是要素選擇，即選擇數據集中最相關/最有價值的要素。我喜歡使用幾種方法來幫助您選擇功能：

Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.
功能重要性：一些算法(例如隨機森林或XGBoost)可讓您確定哪些功能在預測目標變量的值時最“重要”。通過快速創建這些模型之一并進行功能重要性，您將了解哪些變量比其他變量更有用。
Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.
降維：主成分分析(PCA)是最常見的降維技術之一，它具有大量特征，并使用線性代數將其簡化為更少的特征。

處理數據不平衡 (Dealing with Data Imbalances)

One other thing that you’ll want to consider is data imbalances. For example, if there are 5,000 examples of one class (eg. not fraudulent) but only 50 examples for another class (eg. fraudulent), then you’ll want to consider one of a few things:

您還要考慮的另一件事是數據不平衡。例如，如果一個類別有5,000個示例(例如，欺詐性的)，而另一類別中只有50個示例(例如，欺詐性的)，那么您將要考慮以下幾項之一：

Collecting more data — this always works in your favor but is usually not possible or too expensive.
收集更多數據-這總是對您有利，但通常是不可能或太昂貴的。
You can over or undersample the data using the scikit-learn-contrib Python package.
您可以使用scikit-learn-contrib Python軟件包對數據進行過度或欠采樣。

3.數據分割 (3. Data Splitting)

Last comes splitting your data. I’m just going to give a very generic framework that you can use here, that is generally agreed upon.

最后是拆分數據。我將要給出一個非常通用的框架，您可以在這里使用它，這是普遍同意的。

Typically you’ll want to split your data into three sets:

通常，您需要將數據分為三組：

Training Set (70–80%): this is what the models learns on
訓練集 (70–80％)：這是模型學習的內容
Validation Set (10–15%): the model’s hyperparameters are tuned on this set
驗證集 (10％到15％)：在該集合上調整模型的超參數
Test set (10–15%): finally, the model’s final performance is evaluated on this. If you’ve prepared the data correctly, the results from the test set should give a good indication of how the model will perform in the real world.
測試集 (10％到15％)：最后，以此評估模型的最終性能。如果您正確地準備了數據，那么測試集的結果應該可以很好地表明模型在現實世界中的表現。

謝謝閱讀！ (Thanks for Reading!)

I hope you’ve learned a thing or two from this. By reading this, you should now have a general framework in mind when it comes to data preparation. There are many things to consider, but having resources like this to remind you is always helpful.

希望您從中學到了一兩個東西。通過閱讀本文，您現在應該在準備數據時牢記一個通用框架。有很多事情要考慮，但是擁有類似的資源來提醒您總是很有幫助的。

If you follow these steps and keep these things in mind, you’ll definitely have your data better prepared, and you’ll ultimately be able to develop a more accurate model!

如果您遵循這些步驟并牢記這些內容，那么您肯定會為數據做好更好的準備，最終您將能夠開發出更準確的模型！

特倫斯·辛 (Terence Shin)

Check out my free data science resource with new material every week!
每周查看 我的免費數據科學資源 以及新材料！
If you enjoyed this, follow me on Medium for more
如果您喜歡這個，請 在Medium上關注我以 了解更多
Let’s connect on LinkedIn
讓我們在 LinkedIn上建立聯系

翻譯自: https://towardsdatascience.com/an-extensive-step-by-step-guide-for-data-preparation-aee4a109051d

詳盡kmp

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/392314.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/392314.shtml
英文地址，請注明出處：http://en.pswp.cn/news/392314.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！