數據預處理 泰坦尼克號_了解泰坦尼克號數據集的數據預處理

數據預處理 泰坦尼克號

什么是數據預處理? (What is Data Pre-Processing?)

We know from my last blog that data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.

從我的上一篇博客中我們知道,數據預處理是一種數據挖掘技術,它涉及將原始數據轉換為可理解的格式。 實際數據通常不完整,不一致和/或缺少某些行為或趨勢,并且可能包含許多錯誤。 數據預處理是解決此類問題的一種行之有效的方法。 數據預處理將準備原始數據以進行進一步處理。

So in this blog we will learn about the implementation of data pre-processing on a data set. I have decided to do my implementation using the Titanic data set, which I have downloaded from Kaggle. Here is the link to get this dataset- https://www.kaggle.com/c/titanic-gettingStarted/data

因此,在本博客中,我們將學習在數據集上實施數據預處理的方法。 我決定使用我從Kaggle下載的Titanic數據集進行實施。 這是獲取此數據集的鏈接-https : //www.kaggle.com/c/titanic-gettingStarted/data

Note- Kaggle gives 2 datasets, the train and the test dataset, so we will use both of them in this process.

注意 -Kaggle提供了2個數據集,即訓練和測試數據集,因此在此過程中我們將同時使用它們。

預期的結果是什么? (What is the expected outcome?)

The Titanic shipwreck was a massive disaster, so we will implement data pre- processing on this data set to know the number of survivors and their details.

泰坦尼克號沉船事故是一場巨大的災難,因此我們將對該數據集進行數據預處理,以了解幸存者的人數及其詳細信息。

I will show you how to apply data preprocessing techniques on the Titanic dataset, with a tinge of my own ideas into this.

我將向您展示如何在Titanic數據集上應用數據預處理技術,并結合我自己的想法。

So let’s get started…

因此,讓我們開始吧...

Image for post

導入所有重要的庫 (Importing all the important libraries)

Firstly after loading the data sets in our system, we will import the libraries that are needed to perform the functions. In my case I imported NumPy, Pandas and Matplot libraries.

首先,在將數據集加載到我們的系統中之后,我們將導入執行功能所需的庫。 就我而言,我導入了NumPy,Pandas和Matplot庫。

#importing librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd

#importing librarys將numpy導入為npimport matplotlib.pyplot作為pltimport熊貓作為pd

使用Pandas導入數據集 (Importing dataset using Pandas)

To work on the data, you can either load the CSV in excel software or in pandas. So I will load the CSV data in pandas. Then we will also use a function to view that data in the Jupyter notebook.

要處理數據,可以在excel軟件或熊貓中加載CSV。 因此,我將在熊貓中加載CSV數據。 然后,我們還將使用一個函數在Jupyter筆記本中查看該數據。

#importing dataset using pandasdf = pd.read_csv(r’C:\Users\KIIT\Desktop\Internity Internship\Day 4 task\train.csv’)df.shapedf.head()

#使用pandasdf = pd.read_csv(r'C:\ Users \ KIIT \ Desktop \ Internal Internship \ Day 4 task \ train.csv')df.shapedf.head()導入數據集

Image for post

#Taking a look at the data format belowdf.info()

#看看下面的數據格式df.info()

Let’s take a look at the data output that we get from the above code snippets :

讓我們看一下從以上代碼片段獲得的數據輸出:

Image for post

If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (means missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values.

如果您仔細觀察以上熊貓的摘要,則總共有891行,“年齡”僅顯示714行(均值缺失),上船(缺失2幅)以及機艙缺失很多。 對象數據類型是非數字的,因此我們必須找到一種將其編碼為數值的方法。

查看特定數據集中的列 (Viewing the columns in the particular dataset)

We use a function to view all the columns that are being used in this dataset for a better reference of the kind of data that we are working on.

我們使用一個函數來查看此數據集中正在使用的所有列,以更好地參考我們正在處理的數據類型。

#Taking a look at all the columns in the data setprint(df.columns)

#查看數據setprint(df.columns)中的所有列

定義獨立和相關數據的值 (Defining values for independent and dependent data)

Here we will declare the values of X and y for our independent and dependent data.

在這里,我們將為我們的獨立數據和相關數據聲明X和y的值。

#independet dataX = df.iloc[:, 1:-1].values#dependent datay = df.iloc[:, -1].values

#independet dataX = df.iloc [:, 1:-1] .values#dependent datay = df.iloc [:, -1] .values

Image for post
Image for post

刪除無用的列 (Dropping Columns which are not useful)

Lets try to drop some of the columns which many not contribute much to our machine learning model such as Name, Ticket, Cabin etc.

讓我們嘗試刪除一些對我們的機器學習模型貢獻不大的列,例如名稱,票務,機艙等。

So we will drop 3 columns and then we will take a look at the newly generated data.

因此,我們將刪除3列,然后看一下新生成的數據。

#Dropping Columns which are not usefull, so we drop 3 of them here according to our conveniencecols = [‘Name’, ‘Ticket’, ‘Cabin’]df = df.drop(cols, axis=1)

#刪除沒有用的列,因此我們根據我們的便便性將其中的3個放置在此處colcols = ['Name','Ticket','Cabin'] df = df.drop(cols,axis = 1)

#Taking a look at the newly formed data format belowdf.info()

#在下面的df.info()中查看新形成的數據格式

Image for post

刪除缺少值的行 (Dropping rows having missing values)

Next if we want we can drop all rows in the data that has missing values (NaN). You can do it like the code shows-

接下來,如果需要,我們可以刪除數據中所有缺少值(NaN)的行。 您可以像代碼所示那樣進行操作-

#Dropping the rows that have missing valuesdf = df.dropna()df.info()

#刪除缺少值的行df = df.dropna()df.info()

刪除缺少值的行的問題 (Problem with dropping rows having missing values)

After dropping rows with missing values we find that the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can. We will see it later.

刪除缺少值的行后,我們發現數據集從891減少到712行,這意味著我們在浪費數據 。 機器學習模型需要用于訓練的數據才能表現良好。 因此,我們保留并盡可能多地利用數據。 我們稍后會看到。

Image for post

創建虛擬變量 (Creating Dummy Variables)

Now we convert the Pclass, Sex, Embarked to columns in pandas and drop them after conversion.

現在,我們將Pclass,Sex,Embeded轉換為熊貓中的列,并在轉換后將其刪除。

#Creating Dummy Variablesdummies = []cols = [‘Pclass’, ‘Sex’, ‘Embarked’]for col in cols:dummies.append(pd.get_dummies(df[col]))titanic_dummies = pd.concat(dummies, axis=1)

#為col中的col創建虛擬變量dummies = [] cols = ['Pclass','Sex','Embarked'] cols:dummies.append(pd.get_dummies(df [col]))titanic_dummies = pd.concat(Dummies,axis = 1)

So on seeing the information we know we have 8 columns transformed to columns where 1,2,3 represents passenger class.

因此,在查看信息后,我們知道我們將8列轉換為其中1,2,3代表乘客艙位的列。

Image for post

And finally we concatenate to the original data frame column wise.

最后,我們將原始數據幀按列連接。

#Combining the original datasetdf = pd.concat((df,titanic_dummies), axis=1)

#合并原始數據集df = pd.concat((df,titanic_dummies),axis = 1)

Now that we converted Pclass, Sex, Embarked values into columns, we drop the redundant same columns from the data frame and now take a look at the new data set.

現在,我們將Pclass,Sex,Embarked值轉換為列,然后從數據框中刪除了冗余的相同列,現在來看一下新的數據集。

df = df.drop([‘Pclass’, ‘Sex’, ‘Embarked’], axis=1)

df = df.drop(['Pclass','Sex','Embarked'],axis = 1)

df.info()

df.info()

Image for post

照顧丟失的數據 (Taking Care of Missing Data)

All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a interpolate() function that will replace all the missing NaNs to interpolated values.

一切都很好,除了年齡,它有很多缺失的值。 讓我們計算所有年齡的中位數或interpolate()并填充那些缺失的年齡值。 熊貓有一個interpolate()函數,它將所有缺少的NaN替換為插值。

#Taking care of the missing data by interpolate functiondf[‘Age’] = df[‘Age’].interpolate()

#通過插值函數df ['Age'] = df ['Age']。interpolate()處理丟失的數據

df.info()

df.info()

Image for post

Now lets observe the data columns. Notice age which is interpolated now with imputed new values.

現在讓我們觀察數據列。 注意使用新的插值插入的年齡。

Image for post

將數據幀轉換為NumPy (Converting the data frame to NumPy)

Now that we have converted all the data to numeric, its time for preparing the data for machine learning models. This is where scikit and numpy come into play:

現在,我們已將所有數據轉換為數字,這是為機器學習模型準備數據的時間。 這是scikit和numpy發揮作用的地方:

X = Input set with 14 attributesy = Small y Output, in this case ‘Survived’

X =具有14個屬性的輸入集y =小y輸出,在這種情況下為“生存”

Now we convert our dataframe from pandas to numpy and we assign input and output.

現在,我們將數據幀從熊貓轉換為numpy,并分配輸入和輸出。

#using the concept of survived vlues, we conver and view the dataframe to NumPyX = df.valuesy = df[‘Survived’].values

#使用幸存的虛擬詞的概念,我們將數據幀收斂并查看為NumPyX = df.valuesy = df ['Survived']。values

X = np.delete(X, 1, axis=1)

X = np.delete(X,1,軸= 1)

Image for post

將數據集分為訓練集和測試集 (Dividing data set into training set and test set)

Now that we are ready with X and y, lets split the dataset for 70% Training and 30% test set using scikit model_selection like in code and the 4 print functions after that-

現在我們已經準備好使用X和y,讓我們使用scikit model_selection像代碼中那樣拆分70%Training和30%Test Set的數據集,然后使用4個打印功能-

#Dividing data set into training set and test set (Most important step)from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

#從sklearn.model_selection導入數據集分為訓練集和測試集(最重要的步驟)import train_test_splitX_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 0)

Image for post
Image for post
Image for post
Image for post

功能縮放 (Feature Scaling)

Feature Scaling is an important step of data preprocessing. Feature Scaling makes all data in such way that they lie in same scale usually -3 to +3.

特征縮放是數據預處理的重要步驟。 Feature Scaling使所有數據處于相同的比例,通常為-3至+3。

In out data set some field have small value and some field have large value. If we apply out machine learning model without feature scaling then prediction our model have high cost(It does because small value are dominated by large value). So before apply model we have to perform feature scaling.

在輸出數據集中,某些字段的值較小,而某些字段的值較大。 如果我們在沒有特征縮放的情況下應用機器學習模型,那么預測我們的模型將具有較高的成本(這是因為小值由大值主導)。 因此,在應用模型之前,我們必須執行特征縮放。

We can perform feature scaling in two ways.

我們可以通過兩種方式執行特征縮放。

I-:Standardizaion x=(x-mean(X))/standard deviation(X)

I-:標準化x =(x均值(X))/標準差(X)

II-:Normalization-: x=(x-min(X))/(max(X)-min(X))

II-:歸一化-:x =(x-min(X))/(max(X)-min(X))

#Using the concept of feature scalingfrom sklearn.preprocessing import StandardScalersc = StandardScaler()X_train[:,3:] = sc.fit_transform(X_train[:,3:])X_test[:,3:] = sc.transform(X_test[:,3:])

#使用sklearn.preprocessing import的特征縮放概念,StandardScalersc = StandardScaler()X_train [:,3:] = sc.fit_transform(X_train [:,3:])X_test [:,3:] = sc.transform(X_test [ :,3:])

Image for post
Image for post

That’s all for today guys!

今天就這些了!

This is the final outcome of the whole process. For more of such blogs, stay tuned!

這是整個過程的最終結果。 有關此類博客的更多信息,請繼續關注!

翻譯自: https://medium.com/all-about-machine-learning/understanding-data-preprocessing-taking-the-titanic-dataset-ebb78de162e0

數據預處理 泰坦尼克號

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389590.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389590.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389590.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Pytorch中DNN入門思想及實現

DNN全連接層(線性層) 計算公式: y w * x b W和b是參與訓練的參數 W的維度決定了隱含層輸出的維度,一般稱為隱單元個數(hidden size) b是偏差值(本文沒考慮) 舉例: 輸…

IDEA去除mapper.xml文件中的sql語句的背景色

2019獨角獸企業重金招聘Python工程師標準>>> IDEA版本 2017.3 mapper.xml文件中的sql語句,總是黃色一大片,看起來不舒服。 按如下設置進行設置即可 此時設置完還有點背景色 再進行一個設置 Ok,完美解決 轉載于:https://my.oschina.net/u/3939…

vc6.0 繪制散點圖_vc有關散點圖的一切

vc6.0 繪制散點圖Scatterplots are one of the most popular visualization techniques in the world. Its purposes are recognizing clusters and correlations in ‘pairs’ of variables. There are many variations of scatter plots. We will look at some of them.散點圖…

sudo配置臨時取得root權限

sudo配置臨時取得root權限系統中的普通用戶有時需要root權限執行某種操作,要是使用su - root的話必須要知道root的密碼,這是不安全的,所以有了sudo,root可以對/etc/sudoers做一定的配置,讓普通用戶在不切換到root的情況…

Pytorch中RNN入門思想及實現

RNN循環神經網絡 整體思想: 將整個序列劃分成多個時間步,將每一個時間步的信息依次輸入模型,同時將模型輸出的結果傳給下一個時間步,也就是說后面的結果受前面輸入的影響。 RNN的實現公式: 個人思路: 首…

小扎不哭!FB又陷數據泄露風波,9000萬用戶受影響

對小扎來說,又是多災多難的一個月。 繼不久前Twitter曝出修補了一個可能造成數以百萬計用戶私密消息被共享給第三方開發人員的漏洞,連累Facebook股價跟著短線跳水之后,9月28日,Facebook又雙叒叕曝出因安全漏洞遭到黑客攻擊&#…

在衡量歐洲的政治意識形態時,調查規模的微小變化可能會很重要

(Related post: On a scale from 1 to 10, how much do the numbers used in survey scales really matter?)(相關文章: 從1到10的量表,調查量表中使用的數字到底有多重要? ) At Pew Research Center, survey questions about respondents’…

Pytorch中CNN入門思想及實現

CNN卷積神經網絡 基礎概念: 以卷積操作為基礎的網絡結構,每個卷積核可以看成一個特征提取器。 思想: 每次觀察數據的一部分,如圖,在整個矩陣中只觀察黃色部分33的矩陣,將這【33】矩陣(點乘)權重得到特…

java常用設計模式一:單例模式

1、餓漢式 package singleton.demo;/*** author Administrator* date 2019/01/07*/ public class Singleton {//在調用getInstance方法前,實例已經創建好private static Singleton instance new Singleton();//私有構造,防止被實例化private Singleton(…

SDUT-2121_數據結構實驗之鏈表六:有序鏈表的建立

數據結構實驗之鏈表六:有序鏈表的建立 Time Limit: 1000 ms Memory Limit: 65536 KiB Problem Description 輸入N個無序的整數,建立一個有序鏈表,鏈表中的結點按照數值非降序排列,輸出該有序鏈表。 Input 第一行輸入整數個數N&…

事件映射 消息映射_映射幻影收費站

事件映射 消息映射When I was a child, I had a voracious appetite for books. I was constantly visiting the library and picking new volumes to read, but one I always came back to was The Phantom Tollbooth, written by Norton Juster and illustrated by Jules Fei…

前端代碼調試常用

轉載于:https://www.cnblogs.com/tabCtrlShift/p/9076752.html

Pytorch中BN層入門思想及實現

批歸一化層-BN層(Batch Normalization) 作用及影響: 直接作用:對輸入BN層的張量進行數值歸一化,使其成為均值為零,方差為一的張量。 帶來影響: 1.使得網絡更加穩定,結果不容易受到…

JDK源碼學習筆記——TreeMap及紅黑樹

找了幾個分析比較到位的,不再重復寫了…… Java 集合系列12之 TreeMap詳細介紹(源碼解析)和使用示例 【Java集合源碼剖析】TreeMap源碼剖析 java源碼分析之TreeMap基礎篇 關于紅黑樹: Java數據結構和算法(十一)——紅黑樹 【數據結…

匿名內部類和匿名類_匿名schanonymous

匿名內部類和匿名類Everybody loves a fad. You can pinpoint someone’s generation better than carbon dating by asking them what their favorite toys and gadgets were as a kid. Tamagotchi and pogs? You were born around 1988, weren’t you? Coleco Electronic Q…

Pytorch框架中SGD&Adam優化器以及BP反向傳播入門思想及實現

因為這章內容比較多,分開來敘述,前面先講理論后面是講代碼。最重要的是代碼部分,結合代碼去理解思想。 SGD優化器 思想: 根據梯度,控制調整權重的幅度 公式: 權重(新) 權重(舊) - 學習率 梯度 Adam…

朱曄和你聊Spring系列S1E3:Spring咖啡罐里的豆子

標題中的咖啡罐指的是Spring容器,容器里裝的當然就是被稱作Bean的豆子。本文我們會以一個最基本的例子來熟悉Spring的容器管理和擴展點。閱讀PDF版本 為什么要讓容器來管理對象? 首先我們來聊聊這個問題,為什么我們要用Spring來管理對象&…

ab實驗置信度_為什么您的Ab測試需要置信區間

ab實驗置信度by Alos Bissuel, Vincent Grosbois and Benjamin HeymannAlosBissuel,Vincent Grosbois和Benjamin Heymann撰寫 The recent media debate on COVID-19 drugs is a unique occasion to discuss why decision making in an uncertain environment is a …

基于Pytorch的NLP入門任務思想及代碼實現:判斷文本中是否出現指定字

今天學了第一個基于Pytorch框架的NLP任務: 判斷文本中是否出現指定字 思路:(注意:這是基于字的算法) 任務:判斷文本中是否出現“xyz”,出現其中之一即可 訓練部分: 一&#xff…

erlang下lists模塊sort(排序)方法源碼解析(二)

上接erlang下lists模塊sort(排序)方法源碼解析(一),到目前為止,list列表已經被分割成N個列表,而且每個列表的元素是有序的(從大到小) 下面我們重點來看看mergel和rmergel模塊,因為我…