熊貓數據集_熊貓邁向數據科學的第三部分

熊貓數據集

Data is almost never perfect. Data Scientist spend more time in preprocessing dataset than in creating a model. Often we come across scenario where we find some missing data in data set. Such data points are represented with NaN or Not a Number in Pandas. So it is very important that we discover columns with NaN/null values in early stages while analyzing data.

數據幾乎從來都不是完美的。 與創建模型相比,數據科學家在預處理數據集上花費的時間更多。 通常,我們會遇到在數據集中發現一些缺失數據的情況。 此類數據點用NaN表示None Not Number表示 因此,在分析數據的早期發現具有NaN / null值的列非常重要。

We have covered many methods in Pandas library and if you haven’t read previous articles, I recommend you to go through those articles to get in a flow. But if you are following from the beginning then lets get started.

我們已經在Pandas庫中介紹了許多方法,如果您還沒有閱讀過以前的文章,我建議您仔細閱讀這些文章以進行學習。 但是,如果您從頭開始關注,那就開始吧。

In this article, we are going to learn

在本文中,我們將學習

  1. What is NaN ?

    什么是NaN?
  2. How to find NaN in dataset ?

    如何在數據集中找到NaN?
  3. How to deal with NaN as beginner ?

    如何應對NaN作為初學者?
  4. Finally, some methods to make dataframe more readable.

    最后,一些使數據框更具可讀性的方法。

如何在數據集中找到NaN? (How to find NaN in dataset ?)

To check NaN data in a column or in entire dataframe, we use isnull() or isna(). Both of these works as same , so we will use isnull() in this article. If you want to understand why there are two methods for same task, you can learn it here. Lets begin by checking null values in entire dataset.

要檢查列或整個數據框中的NaN數據,我們使用isnull()或isna()。 兩者的工作原理相同,因此我們將在本文中使用isnull() 。 如果您想了解為什么有兩種方法可以完成同一任務,則可以在此處學習 首先檢查整個數據集中的空值。

>> print(titanic_data.info())output :RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Here you can see some valuable information about dataset. But information that we are interested is in Non-Null Count column. It shows number of non-null data points in each column. First line of output shows that there are total 891 entries that is 891 data points. We can also directly check number of non-null entries in each column using count() method as well.

在這里,您可以看到有關數據集的一些有價值的信息。 但是我們感興趣的信息在“ 非空計數”列中。 它顯示每列中非空數據點的數量。 輸出的第一行顯示總共有891個條目,即891個數據點。 我們也可以使用count()方法直接檢查每列中非空條目的數量

>> print(titanic_data.count())output :PassengerId    891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64

From here we can conclude that Age, Cabin and Embarked are the columns with null values. There another way to get this result using isnull() method as we discussed earlier.

從這里我們可以得出結論,“ 年齡”,“機艙”和“ 登機”是具有空值的列。 如前所述,還有另一種方法可以使用isnull()方法獲得此結果。

>> print(titanic_data.isnull().any())output :PassengerId    False
Survived False
Pclass False
Name False
Sex False
Age True
SibSp False
Parch False
Ticket False
Fare False
Cabin True
Embarked True
dtype: bool
>> print(titanic_data.isnull().sum())output :PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

As we can see this result is much better if we are solely interested in null values.

如我們所見,如果我們只對null值感興趣,則此結果會更好。

如何應對NaN作為初學者? (How to deal with NaN as beginner ?)

It is important to know number of null values in a column as it can help us understand how to deal with null values. If there are small numbers of null values like in Embarked, then we can remove those entries from dataset. However if most of the values are null like in Cabin, then it is better to skip that column while creating model.

知道一列中空值的數量很重要,因為它可以幫助我們了解如何處理空值。 如果像Embarked中那樣有少量的空值那么我們可以從數據集中刪除這些條目。 但是,如果像Cabin中的大多數值都為空那么在創建模型時最好跳過該列。

There is another case where null values are not large enough to skip the column and small enough to remove entries as in the case of Age here. For such cases we have many ways to deal with null values, but as a beginner we will learn just one trick here and that is to fill it with a value. We will use fillna() method to do that.

在另一種情況下,空值的大小不足以跳過該列,而其大小不足以刪除條目,如此處的Age一樣 。 對于這種情況,我們有很多方法可以處理空值,但作為一個初學者,我們將在這里僅學習一個技巧,那就是用值填充它。 我們將使用fillna()方法來做到這一點。

>> titanic_data.Age.fillna("Unknown", inplace = True)
>> print(titanic_data.Age.isnull().any())output :false
# It is Age column have no null values

We used inplace argument so that changes are implemented in dataframe which is calling the method. If we do not pass this argument or keep it False then changes will not appear in our dataset. We can also check if a specific column have null values in same manner as we did for whole dataset.

我們使用了inplace參數,以便在調用該方法的數據框中實現更改。 如果我們不傳遞此參數或將其保留為False,則更改將不會出現在我們的數據集中。 我們還可以以與整個數據集相同的方式檢查特定列是否具有空值。

We can also replace values in a column which are not NaN using replace() method.

我們還可以使用replace()方法替換非NaN列中的值。

>> titanic_data.Sex.replace("male","M",inplace = True)
>> titanic_data.Sex.replace("female","F",inplace = True)
>> print(titanic_data.Sex)output :0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Sex, Length: 891, dtype: object

一些使數據集更具可讀性的方法 (Some methods to make Dataset more readable)

  1. rename() : There might be situation, when we realize that column name is not suitable as per our requirement. We can use rename() method to change column name.

    named() :在某些情況下,我們意識到列名不符合我們的要求。 我們可以使用rename()方法來更改列名。

>> titanic_data.rename(columns={"Sex":"Gender"},inplace=True)
>> print(titanic_data.Gender)output :0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Gender, Length: 891, dtype: object

2. rename_axis() : It is a simple method and as name suggest is used to provide names for axis.

2. named_axis() :這是一種簡單的方法,顧名思義,該名稱用于提供軸的名稱。

>> titanic_data.rename_axis("Sr.No",axis='rows',inplace=True)
>> titanic_data.rename_axis("Catergory",axis='columns',inplace=True)
>> print(titanic_data.head(2))output :Catergory PassengerId Survived Pclass .....
Sr.No
0 1 0 3
1 2 1 1
[2 rows x 12 columns]

With this we come to end of this article and series on Pandas. I believe that methods which we came across in this series are very helpful for analyzing data before we can start training them. However, this is just a small fraction of methods in Pandas library and just a beginning of data exploration and preprocessing. But as a beginner, I think these are enough to get started with Data Science journey. I hope you found this series valuable. Thank you for reading. Keep practicing. Happy Coding ! 😄

這樣,我們就結束了本文和有關熊貓的系列文章的結尾。 我相信本系列中遇到的方法在開始訓練數據之前對分析數據非常有幫助。 但是,這只是Pandas庫中方法的一小部分,也是數據探索和預處理的開始。 但是,作為一個初學者,我認為這些足以開始Data Science之旅。 希望您覺得本系列有價值。 感謝您的閱讀。 保持練習。 編碼愉快! 😄

翻譯自: https://medium.com/swlh/pandas-first-step-towards-data-science-part-3-351321c24cc0

熊貓數據集

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389372.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389372.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389372.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Pytorch有關張量的各種操作

一,創建張量 1. 生成float格式的張量: a torch.tensor([1,2,3],dtype torch.float)2. 生成從1到10,間隔是2的張量: b torch.arange(1,10,step 2)3. 隨機生成從0.0到6.28的10個張量 注意: (1).生成的10個張量中包含0.0和6.28&#xff…

mongodb安裝失敗與解決方法(附安裝教程)

安裝mongodb遇到的一些坑 浪費了大量的時間 在此記錄一下 主要是電腦系統win10企業版自帶的防火墻 當然還有其他的一些坑 一般的問題在第6步驟都可以解決,本教程的安裝步驟不夠詳細的話 請自行百度或谷歌 安裝教程很多 我是基于node.js使用mongodb結合Robo 3T數…

【洛谷算法題】P1046-[NOIP2005 普及組] 陶陶摘蘋果【入門2分支結構】Java題解

👨?💻博客主頁:花無缺 歡迎 點贊👍 收藏? 留言📝 加關注?! 本文由 花無缺 原創 收錄于專欄 【洛谷算法題】 文章目錄 【洛谷算法題】P1046-[NOIP2005 普及組] 陶陶摘蘋果【入門2分支結構】Java題解🌏題目…

web性能優化(理論)

什么是性能優化? 就是讓用戶感覺你的網站加載速度很快。。。哈哈哈。 分析 讓我們來分析一下從用戶按下回車鍵到網站呈現出來經歷了哪些和前端相關的過程。 緩存 首先看本地是否有緩存,如果有符合使用條件的緩存則不需要向服務器發送請求了。DNS查詢建立…

python多項式回歸_如何在Python中實現多項式回歸模型

python多項式回歸Let’s start with an example. We want to predict the Price of a home based on the Area and Age. The function below was used to generate Home Prices and we can pretend this is “real-world data” and our “job” is to create a model which wi…

充分利用UC berkeleys數據科學專業

By Kyra Wong and Kendall Kikkawa黃凱拉(Kyra Wong)和菊川健多 ( Kendall Kikkawa) 什么是“數據科學”? (What is ‘Data Science’?) Data collection, an important aspect of “data science”, is not a new idea. Before the tech boom, every industry al…

文本二叉樹折半查詢及其截取值

using System;using System.ComponentModel;using System.Data;using System.Drawing;using System.Text;using System.Windows.Forms;using System.Collections;using System.IO;namespace CS_ScanSample1{ /// <summary> /// Logic 的摘要說明。 /// </summary> …

nn.functional 和 nn.Module入門講解

本文來自《20天吃透Pytorch》 一&#xff0c;nn.functional 和 nn.Module 前面我們介紹了Pytorch的張量的結構操作和數學運算中的一些常用API。 利用這些張量的API我們可以構建出神經網絡相關的組件(如激活函數&#xff0c;模型層&#xff0c;損失函數)。 Pytorch和神經網絡…

10.30PMP試題每日一題

SC>0&#xff0c;CPI<1&#xff0c;說明項目截止到當前&#xff1a;A、進度超前&#xff0c;成本超值B、進度落后&#xff0c;成本結余C、進度超前&#xff0c;成本結余D、無法判斷 答案將于明天和新題一起揭曉&#xff01; 10.29試題答案&#xff1a;A轉載于:https://bl…

02-web框架

1 while True:print(server is waiting...)conn, addr server.accept()data conn.recv(1024) print(data:, data)# 1.得到請求的url路徑# ------------dict/obj d["path":"/login"]# d.get(”path“)# 按著http請求協議解析數據# 專注于web業…

ai驅動數據安全治理_AI驅動的Web數據收集解決方案的新起點

ai驅動數據安全治理Data gathering consists of many time-consuming and complex activities. These include proxy management, data parsing, infrastructure management, overcoming fingerprinting anti-measures, rendering JavaScript-heavy websites at scale, and muc…

從Text文本中讀值插入到數據庫中

/// <summary> /// 轉換數據&#xff0c;從Text文本中導入到數據庫中 /// </summary> private void ChangeTextToDb() { if(File.Exists("Storage Card/Zyk.txt")) { try { this.RecNum.Visibletrue; SqlCeCommand sqlCreateTable…

Dataset和DataLoader構建數據通道

重點在第二部分的構建數據通道和第三部分的加載數據集 Pytorch通常使用Dataset和DataLoader這兩個工具類來構建數據管道。 Dataset定義了數據集的內容&#xff0c;它相當于一個類似列表的數據結構&#xff0c;具有確定的長度&#xff0c;能夠用索引獲取數據集中的元素。 而D…

鐵拳nat映射_鐵拳如何重塑我的數據可視化設計流程

鐵拳nat映射It’s been a full year since I’ve become an independent data visualization designer. When I first started, projects that came to me didn’t relate to my interests or skills. Over the past eight months, it’s become very clear to me that when cl…

Django2 Web 實戰03-文件上傳

作者&#xff1a;Hubery 時間&#xff1a;2018.10.31 接上文&#xff1a;接上文&#xff1a;Django2 Web 實戰02-用戶注冊登錄退出 視頻是一種可視化媒介&#xff0c;因此視頻數據庫至少應該存儲圖像。讓用戶上傳文件是個很大的隱患&#xff0c;因此接下來會討論這倆話題&#…

BZOJ.2738.矩陣乘法(整體二分 二維樹狀數組)

題目鏈接 BZOJ洛谷 整體二分。把求序列第K小的樹狀數組改成二維樹狀數組就行了。 初始答案區間有點大&#xff0c;離散化一下。 因為這題是一開始給點&#xff0c;之后詢問&#xff0c;so可以先處理該區間值在l~mid的修改&#xff0c;再處理詢問。即二分標準可以直接用點的標號…

從數據庫里讀值往TEXT文本里寫

/// <summary> /// 把預定內容導入到Text文檔 /// </summary> private void ChangeDbToText() { this.RecNum.Visibletrue; //建立文件&#xff0c;并打開 string oneLine ""; string filename "Storage Card/YD" DateTime.Now.…

DengAI —如何應對數據科學競賽? (EDA)

了解機器學習 (Understanding ML) This article is based on my entry into DengAI competition on the DrivenData platform. I’ve managed to score within 0.2% (14/9069 as on 02 Jun 2020). Some of the ideas presented here are strictly designed for competitions li…

Pytorch模型層簡單介紹

模型層layers 深度學習模型一般由各種模型層組合而成。 torch.nn中內置了非常豐富的各種模型層。它們都屬于nn.Module的子類&#xff0c;具備參數管理功能。 例如&#xff1a; nn.Linear, nn.Flatten, nn.Dropout, nn.BatchNorm2d nn.Conv2d,nn.AvgPool2d,nn.Conv1d,nn.Co…

有效溝通的技能有哪些_如何有效地展示您的數據科學或軟件工程技能

有效溝通的技能有哪些What is the most important thing to do after you got your skills to be a data scientist? It has to be to show off your skills. Otherwise, there is no use of your skills. If you want to get a job or freelance or start a start-up, you ha…