熊貓數據集
Data is almost never perfect. Data Scientist spend more time in preprocessing dataset than in creating a model. Often we come across scenario where we find some missing data in data set. Such data points are represented with NaN or Not a Number in Pandas. So it is very important that we discover columns with NaN/null values in early stages while analyzing data.
數據幾乎從來都不是完美的。 與創建模型相比,數據科學家在預處理數據集上花費的時間更多。 通常,我們會遇到在數據集中發現一些缺失數據的情況。 此類數據點用NaN表示或用None Not Number表示。 因此,在分析數據的早期發現具有NaN / null值的列非常重要。
We have covered many methods in Pandas library and if you haven’t read previous articles, I recommend you to go through those articles to get in a flow. But if you are following from the beginning then lets get started.
我們已經在Pandas庫中介紹了許多方法,如果您還沒有閱讀過以前的文章,我建議您仔細閱讀這些文章以進行學習。 但是,如果您從頭開始關注,那就開始吧。
In this article, we are going to learn
在本文中,我們將學習
- What is NaN ? 什么是NaN?
- How to find NaN in dataset ? 如何在數據集中找到NaN?
- How to deal with NaN as beginner ? 如何應對NaN作為初學者?
- Finally, some methods to make dataframe more readable. 最后,一些使數據框更具可讀性的方法。
如何在數據集中找到NaN? (How to find NaN in dataset ?)
To check NaN data in a column or in entire dataframe, we use isnull() or isna(). Both of these works as same , so we will use isnull() in this article. If you want to understand why there are two methods for same task, you can learn it here. Lets begin by checking null values in entire dataset.
要檢查列或整個數據框中的NaN數據,我們使用isnull()或isna()。 兩者的工作原理相同,因此我們將在本文中使用isnull() 。 如果您想了解為什么有兩種方法可以完成同一任務,則可以在此處學習。 首先檢查整個數據集中的空值。
>> print(titanic_data.info())output :RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Here you can see some valuable information about dataset. But information that we are interested is in Non-Null Count column. It shows number of non-null data points in each column. First line of output shows that there are total 891 entries that is 891 data points. We can also directly check number of non-null entries in each column using count() method as well.
在這里,您可以看到有關數據集的一些有價值的信息。 但是我們感興趣的信息在“ 非空計數”列中。 它顯示每列中非空數據點的數量。 輸出的第一行顯示總共有891個條目,即891個數據點。 我們也可以使用count()方法直接檢查每列中非空條目的數量 。
>> print(titanic_data.count())output :PassengerId 891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64
From here we can conclude that Age, Cabin and Embarked are the columns with null values. There another way to get this result using isnull() method as we discussed earlier.
從這里我們可以得出結論,“ 年齡”,“機艙”和“ 登機”是具有空值的列。 如前所述,還有另一種方法可以使用isnull()方法獲得此結果。
>> print(titanic_data.isnull().any())output :PassengerId False
Survived False
Pclass False
Name False
Sex False
Age True
SibSp False
Parch False
Ticket False
Fare False
Cabin True
Embarked True
dtype: bool>> print(titanic_data.isnull().sum())output :PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
As we can see this result is much better if we are solely interested in null values.
如我們所見,如果我們只對null值感興趣,則此結果會更好。
如何應對NaN作為初學者? (How to deal with NaN as beginner ?)
It is important to know number of null values in a column as it can help us understand how to deal with null values. If there are small numbers of null values like in Embarked, then we can remove those entries from dataset. However if most of the values are null like in Cabin, then it is better to skip that column while creating model.
知道一列中空值的數量很重要,因為它可以幫助我們了解如何處理空值。 如果像Embarked中那樣有少量的空值,那么我們可以從數據集中刪除這些條目。 但是,如果像Cabin中的大多數值都為空,那么在創建模型時最好跳過該列。
There is another case where null values are not large enough to skip the column and small enough to remove entries as in the case of Age here. For such cases we have many ways to deal with null values, but as a beginner we will learn just one trick here and that is to fill it with a value. We will use fillna() method to do that.
在另一種情況下,空值的大小不足以跳過該列,而其大小不足以刪除條目,如此處的Age一樣 。 對于這種情況,我們有很多方法可以處理空值,但作為一個初學者,我們將在這里僅學習一個技巧,那就是用值填充它。 我們將使用fillna()方法來做到這一點。
>> titanic_data.Age.fillna("Unknown", inplace = True)
>> print(titanic_data.Age.isnull().any())output :false
# It is Age column have no null values
We used inplace argument so that changes are implemented in dataframe which is calling the method. If we do not pass this argument or keep it False then changes will not appear in our dataset. We can also check if a specific column have null values in same manner as we did for whole dataset.
我們使用了inplace參數,以便在調用該方法的數據框中實現更改。 如果我們不傳遞此參數或將其保留為False,則更改將不會出現在我們的數據集中。 我們還可以以與整個數據集相同的方式檢查特定列是否具有空值。
We can also replace values in a column which are not NaN using replace() method.
我們還可以使用replace()方法替換非NaN列中的值。
>> titanic_data.Sex.replace("male","M",inplace = True)
>> titanic_data.Sex.replace("female","F",inplace = True)
>> print(titanic_data.Sex)output :0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Sex, Length: 891, dtype: object
一些使數據集更具可讀性的方法 (Some methods to make Dataset more readable)
rename() : There might be situation, when we realize that column name is not suitable as per our requirement. We can use rename() method to change column name.
named() :在某些情況下,我們意識到列名不符合我們的要求。 我們可以使用rename()方法來更改列名。
>> titanic_data.rename(columns={"Sex":"Gender"},inplace=True)
>> print(titanic_data.Gender)output :0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Gender, Length: 891, dtype: object
2. rename_axis() : It is a simple method and as name suggest is used to provide names for axis.
2. named_axis() :這是一種簡單的方法,顧名思義,該名稱用于提供軸的名稱。
>> titanic_data.rename_axis("Sr.No",axis='rows',inplace=True)
>> titanic_data.rename_axis("Catergory",axis='columns',inplace=True)
>> print(titanic_data.head(2))output :Catergory PassengerId Survived Pclass .....
Sr.No
0 1 0 3
1 2 1 1
[2 rows x 12 columns]
With this we come to end of this article and series on Pandas. I believe that methods which we came across in this series are very helpful for analyzing data before we can start training them. However, this is just a small fraction of methods in Pandas library and just a beginning of data exploration and preprocessing. But as a beginner, I think these are enough to get started with Data Science journey. I hope you found this series valuable. Thank you for reading. Keep practicing. Happy Coding ! 😄
這樣,我們就結束了本文和有關熊貓的系列文章的結尾。 我相信本系列中遇到的方法在開始訓練數據之前對分析數據非常有幫助。 但是,這只是Pandas庫中方法的一小部分,也是數據探索和預處理的開始。 但是,作為一個初學者,我認為這些足以開始Data Science之旅。 希望您覺得本系列有價值。 感謝您的閱讀。 保持練習。 編碼愉快! 😄
翻譯自: https://medium.com/swlh/pandas-first-step-towards-data-science-part-3-351321c24cc0
熊貓數據集
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389372.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389372.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389372.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!