數據分析必備：一步步教你如何用Pandas做數據分析（15）

1、Pandas 數據丟失

Pandas 數據丟失的操作實例
在現實生活中，數據丟失始終是一個問題。機器學習和數據挖掘等領域在模型預測的準確性方面面臨嚴重問題，因為缺少值會導致數據質量較差。在這些領域中，缺失值處理是使模型更準確和有效的主要重點。

1.1、什么時候以及為什么會丟失數據？

讓我們考慮一項產品的在線調查。很多時候，人們不會共享與他們有關的所有信息。很少有人會分享他們的經驗，但是不會分享他們使用該產品有多長時間；很少有人分享他們使用該產品的時間，他們的經歷而不是他們的聯系信息。因此，以某種方式或其他方式總是會丟失一部分數據，這在實時情況下非常普遍。
現在讓我們看看如何使用熊貓處理缺失值（例如NA或NaN）。

# import the pandas libraryimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df)

運行結果

       one        two     three
a  -0.576991  -0.741695  0.553172
b        NaN        NaN       NaN
c   0.744328  -1.735166  1.749580NaN replaced with '0':one        two     three
a  -0.576991  -0.741695  0.553172
b   0.000000   0.000000  0.000000
c   0.744328  -1.735166  1.749580

使用重新索引，我們創建了一個缺少值的DataFrame。在輸出中，NaN表示不是數字。

1.2、檢查缺失值

為了使檢測的缺失值更容易（和不同陣列dtypes），熊貓提供ISNULL()和NOTNULL()功能，這也是對系列和數據幀的對象的方法-

 import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df['one'].isnull())

運行結果

 a  Falseb  Truec  Falsed  Truee  Falsef  Falseg  Trueh  FalseName: one, dtype: bool

**import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df['one'].notnull())**

運行結果

 a  Trueb  Falsec  Trued  Falsee  Truef  Trueg  Falseh  TrueName: one, dtype: bool

1.3、缺少數據的計算

匯總數據時，NA將被視為零
如果數據均為不適用，則結果為不適用

 import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df['one'].sum())

運行結果

  2.02357685917

 import pandas as pdimport numpy as npdf = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])print(df['one'].sum()

運行結果：

nan

1.4、清理/填充丟失的數據

Pandas 提供了多種清除缺失值的方法。fillna函數可以通過以下幾種方法用非空數據“填充” NA值。

1.5、用標量值替換NaN

以下程序顯示了如何將“ NaN”替換為“ 0”。

 import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one','two', 'three'])df = df.reindex(['a', 'b', 'c']))print(df)print(("NaN replaced with '0':"))print(df.fillna(0))

運行結果

       one        two     three
a  -0.576991  -0.741695  0.553172
b        NaN        NaN       NaN
c   0.744328  -1.735166  1.749580NaN replaced with '0':one        two     three
a  -0.576991  -0.741695  0.553172
b   0.000000   0.000000  0.000000
c   0.744328  -1.735166  1.749580

在這里，我們填充零值；相反，我們還可以填充其他任何值。

1.6、向前和向后填充NA

使用“重新索引”一章中討論的填充概念，我們將填充缺少的值。
在這里插入圖片描述

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df.fillna(method='pad'))

運行結果

 one        two      three
a   0.077988   0.476149   0.965836
b   0.077988   0.476149   0.965836
c  -0.390208  -0.551605  -2.301950
d  -0.390208  -0.551605  -2.301950
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g  -0.930230  -0.670473   1.146615
h   0.085100   0.532791   0.887415

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df.fillna(method='backfill'))

運行結果

      one        two      three
a   0.077988   0.476149   0.965836
b  -0.390208  -0.551605  -2.301950
c  -0.390208  -0.551605  -2.301950
d  -2.000303  -0.788201   1.510072
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g   0.085100   0.532791   0.887415
h   0.085100   0.532791   0.887415

1.7、刪除缺失值

如果只想排除丟失的值，則將dropna函數與axis參數一起使用。默認情況下，axis = 0，即沿著行，這意味著如果一行中的任何值為NA，那么將排除整行。

 import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df.dropna())

運行結果

one two three a 0.077988 0.476149 0.965836 c -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print(df.dropna(axis=1))

運行結果

 Empty DataFrameColumns: [ ]Index: [a, b, c, d, e, f, g, h]

1.8、替換缺失的（或）通用值

很多時候，我們必須用某個特定值替換一個通用值。我們可以通過應用replace方法來實現。
用標量值替換NA是fillna()函數的等效行為。

 import pandas as pdimport numpy as npdf = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})print(df.replace({1000:10,2000:60}))

運行結果

   one two0 10 101 20 02 30 303 40 404 50 505 60 60

 import pandas as pdimport numpy as npdf = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})print(df.replace({1000:10,2000:60})

運行結果

   one two0 10 101 20 02 30 303 40 404 50 505 60 60

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/21204.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/21204.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/21204.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！