01Pandas_數據結構

Pandas數據結構

做python數據分析，數據挖掘，機器學習的童鞋應該都離不開pandas。在做數據的預處理的時候pandas尤為給力。

本文主要介紹pandas中的兩種數據結構：series,dataframe。

import pandas as pd

1.Series

首先來介紹series數據結構。

series 類似于一維數組的對象。對于series基本要掌握的是：

構建series
獲取series中的數據與索引
預覽數據
通過索引獲取數據
Series的運算
name屬性

1.1 構建Series

通過list構建Series

向pd.Series()中傳入一個list。就等于將這個list轉換成了Series數據格式了。

可以通過打印數據類型來檢查，顯示的是Series

ser_obj = pd.Series(range(10, 20))print type(ser_obj)

<class 'pandas.core.series.Series'>

通過字典dict構建Series

dict中每個key其實是索引，對應的value是值。所有的值的數據類型需一致。

year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)
print ser_obj2.head()

2001    17.8
2002    20.1
2003    16.5
dtype: float64

1.2 獲取數據與索引

對于Series，使用.values方法就能獲取它的值；使用.index方法就能獲取它的索引。

下面這個例子獲取的索引并沒有直接逐個打印出來，而是打印了一個RangeIndex，里面的參數表示起始數（包括），結尾數（不包括），步長為1。

# 獲取數據
print ser_obj.values# 獲取索引
print ser_obj.index

[10 11 12 13 14 15 16 17 18 19]
RangeIndex(start=0, stop=10, step=1)

1.3 預覽數據

如果數據量太大，但又想看看數據的格式，那么可以提取前幾條數據來瞧一瞧。

直接使用.head()，如果里面不傳入參數，那么默認提取前5條數據；括號里也可以出傳入參數來指定提取前面n條。

# 預覽數據
print ser_obj.head(3)

0    10
1    11
2    12
dtype: int64

1.4 獲取數據

可以通過索引獲取Series中對應位置的value。索引放在中括號[]中。

#通過索引獲取數據
print ser_obj[0]
print ser_obj[8]

10
18

1.5 運算

對1個Series 進行加減乘數的運算時，表示對Series中的每個元素都做一次運算，然后輸出相同長度的Series。

# 索引與數據的對應關系仍保持在數組運算的結果中
print ser_obj * 3

0    30
1    33
2    36
3    39
4    42
5    45
6    48
7    51
8    54
9    57
dtype: int64

除了普通的加減乘除等運算，還可以進行布爾運算，如下，會將所有大于15的值輸出成True，小于15的值輸出成False。

print ser_obj > 15

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9     True
dtype: bool

1.6 name屬性

可以對Series中的Index和Values添加自定義的名字。

# name屬性
ser_obj2.name = 'score'
ser_obj2.index.name = 'year'
print ser_obj2.head()

year
2001    17.8
2002    20.1
2003    16.5
Name: score, dtype: float64

2.DataFrame

DataFrame類似于多維數組或表格數據，與excel類似。

每列數據可以是不同的類型，但是同一列的數據需保持一致數據類型。

DataFrame的索引包括行索引與列索引。

掌握DataFrame的基本使用，需要熟悉以下幾個要點。

構建DataFrame的兩種方法：ndarray構建，dict構建
通過索引獲取數據
增加與刪除數據

import numpy as np

2.1 構建DataFrame

通過ndarray構建DataFram

# 首先創建一個ndarray （大小是5*4）
array = np.random.randn(5,4)
print array# 將ndarray傳入pd.DataFrame()中，即得到了一個DataFrame
df_obj = pd.DataFrame(array)
print df_obj.head()

[[-1.15943918  0.41562598  0.24219151 -0.54127251][-0.72949761  0.7299977  -0.35770911 -1.55597979][-0.26508669  0.73079105  0.019037   -0.28775191][ 2.35757276  0.54826604 -1.10932131  0.36925581][ 0.60940029  0.11843865 -0.30061918  0.44980428]]0         1         2         3
0 -1.159439  0.415626  0.242192 -0.541273
1 -0.729498  0.729998 -0.357709 -1.555980
2 -0.265087  0.730791  0.019037 -0.287752
3  2.357573  0.548266 -1.109321  0.369256
4  0.609400  0.118439 -0.300619  0.449804

上面構建好的DataFrame可見左邊有一列是行索引，上面有一行是列索引。如果沒有特殊指定，系統會默認生成行索引與列索引的。

通過dict構建DataFrame

還記得通過字典構建series時，Key是作為索引的；在DataFrame中，Key是作為列索引（列名）。

講dict傳給pd.DataFrame()中即構成了一個DataFrame

dict_data = {'A': 1., 'B': pd.Timestamp('20161223'),'C': pd.Series(1, index=list(range(4)),dtype='float32'),'D': np.array([3] * 4,dtype='int32'),'E' : pd.Categorical(["Python","Java","C++","C#"]),'F' : 'wangxiaocao' }
#print dict_data
df_obj2 = pd.DataFrame(dict_data)
print df_obj2.head()

     A          B    C  D       E            F
0  1.0 2016-12-23  1.0  3  Python  wangxiaocao
1  1.0 2016-12-23  1.0  3    Java  wangxiaocao
2  1.0 2016-12-23  1.0  3     C++  wangxiaocao
3  1.0 2016-12-23  1.0  3      C#  wangxiaocao

2.2 通過索引獲取數據

這里先簡單介紹一下通過列索引來獲取數據。

通過列索引獲取的數據顧名思義就是獲取處該索引的一整列。著一整列的數據其實就是Series的數據格式。

所以DataFrame可以看成是由一列一列的series組成的。

有兩種方式：
1. df_obj2[‘F’]
2. df_obj2.F

# 方式1
print df_obj2['F']
print type(df_obj2['F'])# 方式2
print df_obj2.F

0    wangxiaocao
1    wangxiaocao
2    wangxiaocao
3    wangxiaocao
Name: F, dtype: object
<class 'pandas.core.series.Series'>
0    wangxiaocao
1    wangxiaocao
2    wangxiaocao
3    wangxiaocao
Name: F, dtype: object

2.3 增加與刪除列

# 增加列
df_obj2['G'] = df_obj2['D'] + 4
print df_obj2.head()

     A          B    C  D       E            F  G
0  1.0 2016-12-23  1.0  3  Python  wangxiaocao  7
1  1.0 2016-12-23  1.0  3    Java  wangxiaocao  7
2  1.0 2016-12-23  1.0  3     C++  wangxiaocao  7
3  1.0 2016-12-23  1.0  3      C#  wangxiaocao  7

# 刪除列
del df_obj2['G'] 
print df_obj2.head()

     A          B    C  D       E            F
0  1.0 2016-12-23  1.0  3  Python  wangxiaocao
1  1.0 2016-12-23  1.0  3    Java  wangxiaocao
2  1.0 2016-12-23  1.0  3     C++  wangxiaocao
3  1.0 2016-12-23  1.0  3      C#  wangxiaocao

3.索引對象 Index

pandas的兩種數據格式都與索引息息相關，這里羅列一下索引的相關知識。

首先要明確索引的特性：不可變！索引

# 索引對象不可變
df_obj2.index[0] = 2

---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-17-7f40a356d7d1> in <module>()1 # 索引對象不可變
----> 2 df_obj2.index[0] = 2/home/cc/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in __setitem__(self, key, value)1243 1244     def __setitem__(self, key, value):
-> 1245         raise TypeError("Index does not support mutable operations")1246 1247     def __getitem__(self, key):TypeError: Index does not support mutable operations

常見的Index種類有：

Index
Int64Index
MultiIndex:層級索引
DatetimeINdex：時間戳類型的索引

print type(ser_obj.index)
print type(df_obj2.index)print df_obj2.index

<class 'pandas.indexes.range.RangeIndex'>
<class 'pandas.indexes.numeric.Int64Index'>
Int64Index([0, 1, 2, 3], dtype='int64')

注：部分例子來自于小象學院Robin課程

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/456150.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/456150.shtml
英文地址，請注明出處：http://en.pswp.cn/news/456150.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！