pandas之Seris和DataFrame

pandas是一個強大的python工具包，提供了大量處理數據的函數和方法，用于處理數據和分析數據。

使用pandas之前需要先安裝pandas包，并通過import pandas as pd導入。

一、系列Series

Seris為帶標簽的一維數組，標簽即為索引。

1.Series的創建

Seris創建的方法：s = pd.Seris(obj , index=' ***' , name='***')

Seris創建時如果不通過參數指定name，名稱默認為None，并不是=前面的變量名稱s。

①通過字典創建

通過字典創建Seris，字典的key即為索引。如果字典的key有重復，創建Seris時會取最后出現的一個值。

dic = {'name':'Alice','age':23,'age':20,'age':25,'hobby':'dance'}
s = pd.Series(dic,name='dic_Seris')
print(s)
# name     Alice
# age         25
# hobby    dance
# Name: dic_Seris, dtype: object

通過字典創建Seris

②通過一維數組、列表或元組創建

通過這種方法，如果不指定索引index，默認為從0開始的整數；如果指定index，index的數量必須與Seris的元素個數保持一致，否則會報錯。

arr = np.arange(1,6)
s1 = pd.Series(arr)
s2 = pd.Series(arr,index=list('abcde'),name='iter_Seris')
print(s1.name,s2.name)
print(s1)
print('-------------')
print(s2)
# None iter_Seris
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# dtype: int32
# -------------
# a    1
# b    2
# c    3
# d    4
# e    5
# Name: iter_Seris, dtype: int32

通過一維數組、列表或元組創建Seris

③通過標量創建

?通過標量創建時，參數obj為一個固定的值，表示Seris中元素的值，此時必須指定index，index的個數表示元素個數。

s = pd.Series('hi',index=list('abc'),name='s_Seris')
print(s)
# a    hi
# b    hi
# c    hi
# Name: s_Seris, dtype: object

通過標量創建Seris

2.Series的索引

①下標索引

下標索引從0開始，-1表示最后一個元素，通過[m:n]切片包括m不包括n。Seris中的每一個元素類型為<class 'numpy.***'>

還可以通過[[ m,n,x]]獲取下標為m、n、x的值，列表和元組沒有該用法。

s = pd.Series([1,2,3,4,5],index=list('abcde'))
print(s[1],type(s[1]))
print(s[-2])
print(s[1:3])
print(s[[0,4]])
# 2 <class 'numpy.int64'>
# 4
# b    2
# c    3
# dtype: int64
# a    1
# e    5
# dtype: int64

Seris下標索引

②標簽索引

與下標索引不同的是，標簽通過[m:n]切片時包含m也包含n。也可以通過[[ m,n,x]]獲取標簽為m、n和x的值

s = pd.Series([1,2,3,4,5],index=list('abcde'))
print(s['b'])
print(s['c':'d'])
print(s[['a','e']])
# 2
# c    3
# d    4
# dtype: int64
# a    1
# e    5
# dtype: int64

Seris標簽索引

注意，如果Seris的標簽也為整數時，會出現混亂，因此不建議自定義數字為標簽索引。

s = pd.Series([1,2,3,4,5],index=[1,2,3,4,5])
print(s)
print('------------')
print(s[3])
print('------------')
print(s[2:4])
# 1    1
# 2    2
# 3    3
# 4    4
# 5    5
# dtype: int64
# ------------
# 3
# ------------
# 3    3
# 4    4
# dtype: int64

View Code

③布爾索引

s = pd.Series([1,2,3,4,5],index=list('abcde'))
m = s > 3
print(m)
print(s[m])
# a    False
# b    False
# c    False
# d     True
# e     True
# dtype: bool
# d    4
# e    5
# dtype: int64

Seris布爾值索引

3.Seris查看和常用方法

①head()和tail()

參數默認為5，表示查看前5個和后5個，可指定參數。

s = pd.Series([1,2,3,4,5,6,7,8,9,10])
print(s.head(2))
print(s.tail((3)))
# 0    1
# 1    2
# dtype: int64
# 7     8
# 8     9
# 9    10
# dtype: int64

head()和tail()

?

②tolist()（也可寫作to_list()）

將Seris轉化為列表

s = pd.Series(np.random.randint(1,10,10))
print(s.tolist())
# [3, 8, 8, 9, 8, 2, 2, 7, 7, 7]

?

③reindex(index , fill_value=NaN)

reindex會生成一個新的Seris，對于參數index，如果在原Seris的index中存在則保留，不存在則將值填充為fill_value指定的值，fill_value默認為NaN

arr = np.arange(1,6)
s1 = pd.Series(arr,index = list('abcde'))
s2 =s1.reindex(['a','d','f','h'],fill_value=0)
print(s1)
print(s2)
# a    1
# b    2
# c    3
# d    4
# e    5
# dtype: int32
# a    1
# d    4
# f    0
# h    0
# dtype: int32

reindex()

?④+和-

Seris與單個值的加法和減法，是對Seris的每個元素進行操作。

兩個Seris的加法和減法，對兩者index相同的數值做加法和減法，不相同的部分index都保留，值默認為NaN。

s1 = pd.Series(np.arange(1,4),index = list('abc'))
s2 = pd.Series(np.arange(5,8),index = list('bcd'))
print(s1+s2)
print('--------')
print(s2-s1)
print('--------')
print(s1+10)
# a    NaN
# b    7.0
# c    9.0
# d    NaN
# dtype: float64
# --------
# a    NaN
# b    3.0
# c    3.0
# d    NaN
# dtype: float64
# --------
# a    11
# b    12
# c    13
# dtype: int32

Seris的加法和減法

?⑤元素的添加

直接通過標簽方式添加元素（通過下標方式添加報超出索引錯誤），修改原Seris。

s = pd.Series(np.arange(1,4),index = list('abc'))
# s[3] = 10
s['p'] = 15
print(s)
# a     1
# b     2
# c     3
# p    15
# dtype: int64

Seris添加元素

s1.appeng(s2)，生成一個新的Seris，不修改s1和s2

s1 = pd.Series(np.arange(1,3),index = list('ab'))
s2 = pd.Series(np.arange(3,5),index = list('mn'))
a = s1.append(s2)
print(s1)
print(s2)
print(a)
# a    1
# b    2
# dtype: int32
# m    3
# n    4
# dtype: int32
# a    1
# b    2
# m    3
# n    4
# dtype: int32

append()

⑥元素的刪除drop()

用法：drop(index,inplace = False)，表示刪除原Seris中索引為參數index的值，默認刪除的內容會生成一個新的Seris且不改變原Seris，如果指定Inplace = True則會直接修改原Seris。

s1 = pd.Series(np.arange(1,4),index = list('abc'))
s2 = s1.drop(['a','c'])
print(s1)
print(s2)
s3 = pd.Series(np.arange(5,8),index = list('lmn'))
s4 = s3.drop('m',inplace=True)
print(s3)
print(s4)
# a    1
# b    2
# c    3
# dtype: int32
# b    2
# dtype: int32
# l    5
# n    7

drop()刪除元素

?返回頂部

二、數據幀DataFrame

DataFrame是一個表格型的數據結構，是一組帶有標簽的二維數組，DataFrame是pandas中最常用的一種數據結構。創建一個DataFrame為df，則

df.index表示行索引，df.columns表示列索引，df.values表示實際的值。

dic = {'name':['alice','Bob','Jane'],'age':[23,26,25]}
df = pd.DataFrame(dic)
print(df)
print(type(df))
print(df.index)
print(df.columns)
print(df.values)
#     name  age
# 0  alice   23
# 1    Bob   26
# 2   Jane   25
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex(start=0, stop=3, step=1)
# Index(['name', 'age'], dtype='object')
# [['alice' 23]
#  ['Bob' 26]
#  ['Jane' 25]]

DataFrame數據示例

1.DataFrame的創建

①通過字典、或者由字典組成的列表創建

通過這種方法，字典的key就是列索引，行索引默認為從0開始的整數。

dic1 = [{'name':'Alice','age':23},{'name':'Bob','age':26},{'name':'Jane','age':25}]
dic2 = {'name':['alice','Bob','Jane'],'age':[23,26,25]}
df1 = pd.DataFrame(dic1)
df2 = pd.DataFrame(dic2)
print(df1)
print('---------------')
# print(pd.DataFrame(df1,columns=['name','age']))
print(df2)
#    age   name
# 0   23  Alice
# 1   26    Bob
# 2   25   Jane
# ---------------
#     name  age
# 0  alice   23
# 1    Bob   26
# 2   Jane   25

通過列表或字典創建DataFrame

創建時可通過index指定行索引，但是索引的個數必須要與DataFrame的行數保持一致，否則會報錯。

也可以通過columns指定列索引，列索引的個數可以不與DataFrame的列數保持一致，索引相同的部分保留，原字典或列表中多余的部分去除，columns中多余的部分保留并填充值為NaN

dic = {'name':['alice','Bob','Jane'],'age':[23,26,25]}
df1 = pd.DataFrame(dic,columns=['name','hobby'])
df2 = pd.DataFrame(dic,index=['a','b','c'])
print(df1)
print(df2)
#    name hobby
# 0  alice   NaN
# 1    Bob   NaN
# 2   Jane   NaN
#     name  age
# a  alice   23
# b    Bob   26
# c   Jane   25

指定行索引和列索引

②通過Seris創建

通過Seris創建時，Seris的長度可以不一致，DataFrame會取最長的Seris，并將不足的部分填充為NaN

dic1 = {'one':pd.Series(np.arange(2)),'two':pd.Series(np.arange(3))}
dic2 = {'one':pd.Series(np.arange(2),index=['a','b']),'two':pd.Series(np.arange(3),index = ['a','b','c'])}
print(pd.DataFrame(dic1))
print('------------')
print(pd.DataFrame(dic2))
#    one  two
# 0  0.0    0
# 1  1.0    1
# 2  NaN    2
# ------------
#    one  two
# a  0.0    0
# b  1.0    1
# c  NaN    2

通過Seris創建DataFrame

③通過二維數組創建

方法：DataFrame(arr,index=‘***’? ,columns=‘***’)，如果不指定index和columns，默認都是從0開始的整數，如果指定則index和columns的長度必須與二維數據的行數和列數相同，否則會報錯。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index=['a','b','c'],columns=['col1','col2','col3','col4'])
print(df)
#    col1  col2  col3  col4
# a     0     1     2     3
# b     4     5     6     7
# c     8     9    10    11

通過二維數組創建DataFrame

④通過嵌套字典創建?

通過這種方法創建，字典的外層key為列索引，內層key為行索引。

dic = {'Chinese':{'Alice':92,'Bob':95,'Jane':93},'Math':{'Alice':96,'Bob':98,'Jane':95}}
print(pd.DataFrame(dic))
#        Chinese  Math
# Alice       92    96
# Bob         95    98
# Jane        93    95

通過嵌套字典創建DataFrame

2.DataFrame的索引

可通過.values直接獲取不帶index和column的內容部分，結果為一個二維數組。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df.values)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

.values獲取內容部分

①列索引

單列索引直接使用df['列索引']即可，數據類型為Seris，名稱為列索引，index為原DataFrame的index；

多列索引通過df[['列索引1','列索引2',...]]，結果為DataFrame，columns為指定的索引，index為原DataFrame的index。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df)
print('-------------')
print(df['one'],type(df['one']))
print('-------------')
print(df[['one','three']])
#    one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# -------------
# a    0
# b    4
# c    8
# Name: one, dtype: int32 <class 'pandas.core.series.Series'>
# -------------
#    one  three
# a    0      2
# b    4      6
# c    8     10

DataFrame列索引

②行索引

單行索引通過df.loc['行索引']實現，數據類型為Seris，名稱為行索引，index為原DataFrame的columns；

多行索引通過df.loc[['行索引1','行索引2',...]]，結果為DataFrame，columns為原DataFrame的columns，index為的指定的行索引。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df.loc['a'],type(df.loc['a']))
print(df.loc[['a','c']])
# one      0
# two      1
# three    2
# four     3
# Name: a, dtype: int32 <class 'pandas.core.series.Series'>
#    one  two  three  four
# a    0    1      2     3
# c    8    9     10    11

DataFrame行索引

行索引也可以使用iloc[]，loc[]使用標簽作為行索引，iloc[ ]使用下標（即第幾行）作為索引

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df.iloc[1],type(df.iloc[1]))
print(df.iloc[[0,2]])
# one      4
# two      5
# three    6
# four     7
# Name: b, dtype: int32 <class 'pandas.core.series.Series'>
#    one  two  three  four
# a    0    1      2     3
# c    8    9     10    11

DataFrame的iloc[]行索引

③單元格和塊索引

單元格的索引有三種方式：df['列索引'].loc['行索引']、df.loc['行索引']['列索引']、df.loc['行索引','列索引']

塊索引：df[['列索引1','列索引2'...]].loc[['行索引1','行索引2'...]]、df.loc[['行索引1','行索引2'...]][['列索引1','列索引2'...]]、df.loc[['行索引1','行索引2'...]],[['列索引1','列索引2'...]]

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df)
print('--------------------------')
print(df['two'].loc['b'] , df.loc['b']['two'] , df.loc['b','two'])
print('--------------------------')
print(df.loc[['a','c'],['one','four']])
#    one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# --------------------------
# 5 5 5
# --------------------------
#    one  four
# a    0     3
# c    8    11

DataFrame單元格和塊索引

④布爾索引

如果對DataFrame進行單列布爾索引，結果會顯示列中值為True所在的行。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
m1= df['one']>5
print(df)
print('------------------------')
print(m1) #索引c對應的值為True
print('------------------------')
print(df[m1])  #顯示索引c所在的行，包括所有列
#   one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# ------------------------
# a    False
# b    False
# c     True
# Name: one, dtype: bool
# ------------------------
#    one  two  three  four
# c    8    9     10    11

DataFrame單列布爾索引

如果對多列或整個DataFrame進行布爾索引，結果是一個與DataFrame結構相同的DataFrame，其中索引列中符合條件的以實際值顯示，不符合條件的以NaN顯示。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
m1 = df[['one','three']] > 5
print(m1)
print(df[m1])   #列one、three中符合條件的顯示實際值，其他都顯示為NaN
#      one  three
# a  False  False
# b  False   True
# c   True   True
#    one  two  three  four
# a  NaN  NaN    NaN   NaN
# b  NaN  NaN    6.0   NaN
# c  8.0  NaN   10.0   NaN

DataFrame多列布爾索引

df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
m = df >5
print(m)
print(df[m])
#     one    two  three   four
# a  False  False  False  False
# b  False  False   True   True
# c   True   True   True   True
#    one  two  three  four
# a  NaN  NaN    NaN   NaN
# b  NaN  NaN    6.0   7.0
# c  8.0  9.0   10.0  11.0

整個DataFrame布爾索引

（對行做布爾索引會報錯pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match）

3.DataFrame的常用方法

①.T轉置

DataFrame轉置會將原columns變為index，原index變為columns，并且修改原DataFrame會修改轉置后的DataFrame，修改轉置后的DataFrame也會修改原DataFrame。

arr = np.arange(12).reshape(3,4)
df1 = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
df2 = df1.T
df1.loc['a','one'] = 100
print(df1)
print(df2)
df2.loc['two','b'] = 500
print(df1)
print(df2)
#    one  two  three  four
# a  100    1      2     3
# b    4    5      6     7
# c    8    9     10    11
#          a  b   c
# one    100  4   8
# two      1  5   9
# three    2  6  10
# four     3  7  11
#    one  two  three  four
# a  100    1      2     3
# b    4  500      6     7
# c    8    9     10    11
#          a    b   c
# one    100    4   8
# two      1  500   9
# three    2    6  10
# four     3    7  11

DataFrame轉置

②添加與修改

增加列：df['新列索引'] = [***]，元素的個數必須與DataFrame的行數相同，否則會報錯。

增加行：df.loc['新行索引'] = [***]，元素的個數必須與DataFrame的列數相同，否則會報錯。

修改DataFrame直接通過上一節單元格或塊索引的方式獲得單元格或塊，再修改即可。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df)
df['five'] = [11,22,33]  #元素個數必須與行數相同，否則會報錯
print(df)
df.loc['d'] = [100,200,300,400,500]  #元素個數必須與列數相同，否則會報錯
print(df)
#   one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
#    one  two  three  four  five
# a    0    1      2     3    11
# b    4    5      6     7    22
# c    8    9     10    11    33
#    one  two  three  four  five
# a    0    1      2     3    11
# b    4    5      6     7    22
# c    8    9     10    11    33
# d  100  200    300   400   500

DataFrame增加行或列

③刪除

del df['列索引'] 直接刪除原DataFrame的列

df.drop('索引',axis = 0,inplace = False)，drop可以刪除行也可以刪除列，默認axis為0即默認刪除行，為1則表示刪除列，如果給定的索引在行中或者列中不存在會報錯；

drop默認生成新的DataFrame不改變原DataFrame，即inplace=False，如果inplace設置為True則不生成新的DataFrame，而是直接修改原DataFrame。

arr = np.arange(12).reshape(3,4)
df = pd.DataFrame(arr,index = ['a','b','c'],columns = ['one','two','three','four'])
print(df)
del df['four']
print(df)  #del刪除原DataFrame的列
f = df.drop('c')
print(f)
print(df)
f = df.drop('three',axis=1,inplace=True)
print(f)
print(df)
#    one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
#    one  two  three
# a    0    1      2
# b    4    5      6
# c    8    9     10
#    one  two  three
# a    0    1      2
# b    4    5      6
#    one  two  three
# a    0    1      2
# b    4    5      6
# c    8    9     10
# None
#    one  two
# a    0    1
# b    4    5
# c    8    9

DataFrame刪除行或列

④相加

DataFrame與單個值相加或相減，對每個元素進行加或減即可。

DataFrame之間相加或相減，不要求index和columns相同，對行和列對應的部分加或減，多余的行和列都保留并且值全部為NaN。

arr1 = np.arange(12).reshape(3,4)
arr2 = np.arange(12).reshape(4,3)
df1 = pd.DataFrame(arr1,index = ['a','b','c'],columns = ['one','two','three','four'])
df2 = pd.DataFrame(arr2,index = ['a','b','c','d'],columns = ['one','two','three'])
print( df1 + 1 )
print( df1 + df2 )
#    one  two  three  four
# a    1    2      3     4
# b    5    6      7     8
# c    9   10     11    12
#    four   one  three   two
# a   NaN   0.0    4.0   2.0
# b   NaN   7.0   11.0   9.0
# c   NaN  14.0   18.0  16.0
# d   NaN   NaN    NaN   NaN

DataFrame相加或相減

⑤排序

按值排序：sort_values('列索引',ascending=True)，即對某一列的值按行排序，默認升序排序，對多個列排序則用['列索引1','列索引2',...]

按index排序：sort_index(ascending=True)，按照index的名稱進行排序，默認升序。

arr = np.random.randint(1,10,[4,3])
df = pd.DataFrame(arr,index = ['a','b','c','d'],columns = ['one','two','three'])
print(df)
print(df.sort_values(['one','three'],ascending=True))
print(df.sort_index(ascending=False))
#    one  two  three
# a    7    7      1
# b    5    7      1
# c    1    9      4
# d    7    9      9
#    one  two  three
# c    1    9      4
# b    5    7      1
# a    7    7      1
# d    7    9      9
#    one  two  three
# d    7    9      9
# c    1    9      4
# b    5    7      1
# a    7    7      1