上一章:機器學習實操項目01——Numpy入門(基本操作、數組形狀操作、復制與試圖、多種索引技巧、線性代數)
下一章:
機器學習核心知識點目錄:機器學習核心知識點目錄
機器學習實戰項目目錄:【從 0 到 1 落地】機器學習實操項目目錄:覆蓋入門到進階,大學生就業 / 競賽必備
文章目錄
- 導入
- 創建對象
- 查看數據
- 數據選擇
- 數據獲取
- 根據標簽選擇
- 根據位置進行選擇
- 布爾索引
- 設置數據
- 處理缺失數據
- 基礎操作
- 基本的統計
- 應用函數
- 直方圖
- 字符串函數
- 數據合并
- 數據拼接
- 數據關聯
- 添加數據(Append)
- 分組
- 改變數據形狀
- 壓縮
- 數據透視表
- 時間序列
- 標簽數據
- 繪圖
- 數據獲取與導出
- CSV
- HDF5
- Excel
導入
習慣上,我們這樣導入pandas:
代碼:
import pandas as pddf = pd.DataFrame()
print(df)
輸出結果:
Empty DataFrame
Columns: []
Index: []
創建對象
通過列表可以創建Series,Pandas會自動創建整型索引。
代碼:
s=pd.Series([1,3,5,np.nan,6,8])
print(s)
輸出結果:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
通過NumPy的數組可以創建一個DataFrame。
代碼:
dates=pd.date_range('20130101',periods=6)
print(dates)
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print(df)
輸出結果:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04','2013-01-05', '2013-01-06'],dtype='datetime64[ns]', freq='D')A B C D
2013-01-01 0.069113 -1.260614 -0.841799 -2.473165
2013-01-02 0.129673 0.077880 -1.757650 2.095012
2013-01-03 0.372136 0.763536 0.719388 0.048272
2013-01-04 1.292856 -0.854011 0.081932 -0.534714
2013-01-05 0.086568 -0.351541 -1.134527 -0.847503
2013-01-06 0.962887 1.711989 -0.465767 -0.278005
也可以通過字典來創建DataFrame
代碼:
df2=pd.DataFrame({'A':1.,'B':pd.Timestamp('20130102'),'C':pd.Series(1,index=list(range(4)),dtype='float32'),'D':np.array([3]*4,dtype='int32'),'E':pd.Categorical(['test','train','test','train']),'F':'foo'})
print(df2)
輸出結果:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
設置了特定的dtypes
代碼:
df2.dtypes
輸出結果:
A float64
B datetime64[s]
C float32
D int32
E category
F object
dtype: object
查看數據
查看數據的頂部和底部
代碼:
print(df.head())
print(df.tail(3))
輸出結果:
A B C D
2013-01-01 0.069113 -1.260614 -0.841799 -2.473165
2013-01-02 0.129673 0.077880 -1.757650 2.095012
2013-01-03 0.372136 0.763536 0.719388 0.048272
2013-01-04 1.292856 -0.854011 0.081932 -0.534714
2013-01-05 0.086568 -0.351541 -1.134527 -0.847503A B C D
2013-01-04 1.292856 -0.854011 0.081932 -0.534714
2013-01-05 0.086568 -0.351541 -1.134527 -0.847503
2013-01-06 0.962887 1.711989 -0.465767 -0.278005
查看索引,列名,以及純的NumPy數據
代碼:
print(df.index)
print(df.columns)
print(df.values)
輸出結果:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04','2013-01-05', '2013-01-06'],dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[ 0.06911332 -1.26061399 -0.84179936 -2.47316502][ 0.12967295 0.07787973 -1.75764971 2.09501184][ 0.37213615 0.76353566 0.71938772 0.04827239][ 1.29285614 -0.85401117 0.0819325 -0.53471385][ 0.08656817 -0.35154148 -1.1345272 -0.84750345][ 0.96288654 1.71198865 -0.46576693 -0.27800489]]
DataFrame.describe()
提供了便捷的數據統計
代碼:
print(df.describe())
輸出結果:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.485539 0.014540 -0.566404 -0.331684
std 0.519944 1.091856 0.883335 1.478468
min 0.069113 -1.260614 -1.757650 -2.473165
25% 0.097344 -0.728394 -1.061345 -0.769306
50% 0.250905 -0.136831 -0.653783 -0.406359
75% 0.815199 0.592122 -0.054992 -0.033297
max 1.292856 1.711989 0.719388 2.095012
轉置矩陣
代碼:
print(df.T)
輸出結果:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.069113 0.129673 0.372136 1.292856 0.086568 0.962887
B -1.260614 0.077880 0.763536 -0.854011 -0.351541 1.711989
C -0.841799 -1.757650 0.719388 0.081932 -1.134527 -0.465767
D -2.473165 2.095012 0.048272 -0.534714 -0.847503 -0.278005
沿某一軸排序
代碼:
print(df.sort_index(axis=1,ascending=False))
輸出結果:
D C B A
2013-01-01 -2.473165 -0.841799 -1.260614 0.069113
2013-01-02 2.095012 -1.757650 0.077880 0.129673
2013-01-03 0.048272 0.719388 0.763536 0.372136
2013-01-04 -0.534714 0.081932 -0.854011 1.292856
2013-01-05 -0.847503 -1.134527 -0.351541 0.086568
2013-01-06 -0.278005 -0.465767 1.711989 0.962887
按照值進行排序
代碼:
print(df.sort_values(by='B'))
輸出結果:
A B C D
2013-01-01 0.069113 -1.260614 -0.841799 -2.473165
2013-01-04 1.292856 -0.854011 0.081932 -0.534714
2013-01-05 0.086568 -0.351541 -1.134527 -0.847503
2013-01-02 0.129673 0.077880 -1.757650 2.095012
2013-01-03 0.372136 0.763536 0.719388 0.048272
2013-01-06 0.962887 1.711989 -0.465767 -0.278005
數據選擇
**注意:**盡管Python標準庫和NumPy的語句在選擇和設置數據時更直觀和方便交互。但是在生產環境中,我們更建議使用Pandas的數據訪問函數:.at, .iat, .loc, .iloc, .ix
數據獲取
選擇一列,返回Series。等同于df.A
代碼:
df['A']
輸出結果:
2013-01-01 0.069113
2013-01-02 0.129673
2013-01-03 0.372136
2013-01-04 1.292856
2013-01-05 0.086568
2013-01-06 0.962887
Freq: D, Name: A, dtype: float64
使用[]
來切分多行數據
代碼:
print(df[0:3])
print(df['20130102':'20130104'])
輸出結果:
A B C D
2013-01-01 0.069113 -1.260614 -0.841799 -2.473165
2013-01-02 0.129673 0.077880 -1.757650 2.095012
2013-01-03 0.372136 0.763536 0.719388 0.048272A B C D
2013-01-02 0.129673 0.077880 -1.757650 2.095012
2013-01-03 0.372136 0.763536 0.719388 0.048272
2013-01-04 1.292856 -0.854011 0.081932 -0.534714
根據標簽選擇
使用標簽來得到特定的一行
代碼:
df.loc[dates[0]]
輸出結果:
A 0.069113
B -1.260614
C -0.841799
D -2.473165
Name: 2013-01-01 00:00:00, dtype: float64
使用標簽來得到多個指定的列
代碼:
print(df.loc[:,['A','B']])
輸出結果:
A B
2013-01-01 0.069113 -1.260614
2013-01-02 0.129673 0.077880
2013-01-03 0.372136 0.763536
2013-01-04 1.292856 -0.854011
2013-01-05 0.086568 -0.351541
2013-01-06 0.962887 1.711989
用標簽來切片時,端點的值都是被包含在內的
代碼:
print(df.loc['20130102':'20130104',['A','B']])
輸出結果:
A B
2013-01-02 0.129673 0.077880
2013-01-03 0.372136 0.763536
2013-01-04 1.292856 -0.854011
選擇的返回值可以降低維度
代碼:
df.loc['20130102',['A','B']]
輸出結果:
A 0.129673
B 0.077880
Name: 2013-01-02 00:00:00, dtype: float64
選擇一個標量
代碼:
df.loc[dates[0],'A']
輸出結果:
0.0691133191542612
快速獲取一個標量(與上面相同)
代碼:
df.at[dates[0],'A']
輸出結果:
0.0691133191542612
根據位置進行選擇
根據整數下標來選擇數據
代碼:
df.iloc[3]
輸出結果:
A 1.292856
B -0.854011
C 0.081932
D -0.534714
Name: 2013-01-04 00:00:00, dtype: float64
用整數來切片,類似于Python標準庫和NumPy
代碼:
print(df.iloc[3:5,0:2])
輸出結果:
A B
2013-01-04 1.292856 -0.854011
2013-01-05 0.086568 -0.351541
用整數列表來進行選擇,與Python標準庫和NumPy類似
代碼:
print(df.iloc[[1,2,4],[0,2]])
輸出結果:
A C
2013-01-02 0.129673 -1.757650
2013-01-03 0.372136 0.719388
2013-01-05 0.086568 -1.134527
對行進行切片選擇
代碼:
print(df.iloc[1:3,:])
輸出結果:
A B C D
2013-01-02 0.129673 0.077880 -1.757650 2.095012
2013-01-03 0.372136 0.763536 0.719388 0.048272
對列進行切片選擇
代碼:
print(df.iloc[:,1:3])
輸出結果:
B C
2013-01-01 -1.260614 -0.841799
2013-01-02 0.077880 -1.757650
2013-01-03 0.763536 0.719388
2013-01-04 -0.854011 0.081932
2013-01-05 -0.351541 -1.134527
2013-01-06 1.711989 -0.465767
獲取某個標量
代碼:
df.iloc[1,1]
輸出結果:
0.07787972504106509
快速地獲取某個標量(與上面的函數相同)
代碼:
df.iat[1,1]
輸出結果:
0.07787972504106509
布爾索引
使用某列值來選擇數據
代碼:
print(df[df.A>0])
輸出結果:
A B C D
2013-01-01 0.069113 -1.260614 -0.841799 -2.473165
2013-01-02 0.129673 0.077880 -1.757650 2.095012
2013-01-03 0.372136 0.763536 0.719388 0.048272
2013-01-04 1.292856 -0.854011 0.081932 -0.534714
2013-01-05 0.086568 -0.351541 -1.134527 -0.847503
2013-01-06 0.962887 1.711989 -0.465767 -0.278005
從DataFrame中選擇滿足某條件的數據
代碼:
print(df[df>0])
輸出結果:
A B C D
2013-01-01 0.069113 NaN NaN NaN
2013-01-02 0.129673 0.077880 NaN 2.095012
2013-01-03 0.372136 0.763536 0.719388 0.048272
2013-01-04 1.292856 NaN 0.081932 NaN
2013-01-05 0.086568 NaN NaN NaN
2013-01-06 0.962887 1.711989 NaN NaN
使用isin()
函數來過濾數據
代碼:
df2=df.copy()
df2['E']=['one','one','two','three','four','three']
print(df2)
print(df2[df2['E'].isin(['two','four'])])
輸出結果:
A B C D E
2013-01-01 0.069113 -1.260614 -0.841799 -2.473165 one
2013-01-02 0.129673 0.077880 -1.757650 2.095012 one
2013-01-03 0.372136 0.763536 0.719388 0.048272 two
2013-01-04 1.292856 -0.854011 0.081932 -0.534714 three
2013-01-05 0.086568 -0.351541 -1.134527 -0.847503 four
2013-01-06 0.962887 1.711989 -0.465767 -0.278005 threeA B C D E
2013-01-03 0.372136 0.763536 0.719388 0.048272 two
2013-01-05 0.086568 -0.351541 -1.134527 -0.847503 four
設置數據
設置一個新列并設置索引
代碼:
s1=pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
df['F']=s1
s1
輸出結果:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
根據標簽來設置值
代碼:
df.at[dates[0],'A']=0
print(df)
輸出結果:
A B C D F
2013-01-01 0.000000 -1.260614 -0.841799 -2.473165 NaN
2013-01-02 0.129673 0.077880 -1.757650 2.095012 1.0
2013-01-03 0.372136 0.763536 0.719388 0.048272 2.0
2013-01-04 1.292856 -0.854011 0.081932 -0.534714 3.0
2013-01-05 0.086568 -0.351541 -1.134527 -0.847503 4.0
2013-01-06 0.962887 1.711989 -0.465767 -0.278005 5.0
根據位置的來設置值
代碼:
df.iat[0,1]=0
print(df)
輸出結果:
A B C D F
2013-01-01 0.000000 0.000000 -0.841799 -2.473165 NaN
2013-01-02 0.129673 0.077880 -1.757650 2.095012 1.0
2013-01-03 0.372136 0.763536 0.719388 0.048272 2.0
2013-01-04 1.292856 -0.854011 0.081932 -0.534714 3.0
2013-01-05 0.086568 -0.351541 -1.134527 -0.847503 4.0
2013-01-06 0.962887 1.711989 -0.465767 -0.278005 5.0
使用NumPy數組來賦值
代碼:
df.loc[:,'D']=np.array([5]*len(df))
print(df)
輸出結果:
A B C D F
2013-01-01 0.000000 0.000000 -0.841799 5.0 NaN
2013-01-02 0.129673 0.077880 -1.757650 5.0 1.0
2013-01-03 0.372136 0.763536 0.719388 5.0 2.0
2013-01-04 1.292856 -0.854011 0.081932 5.0 3.0
2013-01-05 0.086568 -0.351541 -1.134527 5.0 4.0
2013-01-06 0.962887 1.711989 -0.465767 5.0 5.0
在過濾的同時進行賦值
代碼:
df2=df.copy()
df2[df2>0]=-df2
print(df2)
輸出結果:
A B C D F
2013-01-01 0.000000 0.000000 -0.841799 -5.0 NaN
2013-01-02 -0.129673 -0.077880 -1.757650 -5.0 -1.0
2013-01-03 -0.372136 -0.763536 -0.719388 -5.0 -2.0
2013-01-04 -1.292856 -0.854011 -0.081932 -5.0 -3.0
2013-01-05 -0.086568 -0.351541 -1.134527 -5.0 -4.0
2013-01-06 -0.962887 -1.711989 -0.465767 -5.0 -5.0
處理缺失數據
Pandas主要使用np.nan
來表示缺失數據。這個值不會加入在計算中。
重新索引允許你對某一軸的索引進行更改/添加/刪除操作。這會返回操作數據的拷貝對象。
代碼:
df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E']=1
print(df1)
輸出結果:
A B C D F E
2013-01-01 0.000000 0.000000 -0.841799 5.0 NaN 1.0
2013-01-02 0.129673 0.077880 -1.757650 5.0 1.0 1.0
2013-01-03 0.372136 0.763536 0.719388 5.0 2.0 NaN
2013-01-04 1.292856 -0.854011 0.081932 5.0 3.0 NaN
刪除含有缺失數據的行
代碼:
print(df1.dropna(how='any'))
輸出結果:
A B C D F E
2013-01-02 0.129673 0.07788 -1.75765 5.0 1.0 1.0
填充缺失數據
代碼:
print(df1.fillna(value=5))
輸出結果:
A B C D F E
2013-01-01 0.000000 0.000000 -0.841799 5.0 5.0 1.0
2013-01-02 0.129673 0.077880 -1.757650 5.0 1.0 1.0
2013-01-03 0.372136 0.763536 0.719388 5.0 2.0 5.0
2013-01-04 1.292856 -0.854011 0.081932 5.0 3.0 5.0
獲取數據中是nan
的位置
代碼:
print(pd.isnull(df1))
輸出結果:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
基礎操作
基本的統計
操作默認排除了缺失的數據
執行一個描述性的統計
代碼:
df.mean()
輸出結果:
A 0.474020
B 0.224642
C -0.566404
D 5.000000
F 3.000000
dtype: float64
在另外的軸執行該操作
代碼:
df.mean(1)
輸出結果:
2013-01-01 1.039550
2013-01-02 0.889981
2013-01-03 1.771012
2013-01-04 1.704155
2013-01-05 1.520100
2013-01-06 2.441822
Freq: D, dtype: float64
對不同維度數據的操作需要進行校準。Pandas會自動的在指定的維數上應用廣播。
代碼:
s=pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
print(s)
print(df.sub(s,axis='index'))
輸出結果:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.627864 -0.236464 -0.280612 4.0 1.0
2013-01-04 -1.707144 -3.854011 -2.918068 2.0 0.0
2013-01-05 -4.913432 -5.351541 -6.134527 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
應用函數
對數據應用函數
代碼:
print(df.apply(np.cumsum))
df.apply(lambda x:x.max()-x.min())
輸出結果:
A B C D F
2013-01-01 0.000000 0.000000 -0.841799 5.0 NaN
2013-01-02 0.129673 0.077880 -2.599449 10.0 1.0
2013-01-03 0.501809 0.841415 -1.880061 15.0 3.0
2013-01-04 1.794665 -0.012596 -1.798129 20.0 6.0
2013-01-05 1.881233 -0.364137 -2.932656 25.0 10.0
2013-01-06 2.844120 1.347851 -3.398423 30.0 15.0A 1.292856
B 2.566000
C 2.477037
D 0.000000
F 4.000000
dtype: float64
直方圖
代碼:
s=pd.Series(np.random.randint(0,7,size=10))
print(s)
print(s.value_counts())
輸出結果:
0 5
1 4
2 1
3 6
4 2
5 1
6 4
7 5
8 2
9 1
dtype: int32
1 3
5 2
4 2
2 2
6 1
Name: count, dtype: int64
字符串函數
在下面的代碼中,Series可以在str
屬性中應用許多字符處理函數,可以更方便的處理數據中的每個元素。需要注意的是模式識別默認使用正則表達式。
代碼:
s=pd.Series(['A','B','C','Aaba','Baca',np.nan,'CABA','dog','cat'])
s.str.lower()
輸出結果:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
數據合并
數據拼接
Pandas提供了許多工具來根據索引邏輯和線性代數關系組合Series、DataFrame和Panel。
用concat()
組合Pandas數據
代碼:
df=pd.DataFrame(np.random.randn(10,4))
print(df)
pieces=[df[:3],df[3:7],df[7:]]
print(pd.concat(pieces))
輸出結果:
0 1 2 3
0 2.458179 -1.015589 0.213889 -0.751448
1 0.367231 -0.894997 -0.174994 -0.088047
2 1.778165 -2.184130 -0.137273 -1.098623
3 1.625942 0.327543 0.048892 0.045402
4 -0.321092 1.332430 -0.799614 0.460845
5 0.495226 0.377637 -0.981761 0.999542
6 0.325414 0.121565 1.271907 0.611660
7 -0.218155 0.699750 1.291248 -0.020842
8 -1.848206 0.120794 -1.332943 1.596437
9 -1.182888 0.252381 -0.725382 1.8759880 1 2 3
0 2.458179 -1.015589 0.213889 -0.751448
1 0.367231 -0.894997 -0.174994 -0.088047
2 1.778165 -2.184130 -0.137273 -1.098623
3 1.625942 0.327543 0.048892 0.045402
4 -0.321092 1.332430 -0.799614 0.460845
5 0.495226 0.377637 -0.981761 0.999542
6 0.325414 0.121565 1.271907 0.611660
7 -0.218155 0.699750 1.291248 -0.020842
8 -1.848206 0.120794 -1.332943 1.596437
9 -1.182888 0.252381 -0.725382 1.875988
數據關聯
SQL形式
代碼:
left=pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
right=pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})
print(left)
print(right)
print(pd.merge(left,right,on='key'))
輸出結果:
key lval
0 foo 1
1 foo 2key rval
0 foo 4
1 foo 5key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
代碼:
left=pd.DataFrame({'key':['foo','bar'],'lval':[1,2]})
right=pd.DataFrame({'key':['foo','bar'],'rval':[4,5]})
print(left)
print(right)
print(pd.merge(left,right,on='key'))
輸出結果:
key lval
0 foo 1
1 bar 2key rval
0 foo 4
1 bar 5key lval rval
0 foo 1 4
1 bar 2 5
添加數據(Append)
給DataFrame添加行
代碼:
import pandas as pd
import numpy as np# 創建一個示例 DataFrame
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
print("原始數據框:")
print(df)# 選擇第 4 行(索引為 3)
s = df.iloc[3]# 使用 pd.concat 方法將這一行追加到數據框中
df = pd.concat([df, pd.DataFrame([s])], ignore_index=True)print("\n追加后的數據框:")
print(df)
輸出結果:
原始數據框:A B C D
0 0.087724 0.581745 0.632270 -1.216399
1 2.036586 -0.299526 -1.769757 0.586204
2 -0.345521 -0.814169 -1.160057 1.644311
3 -0.465290 0.108643 -0.585329 2.574058
4 -0.311185 -0.650010 0.448088 1.939568
5 -0.500311 0.320773 -0.751497 0.764619
6 0.680282 1.098069 -2.233506 0.203445
7 -0.311157 -1.404242 0.405042 -0.455882追加后的數據框:A B C D
0 0.087724 0.581745 0.632270 -1.216399
1 2.036586 -0.299526 -1.769757 0.586204
2 -0.345521 -0.814169 -1.160057 1.644311
3 -0.465290 0.108643 -0.585329 2.574058
4 -0.311185 -0.650010 0.448088 1.939568
5 -0.500311 0.320773 -0.751497 0.764619
6 0.680282 1.098069 -2.233506 0.203445
7 -0.311157 -1.404242 0.405042 -0.455882
8 -0.465290 0.108643 -0.585329 2.574058
分組
分組包含以下的一個或多個流程:
- 分組 根據某種標準分組數據
- 應用 對每個組應用一個函數
- 組合 把分組的結果組合成一個整體結構
代碼:
df=pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
print(df)
輸出結果:
A B C D
0 foo one 1.431858 0.050144
1 bar one -1.919458 0.880757
2 foo two 0.292022 -0.222843
3 bar three 0.327120 0.267704
4 foo two 0.216566 0.439842
5 bar two -0.085984 0.284717
6 foo one -1.684079 1.090991
7 foo three -0.670694 0.381422
分組并對分組后的結果求和
代碼:
print(df.groupby('A').sum())
輸出結果:
B C D
A
bar onethreetwo -1.678323 1.433178
foo onetwotwoonethree -0.414327 1.739556
根據多個列進行分組可以如下操作
代碼:
print(df.groupby(['A','B']).sum())
輸出結果:
C D
A B
bar one -1.919458 0.880757three 0.327120 0.267704two -0.085984 0.284717
foo one -0.252221 1.141135three -0.670694 0.381422two 0.508588 0.216999
改變數據形狀
壓縮
代碼:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
print(tuples)
index=pd.MultiIndex.from_tuples(tuples,names=['first','second'])
df=pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])
df2 = df[:4]
print(df2)
輸出結果:
[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]A B
first second
bar one 1.042058 0.474557two -0.066296 -0.102406
baz one -1.060188 -0.102398two -0.822575 -0.374684
stack()
函數壓縮了這個DataFrame中的一列
代碼:
stacked=df2.stack()
stacked
輸出結果:
first second
bar one A 1.042058B 0.474557two A -0.066296B -0.102406
baz one A -1.060188B -0.102398two A -0.822575B -0.374684
dtype: float64
對于壓縮之后的DataFrame,能夠使用和stack()
函數相反的unstack()
函數,默認會解壓縮最后一列
代碼:
print(stacked.unstack())
print(stacked.unstack(1))
print(stacked.unstack(0))
輸出結果:
A B
first second
bar one 1.042058 0.474557two -0.066296 -0.102406
baz one -1.060188 -0.102398two -0.822575 -0.374684
second one two
first
bar A 1.042058 -0.066296B 0.474557 -0.102406
baz A -1.060188 -0.822575B -0.102398 -0.374684
first bar baz
second
one A 1.042058 -1.060188B 0.474557 -0.102398
two A -0.066296 -0.822575B -0.102406 -0.374684
數據透視表
代碼:
df=pd.DataFrame({'A':['one','one','two','three']*3,'B':['A','B','C']*4,'C':['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,'D':np.random.randn(12),'E':np.random.randn(12)})
print(df)
輸出結果:
A B C D E
0 one A foo -0.843464 -1.096178
1 one B foo 0.397729 0.829569
2 two C foo -0.063220 0.085359
3 three A bar 0.751889 0.204106
4 one B bar 0.560714 0.000916
5 one C bar 0.369430 -0.764335
6 two A foo 0.834749 -0.701209
7 three B foo 0.166880 -0.773311
8 one C foo -1.575741 0.677317
9 one A bar 0.588925 -0.357284
10 two B bar -0.019995 0.651831
11 three C bar 1.569239 0.192160
我們可以非常簡單的用上面的數據生成一張數據透視表:
代碼:
print(pd.pivot_table(df,values='D',index=['A','B'],columns=['C']))
輸出結果:
C bar foo
A B
one A 0.588925 -0.843464B 0.560714 0.397729C 0.369430 -1.575741
three A 0.751889 NaNB NaN 0.166880C 1.569239 NaN
two A NaN 0.834749B -0.019995 NaNC NaN -0.063220
時間序列
Pandas擁有易用、強大且高效的方法來在頻率變換中執行重采樣操作(例如:把秒級別的數據轉換成5分鐘級別的數據)。這通常在金融應用中使用,但不僅限于金融應用。
代碼:
rng=pd.date_range('1/1/2012',periods=100,freq='S')
ts=pd.Series(np.random.randint(0,500,len(rng)),index=rng)
ts.resample('5Min').sum()
輸出結果:
2012-01-01 25049
Freq: 5min, dtype: int32
表現時區
代碼:
rng=pd.date_range('3/6/2012 00:00',periods=5,freq='D')
ts=pd.Series(np.random.randn(len(rng)),rng)
print(ts)
ts_utc=ts.tz_localize('UTC')
ts_utc
輸出結果:
2012-03-06 0.337857
2012-03-07 -0.814280
2012-03-08 -0.202582
2012-03-09 1.883021
2012-03-10 0.914846
Freq: D, dtype: float642012-03-06 00:00:00+00:00 0.337857
2012-03-07 00:00:00+00:00 -0.814280
2012-03-08 00:00:00+00:00 -0.202582
2012-03-09 00:00:00+00:00 1.883021
2012-03-10 00:00:00+00:00 0.914846
Freq: D, dtype: float64
轉換到其他時區
代碼:
ts_utc.tz_convert('US/Eastern')
輸出結果:
2012-03-05 19:00:00-05:00 0.337857
2012-03-06 19:00:00-05:00 -0.814280
2012-03-07 19:00:00-05:00 -0.202582
2012-03-08 19:00:00-05:00 1.883021
2012-03-09 19:00:00-05:00 0.914846
Freq: D, dtype: float64
轉換時間間隔的表現形式
代碼:
rng=pd.date_range('1/1/2012',periods=5,freq='M')
ts=pd.Series(np.random.randn(len(rng)),index=rng)
print(ts)
ps=ts.to_period()
print(ps)
ps.to_timestamp()
輸出結果:
2012-01-31 2.189840
2012-02-29 -0.480638
2012-03-31 1.682603
2012-04-30 0.729958
2012-05-31 -1.207027
Freq: ME, dtype: float64
2012-01 2.189840
2012-02 -0.480638
2012-03 1.682603
2012-04 0.729958
2012-05 -1.207027
Freq: M, dtype: float64C:\Users\Verhaengnis\AppData\Local\Temp\ipykernel_2932\3080443555.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.rng=pd.date_range('1/1/2012',periods=5,freq='M')2012-01-01 2.189840
2012-02-01 -0.480638
2012-03-01 1.682603
2012-04-01 0.729958
2012-05-01 -1.207027
Freq: MS, dtype: float64
標簽數據
Pandas可以在DataFrame中包含標簽數據。
代碼:
df=pd.DataFrame({'id':[1,2,3,4,5,6],'raw_grade':['a','b','b','a','a','e']})
將原始數據轉換成標簽數據
代碼:
df['grade']=df['raw_grade'].astype('category')
df['grade']
輸出結果:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
將標簽重命名成更有意義的名字(對Series.cat.categories
的賦值是沒有申請新內存的)
重排序標簽并且同時增加缺失的標簽(在Series.cat
包下的函數默認返回新的Series
)
代碼:
df['grade']=df['grade'].cat.set_categories(['very bad','bad','medium','good','very good'])
df['grade']
輸出結果:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
排序是根據標簽排序的,而不是字典序
代碼:
print(df.sort_values(by='grade'))
輸出結果:
id raw_grade grade
0 1 a NaN
1 2 b NaN
2 3 b NaN
3 4 a NaN
4 5 a NaN
5 6 e NaN
對標簽數據分組時同樣會顯示空標簽
代碼:
df.groupby('grade').size()
輸出結果:
grade
very bad 0
bad 0
medium 0
good 0
very good 0
dtype: int64
繪圖
代碼:
ts=pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
ts=ts.cumsum()
ts.plot()
輸出結果:
<Axes: >
在DataFrame上,plot()
是一個函數可以方便地對數據的每個列進行繪圖
代碼:
df=pd.DataFrame(np.random.randn(1000,4),index=ts.index,columns=['A','B','C','D'])
df=df.cumsum()
df.plot()
輸出結果:
<Axes: >
數據獲取與導出
CSV
保存數據到csv文件中
df.to_csv('foo.csv')
從csv中讀取數據
代碼:
df=pd.read_csv('foo.csv')
print(df[:10])
輸出結果:
Unnamed: 0 A B C D
0 2000-01-01 -1.209412 -0.049535 0.380112 -0.109876
1 2000-01-02 -1.196146 -1.215185 -0.009250 1.115329
2 2000-01-03 -3.395013 -0.871985 -1.713391 0.439392
3 2000-01-04 -3.272589 0.451864 -1.860023 -0.933256
4 2000-01-05 -5.055778 0.073810 0.064276 -0.367640
5 2000-01-06 -5.676100 0.716393 1.457514 0.250130
6 2000-01-07 -6.043756 0.051985 2.045378 -0.782177
7 2000-01-08 -5.045143 -1.418815 3.013393 -0.286026
8 2000-01-09 -4.484707 0.327100 3.622895 1.287296
9 2000-01-10 -4.885518 0.525415 4.710187 1.067082
HDF5
生成HDF5存儲(需要安裝tables庫 pip3 install tables
)
代碼:
df.to_hdf('foo.h5','df')
從HDF5存儲中讀取數據
代碼:
df=pd.read_hdf('foo.h5','df')
print(df[:10])
輸出結果:
Unnamed: 0 A B C D
0 2000-01-01 -1.209412 -0.049535 0.380112 -0.109876
1 2000-01-02 -1.196146 -1.215185 -0.009250 1.115329
2 2000-01-03 -3.395013 -0.871985 -1.713391 0.439392
3 2000-01-04 -3.272589 0.451864 -1.860023 -0.933256
4 2000-01-05 -5.055778 0.073810 0.064276 -0.367640
5 2000-01-06 -5.676100 0.716393 1.457514 0.250130
6 2000-01-07 -6.043756 0.051985 2.045378 -0.782177
7 2000-01-08 -5.045143 -1.418815 3.013393 -0.286026
8 2000-01-09 -4.484707 0.327100 3.622895 1.287296
9 2000-01-10 -4.885518 0.525415 4.710187 1.067082
Excel
生成Excel文件(需要安裝openpyxl庫pip3 install openpyxl
)
df.to_excel('foo.xlsx',sheet_name='Sheet01')
從Excel中讀取數據
代碼:
print(pd.read_excel('foo.xlsx', 'Sheet01',index_col=None, na_values=['NA'])[:10])
輸出結果:
Unnamed: 0.1 Unnamed: 0 A B C D
0 0 2000-01-01 -1.209412 -0.049535 0.380112 -0.109876
1 1 2000-01-02 -1.196146 -1.215185 -0.009250 1.115329
2 2 2000-01-03 -3.395013 -0.871985 -1.713391 0.439392
3 3 2000-01-04 -3.272589 0.451864 -1.860023 -0.933256
4 4 2000-01-05 -5.055778 0.073810 0.064276 -0.367640
5 5 2000-01-06 -5.676100 0.716393 1.457514 0.250130
6 6 2000-01-07 -6.043756 0.051985 2.045378 -0.782177
7 7 2000-01-08 -5.045143 -1.418815 3.013393 -0.286026
8 8 2000-01-09 -4.484707 0.327100 3.622895 1.287296
9 9 2000-01-10 -4.885518 0.525415 4.710187 1.067082
上一章:機器學習實操項目01——Numpy入門(基本操作、數組形狀操作、復制與試圖、多種索引技巧、線性代數)
下一章:
機器學習核心知識點目錄:機器學習核心知識點目錄
機器學習實戰項目目錄:【從 0 到 1 落地】機器學習實操項目目錄:覆蓋入門到進階,大學生就業 / 競賽必備