pandas之時間數據

1.時間戳Timestamp()

參數可以為各種形式的時間，Timestamp()會將其轉換為時間。

time1 = pd.Timestamp('2019/7/13')
time2 = pd.Timestamp('13/7/2019 13:05')
time3 - pd.Timestamp('2019-7-13')
time4 = pd.Timestamp('2019 7 13 13:05')
time5 = pd.Timestamp('2019 July 13 13')
time6 = pd.Timestamp(datetime.datetime(2019,7,13,13,5))
print(datetime.datetime.now(),type(datetime.datetime.now()))
print(time1,type(time1))
print(time2)
print(time3)
print(time4)
print(time5)
print(time6)
# 2019-07-25 14:33:20.482696 <class 'datetime.datetime'>
# 2019-07-13 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# 2019-07-13 13:05:00
# 2019-07-13 00:00:00
# 2019-07-13 13:05:00
# 2019-07-13 13:00:00
# 2019-07-13 13:05:00

Timestamp()

2.to_datetime()時間戳和時間序列

對于單個時間的轉換，與timestamp()的用法相同，將各種形式的時間參數轉換為時間。

time1 = pd.to_datetime('2019/7/13')
time2 = pd.to_datetime('13/7/2019 13:05')
time3 = pd.to_datetime(datetime.datetime(2019,7,13,13,5))
print(datetime.datetime.now(),type(datetime.datetime.now()))
print(time1,type(time1))
print(time2)
print(time3)
# 2019-07-23 22:33:56.650290 <class 'datetime.datetime'>
# 2019-07-13 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# 2019-07-13 13:05:00
# 2019-07-13 13:05:00

to_datetime()處理單個時間

對于多個時間的處理，Timestamp()無法使用，而to_datetime()可以處理成時間序列

timelist = ['2019/7/13','13/7/2019 13:05',datetime.datetime(2019,7,13,13,5)]
t = pd.to_datetime(timelist)
print(t)
print(type(t))
# DatetimeIndex(['2019-07-13 00:00:00', '2019-07-13 13:05:00','2019-07-13 13:05:00'],
#               dtype='datetime64[ns]', freq=None)
# <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

to_datetime()處理時間序列

3.DatetimeIndex時間序列

一個時間序列，可通過索引獲取值。

t1 = pd.DatetimeIndex(['2019/7/13','13/7/2019 13:05',datetime.datetime(2019,7,13,18,5)])
print(t1,type(t1))
print(t1[1])
# DatetimeIndex(['2019-07-13 00:00:00', '2019-07-13 13:05:00',
#                '2019-07-13 18:05:00'],
#               dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
# 2019-07-13 13:05:00

DatetimeIndex

4.TimeSeries

索引為DatetimeIndex的Series

v = ['a','b','c']
t = pd.DatetimeIndex(['2019/7/13','13/7/2019 13:05',datetime.datetime(2019,7,13,18,5)])
s = pd.Series(v,index = t,name='s')
print(s)
# 2019-07-13 00:00:00    a
# 2019-07-13 13:05:00    b
# 2019-07-13 18:05:00    c
# Name: s, dtype: object

TimeSeries

重置頻率asfreq('新頻率',method)

表示對原TimeSeris索引重新劃分頻率，重置索引后如果出現新的索引，method默認為None表示對應的值為NaN，ffill和bfill分別表示用前面、后面的值填充。

t = pd.date_range('2019/1/3','2019/1/5')
arr = pd.Series(np.arange(3),index=t)
print(arr)
print('----------------------------')
print(arr.asfreq('8H'))
print('----------------------------')
print(arr.asfreq('8H',method='bfill'))
# 2019-01-03    0
# 2019-01-04    1
# 2019-01-05    2
# Freq: D, dtype: int32
# ----------------------------
# 2019-01-03 00:00:00    0.0
# 2019-01-03 08:00:00    NaN
# 2019-01-03 16:00:00    NaN
# 2019-01-04 00:00:00    1.0
# 2019-01-04 08:00:00    NaN
# 2019-01-04 16:00:00    NaN
# 2019-01-05 00:00:00    2.0
# Freq: 8H, dtype: float64
# ----------------------------
# 2019-01-03 00:00:00    0
# 2019-01-03 08:00:00    1
# 2019-01-03 16:00:00    1
# 2019-01-04 00:00:00    1
# 2019-01-04 08:00:00    2
# 2019-01-04 16:00:00    2
# 2019-01-05 00:00:00    2
# Freq: 8H, dtype: int32

時間序列的asfreq()

移位shift(n,freq,fill_value)

如果只有參數n，表示索引不變而將值進行移動，正數表示向后移動，負數表示向前移動，移動后出現的空值用fill_value填充，默認為NaN。

如果指定了n和freq，表示將索引按照指定的freq進行加法或減法，而值不變。

t = pd.date_range('2019/1/3','2019/1/5')
arr = pd.Series([15,16,14],index=t)
print(arr)
print('-----------------------')
print(arr.shift(1,fill_value='haha'))#移動后第一個索引沒有對應的值，以haha填充
print('-----------------------')
print(arr.shift(-1))#移動后最后一個索引沒有對應的值，默認為NaN
# 2019-01-03    15
# 2019-01-04    16
# 2019-01-05    14
# Freq: D, dtype: int64
# -----------------------
# 2019-01-03    haha
# 2019-01-04      15
# 2019-01-05      16
# Freq: D, dtype: object
# -----------------------
# 2019-01-03    16.0
# 2019-01-04    14.0
# 2019-01-05     NaN
# Freq: D, dtype: float64

shift()移動值

t = pd.date_range('2019/1/3','2019/1/5')
arr = pd.Series([15,16,14],index=t)
print(arr)
print('-----------------------')
print(arr.shift(2,freq='D'))
print('-----------------------')
print(arr.shift(-2,freq='H'))
# 2019-01-03    15
# 2019-01-04    16
# 2019-01-05    14
# Freq: D, dtype: int64
# -----------------------
# 2019-01-05    15
# 2019-01-06    16
# 2019-01-07    14
# Freq: D, dtype: int64
# -----------------------
# 2019-01-02 22:00:00    15
# 2019-01-03 22:00:00    16
# 2019-01-04 22:00:00    14
# Freq: D, dtype: int64

shift()移動索引

5.date_range()和bdate_range()

生成時間范圍，類型為DatetimeIndex，date_range()是生成自然日，bdate_range()是生成工作日，下面以date_range()為例。

使用方法：date_range(start,end,periods,freq,closed,normalize,name,tz)

start：時間起始點

end：時間結束點

periods：生成的時間個數

freq：頻率，默認為D日歷天，其他Y、M、B、H、T/MIN、S、L、U分別表示年、月、工作日、小時、分、秒、毫秒、微妙（不區分大小寫）

　　其他參數：W-MON表示從每周的周幾開始，WOM-2MON表示每月的周幾開始

closed：默認為None，表示包括起始點和結束點，left表示包括起始點，right表示包括終端

normalize：默認為false，True表示將時刻設置為0:00:00

name：時間范圍的名稱

tz：時區

t1 = pd.date_range('2000/1/5','2003/1/5',freq='y')
t2 = pd.date_range('2000/1/1','2000/3/5',freq='m')
t3 = pd.date_range('2000/1/1','2000/1/10',periods=3)
t4 = pd.date_range('2000/1/1 12','2000/1/1 15',freq='h')
t5 = pd.date_range('2000/1/1 12','2000/1/1 15',freq='h',closed='left',name='t3')
t6 = pd.date_range(start = '2000/1/1 11:00:00',periods=3)
t7 = pd.date_range(end = '2000/1/1 12:00:00',periods=3)
print(t1)
print(t2)
print(t3)
print(t4)
print(t5)
print(t6)
print(t7)
# DatetimeIndex(['2000-12-31', '2001-12-31', '2002-12-31'], dtype='datetime64[ns]', freq='A-DEC')
# DatetimeIndex(['2000-01-31', '2000-02-29'], dtype='datetime64[ns]', freq='M')
# DatetimeIndex(['2000-01-01 00:00:00', '2000-01-05 12:00:00', '2000-01-10 00:00:00'],
#               dtype='datetime64[ns]', freq=None)
# DatetimeIndex(['2000-01-01 12:00:00', '2000-01-01 13:00:00', '2000-01-01 14:00:00', '2000-01-01 15:00:00'],
#               dtype='datetime64[ns]', freq='H')
# DatetimeIndex(['2000-01-01 12:00:00', '2000-01-01 13:00:00', '2000-01-01 14:00:00'],
#               dtype='datetime64[ns]', name='t3', freq='H')
# DatetimeIndex(['2000-01-01 11:00:00', '2000-01-02 11:00:00', '2000-01-03 11:00:00'],
#               dtype='datetime64[ns]', freq='D')
# DatetimeIndex(['1999-12-30 12:00:00', '1999-12-31 12:00:00', '2000-01-01 12:00:00'],
#               dtype='datetime64[ns]', freq='D')

date_range()

6.Period()時期

Period('date',freq = '*')：默認的頻率freq為傳入時間的最小單位，例如傳入時間的形式最小月份，那么默認頻率為月，如果傳入時間的形式最小單位為分鐘，那么默認頻率為分。

下面例子中的p4，設置頻率為2M即2個工作日，那么對于p4來說的1個單位就相當于2M，所以p4+3就是p4+3*2M

p1 = pd.Period('2017')
p2 = pd.Period('2017',freq = 'M')
print(p1,type(p1),p1+1)
print(p2,p2+1)
p3 = pd.Period('2017-1-1')
p4 = pd.Period('2017-1-1',freq = '2M')
print(p3,p3+2)
print(p4,p4+3)
p5 = pd.Period('2017-1-1 13:00')
p6 = pd.Period('2017-1-1 13:00',freq = '5T')
print(p5,p5+4)
print(p6,p6+5)
# 2017 <class 'pandas._libs.tslibs.period.Period'> 2018
# 2017-01 2017-02
# 2017-01-01 2017-01-03
# 2017-01 2017-07
# 2017-01-01 13:00 2017-01-01 13:04
# 2017-01-01 13:00 2017-01-01 13:25

Period()

7.period_range()

時期范圍，類型為PeriodIndex，用法類似date_range()。

p = pd.period_range('2000/1/1','2000/1/2',freq='6H')
print(p,type(p))
# PeriodIndex(['2000-01-01 00:00', '2000-01-01 06:00', '2000-01-01 12:00','2000-01-01 18:00', '2000-01-02 00:00'],
#             dtype='period[6H]', freq='6H') <class 'pandas.core.indexes.period.PeriodIndex'>

period_range()

period和period_range()的asfreq，默認顯示freq中的最后一個值，如果指定how='start'則顯示freq中的第一個值。

p1 = pd.Period( '2019/5/1')   #2019-05-01
p2 = p1.asfreq('H')  #2019-05-01 23:00
p3 = p1.asfreq('2H',how='start')  #2019-05-01 00:00，頻率設置為2M的2并不起作用
p4 = p1.asfreq('S')  #2019-05-01 23:59:59
p5 = p1.asfreq('S',how='start')  #2019-05-01 00:00:00

Period()的asfreq

p = pd.period_range('2015/3','2015/6',freq='M')
ps1 = pd.Series(np.random.rand(len(p)),index=p.asfreq('D'))
ps2 = pd.Series(np.random.rand(len(p)),index=p.asfreq('D',how='start'))
print(p)
print(ps1)
print('--------------------------')
print(ps2)
# PeriodIndex(['2015-03', '2015-04', '2015-05', '2015-06'], dtype='period[M]', freq='M')
# 2015-03-31    0.708730
# 2015-04-30    0.238101
# 2015-05-31    0.793451
# 2015-06-30    0.584621
# Freq: D, dtype: float64
# --------------------------
# 2015-03-01    0.397659
# 2015-04-01    0.032417
# 2015-05-01    0.763550
# 2015-06-01    0.129498
# Freq: D, dtype: float64

period_range()的asfreq

8.to_timestamp()和to_period()

時間戳和時期的轉化.

p1 = pd.date_range('2015/3','2015/6',freq='M')
p2 = pd.period_range('2015/3','2015/6',freq='M')
ps1 = pd.Series(np.random.rand(len(p1)),index=p1)
ps2 = pd.Series(np.random.rand(len(p2)),index=p2)
print(ps1)
print('---------------')
print(ps2)
print('---------------')
print(ps1.to_period())
print('---------------')
print(ps2.to_timestamp())
# 2015-03-31    0.066644
# 2015-04-30    0.159969
# 2015-05-31    0.111716
# Freq: M, dtype: float64
# ---------------
# 2015-03    0.966091
# 2015-04    0.779257
# 2015-05    0.953817
# 2015-06    0.765121
# Freq: M, dtype: float64
# ---------------
# 2015-03    0.066644
# 2015-04    0.159969
# 2015-05    0.111716
# Freq: M, dtype: float64
# ---------------
# 2015-03-01    0.966091
# 2015-04-01    0.779257
# 2015-05-01    0.953817
# 2015-06-01    0.765121
# Freq: MS, dtype: float64

to_timestamp()和to_period()

9.時間序列索引

可通過下標和標簽進行索引，標簽可以為各種形式的時間.

p = pd.Series(np.random.rand(4),pd.period_range('2015/3','2015/6',freq='M'))
print(p)
print(p[0])
print(p.iloc[1])
print(p.loc['2015/5'])
print(p.loc['2015-5'])
print(p.loc['201505'])
# 2015-03    0.846543
# 2015-04    0.631335
# 2015-05    0.218029
# 2015-06    0.646544
# Freq: M, dtype: float64
# 0.846543180730373
# 0.6313347971612441
# 0.21802886896115137
# 0.21802886896115137
# 0.21802886896115137

時間序列索引

p = pd.Series(np.random.rand(5),index=pd.period_range('2015/5/30','2015/6/3'))
print(p)
print(p[0:2])  #下標索引，末端不包含
print(p.iloc[0:2]) #下標索引，末端不包含
print(p.loc['2015/5/30':'2015/6/1']) #標簽索引，兩端包含
print(p['2015/5'])  #只傳入月份，會將序列中在此月份中的行全部顯示
# 2015-05-30    0.976255
# 2015-05-31    0.671226
# 2015-06-01    0.888682
# 2015-06-02    0.875901
# 2015-06-03    0.953603
# Freq: D, dtype: float64
# 2015-05-30    0.976255
# 2015-05-31    0.671226
# Freq: D, dtype: float64
# 2015-05-30    0.976255
# 2015-05-31    0.671226
# Freq: D, dtype: float64
# 2015-05-30    0.976255
# 2015-05-31    0.671226
# 2015-06-01    0.888682
# Freq: D, dtype: float64
# 2015-05-30    0.976255
# 2015-05-31    0.671226
# Freq: D, dtype: float64

時間序列切片

10.唯一unique()

is_unique判斷序列的值是否唯一，index.is_unique判斷標簽是否唯一。

對于時間序列的索引，如果時間序列不重復，取單個時間對應的值的結果為一個數值。

而如果時間序列有重復，取無重復時間的結果仍為序列，如果取有重復的時間的值，默認會將所有符合條件的結果顯示出來，可使用groupby進行分組。

p = pd.Series(np.random.rand(5),index=pd.DatetimeIndex(['2019/5/1','2019/5/2','2019/5/3','2019/5/1','2019/5/2']))
print(p)
print(p.is_unique,p.index.is_unique)
print('--------------------')
print(p['2019/5/3'])
print('--------------------')
print(p['2019/5/1'])
print('--------------------')
print(p['2019/5/1'].groupby(level=0).mean())#對標簽為2019/5/1按x軸分組，值取兩者的平均值
# 2019-05-01    0.653468
# 2019-05-02    0.116834
# 2019-05-03    0.978432
# 2019-05-01    0.724633
# 2019-05-02    0.250191
# dtype: float64
# True False
# --------------------
# 2019-05-03    0.978432
# dtype: float64
# --------------------
# 2019-05-01    0.653468
# 2019-05-01    0.724633
# dtype: float64
# --------------------
# 2019-05-01    0.689051
# dtype: float64

重復時間索引

時間重采樣

通過resample('新頻率')進行重采樣，結果是一個對象，需要通過sum()、mean()、max()、min()、median()、first()、last()、ohlc()（經濟，開盤、最高、最低、收盤）顯示

將時間序列從一個頻率轉換為另一個頻率的過程，會有數據的填充或結合。

降采樣：高頻數據→低頻數據，例如以天為頻率的數據轉換為以月為頻率的數據，會有數據的結合。

升采樣：低頻數據→高頻數據，例如以年為頻率的數據轉換為以月為頻率的數據，會有數據的填充。

ts = pd.Series(np.arange(1,9),index=pd.date_range(start = '2019/5/1',periods=8))
print(ts)
print('重采樣：',ts.resample('3D'),' 數據類型',type(ts.resample('3D')))
print('重采樣和值：',type(ts.resample('3D').sum()),'\n',ts.resample('3D').sum())
print('重采樣均值：\n',ts.resample('3D').mean())
print('重采樣最大值：\n',ts.resample('3D').max())
print('重采樣最小值：\n',ts.resample('3D').min())
print('重采樣中值：\n',ts.resample('3D').median())
print('重采樣第一個：\n',ts.resample('3D').first())
print('重采樣最后一個：\n',ts.resample('3D').last())
print('OHLC重采樣：\n',ts.resample('3D').ohlc())
# 2019-05-01    1
# 2019-05-02    2
# 2019-05-03    3
# 2019-05-04    4
# 2019-05-05    5
# 2019-05-06    6
# 2019-05-07    7
# 2019-05-08    8
# Freq: D, dtype: int32
# 重采樣： DatetimeIndexResampler [freq=<3 * Days>, axis=0, closed=left, label=left, convention=start, base=0] 數據類型<class 'pandas.core.resample.DatetimeIndexResampler'>
# 重采樣和值： <class 'pandas.core.series.Series'> 
#  2019-05-01     6
# 2019-05-04    15
# 2019-05-07    15
# Freq: 3D, dtype: int32
# 重采樣均值：
#  2019-05-01    2.0
# 2019-05-04    5.0
# 2019-05-07    7.5
# Freq: 3D, dtype: float64
# 重采樣最大值：
#  2019-05-01    3
# 2019-05-04    6
# 2019-05-07    8
# Freq: 3D, dtype: int32
# 重采樣最小值：
#  2019-05-01    1
# 2019-05-04    4
# 2019-05-07    7
# Freq: 3D, dtype: int32
# 重采樣中值：
#  2019-05-01    2.0
# 2019-05-04    5.0
# 2019-05-07    7.5
# Freq: 3D, dtype: float64
# 重采樣第一個：
#  2019-05-01    1
# 2019-05-04    4
# 2019-05-07    7
# Freq: 3D, dtype: int32
# 重采樣最后一個：
#  2019-05-01    3
# 2019-05-04    6
# 2019-05-07    8
# Freq: 3D, dtype: int32
# OHLC重采樣：
#              open  high  low  close
# 2019-05-01     1     3    1      3
# 2019-05-04     4     6    4      6
# 2019-05-07     7     8    7      8

重采樣resample()示例

對于降采樣，如果resample()中設置參數closed='right'，則指定間隔右邊為結束，默認是采用left間隔左邊為結束。【不是很明白】

ts = pd.Series(np.arange(1,9),index=pd.date_range(start = '2019/5/1',periods=8))
print(ts.resample('3D').sum())   
print(ts.resample('3D',closed='right').sum())
'''[1,2,3],[4,5,6],[7,8]]'''
'''[(29,30)1],[2,3,4],[5,6,7],[8]'''
# 2019-05-01     6
# 2019-05-04    15
# 2019-05-07    15
# Freq: 3D, dtype: int32
# 2019-04-28     1
# 2019-05-01     9
# 2019-05-04    18
# 2019-05-07     8
# Freq: 3D, dtype: int32

重采樣左右結束

對于降采樣，如果resample()中設置lable='right'，表示顯示的標簽為下一組里面的第一個標簽，默認為當前分組的第一個標簽。

ts = pd.Series(np.arange(1,9),index=pd.date_range(start = '2019/5/1',periods=8))
print(ts.resample('3D').sum())   #顯示的標簽為當前分組中的第一個標簽
print(ts.resample('3D',label='right').sum())  #顯示的標簽為下一個分組中的第一個標簽
#按照3D重采樣，分組[1,2,3] [4,5,6] [7,8,9]
# 2019-05-01     6
# 2019-05-04    15
# 2019-05-07    15
# Freq: 3D, dtype: int32
# 2019-05-04     6
# 2019-05-07    15
# 2019-05-10    15
# Freq: 3D, dtype: int32

降采樣顯示標簽

對于升采樣，由于會增加標簽，因此會出現空值問題，bfill()使用后面的值填充空值，ffill()使用前面的值填充空值。

ts = pd.Series(np.arange(1,4),index=pd.date_range(start = '2019/5/1',periods=3))
print(ts)
print(ts.resample('12H'))  #對象
print(ts.resample('12H').asfreq())   #使用NaN填充空值
print(ts.resample('12H').bfill())    #使用后面的值填充空值
print(ts.resample('12H').ffill())   #使用前面的值填充空值
# 2019-05-01    1
# 2019-05-02    2
# 2019-05-03    3
# Freq: D, dtype: int32
# DatetimeIndexResampler [freq=<12 * Hours>, axis=0, closed=left, label=left, convention=start, base=0]
# 2019-05-01 00:00:00    1.0
# 2019-05-01 12:00:00    NaN
# 2019-05-02 00:00:00    2.0
# 2019-05-02 12:00:00    NaN
# 2019-05-03 00:00:00    3.0
# Freq: 12H, dtype: float64
# 2019-05-01 00:00:00    1
# 2019-05-01 12:00:00    2
# 2019-05-02 00:00:00    2
# 2019-05-02 12:00:00    3
# 2019-05-03 00:00:00    3
# Freq: 12H, dtype: int32
# 2019-05-01 00:00:00    1
# 2019-05-01 12:00:00    1
# 2019-05-02 00:00:00    2
# 2019-05-02 12:00:00    2
# 2019-05-03 00:00:00    3
# Freq: 12H, dtype: int32