hive-窗口函數

1 窗口函數語法

分析函數/專用窗口函數 over(partition by 列名 order by 列名 rows between 開始位置 and
結束位置)

常用的分析函數

常用的分析函數：sum()、max()、min()、avg()、count()

常用的專用窗口函數

專用窗口函數：row_number()、rank()、dense_rank()

窗口函數

窗口函數的3個組成部分可以單獨使用，也可以混合使用，也可以全部不用

over(partition by 列名 order by 列名 rows between 開始位置 and 結束位置)

partition by

對指定的字段進行分組，后續都會以組為單位，把每個分組單獨作為一個窗口進行統計分析操作。

案例 01：對窗口中的數據求和，并把求和結果分別分發到對應窗口的每一條數據中

with temp as(
select 'A' as col1,1 as col2
union all 
select 'A' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)select
col1
,sum(col2) over(partition by col1) as '對窗口中的數據求和'
from temp
輸出結果：
col     對窗口中的數據求和
A       2
A       2
B       1

案例 02：對整體數據求和，并把求和結果分發到每一條數據中

with temp as(
select 'A' as col1,1 as col2
union all 
select 'A' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)select
col1
,sum(col2) over() as '對整體數據求和'
from temp
輸出結果：
col     對整體數據求和
A       3
A       3
B       3

order by

order by 與 partition by 連用的時候，可以對各個分組內的數據，按照指定的字段進行排序。如果沒有 partition by 指定分組字段，那么會對全局的數據進行排序。

案例 01：對數據進行全局排序

with temp as(
select 'A' as col1,1 as col2
union all 
select 'C' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)select col1,row_number() over(order by col1 desc) as 排序 from temp
輸出結果：
col1     排序 
C        1
C        2
B        3
A        4

案例 02：當排序的維度不存在重復的情況下，即 order by 指定的字段，使用 order by + 分析函數 sum()，可以產生求整體累計數的效果。

with temp_01 as(
select 'A' as col1,1 as col2
union all 
select 'D' as col1,1 as col2
union all 
select 'C' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)select col1,sum(col2) over(order by col1) as 求累計 from temp_01
輸出結果：
col1    求累計
A       1
B       2
C       3
D       4

但是當 order by 指定的字段，數據存在重復的時候，會在不重復的數據中產生累計效果，重復的數據中，會把整體的累計結果分發到每條重復的數據中。

with temp_02 as(
select 'A' as col1,1 as col2
union all 
select 'C' as col1,1 as col2
union all 
select 'C' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)select col1,sum(col2) over(order by col1) as 求累計 from temp_02
輸出結果：
col1    求累計
A       1
B       2
C       4
C       4

案例 03：當排序的維度不存在重復的情況下，即 order by 指定的字段，parition by 與 order by 連用，并使用分析函數 sum() ，可以產生求窗口中累計數的效果。

with temp_01 as(
select 'A' as col1,'b' as col2,1 as col3
union all 
select 'A' as col1,'a' as col2,1 as col3
union all 
select 'C' as col1,'a' as col2,1 as col3
union all 
select 'C' as col1,'b' as col2,1 as col3
)select col1,sum(col3) over(partition by col1 order by col1,col2) as 求累計 from temp_01
輸出結果：
col1     求累計
A        1
A        2
C        1
C        2

但是當 order by 指定的字段組合，數據存在重復的時候，會在不重復的數據中產生累計效果，而重復的數據中，也是會把整體的累計結果分發到每條重復的數據中，如下，用 col1 和 col2 排序。

with temp_02 as(
select 'A' as col1,'b' as col2,1 as col3
union all 
select 'A' as col1,'a' as col2,1 as col3
union all 
select 'C' as col1,'a' as col2,1 as col3
union all 
select 'C' as col1,'a' as col2,1 as col3
)select col1,sum(col3) over(partition by col1 order by col1,col2) as 求累計 from temp_02
輸出結果：
col1     求累計
A        1
A        2
C        2
C        2

rows between 開始位置 and 結束位置

rows between 是用來劃分窗口中，函數發揮作用的數據范圍。我們用如下例子加深 rows between 的理解。
在這里插入圖片描述

在A分組中，計算第2行，第2列的值的時候，會從窗口起點（第2行，第2列），計算到當前行（第2行，第2列）。結果是1。
在A分組中，計算第3行，第2列的值的時候，會從窗口起點（第2行，第2列），計算到當前行（第3行，第2列）。結果是2。

rows between 常用的參數如下：

① n preceding：往前

② n following：往后

③ current row：當前行

④ unbounded：起點（一般結合preceding，following使用）

a. unbounded preceding：表示該窗口最前面的行（起點） b. unbounded
following：表示該窗口最后面的行（終點）

使用例子如下：

rows between unbounded preceding and current row（表示從起點到當前行的數據進行） rows
between current row and unbounded following（表示當前行到終點的數據進行） rows
between unbounded preceding and unbounded following （表示起點到終點的數據） rows
between 1 preceding and 1 following（表示往前1行到往后1行的數據） rows between 1
preceding and current row（表示往前1行到當前行）

rows between unbounded preceding and current row與 partition by 、order by 連用，可以產生對窗口中的數據求累計數的效果。

with temp as(
select 'A' as col1,1 as col2
union all 
select 'A' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)select
col1
,sum(col2) over(partition by col1 order by col2 desc rows between unbounded preceding and current row) as '對窗口中的數據求和'
from temp
輸出結果：
col1     對窗口中的數據求和
A        1
A        2
B        1

排序窗口函數
2.1 排序并產生自增編號，自增編號不重復且連續
我們可以使用函數：row_number() over()

數據樣例：

col1                             ranks
a   &nbsp; &nbsp; &nbsp;    1
b    &nbsp; &nbsp; &nbsp;     2
b    &nbsp; &nbsp; &nbsp;     3
b    &nbsp; &nbsp; &nbsp;     4
c   &nbsp; &nbsp; &nbsp;      5
d    &nbsp; &nbsp; &nbsp;     6具體語法如下：> row_number() over(partition by 列名 order by 列名 rows between 開始位置 and
> 結束位置)案例如下：>with temp as(
select 'a' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'c' as col1
union all 
select 'd' as col1
)
>
>select col1,row_number() over(order by col1) as ranks from temp
輸出結果：
col1     rank
a        1
b        2
b        3
b        4
c        5
d        6

2.2 排序并產生自增編號，自增編號會重復且不連續
我們可以使用函數：rank() over()

數據樣例：

col1     ranks
a        1
b        2
b        2
b        2
c        5
d        6

具體語法如下：

rank() over(partition by 列名 order by 列名 rows between 開始位置 and 結束位置)

案例如下：

with temp as(
select 'a' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'c' as col1
union all 
select 'd' as col1
)select col1,rank() over(order by col1) as ranks from temp
輸出結果：
col1     rank
a        1
b        2
b        2
b        2
c        5
d        6

2.3 排序并產生自增編號，自增編號會重復且連續
我們可以使用函數：dense_rank() over()

數據樣例：

col1     ranks
a        1
b        2
b        2
b        2
c        3
d        4

具體語法如下：

dense_rank() over(partition by 列名 order by 列名 rows between 開始位置 and
結束位置)

案例如下：

with temp as(
select 'a' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'b' as col1
union all 
select 'c' as col1
union all 
select 'd' as col1
)select col1,dense_rank() over(order by col1) as ranks from temp	
輸出結果：
col1     ranks
a        1
b        2
b        2
b        2
c        3
d        4

聚合窗口函數

3.1 求窗口中的累計值
我們可以使用：sum() over()

with temp as(
select 'A' as col1,1 as col2
union all 
select 'A' as col1,1 as col2
union all
select 'A' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
union all 
select 'B' as col1,1 as col2
)select
col1
,sum(col2) over(partition by col1 order by col2 desc rows between unbounded preceding and current row) as '對窗口中的數據求和'
from temp
輸出結果：
col1     對窗口中的數據求和
A        1
A        2
A        3
B        1
B        1

3.2 求窗口中 3 天的平均價格
我們可以使用 avg() over()

with temp as(
select 'A' as col1,'2022-11-01' as date_time,50 as price
union all 
select 'A' as col1,'2022-11-02' as date_time,60 as price
union all 
select 'A' as col1,'2022-11-03' as date_time,45 as price
union all 
select 'A' as col1,'2022-11-04' as date_time,70 as price
union all 
select 'A' as col1,'2022-11-05' as date_time,40 as price
union all 
select 'A' as col1,'2022-11-06' as date_time,40 as price
union all 
select 'B' as col1,'2022-11-01' as date_time,40 as price
union all 
select 'B' as col1,'2022-11-02' as date_time,30 as price
union all 
select 'B' as col1,'2022-11-03' as date_time,50 as price
union all 
select 'B' as col1,'2022-11-04' as date_time,50 as price
)select
col1
,date_time
,price
,avg(price) over(partition by col1 order by date_time rows between 2 preceding and current row) as '3天的平均價格'
from temp輸出結果：
col1      date_time        price      3天的平均價格
A         2022-11-01       50         50
A         2022-11-02       60         55
A         2022-11-03       45         51.666666666666664
A         2022-11-04       70         58.333333333333336
A         2022-11-05       40         51.666666666666664
A         2022-11-06       40         50
B         2022-11-01       40         40
B         2022-11-02       30         35
B         2022-11-03       50         40
B         2022-11-01       50         43.333333333333336

3.3 求分組中的最大值/最小值

with temp_01 as(
select 'A' as col1,10 as col2
union all 
select 'C' as col1,10 as col2
union all 
select 'C' as col1,20 as col2
union all 
select 'A' as col1,20 as col2
union all
select 'A' as col1,20 as col2
)select
col1
,col2
,max(col2) over(partition by col1) as 窗口中的最大值
,min(col2) over(partition by col1) as 窗口中的最小值
from temp_01輸出結果：
col1     col2     窗口中的最大值     窗口中的最小值
A        10       20                 10
A        20       20                 10
A        20       20                 10
C        10       20                 10
C        20       20                 10

3.4 求分組中的總記錄數

with temp_01 as(
select 'A' as col1,'a' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'A' as col1,'b' as col2
union all
select 'A' as col1,'b' as col2
)select
col1
,col2
,count(col2) over(partition by col1) as 分組中的記錄數
from temp_01輸出結果：
col1     col2     分組中的記錄數
A        a        3
A        b        3
A        b        3
C        a        2
C        a        2

位移窗口函數
4.1 獲取分組中往前 n 行的值
基礎語法：

lead(field,n,default_value) over()

語法解析：

field 是指定的列名
n 是往前的行數
行往前導致的，最后的 n 行值為 null，可以用 default_value 代替。

使用案例：

with temp_01 as(
select 'A' as col1,'2022-12-01' as date_time
union all 
select 'C' as col1,'2022-12-01' as date_time
union all 
select 'C' as col1,'2022-12-02' as date_time
union all 
select 'A' as col1,'2022-12-02' as date_time
union all
select 'A' as col1,'2022-12-03' as date_time
)select
col1
,date_time
,lead(date_time,1,'2022-12-30') over(partition by col1 order by date_time) as 往前n行的值 
from temp_01輸出結果：
col1     date_time     往前n行的值
A        2022-12-01    2022-12-02
A        2022-12-02    2022-12-03
A        2022-12-03    2022-12-30
C        2022-12-01    2022-12-02
C        2022-12-02    2022-12-30

4.2 獲取分組中往后 n 行的值
基礎語法：

lag(field,n, default_value) over()

語法解析：

field 是指定的列名
n 是往前的行數
行往后導致的，前面的 n 行值為 null，可以用 default_value 代替。

使用案例：

with temp_01 as(
select 'A' as col1,'2022-12-01' as date_time
union all 
select 'C' as col1,'2022-12-01' as date_time
union all 
select 'C' as col1,'2022-12-02' as date_time
union all 
select 'A' as col1,'2022-12-02' as date_time
union all
select 'A' as col1,'2022-12-03' as date_time
)select
col1
,date_time
,lag(date_time,1,'2022-12-30') over(partition by col1 order by date_time) as 往前n行的值 
from temp_01輸出結果：
col1      date_time     往前n行的值
A         2022-12-01    2022-12-30
A         2022-12-02    2022-12-01
A         2022-12-03    2022-12-02
C         2022-12-01    2022-12-30
C         2022-12-02    2022-12-01

極值窗口函數

5.1 獲取分組內第一行的值
我們可以使用 first_value(col,true/false) over()，作用是：取分組內排序后，截止到當前行，第一個值。

注意：

當第二個參數為 true 的時候，會跳過空值當 over() 中不指定排序的時候，會默認使用表中數據的原排序。
案例：將每行數據換成當前窗口指定字段的第一個值

with temp_01 as(
select 'A' as col1,'b' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'C' as col1,'b' as col2
union all 
select 'A' as col1,'a' as col2
union all
select 'A' as col1,'b' as col2
)select
col1
,first_value(col2) over(partition by col1 order by col2) as 第一個值 
from temp_01
輸出結果：
col1     第一個值
A        a
A        a
A        a
C        a 
C        a

select
col1
,first_value(col2) over(partition by col1) as 第一個值 
from temp_01
輸出結果：
col1     第一個值
A        b
A        b
A        b
C        a
C        a

5.2 獲取分組內最后一行的值
我們可以使用 last_value(col,true/false) over()，作用是：取分組內排序后，截止到當前行，最后一個值。所以，如果使用 order by 排序的時候，想要取最后一個值，需要與 rows between unbounded preceding and unbounded following 連用。

注意：

當第二個參數為 true 的時候，會跳過空值當 over() 中不指定排序的時候，會默認使用表中數據的原排序。當 over()
中指定排序的時候，要與 rows between unbounded preceding and unbounded following
連用

with temp_01 as(
select 'A' as col1,'b' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'C' as col1,'b' as col2
union all 
select 'A' as col1,'a' as col2
union all
select 'A' as col1,'b' as col2
)select
col1
,last_value(col2) over(partition by col1 order by col2 rows between unbounded preceding and unbounded following) as 第一個值 
from temp_01輸出結果：
col1       第一個值
A          b
A          b
A          b
C          b
C          b

相信大家都發現了，在本案例中，我們使用 order by 的時候與 rows between unbounded preceding and unbounded following 連用了，這是需要注意的一個點，如果不連用，將會產生以下效果：

with temp_01 as(
select 'A' as col1,'b' as col2
union all 
select 'C' as col1,'a' as col2
union all 
select 'C' as col1,'b' as col2
union all 
select 'A' as col1,'a' as col2
union all
select 'A' as col1,'b' as col2
)select
col1
,last_value(col2) over(partition by col1 order by col2) as 第一個值 
from temp_01輸出結果：
col1      第一個值
A         a
A         b
A         b
C         a
C         b

分箱窗口函數
ntile() over() 分箱窗口函數，用于將分組數據按照順序切分成 n 片，返回當前切片值，如果切片不均勻，默認增加到第一個切片中。

案例：查詢成績前 20% 的人。

with temp as(
select 'A' as col1,90 as grade
union all 
select 'B' as col1,80 as grade
union all
select 'C' as col1,82 as grade
union all
select 'D' as col1,99 as grade
union all 
select 'E' as col1,100 as grade
union all 
select 'F' as col1,92 as grade
union all
select 'G' as col1,93 as grade
union all
select 'H' as col1,85 as grade
union all
select 'I' as col1,95 as grade
union all
select 'J' as col1,70 as grade
)select 
col1
,grade
from(selectcol1,grade,ntile(5) over(order by grade desc) as levelfrom temp)t1
where t1.level = 1輸出結果：
col1     grade
E        100
D        99