本文用于復習數據庫窗口函數,希望能夠溫故知新,也希望讀到這篇文章的有所收獲。
本文以:MySQL為例
參考文檔: https://www.begtut.com/mysql/mysql-window-functions.html
使用的樣例數據:https://www.begtut.com/mysql/mysql-sample-database.html
1. 概括的說明
函數 | 說明 |
---|---|
ROW_NUMBER | 為其分區中的每一行分配一個序號。 |
RANK | 根據ORDER BY的字段,為每一行分配一個排名。 值相同的行分配相同的排名, 下一行排名不聯系,會累加值相同的行數。 |
DENSE_RANK | 與RANK()函數類似,只是當出現值相同的行時,排名是連續的,不是累加行數。 |
PERCENT_RANK | 計算分區或結果集中行的百分位數。計算公式為:(當前從小到大排序序號-1 ) / (總序號數-1) 【就是(rank - 1) / (total_rows - 1) 】 |
FIRST_VALUE | 返回指定表達式相對于窗口框架中第一行的值。 |
LAST_VALUE | 返回指定表達式相對于窗口框架中最后一行的值。 |
LEAD | 返回分區中當前行之后的第N行的值。 如果不存在后續行,則返回NULL。 |
LAG | 返回分區中當前行之前的第N行的值。 如果不存在前一行,則返回NULL。 |
NTILE | 將每個窗口分區的行分配到指定數量的已排名組中。 (把結果分成n個組) |
CUME_DIST | 計算一組值中值的累積分布。 |
NTH_VALUE | 返回窗口框架第N行的參數 |
2.??注意 rows between 的用法
- rows between …… and ……
- unbounded preceding 前面所有行 、n preceding ?前面n行
- unbounded following 后面所有行 、n following ?后面n行
- current row 當前行
SELECTorderNumber,productCode,quantityOrdered,SUM(quantityOrdered) OVER(PARTITION BY orderNumber ORDER BY productcode) AS quantity_amount,-- 前面一行和當前行的值累加SUM(quantityOrdered) OVER(PARTITION BY orderNumber ORDER BY productcode ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) quantity_add
FROMmysqldemo.orderdetails
WHEREorderNumber = 10103;
3. 注意 range between的用法
range between 按照排序字段的值限制窗口大小。主要將order by后面字段排序后,然后根據排序字段的值,框定一個范圍,再對這個范圍內的行進行匯總。樣例如下:
?
sum(num) over(order by dateTime range between interval 6 day preceding and current row)
-- 必須是date類型的數據,這一天和前面6天(如果存在)的數據sum(close) over(order by salary range between 100 preceding and 200 following)
--通過 salary 字段差值來進行選擇。如當前行的 salary 字段值是 200,那么這個窗口大小的定義就會選擇分區中 salary 字段值落在 100 至 400 區間的記錄(行),再求這些行的sum(close).
需要注意的點:
- rows表示行,就是前n行,后n行。
- range表示的是具體的值,比這個值小n的行,比這個值大n的行。是以當前值為錨點進行計算。
- 同時 range 也可以使用 between unbounded preceding and unbounded following,效果和等同于rows一樣,取上下限所有行,不指定值。
- range 窗口僅對數字和日期起作用,因為需要計算值的范圍。
- 在range 的開窗中,order by 中只能有一列;rows 的開窗的order by 可以有多列。
SELECTorderNumber,productCode,quantityOrdered,SUM(quantityOrdered) OVER(PARTITION BY orderNumber ORDER BY quantityOrdered) AS quantity_amount,-- quantityOrdered 的值-1 和 +2的值區間范圍內的行的累加SUM(quantityOrdered) OVER(PARTITION BY orderNumber ORDER BY quantityOrdered RANGE BETWEEN 1 PRECEDING AND 2 following) quantity_add
FROMmysqldemo.orderdetails;
4.?ROW_NUMBER & RANK & DENSE_RANK
比較常用,都很熟悉,基本用法就不用贅述了。
SELECTorderNumber,productCode,quantityOrdered,ROW_NUMBER() OVER (ORDER BY quantityOrdered) AS nb,RANK() OVER (PARTITION BY orderNumber ORDER BY quantityOrdered) AS rank_quantity,DENSE_RANK() OVER (PARTITION BY orderNumber ORDER BY quantityOrdered) AS dense_rank_quantity
FROMmysqldemo.orderdetails
WHEREorderNumber = 10103;
rank和dense_rank 的區別,就是遇到有多行值相同時,那么下一行的序號,rank會加上重復的行數,那么rank對應的序號就不連續了;dense_rank 不會加上重復的行數,保持序號任然是連續的。
需要注意的點:
- ROW_NUMBER 不加partition的時候,對所有行加序號,加partition之后分組加序號。
- RANK 注意不加order by的時候,不排序,全是1,即使加partition也沒用;一定要加order by才會排序。
- SUM 用法和 ROW_NUMBER 相同,匯總和分組匯總。
SELECT *, ROW_NUMBER() OVER () row_num0,ROW_NUMBER() OVER (PARTITION BY productline) row_num1,RANK() OVER() AS Rank00,RANK() OVER(PARTITION BY productline) AS Rank01,RANK() OVER(PARTITION BY productline,order_year) AS Rank02,RANK() OVER(order by amount) AS Rank1,RANK() OVER(PARTITION BY productline order by amount) AS Rank2,SUM(amount) OVER(PARTITION BY productline,order_year ) AS amount0,SUM(amount) OVER(PARTITION BY productline ) AS amount1,SUM(amount) OVER() AS amount2
FROM (SELECT productline, year(orderDate) order_year, sum(quantityOrdered) as amountFROM ordersINNER JOIN orderdetails USING (orderNumber)INNER JOIN products USING (productCode)GROUP BY productline,order_year) T;
5.?PERCENT_RANK()
函數返回一個從0到1的數字。 計算公式為:(rank - 1) / (total_rows - 1)。
rank是當前行的等級,total_rows是要計算的行數。 公式的意思就是計算當前行的等級減1,除以分區或結果集中的總行數減1。
- PERCENT_RANK()對于分區或結果集中的第一行,函數始終返回零。重復的列值將接收相同的PERCENT_RANK()值。
- PERCENT_RANK()是一個順序敏感函數,因此,您應始終使用ORDER BY子句。
CREATE TABLE productLineSales -- 我們創建了一張表,后面還會重復用到它
SELECTproductLine,YEAR(orderDate) orderYear,SUM(quantityOrdered * priceEach) orderValue
FROM orderDetails
INNER JOIN orders USING (orderNumber)
INNER JOIN products USING (productCode)
GROUP BY productLine , YEAR(orderDate);?WITH t AS (SELECT productLine, SUM(orderValue) orderValueFROM productLineSalesGROUP BY productLine
)
SELECTproductLine,orderValue,ROUND(PERCENT_RANK() OVER (ORDER BY orderValue),2) percentile_rank
FROM t;?
在這個例子中: 首先,我們使用表達式按產品線匯總訂單值。 其次,我們用它PERCENT_RANK()來計算每種產品的訂單價值的百分等級。
以下是輸出中的一些分析:
- 訂單價值Trains并不比任何其他產品線更好,后者用零表示。
- Vintage Cars 表現優于50%的其他產品。
- Classic Cars 表現優于任何其他產品系列,因此其百分比等級為1或100%
6. CUME_DIST
它表示值小于或等于當前行的值除以總行數。 公式為: ROW_NUMBER() / total_rows 。注意和 PERCENT_RANK 的區別。
- CUME_DIST()函數的返回值大于零且小于或等于1。
- 重復的列值接收相同的CUME_DIST()值。?
樣例:計算某產品的訂單訂貨量數量分布 (注意第7行開始有重復的值31,對應百分比也是相同的。表示數量小于等于31的一共10行,占總行數28的35.71%)
?
SELECT orderNumber, productCode, quantityOrdered, ROW_NUMBER() OVER(ORDER BY quantityordered) AS nb,CUME_DIST() OVER(ORDER BY quantityordered) AS pct,PERCENT_RANK() OVER(ORDER BY quantityordered) AS pct_rank
FROM mysqldemo.orderdetails
WHERE productcode = 'S18_2949';
7. FIRST_VALUE
樣例:獲取客戶首單訂單金額。
SELECT customernumber,amount,paymentDate,FIRST_VALUE (amount) OVER (PARTITION BY customernumber ORDER BY paymentDate) AS first_amount
FROM payments
ORDER BY customernumber;
8. Last_Value
樣例:獲取客戶最后一筆訂單金額。
注意:Last_Value 和 First_Value 不同, 他認為每一行,是當前行中的最后一行。注意對比下面兩個字段的不同。
SELECT customernumber, amount, paymentDate,last_value (amount) OVER (PARTITION BY customernumber ORDER BY paymentDate) AS last_amount,last_value (amount) OVER (PARTITION BY customernumber ORDER BY paymentDate RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_amount_umbounded
FROM payments
ORDER BY customernumber;
8. Lead和Lag
函數類似,是查詢某一字段的從當前行往后找到第N行的數據(Lead)和往前找到第N行的數據(Lag)。在找到某一行的偏移n行的數據非常有用。
lead/lag(expression, offffset, default) over(partion by ......order by ......)
- expression 要取的是哪一個字段
- offset 是從當前行前進(lead)/后退(lag)的行數。 必須是一個非負整數,為零則取當前行。
- default 如果沒有后續行,則函數返回default。例如,如果offset是1,則lead的最后一行,lag的第一行的返回值為default。 未指定default_value,則返回 NULL 。
樣例: 查詢出上一個訂單,下一個訂單的時間?
SELECT?customerName,orderDate,LEAD(orderDate,1) OVER (PARTITION BY customerNumber ORDER BY orderDate ) nextOrderDate,LAG(orderDate,1) OVER (PARTITION BY customerNumber ORDER BY orderDate ) PreviousOrderDate
FROM orders
INNER JOIN customers USING (customerNumber);?
樣例:查詢出下單時間間隔最長的用戶。
SELECT customerName,?MAX(orderdate_interval) AS MAX_interval,?RANK() OVER(ORDER BY MAX(orderdate_interval) DESC) AS data_rank
FROM(SELECT?customerName,orderDate,LEAD(orderDate,1) OVER (PARTITION BY customerNumber ORDER BY orderDate ) nextOrderDate,datediff(LEAD(orderDate,1) OVER (PARTITION BY customerNumber ORDER BY orderDate), orderDate) orderdate_intervalFROM ordersINNER JOIN customers USING (customerNumber)) T1
WHERE nextOrderDate IS NOT NULL
GROUP BY customerName;?
9. NTILE 平均分組
樣例:將產品線按照年份,匯總訂單金額,并且劃分為三個組。
注意不能平均分配時,例如將9行數據分成4個組,他會把第1組分3個,剩余3個組每個組2個;
SELECTproductline,?orderYear,?orderValue,NTILE(3) OVER (PARTITION BY orderYear ORDER BY orderValue DESC) product_line_group
FROM?productlineSales;?
樣例: 查詢出2013支付金額排名前30%的所有用戶?
SELECT customerNumber, pay_amount, level
FROM (SELECT customerNumber, SUM(amount) AS pay_amount,NTILE(10) OVER(ORDER BY SUM(amount) DESC) AS levelFROM mysqldemo.paymentsWHERE Year(paymentDate) = 2013GROUP BY customerNumber)a
WHERE level in (1,2,3);
通過這種方法計算出來的百分比不準確,通過下面的SQL,會發現前3個組的人數超過了30%。
SELECT COUNT(customerNumber), level
FROM (SELECT customerNumber, SUM(amount) AS pay_amount,NTILE(10) OVER(ORDER BY SUM(amount) DESC) AS levelFROM mysqldemo.paymentsWHERE Year(paymentDate) = 2013GROUP BY customerNumber)a
GROUP BY level;
使用用 CUME_DIST 效果更好。
SELECT customerNumber, pay_amount, level, pct
FROM (SELECT customerNumber, SUM(amount) AS pay_amount,NTILE(10) OVER(ORDER BY SUM(amount) DESC) AS level,CUME_DIST() over(order by SUM(amount) desc) as pctFROM mysqldemo.paymentsWHERE Year(paymentDate) = 2013GROUP BY customerNumber)a
直接定位帶排序小于等于30%的即可。從結果可以看出,和NTILE不一樣,第三組的人沒有全部都取。?
10. NTH_VALUE
函數格式為:
NTH_VALUE(expression, N)
OVER (partition_clauseorder_clauseframe_clause)?
從有序行集中的第N行獲取值;如果第N行不存在,則函數返回NULL;N必須是正整數。
注意:From First(標準SQL 支持 From Last, MySQL只支持From First。如果要模擬效果From Last,則可以使用其中ORDER BY倒敘排列)
樣例:2015年每月購買金額第三的人
SELECT paymentmonth,customernumber, amount,
NTH_VALUE(customernumber, 3) OVER(PARTITION BY paymentmonth ORDER BY amount DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS NTH
FROM (SELECT MONTH(paymentDate) AS paymentmonth, customernumber, SUM(amount) amountFROM paymentsWHERE YEAR(paymentDate) = 2015GROUP BY customernumber, paymentmonth) T1
11. 測試:查詢出每年連續下單的客戶和連續的年份
方法1
使用 lag 取上一年的年份,計算差值是1的,就是這兩年是連續的;然后對customerName進行group by。
SELECT customerName, max(orderYear), min(previousYear), SUM(gap)+1
FROM (SELECT customerName,?orderYear,?lag(orderYear) over(partition by customerName order by orderYear) AS previousYear,orderYear - lag(orderYear) over(partition by customerName order by orderYear) ?gapFROM (SELECT?customerName,YEAR(orderDate) AS orderYearFROM ordersINNER JOIN customers USING (customerNumber)GROUP BY customerName, orderYear?) T1) T2
WHERE gap =1
GROUP BY customerName
方法2
用Year 減去row_number, 取得gap,gap相同的,就是年份連續的。
SELECT customerName, minYear, maxYear, max(nb)
FROM (SELECT customerName,gap,orderYear,?min(orderYear) OVER (partition by customerName,gap ORDER BY customerName,gap) minYear,?max(orderYear)OVER (partition by customerName,gap ORDER BY customerName,gap) maxYear,ROW_NUMBER() OVER (partition by customerName,gap ORDER BY customerName,gap) nbFROM(SELECT customerName, orderYear, orderYear-nbbycustomer as gapFROM (SELECT?customerName,YEAR(orderDate) AS orderYear,ROW_NUMBER() OVER (PARTITION BY customerName ORDER BY YEAR(orderDate)) nbbycustomerFROM ordersINNER JOIN customers USING (customerNumber)GROUP BY customerName,orderYear)T1) T2
) T3
WHERE minYear <> maxYear
GROUP BY customerName, minYear, maxYear;
12. 其它有趣的函數
使用rand() 獲取隨機10行數據。
select * from customers order by rand() limit 10;