從博客上發現兩個面試題,其中有個用到了lag函數。整理學習
LAG 函數是 Hive 中常用的窗口函數,用于訪問同一分區內?前一行(或前 N 行)的數據。它在分析時間序列數據、計算相鄰記錄差異等場景中非常有用。
一、語法
LAG(column, offset, default) OVER (PARTITION BY partition_expression ORDER BY sort_expression [ASC|DESC]
)
-
column:需要訪問的列。
-
offset:向前回溯的行數(默認為 1)。
-
default:當沒有前 N 行時的默認值(默認為 NULL)。
-
PARTITION BY:按指定字段分區,每個分區獨立計算。
-
ORDER BY:定義分區內的排序方式。
二、代碼示例
1. 示例數據表?sales
sale_date | revenue | user_id |
---|---|---|
2023-01-01 | 100 | 1 |
2023-01-02 | 150 | 1 |
2023-01-03 | 200 | 1 |
2023-01-01 | 50 | 2 |
2023-01-02 | 80 | 2 |
2. 創建表并插入數據
CREATE TABLE sales (sale_date STRING,revenue INT,user_id INT
);INSERT INTO sales VALUES
('2023-01-01', 100, 1),
('2023-01-02', 150, 1),
('2023-01-03', 200, 1),
('2023-01-01', 50, 2),
('2023-01-02', 80, 2);
3. 使用 LAG 計算每日環比增長
select user_id --用戶,sale_date --銷售日期,revenue --收入,lag(revenue,1) over(partition by user_id order by sale_date ) as prev_revenue --前一天的收入,revenue - LAG(revenue, 1) OVER (PARTITION BY user_id ORDER BY sale_date) AS growth --每日環比增長from sales
三、使用場景
1、時間序列分析
-
計算每日/月銷售額的環比增長(如示例所示)。
-
檢測異常波動(如某天收入驟降 90%)。
?異常波動示例:假設有一張銷售表?sales
,需要檢測?單用戶單日銷售額相比前一日波動超過 50%?的異常情況。
CREATE TABLE sales_yc (user_id INT,sale_date STRING,revenue DOUBLE
)
stored as orcINSERT INTO sales_yc VALUES
(1, '2023-01-01', 100.0),
(1, '2023-01-02', 150.0), -- 正常增長 50%
(1, '2023-01-03', 30.0), -- 異常下降 80%
(2, '2023-01-01', 200.0),
(2, '2023-01-02', 450.0); -- 異常增長 125%with tmp as
(
select user_id,sale_date,revenue,lag(revenue,1,0) over(partition by user_id order by sale_date) prev_revenuefrom sales_yc
)
select user_id,sale_date,revenue,if(prev_revenue = 0,null,round((revenue - prev_revenue) / prev_revenue * 100,2))||'%' change_percentfrom tmp
where abs((revenue - prev_revenue) / prev_revenue) > 0.5and prev_revenue <> 0
user_id | sale_date | revenue | prev_revenue | change_percent | |
---|---|---|---|---|---|
1 | 2023-01-03 | 30.0 | 150.0 | -80.0 | -- 下降 80% |
2 | 2023-01-02 | 450.0 | 200.0 | 125.0 | -- 增長 125% |
2、填充缺失值
若數據缺失,可用前一行值填充:?
SELECT sale_date,COALESCE(revenue, LAG(revenue) OVER (ORDER BY sale_date)) AS imputed_revenue
FROM sales;
3、用戶行為分析
計算用戶兩次操作的時間間隔:
CREATE TABLE user_events (user_id STRING,event_time STRING,event_type STRING
)
stored as orcINSERT overwrite table user_events VALUES
('u1', '2023-01-01 08:00:00', 'login'),
('u1', '2023-01-01 08:05:30', 'click'),
('u1', '2023-01-01 08:15:45', 'purchase'),
('u2', '2023-01-01 09:00:00', 'login'),
('u2', '2023-01-01 09:30:00', 'logout'),
('u1', '2023-01-15 08:15:45', 'Add shopp');with tmp as
(
select user_id,event_time,event_type,LAG(event_time, 1) OVER (PARTITION BY user_id ORDER BY event_time ) AS prev_event_time -- 獲取前一次操作時間(按用戶分區,時間排序)from user_events
)
select user_id,event_time,event_type,prev_event_time,ROUND((UNIX_TIMESTAMP(event_time) - UNIX_TIMESTAMP(prev_event_time)) / 60,2) prev_event_time -- 計算時間間隔(轉換為分鐘,保留2位小數)from tmp;
4.庫存管理
跟蹤庫存變化時,對比當前庫存與前一日的差異。
CREATE TABLE stock (product_id STRING,stock_date STRING, -- 日期格式需為 yyyy-MM-ddquantity INT
)
stored as orcINSERT INTO stock VALUES
('1', '2023-01-01', 100),
('1', '2023-01-02', 80),
('1', '2023-01-03', 120),
('2', '2023-01-01', 200),
('2', '2023-01-03', 180);SELECT product_id,stock_date,quantity,-- 獲取前一日庫存(按商品分區,按日期排序)LAG(quantity, 1) OVER (PARTITION BY product_id ORDER BY stock_date) AS prev_quantity,-- 計算差異(當前庫存 - 前一日庫存)quantity - LAG(quantity, 1) OVER (PARTITION BY product_id ORDER BY stock_date) AS diff
FROM stock;--篩選出庫/入庫記錄
SELECT *
FROM (-- 上述計算差異的 SQL
) tmp
WHERE diff IS NOT NULL;
四、面試題
已知有數據A如下,請分別根據A生成B和C。
數據A
+-----+-------+
| id | name |
+-----+-------+
| 1 | aa |
| 2 | aa |
| 3 | aa |
| 4 | d |
| 5 | c |
| 6 | aa |
| 7 | aa |
| 8 | e |
| 9 | f |
| 10 | g |
+-----+-------+
數據B?
+-----+-----------------+
| id | name |
+-----+-----------------+
| 7 | aa|aa|aa|aa|aa |
| 4 | d |
| 5 | c |
| 8 | e |
| 9 | f |
| 10 | g |
+-----+-----------------+
+-----+-----------+
| id | name |
+-----+-----------+
| 3 | aa|aa|aa |
| 4 | d |
| 5 | c |
| 7 | aa|aa |
| 8 | e |
| 9 | f |
| 10 | g |
+-----+-----------+
1、題目一 要求對name相同的數據進行合并處理,name相同的合并到一起用’|'進行拼接,id取組內最大值
2、題目二 要求對相鄰name相同的數據進行合并,name相同的合并到一起用’|'進行拼接,id取組內最大值
問題1實現邏輯:
with tmp as (
select id,name,max(id) over(partition by name) new_idfrom data_a
)
select new_id,concat_ws('|',collect_list(name))from tmp
group by new_id
order by new_id
問題2實現邏輯:
selectmax(id) as id,concat_ws("|",collect_list(name)) as name
from(selectid,name,sum(is_continus) over(order by id) as continus_gpfrom(selectid,name,if(name=lag(name) over( order by id),0,1) as is_continusfromdata_a)tmp
) a
group bycontinus_gp
注:sql面試題目來源?SQL面試題——京東SQL面試題 合并數據-CSDN博客