以下是高頻HQL面試題及對應SQL示例,涵蓋核心語法、優化技巧和典型場景,可直接用于面試準備:
一、基礎操作與DDL
1. 創建分區表 & 動態插入分區
sql
-- 創建外部分區表(按日期分區) CREATE EXTERNAL TABLE logs (user_id STRING,event STRING,duration INT ) PARTITIONED BY (dt STRING) -- 分區字段 STORED AS ORC LOCATION '/user/hive/warehouse/logs';-- 動態插入分區(自動創建分區目錄) SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict;INSERT INTO TABLE logs PARTITION (dt) SELECT user_id, event, duration, event_date AS dt -- 最后一列為分區字段 FROM raw_logs;
二、查詢優化
2. 分桶表創建 & SMB Join優化
sql
-- 創建分桶排序表(分桶數4,按user_id排序) CREATE TABLE users_bucketed (user_id STRING,name STRING ) CLUSTERED BY (user_id) SORTED BY (user_id) INTO 4 BUCKETS STORED AS ORC;-- 啟用SMB Join SET hive.optimize.bucketmapjoin=true; SET hive.auto.convert.sortmerge.join=true;SELECT /*+ MAPJOIN(b) */ a.user_id, a.event, b.name FROM logs a JOIN users_bucketed b ON a.user_id = b.user_id; -- 避免Shuffle
三、窗口函數實戰
3. 計算連續3天活躍用戶
sql
SELECT user_id FROM (SELECT user_id,dt,LAG(dt, 2) OVER (PARTITION BY user_id ORDER BY dt) AS lag2 -- 取前2天的日期FROM logsWHERE event = 'active' ) t WHERE datediff(dt, lag2) = 2; -- 當前日期與前2天日期差2天(連續3天)
// add by me-方法二
SQL統計連續登陸3天的用戶(連續活躍超3天用戶)_sql連續登錄3天用戶數-CSDN博客
SELECT user_id,
? ? ? ? ?count(1) AS cnt
FROM?
? ? (SELECT user_id, DATE_SUB(dt,??rn) AS sub_date
? ? FROM?
? ? ? ? (SELECT user_id,??dt,??row_number() over( partition by user_id??ORDER BY ?dt) AS rn
? ? ? ? FROM logs) t
? ? ? ? GROUP BY ?user_id, sub_date
? ? ) diffTable
? ? GROUP BY ?user_id, sub_date
HAVING cnt >= 3
連續登陸(含間隔)
??ll???????https://zhuanlan.zhihu.com/p/29641524870
4. 分組Top N(部門工資前三)
sql
SELECT dept, name, salary FROM (SELECT dept, name, salary,DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rkFROM employees ) t WHERE rk <= 3;
5,去除最大最小值求均值
https://zhuanlan.zhihu.com/p/28466428027
6.波峰波谷(快手)
https://zhuanlan.zhihu.com/p/1885646993512200080
7,AI無效oncall
https://zhuanlan.zhihu.com/p/1893678485387588081
四、數據傾斜解決方案
5. 大表Join傾斜Key打散
sql
-- 假設user_id='999'是傾斜Key SELECT * FROM (SELECT user_id,event,CASE WHEN user_id = '999' THEN concat(user_id, '_', rand()) -- 打散傾斜KeyELSE user_id END AS join_keyFROM logs ) a JOIN users b ON a.join_key = b.user_id;
五、高級函數與轉換
6. 行列轉換
sql
-- 行轉列(聚合多行) SELECT user_id,CONCAT_WS(',', COLLECT_LIST(event)) AS events -- 合并事件列表 FROM logs GROUP BY user_id;-- 列轉行(拆分數組) SELECT user_id, event_name FROM logs LATERAL VIEW EXPLODE(SPLIT(events, ',')) e AS event_name; -- events是逗號分隔字符串
7. JSON解析
sql
SELECT get_json_object(json_col, '$.user.id') AS user_id,json_tuple(json_col, 'event', 'timestamp') AS (event, ts) -- 同時解析多字段 FROM json_logs;
六、性能優化技巧
8. 謂詞下推優化
sql
-- 優化前(全表掃描) SELECT * FROM logs WHERE dt = '2023-08-12' AND duration > 1000;-- 優化后(分區裁剪+列裁剪) SELECT user_id, event -- 只取所需列 FROM logs WHERE dt = '2023-08-12' -- 分區字段過濾AND duration > 1000; -- ORC格式下自動謂詞下推
9. MapJoin手動指定
sql
SELECT /*+ MAPJOIN(small_table) */ big_table.id, small_table.name FROM big_table JOIN small_table ON big_table.id = small_table.id;
七、場景題模板
10. 留存率計算(次日留存)
https://www.zhihu.com/question/294871305/answer/1903544000008417365
sql
SELECT a.dt,COUNT(DISTINCT a.user_id) AS dau,COUNT(DISTINCT b.user_id) AS next_day_retained,COUNT(DISTINCT b.user_id) / COUNT(DISTINCT a.user_id) AS retention_rate FROM (SELECT dt, user_id FROM logs WHERE event='login' ) a LEFT JOIN (SELECT dt, user_id FROM logs WHERE event='login' ) b ON a.user_id = b.user_id AND b.dt = DATE_ADD(a.dt, 1) -- 次日留存 GROUP BY a.dt;
高頻考點總結
類型 | 關鍵點 |
---|---|
語法 | PARTITIONED BY ?vs?CLUSTERED BY 、LATERAL VIEW explode() |
窗口函數 | ROW_NUMBER() /RANK() /DENSE_RANK() 、LAG() /LEAD() 、ROWS BETWEEN |
優化 | 分區裁剪、列裁剪、MapJoin、SMB Join、隨機數打散傾斜Key |
復雜類型處理 | COLLECT_LIST() /COLLECT_SET() 、get_json_object() |
實戰場景 | 留存率、連續登錄、Top N、UV/PV統計 |
提示:面試時務必說明優化原理(如?"SMB Join通過分桶排序避免Shuffle"),并強調數據傾斜處理經驗。
幾個關鍵函數
ROWS BETWEEN
sum(sales_volume) over(rows between 1 preceding and current row) sum_sales
sum(sales_volume) over(rows between current row and unbounded following)?
sum(sales_volume) over(rows between unbounded preceding and current row) sum_sales
sum(sales_volume) over(rows between current row and 2 following) sum_sales
sum(sales_volume) over(rows between 1 preceding and current row) sum_sales
LAG()
/LEAD()
LAG(column_name, offset, default_value) OVER (PARTITION BY partition_column ORDER BY sort_column)
LEAD(column_name, offset, default_value) OVER (PARTITION BY partition_column ORDER BY sort_column)
示例場景
假設有員工薪資表?emp
,包含?ename
(姓名)、job
(職位)、sal
(薪資)。
- ?獲取前一行薪資?:
LAG(sal, 1) OVER (ORDER BY sal)
?可獲取當前行前一行薪資。 ?1 - ?設置默認值?:
LAG(sal, 1, 3.1415)
?當前行無前一行數據時,默認返回3.1415。 ?1 - ?連續性檢查?:通過?
LAG(sal, 1)
?與?LEAD(sal, 1)
?組合,可驗證薪資是否連續。 ?24
ROW_NUMBER()
/RANK()
/DENSE_RANK()
dense_rank() over(partition by class order by pjs.sn_scores desc) as dense_rank
rank() over(partition by class?order by pjs.sn_scores desc) as rank
row_number() over(partition by class?order by pjs.sn_scores desc) as row_number
假設學生成績表按分數降序排列:
- ?ROW_NUMBER()?:1, 2, 3, 4, 5(完全按順序分配) ?12
- ?RANK()?:1, 2, 2, 4(相同分數共享排名,后續跳過) ?15
- ?DENSE_RANK()?:1, 2, 2, 3(相同分數仍保持連續排名) ?25
[ PARTITIONED BY ]
?vs [?CLUSTERED BY ]
CLUSTERED BY(分桶表設置)
作用:在建表時指定數據的分桶規則,物理上把數據分到多個文件(桶)中。
示例:
????????CREATE TABLE table_name (col1 INT, col2 STRING)
????????CLUSTERED BY (col1) INTO 4 BUCKETS;
原理:Hive 會根據col1的哈希值將數據分散到 4 個桶中,查詢時可加速數據讀取。
SORTED BY(桶內排序設置)
作用:在建表時指定桶內數據的排序規則。
示例:
????????CREATE TABLE table_name (col1 INT, col2 STRING)
????????CLUSTERED BY (col1) SORTED BY (col2 ASC) INTO 4 BUCKETS;
應用:適合需要頻繁按col2過濾的場景,如時間字段。
DISTRIBUTED BY(MapJoin 優化)
作用:在INSERT OVERWRITE語句中控制數據分發,常用于優化 MapJoin。
示例:
????????INSERT OVERWRITE TABLE target_table
????????SELECT * FROM source_table
????????DISTRIBUTED BY (join_key);
PARTITIONED BY(分區表定義)
作用:創建分區表,數據按分區字段物理存儲在不同目錄中。
示例:
????????CREATE TABLE table_name (col1 INT, col2 STRING)
????????PARTITIONED BY (dt STRING);
查詢優化:
????????SELECT * FROM table_name WHERE dt = '2023-01-01'; -- 直接過濾分區目錄
上表來源:
HiveSQL 入門避坑指南:搞懂這些 “BY“,讓你的 SQL 少跑 80% 的冤枉路_hive cluster by作用-CSDN博客
LATERAL VIEW explode()
Hive之explode()函數和posexplode()函數和lateral view函數_lateral view explode函數-CSDN博客
Hive中的explode函數、posexplode函數與later view函數_hive explode-CSDN博客