????🍋🍋大數據學習🍋🍋
🔥系列專欄: 👑哲學語錄: 用力所能及,改變世界。
💖如果覺得博主的文章還不錯的話,請點贊👍+收藏??+留言📝支持一下博主哦🤞
一、分組排序問題(Top N 變體)
1. 按多個條件排序并取 Top N
問題:查詢每個部門薪資最高且入職最早的前 2 名員工。
思路:
- 窗口函數中用?
ORDER BY salary DESC, hire_date ASC
?實現多條件排序。 - 用?
ROW_NUMBER()
?生成唯一排名,避免并列。
代碼模板:
WITH ranked_employees AS (SELECT *,ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC, hire_date ASC) AS rankFROM employees
)
SELECT * FROM ranked_employees WHERE rank <= 2;
2. 動態 Top N(按分組比例取前 N%)
問題:查詢每個部門薪資前 10% 的員工。
思路:
- 用?
NTILE(10)
?將數據按薪資分為 10 組,取第 1 組。
代碼模板:
WITH salary_tiles AS (SELECT *,NTILE(10) OVER (PARTITION BY dept_id ORDER BY salary DESC) AS salary_tileFROM employees
)
SELECT * FROM salary_tiles WHERE salary_tile = 1;
二、連續區間問題(變體)
1. 連續缺失值檢測
問題:檢測用戶登錄記錄中連續缺失超過 3 天的區間。
思路:
- 生成完整日期序列,左連接實際記錄,標記缺失日期。
- 用?
日期-行號
?分組連續缺失區間。
代碼模板:
WITH all_dates AS (-- 生成日期序列(略)
),
missing_dates AS (SELECT user_id,date,CASE WHEN login_id IS NULL THEN 1 ELSE 0 END AS is_missingFROM all_datesLEFT JOIN user_logins USING (user_id, date)
),
missing_groups AS (SELECT user_id,date,DATE_SUB(date, ROW_NUMBER() OVER (PARTITION BY user_id, is_missing ORDER BY date)) AS grpFROM missing_datesWHERE is_missing = 1
)
SELECT user_id,MIN(date) AS start_date,MAX(date) AS end_date,COUNT(*) AS missing_days
FROM missing_groups
GROUP BY user_id, grp
HAVING COUNT(*) > 3;
2. 周期性行為識別
問題:識別用戶每周固定某天的登錄習慣(如每周三)。
思路:
- 用?
DAYOFWEEK()
?獲取星期幾,按用戶和星期分組統計頻次。
代碼模板:
SELECT user_id,DAYOFWEEK(login_date) AS day_of_week,COUNT(*) AS login_count,ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY COUNT(*) DESC) AS rank
FROM user_logins
GROUP BY user_id, DAYOFWEEK(login_date)
HAVING rank = 1; -- 取頻次最高的一天
三、復雜聚合問題
1. 分組內條件聚合(加權平均)
問題:計算每個商品在不同促銷活動下的加權平均銷量(權重為活動持續天數)。
思路:
- 用?
SUM(銷量*權重)/SUM(權重)
?實現加權平均。
代碼模板:
SELECT product_id,SUM(sales * duration_days) / SUM(duration_days) AS weighted_avg_sales
FROM (SELECT product_id,campaign_id,SUM(daily_sales) AS sales,DATEDIFF(end_date, start_date) + 1 AS duration_daysFROM sales_recordsGROUP BY product_id, campaign_id, start_date, end_date
) t
GROUP BY product_id;
2. 動態區間聚合(按事件觸發)
問題:計算用戶每次登錄后 24 小時內的消費總額。
思路:
- 用?
JOIN
?關聯同一用戶的登錄和消費記錄,篩選時間窗口。
代碼模板:
SELECT l.user_id,l.login_time,SUM(o.amount) AS total_spent
FROM user_logins l
LEFT JOIN orders o
ON l.user_id = o.user_id
AND o.order_time BETWEEN l.login_time AND DATE_ADD(l.login_time, 1)
GROUP BY l.user_id, l.login_time;
四、多維分析(OLAP 風格)
1. 小計與總計(GROUPING SETS/CUBE/ROLLUP)
問題:同時計算按部門、職位和兩者組合的薪資總和。
思路:
- 用?
GROUPING SETS
?生成多種分組組合。
代碼模板:
SELECT dept_id,position,SUM(salary) AS total_salary
FROM employees
GROUP BY GROUPING SETS((dept_id, position), -- 部門+職位分組(dept_id), -- 部門分組(position), -- 職位分組() -- 總計
);
2. 同比 / 環比(跨時間周期比較)
問題:計算 2023 年每月銷售額的同比和環比增長率。
思路:
- 用?
LAG()
?獲取上月 / 去年同月數據,或用?JOIN
?關聯時間偏移表。
代碼模板:
WITH monthly_sales AS (SELECT YEAR(sale_date) AS sale_year,MONTH(sale_date) AS sale_month,SUM(amount) AS total_amountFROM salesGROUP BY YEAR(sale_date), MONTH(sale_date)
)
SELECT curr.sale_year,curr.sale_month,curr.total_amount,prev_month.total_amount AS prev_month_amount,prev_year.total_amount AS prev_year_amount,(curr.total_amount - prev_month.total_amount) / prev_month.total_amount AS mom_growth,(curr.total_amount - prev_year.total_amount) / prev_year.total_amount AS yoy_growth
FROM monthly_sales curr
LEFT JOIN monthly_sales prev_month
ON curr.sale_year = prev_month.sale_year
AND curr.sale_month = prev_month.sale_month + 1
LEFT JOIN monthly_sales prev_year
ON curr.sale_year = prev_year.sale_year + 1
AND curr.sale_month = prev_year.sale_month;
五、地理信息與空間分析
1. 區域聚合(按地理邊界統計)
問題:統計每個城市商圈內的店鋪數量。
思路:
- 用?
ST_Contains()
?判斷點(店鋪)是否在多邊形(商圈)內。
代碼模板:
SELECT district_name,COUNT(shop_id) AS shop_count
FROM shops s
JOIN districts d
ON ST_Contains(ST_GeomFromText(d.polygon_wkt), -- 商圈多邊形ST_Point(s.longitude, s.latitude) -- 店鋪坐標
)
GROUP BY district_name;
2. 距離最近點查詢
問題:為每個用戶找到距離最近的 3 個服務點。
思路:
- 用 Haversine 公式計算距離,
ROW_NUMBER()
?取 Top N。
代碼模板:
WITH distances AS (SELECT u.user_id,s.service_id,6371 * 2 * ASIN(SQRT(POWER(SIN((s.lat - u.lat) * PI()/180 / 2), 2) +COS(u.lat * PI()/180) * COS(s.lat * PI()/180) *POWER(SIN((s.lon - u.lon) * PI()/180 / 2), 2))) AS distance_kmFROM users uCROSS JOIN service_points s
)
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY distance_km) AS rankFROM distances
) t
WHERE rank <= 3;
六、時間窗口滑動聚合
1. 固定窗口聚合(每小時 / 每天)
問題:計算每小時的平均請求數。
思路:
- 用?
DATE_TRUNC()
?截斷時間到小時,按小時分組。
代碼模板:
SELECT DATE_TRUNC('HOUR', request_time) AS hour,COUNT(request_id) AS request_count,AVG(response_time) AS avg_response_time
FROM requests
GROUP BY DATE_TRUNC('HOUR', request_time);
2. 滑動窗口聚合(過去 N 條記錄)
問題:計算每個用戶最近 5 次登錄的平均停留時長。
思路:
- 用?
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
?定義滑動窗口。
代碼模板:
SELECT user_id,login_time,session_duration,AVG(session_duration) OVER (PARTITION BY user_id ORDER BY login_time ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS avg_last_5_sessions
FROM user_sessions;
七、數據透視與交叉表
1. 動態列轉置(不確定列數)
問題:將用戶標簽(每行一個標簽)轉為列(每個標簽一列)。
思路:
- 用?
collect_set()
?聚合標簽,size()
?判斷是否存在。
代碼模板:
WITH user_tags AS (SELECT user_id,collect_set(tag) AS tagsFROM user_tag_mappingGROUP BY user_id
)
SELECT user_id,CASE WHEN 'vip' IN (SELECT * FROM UNNEST(tags)) THEN 1 ELSE 0 END AS is_vip,CASE WHEN 'new' IN (SELECT * FROM UNNEST(tags)) THEN 1 ELSE 0 END AS is_new,-- 動態添加更多標簽判斷
FROM user_tags;
2. 交叉表統計(多維度組合)
問題:統計不同年齡段和性別用戶的消費金額分布。
思路:
- 用?
CASE WHEN
?組合維度,SUM()
?聚合金額。
代碼模板:
SELECT age_group,SUM(CASE WHEN gender = 'M' THEN amount ELSE 0 END) AS male_amount,SUM(CASE WHEN gender = 'F' THEN amount ELSE 0 END) AS female_amount,SUM(amount) AS total_amount
FROM users u
JOIN orders o USING (user_id)
GROUP BY age_group;
八、遞歸查詢與層級結構
1. 樹形結構路徑查詢(如組織架構)
問題:查詢員工及其所有上級的完整路徑。
思路:
- 用遞歸 CTE 逐層向上查找上級。
代碼模板:
WITH RECURSIVE employee_hierarchy AS (SELECT emp_id,manager_id,emp_name,CAST(emp_name AS STRING) AS pathFROM employeesWHERE manager_id IS NULL -- 根節點(CEO)UNION ALLSELECT e.emp_id,e.manager_id,e.emp_name,CONCAT(eh.path, ' -> ', e.emp_name) AS pathFROM employees eJOIN employee_hierarchy eh ON e.manager_id = eh.emp_id
)
SELECT * FROM employee_hierarchy;
2. 層級聚合(如區域銷售額匯總)
問題:計算每個區域及其子區域的總銷售額。
思路:
- 自下而上遞歸聚合,用?
SUM() OVER (PARTITION BY region_id)
。
代碼模板:
WITH region_sales AS (-- 基礎銷售額(略)
),
region_hierarchy AS (-- 區域層級關系(略)
),
recursive_sales AS (-- 遞歸計算子區域銷售額(略)
)
SELECT region_id,region_name,SUM(sales_amount) OVER (PARTITION BY region_id) AS total_sales
FROM recursive_sales;
九、字符串與模式匹配
1. 復雜字符串分割與提取
問題:從日志中提取?user_id
?和?action
(格式:[user_id:1001][action:click]
)。
思路:
- 用?
regexp_extract()
?或?substr()
+instr()
?提取子串。
代碼模板:
SELECT regexp_extract(log_line, '\\[user_id:(\\d+)\\]', 1) AS user_id,regexp_extract(log_line, '\\[action:(\\w+)\\]', 1) AS action
FROM logs;
2. 字符串相似度計算
問題:找出商品名稱中包含特定關鍵詞的記錄。
思路:
- 用?
LIKE
?或?REGEXP
?匹配,或用?levenshtein_distance()
?計算編輯距離。
代碼模板:
-- 方法1:模糊匹配
SELECT * FROM products WHERE product_name LIKE '%關鍵詞%';-- 方法2:正則匹配
SELECT * FROM products WHERE product_name REGEXP '關鍵詞';-- 方法3:相似度計算
SELECT *
FROM products
WHERE levenshtein_distance(product_name, '目標名稱') <= 3;
解題思路:
- 問題拆解:將復雜需求分解為子問題(如 “連續登錄” → “生成連續標識” → “分組統計”)。
- 數據建模:明確輸入輸出表結構,確定關聯字段和聚合維度。
- 技術選型:
- 窗口函數:排名、累計計算、滑動窗口。
- JOIN:關聯多表數據,注意過濾條件前置。
- 正則 / JSON 函數:處理復雜字符串和嵌套結構。
- 性能優化:
- 用?
EXPLAIN
?分析執行計劃,避免全表掃描。 - 對大表 JOIN 考慮 MapJoin 或分桶表。
- 過濾條件盡量前置,減少中間數據量。
- 用?