clickhouse 隨心所欲的聚合模型-AggregatingMergeTree

clickhouse 強大的 MergeTree 系列引擎令人信服，其 ReplacingMergeTree、SummingMergeTree 在數據唯一性和匯總場景中表現非凡。但你是否還有保留最小(大)、平均等預聚合需求，甚至在一個模型中既有唯一性語意也有匯總、最小、最大、平均值語意該如何處理。在 doris 中 Aggregate 數據模型可以輕松解決，那么同為頭部 AP 數據庫的 clickhouse 是否可以隨心所欲的定義聚合模型呢？

一、AggregatingMergeTree

1.1 基本使用

AggregatingMergeTree 表引擎作為 MergeTree 系列引擎也是遵循其家族的基本邏輯的，它能夠在合并分區的時候按照預先定義的方式聚合數據。與 ReplacingMergeTree、SummingMergeTree 不同的是表引擎已經內置好了聚合方式，用戶只能指定字段在分區合并時對字段進行去重或累加，AggregatingMergeTree 則進一步開發底層給用戶，用戶需要指定在分區合并時采用何種聚合函數，以及針對哪些字段進行計算，下面是該引擎的使用方式（復刻 doris 官方文檔的案例）

drop table if exists tbl_agg;
create table if not exists tbl_agg
(`user_id` String comment '用戶id',`date`    datetime comment '數據灌入日期時間',`city`    String comment '用戶所在城市',`age`     Int8 comment '用戶年齡',`sex`     Int8 comment '用戶性別',`last_visit_date` AggregateFunction(anyLast,DateTime) comment '用戶最后一次訪問時間',`cost` AggregateFunction(sum, Int256) comment '用戶總消費',`max_dwell_time` AggregateFunction(max,Int64) comment '用戶最大停留時間',`min_dwell_time` AggregateFunction(min,Int64) comment '用戶最小停留時間'
) engine AggregatingMergeTree()order by (user_id, date, city, age, sex);

AggregateFunction 是 clickhouse 提供的特殊數據類型，它能夠以二進制的形式存儲中間狀態結果。其使用方式也十分特殊，在定義的時候需要提供聚合方式以及數據類型。常用的聚合方式整理如下：

count: 計數非空行數
sum: 累加
max: 最大值
min: 最小值
anyLast: 最后一個非空值
uniq: 去重計數

當然 clickhouse 提供的聚合函數很多，詳情可以訪問: https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference

因為 AggregateFunction 是二進制存儲的中間結果，我們在插入數據時也需要將明文數據轉換為 AggregateFunction 可以接受的數據類型，clickhouse 為每個聚合函數都提供了轉換為 AggregateFunction 類型的 *State 函數

insert into tbl_agg
select 10000,'2017-10-01','北京',20,0,anyLastState(toDateTime('2017-10-01 06:00:00')),sumState(toInt256(20)),maxState(toInt64(10)),minState(toInt64(10));insert into tbl_agg
select 10000,'2017-10-01','北京',20,0,anyLastState(toDateTime('2017-10-01 07:00:00')),sumState(toInt256(15)),maxState(toInt64(2)),minState(toInt64(2));

同理我們在查詢是也需要特殊的函數將 AggregateFunction 類型轉換為明文（類似序列化與反序列區別），而查詢時需要使用 *Merge 函數

SELECTuser_id,anyLastMerge(last_visit_date) AS last_visit_date,sumMerge(cost) AS cost,maxMerge(max_dwell_time) AS max_dwell_time,minMerge(min_dwell_time) AS min_dwell_time
FROM tbl_agg
GROUP BY user_id
ORDER BY user_id ASCQuery id: 30a237df-6018-42fa-a6a9-1d324e21310d┌─user_id─┬─────last_visit_date─┬─cost─┬─max_dwell_time─┬─min_dwell_time─┐
│ 10000   │ 2017-10-01 06:00:00 │   35 │             10 │              2 │
└─────────┴─────────────────────┴──────┴────────────────┴────────────────┘1 row in set. Elapsed: 0.005 sec.

看到這里是否覺得這種方式過于繁瑣，連正常的數據插入都需要借助 State 函數，那么在升級改造時將寸步難行。好在上面的方式并不是主流的方式，我們可以借助物化視圖來屏蔽 State 過程，讓數據插入保持原生。

1.2 優化體驗

首先我們創建相同結構的普通表作為底表

drop table if exists tbl_agg_basic;
create table if not exists tbl_agg_basic
(`user_id`         String comment '用戶id',`date`            datetime comment '數據灌入日期時間',`city`            String comment '用戶所在城市',`age`             Int8 comment '用戶年齡',`sex`             Int8 comment '用戶性別',`last_visit_date` datetime comment '用戶最后一次訪問時間',`cost`            Int256 comment '用戶總消費',`max_dwell_time`  Int64 comment '用戶最大停留時間',`min_dwell_time`  Int64 comment '用戶最小停留時間'
) engine MergeTreeorder by (user_id, date, city, age, sex);

之后我們將 State 過程寫入物化視圖中

drop table if exists mv_tbl_agg;
create materialized view if not exists mv_tbl_agg to tbl_agg
as
select user_id,date,city,age,sex,anyLastState(last_visit_date) as last_visit_date,sumState(cost)                as cost,maxState(max_dwell_time)      as max_dwell_time,minState(min_dwell_time)      as min_dwell_time
from tbl_agg_basic
group by user_id, date, city, age, sex;

對用戶來說將明細數據優雅的寫入底表中，tbl_agg 對外提供查詢功能，用戶無需關系數據怎么序列化

flow

下面我們只需要假裝什么都不知道向明細數據表插入數據

insert into tbl_agg_basic
values (10000, '2017-10-01', '北京', 20, 0, '2017-10-01 06:00:00', 20, 10, 10),(10000, '2017-10-01', '北京', 20, 0, '2017-10-01 07:00:00', 15, 2, 2),(10001, '2017-10-01', '北京', 30, 1, '2017-10-01 17:05:45', 2, 22, 22),(10002, '2017-10-02', '上海', 20, 1, '2017-10-02 12:59:12', 200, 5, 5),(10003, '2017-10-02', '廣州', 32, 0, '2017-10-02 11:20:00', 30, 11, 11),(10004, '2017-10-01', '深圳', 35, 0, '2017-10-01 10:00:15', 100, 3, 3),(10004, '2017-10-03', '深圳', 35, 0, '2017-10-03 10:20:22', 11, 6, 6);

數據會自動同步到 tbl_agg 中，在查詢時我們只需要面向 tbl_agg 此時會比直接查詢 tbl_agg_basic 有更高的性能

SELECTuser_id,date,city,age,sex,anyLastMerge(last_visit_date) AS last_visit_date,sumMerge(cost) AS cost,maxMerge(max_dwell_time) AS max_dwell_time,minMerge(min_dwell_time) AS min_dwell_time
FROM tbl_agg
GROUP BYuser_id,date,city,age,sex
ORDER BY user_id ASCQuery id: 6f7fd017-9378-4f42-8c20-56bd711487d1┌─user_id─┬────────────────date─┬─city─┬─age─┬─sex─┬─────last_visit_date─┬─cost─┬─max_dwell_time─┬─min_dwell_time─┐
│ 10000   │ 2017-10-01 00:00:00 │ 北京 │  20 │   0 │ 2017-10-01 07:00:00 │   35 │             10 │              2 │
│ 10001   │ 2017-10-01 00:00:00 │ 北京 │  30 │   1 │ 2017-10-01 17:05:45 │    2 │             22 │             22 │
│ 10002   │ 2017-10-02 00:00:00 │ 上海 │  20 │   1 │ 2017-10-02 12:59:12 │  200 │              5 │              5 │
│ 10003   │ 2017-10-02 00:00:00 │ 廣州 │  32 │   0 │ 2017-10-02 11:20:00 │   30 │             11 │             11 │
│ 10004   │ 2017-10-01 00:00:00 │ 深圳 │  35 │   0 │ 2017-10-01 10:00:15 │  100 │              3 │              3 │
│ 10004   │ 2017-10-03 00:00:00 │ 深圳 │  35 │   0 │ 2017-10-03 10:20:22 │   11 │              6 │              6 │
└─────────┴─────────────────────┴──────┴─────┴─────┴─────────────────────┴──────┴────────────────┴────────────────┘6 rows in set. Elapsed: 0.008 sec.

還可以插入幾條數據來觀察 tbl_agg 的結果是否符合我們定義的聚合語意

二、SimpleAggregateFunction

對于上面的案例其實在查詢時依然不方便需要調用 Merge 函數，本質因為 AggregateFunction 使用二進制存儲。如果數據以明文存儲是不是就不需要這么麻煩，clickhouse 針對這類場景提供了 SimpleAggregateFunction

drop table if exists tbl_agg_s;
create table if not exists tbl_agg_s
(`user_id` String comment '用戶id',`date`    datetime comment '數據灌入日期時間',`city`    String comment '用戶所在城市',`age`     Int8 comment '用戶年齡',`sex`     Int8 comment '用戶性別',`last_visit_date` SimpleAggregateFunction(anyLast,datetime) comment '用戶最后一次訪問時間',`cost` SimpleAggregateFunction(sum, Int256) comment '用戶總消費',`max_dwell_time` SimpleAggregateFunction(max,Int64) comment '用戶最大停留時間',`min_dwell_time` SimpleAggregateFunction(min,Int64) comment '用戶最小停留時間'
) engine AggregatingMergeTree()order by (user_id, date, city, age, sex);

此時該模型就可以視為完美復刻了 doris 的聚合模型，因為插入和查詢將變得原生化

insert into tbl_agg_s
values (10000, '2017-10-01', '北京', 20, 0, '2017-10-01 06:00:00', 20, 10, 10),(10000, '2017-10-01', '北京', 20, 0, '2017-10-01 07:00:00', 15, 2, 2),(10001, '2017-10-01', '北京', 30, 1, '2017-10-01 17:05:45', 2, 22, 22),(10002, '2017-10-02', '上海', 20, 1, '2017-10-02 12:59:12', 200, 5, 5),(10003, '2017-10-02', '廣州', 32, 0, '2017-10-02 11:20:00', 30, 11, 11),(10004, '2017-10-01', '深圳', 35, 0, '2017-10-01 10:00:15', 100, 3, 3),(10004, '2017-10-03', '深圳', 35, 0, '2017-10-03 10:20:22', 11, 6, 6);select * from tbl_agg_s;