Hive 邏輯優化器

Optimizer

PointLookupOptimizer

作用:把符合條件的 OR 表達式轉為 IN。
參數hive.optimize.point.lookup 設置是否開啟 PointLookupOptimizer,默認為 true.
參數 hive.optimize.point.lookup.min 控制多少個 OR 表達式轉為 IN,默認 31。
例如以下 sql, or 有3個 child,分別是 web_site_sk =1,web_site_sk =2, web_site_sk =3。這些 child 都必須是 = 。并且判斷的字段是同一字段 web_site_sk。web_site_sk =1 or web_site_sk =2 or web_site_sk =3 可以轉為 `web_site_sk in (1,2,3)``

set hive.optimize.point.lookup.min=2;
select web_county,count(1) cnt 
from web_site 
where web_site_sk = 1 or web_site_sk = 2 or web_site_sk =3 
group by web_county;

判斷 web_site_sk = 2 改成 2 = web_site_sk 也是可以的。

  • 下邊的用 or 關聯的多個字段可以說可以轉化的。
set hive.optimize.point.lookup.min=2;
explain select web_county,count(1) cnt 
from web_site 
where  (web_site_sk = 1 and web_open_date_sk=1) or 
(web_site_sk = 2 and web_open_date_sk=2) or
(web_site_sk = 3 and web_open_date_sk=3)
group by web_county;

生成的執行計劃如下:

Explain
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0Fetch Operatorlimit:-1Stage-1Reducer 2File Output Operator [FS_7]Group By Operator [GBY_5] (rows=1 width=109)Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0<-Map 1 [SIMPLE_EDGE]SHUFFLE [RS_4]PartitionCols:_col0Group By Operator [GBY_3] (rows=1 width=109)Output:["_col0","_col1"],aggregations:["count(1)"],keys:web_countySelect Operator [SEL_2] (rows=1 width=117)Output:["web_county"]Filter Operator [FIL_8] (rows=1 width=117)predicate:(struct(web_open_date_sk,web_site_sk)) IN (const struct(1L,1L), const struct(2L,2L), const struct(3L,3L))TableScan [TS_0] (rows=32 width=117)tpcds_hdfs_orc_3@web_site,web_site,Tbl:COMPLETE,Col:COMPLETE,Output:["web_site_sk","web_open_date_sk","web_county"]
  • struct 字段不能轉化為 in
drop table if exists complex_table;
create table complex_table(c_int int,c_struct struct<name:string,age:int>
)
row format delimited
FIELDS TERMINATED BY '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;insert into complex_table values(1, named_struct("name","Alice", "age",18)
);select distinct c_int from complex_table where 
(c_struct.name="Alice" and  c_struct.age=18) or 
(c_struct.name="Alice" and  c_struct.age=19) or 
(c_struct.name="Alice" and  c_struct.age=20);

PartitionColumnsSeparator

作用:把分區字段從 in 提取出來。如 STRUCT(T1.a, T1.b, T2.b, T2.c) IN (STRUCT(1, 2, 3, 4) , STRUCT(2, 3, 4, 5)),T1.a, T1.b, T2.c 是分區字段,T2.b 不是分區字段。處理之后刪除額外的斷言 STRUCT(T1.a, T1.b) IN (STRUCT(1, 2), STRUCT(2, 3))
AND
STRUCT(T2.c) IN (STRUCT(4), STRUCT(5))
這些額外的斷言用來分區過濾。一旦分區過濾完成,分區條件會從執行計劃中去除。

例如以下語句僅列出了3個分區的,因為 ws_sold_date_sk 是分區字段。

explain extended select distinct ws_item_sk from web_sales 
where (ws_sold_date_sk=2452642 and ws_sold_time_sk=1) or (ws_sold_date_sk=2452641 and ws_sold_time_sk=2) or (ws_sold_date_sk=2452640 and ws_sold_time_sk=3);

PredicateTransitivePropagate(沒找到觸發 sql)

以下語句能走到邏輯,但是沒有改執行計劃

select sum(ws_net_paid) from web_sales join(select * from web_site where web_site_sk < 10) web_site where ws_sold_date_sk=2452640 and web_site_sk = 1;

ConstantPropagate

從 root 到 child 的順序遍歷 DAG,對于每個條件表達式,做以下處理:

  1. 折疊表達式,如果表達式是 UDF 并且所以的參數是常數。
  • Filter 中 包含 true 的表達式去掉。
explain select count(1) from web_site where 1=1 and web_site_sk<10;

可以看到 1=1 總是為 true,可以去掉。

Filter Operator [FIL_8] (rows=10 width=8)predicate:(web_site_sk < 10L)
  • Filter 中 包含 false 的表達式可以短路計算。
explain select count(1) from web_site where 1=1 or web_site_sk<10;

可以看到 1=1 總是為 true,所以所有的表達式都不需要計算。

  • null 條件等于 false

以下兩個 sql 的結果都為 0

select count(1) from web_site where null;
select count(1) from web_site where false;
  • 表達式的傳遞
    如以下語句中 web_site_sk=5 可以向上傳遞, 和 web_site_sk < 10 組在一起。
explain select * from (select * from web_site where web_site_sk < 10) t where web_site_sk=5;

最終的 Filter 如下

Filter Operator [FIL_6]predicate:((web_site_sk < 10L) and (web_site_sk = 5L))

SyntheticJoinPredicate

explain logical 
select sum(ws_net_paid) net_paid 
from  web_sales 
where ws_web_site_sk in(select web_site_sk from web_site where web_site_sk < 10 ) 
and ws_sold_date_sk=2452642;

輸出結果如下:
web_site 的 TablScan 后 Filter 增加了 (web_site_sk) IN (RS[7])

predicate: ((web_site_sk < 10L) and (web_site_sk) IN (RS[7])) (type: boolean)

web_sales 之后增加了 Filter 如下:

alias: web_salesFilter Operator (FIL_20)predicate: (ws_web_site_sk is not null and (ws_web_site_sk) IN (RS[9])) (type: boolean)
explain logical 
select sum(ws_net_paid) net_paid 
from  web_sales join web_site on ws_web_site_sk=web_site_sk
wherews_sold_date_sk=2452642;

生成的執行計劃如下:

Explain
LOGICAL PLAN:
web_sales TableScan (TS_0)alias: web_salesFilter Operator (FIL_16)predicate: (ws_web_site_sk is not null and (ws_web_site_sk) IN (RS[5])) (type: boolean)// ...
web_site TableScan (TS_1)alias: web_siteFilter Operator (FIL_17)predicate: (web_site_sk is not null and (web_site_sk) IN (RS[3])) (type: boolean)// ...

SortedDynPartitionOptimizer

動態分區排序優化,啟用時,在插入記錄之前,按分區字段,或者 bucket 字段運行排序,保證一個 reducer 僅有一個 writer,可以減少 reducer 的內存壓力。

set hive.stats.autogather=false;
create table profile(c1 string) stored as textfile;
load data local inpath '/etc/profile' overwrite into table profile;
create table p_profile(c1 string) partitioned by (len int);
  • 沒有經過 sort 優化
set hive.optimize.sort.dynamic.partition=false;
explain logical insert overwrite table  p_profile select c1,length(c1) from profile;

輸出如下:

LOGICAL PLAN:
profile TableScan (TS_0)alias: profileSelect Operator (SEL_1)expressions: c1 (type: string), length(c1) (type: int)outputColumnNames: _col0, _col1File Output Operator (FS_2)compressed: falsetable:input format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: test.p_profile
  • 經過 sort 優化
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.optimize.sort.dynamic.partition=true;
explain logical insert overwrite table  p_profile select c1,length(c1) from profile;

多了排序

Explain
LOGICAL PLAN:
profile TableScan (TS_0)alias: profileSelect Operator (SEL_1)expressions: c1 (type: string), length(c1) (type: int)outputColumnNames: _col0, _col1Reduce Output Operator (RS_3)key expressions: _col1 (type: int)sort order: +Map-reduce partition columns: _col1 (type: int)value expressions: _col0 (type: string)Select Operator (SEL_4)expressions: VALUE._col0 (type: string), KEY._col1 (type: int)outputColumnNames: _col0, _col1File Output Operator (FS_2)compressed: falseDp Sort State: PARTITION_SORTEDtable:input format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: test.p_profile

SortedDynPartitionTimeGranularityOptimizer

專門為 FileSink 并且FileSink 為 org.apache.hadoop.hive.druid.io.DruidOutputFormat 可以優化。

PartitionPruner(簡稱 ppr) & PartitionConditionRemover(簡稱 PCR)

select sum(ws_net_paid) sum_ws_net_paid
from web_sales 
where ws_sold_date_sk >= 2452640 and ws_sold_date_sk <=2452642 and ws_net_paid>1.0 ;

PartitionPruner 把 TableScan 后面的 Filter 中的 predicate 信息放入 opToPartList 中。

HashMap<TableScanOperator, PrunedPartitionList> opToPartList;

PartitionConditionRemover 從 opToPartList 取出 ts 對應的,進行 PartitionPruner.prune 得到 partsList。調用 opToPartList.put(ts, partsList);

如果 Filter 中,僅包含 分區字段的條件,則然后刪除 TableScan 后的Filter。否則刪除 Filter 中關于分區字段的判斷。

ListBucketingPruner

drop table if exists list_bucket_test;
CREATE TABLE list_bucket_test (key int, value int) partitioned by (dt string)SKEWED BY (key) ON (1,2) STORED AS DIRECTORIES;insert overwrite table list_bucket_test partition(dt=20250519) values(1,1),(1,2),(2,3),(2,4),(3,5);

list_bucket_test 表目錄下有3個目錄,分別是 key=1,key=2 和 HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME。

drwxr-xr-x  4 houzhizhen  staff  128  5 19 16:32 HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME
drwxr-xr-x  4 houzhizhen  staff  128  5 19 16:32 key=1
drwxr-xr-x  4 houzhizhen  staff  128  5 19 16:32 key=2
set hive.optimize.listbucketing=true;
select sum(value) from list_bucket_test where dt=20250519 and key=1;

ListBucketingPruner 把 TableScanOperator->(Partition-> “GenericUDFOPEqual(Column[key], Const int 1)”) 信息放入 ParseContext 的以下字段。

Map<TableScanOperator, Map<String, ExprNodeDesc>> opToPartToSkewedPruner;

value 是 dt=20250519 -> “GenericUDFOPEqual(Column[key], Const int 1)”

然后在 GenMapRedUtils 里,根據 opToPartToSkewedPruner 信息生成 Partition 需要處理的 path.

GroupByOptimizer

Group by 優化,如果 group key 包括所有的 bucketing key 和 sort key,并且順序相同,那么 group by 可以在map 中進行。


drop table if exists test ;
CREATE TABLE test (bkey int,skey int, value int) partitioned by(pkey int)
clustered by(bkey) sorted by (bkey,skey asc) into 8 buckets;insert overwrite table test partition(pkey=5)values(1,1,1),(1,2,3),(2,3,3),(2,3,4);
-- 數據量少,不轉成 Fetcher
set hive.fetch.task.conversion=none;
explain select bkey,skey, sum(value) from test where pkey=5 group by bkey,skey ;

輸出如下,可以看到,在 map 端完成了group by 操作,沒有 reduce,這樣可能降低并行度。

Explain
Stage-0Fetch Operatorlimit:-1Stage-1Map 1File Output Operator [FS_7]Group By Operator [GBY_3] (rows=4 width=16)Output:["_col0","_col1","_col2"],aggregations:["sum(value)"],keys:bkey, skeySelect Operator [SEL_2] (rows=4 width=12)Output:["bkey","skey","value"]TableScan [TS_0] (rows=4 width=12)tpcds_hdfs_orc_3@test,test,Tbl:COMPLETE,Col:COMPLETE,Output:["bkey","skey","value"]

ColumnPruner

后根遍歷所有的 Operator,僅僅保留子操作用到的列。

explain logical 
select sum(ws_net_paid) net_paid 
from  web_sales 
where ws_sold_date_sk=2452642;

如沒有優化的 TableScan,是所有的列,之后的 Select 也是。優化后,僅包含子操作需要的列。

CountDistinctRewriteProc

這個優化僅僅針對 tez 引擎生效。

count(distinct)只能有一個字段。并且僅能有一個 count distinct.

可以生效示例:

explain  
select count(distinct ws_web_site_sk)
from  web_sales;

生成的執行計劃如下:

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)Stage-0Fetch Operatorlimit:-1Stage-1Reducer 3 vectorizedFile Output Operator [FS_19]Group By Operator [GBY_18] (rows=1 width=8)Output:["_col0"],aggregations:["count(VALUE._col0)"]<-Reducer 2 [CUSTOM_SIMPLE_EDGE] vectorizedPARTITION_ONLY_SHUFFLE [RS_17]Group By Operator [GBY_16] (rows=1 width=8)Output:["_col0"],aggregations:["count(_col0)"]Group By Operator [GBY_15] (rows=360000188 width=934)Output:["_col0"],keys:KEY._col0<-Map 1 [SIMPLE_EDGE] vectorizedSHUFFLE [RS_14]PartitionCols:_col0Group By Operator [GBY_13] (rows=720000376 width=934)Output:["_col0"],keys:ws_web_site_skSelect Operator [SEL_12] (rows=720000376 width=934)Output:["ws_web_site_sk"]TableScan [TS_0] (rows=720000376 width=934)tpcds_bos_parquet_1000@web_sales,web_sales,Tbl:COMPLETE,Col:NONE,Output:["ws_web_site_sk"]

Count Distinct 的執行過程:
對于 mr 引擎.

  1. 多個 Map task 執行 distinct ws_web_site_sk,輸出去重后的 ws_web_site_sk。
  2. 然后啟動一個 reducer,拉取所有 Map Task 的數據,進行最終的 count(distinct ws_web_site_sk) 處理。如果每個 distinct ws_web_site_sk 數量比較多,可能 OOM。

對于 tez 引擎

  1. 多個 Map task 執行 distinct ws_web_site_sk, 根據 ws_web_site_sk 為 partition key,把數據分給多個 reducer。
  2. 多個 reducer 做 count(distinct ws_web_site_sk) 處理。把數值輸出。
  3. 一個 reducer 對第 2 步的數值進行相加。

不能生效示例1:

explain 
select count(distinct ws_web_site_sk),
count(distinct ws_sold_time_sk)
from  web_sales;
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)Stage-0Fetch Operatorlimit:-1Stage-1Reducer 3 vectorizedFile Output Operator [FS_21]Group By Operator [GBY_20] (rows=1 width=16)Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(VALUE._col1)"]<-Reducer 2 [CUSTOM_SIMPLE_EDGE] vectorizedPARTITION_ONLY_SHUFFLE [RS_19]Group By Operator [GBY_18] (rows=1 width=16)Output:["_col0","_col1"],aggregations:["count(_col0)","count(_col1)"]Select Operator [SEL_17] (rows=720000376 width=934)Output:["_col0","_col1"]Group By Operator [GBY_16] (rows=720000376 width=934)Output:["_col0","_col1","_col2"],keys:KEY._col0, KEY._col1, KEY._col2<-Map 1 [SIMPLE_EDGE] vectorizedSHUFFLE [RS_15]PartitionCols:_col0, _col1, _col2Group By Operator [GBY_14] (rows=1440000752 width=934)Output:["_col0","_col1","_col2"],keys:_col0, _col1, 0LSelect Operator [SEL_13] (rows=720000376 width=934)Output:["_col0","_col1"]TableScan [TS_0] (rows=720000376 width=934)tpcds_bos_parquet_1000@web_sales,web_sales,Tbl:COMPLETE,Col:NONE,Output:["ws_web_site_sk","ws_sold_time_sk"]

不能生效示例2:

explain 
select count(distinct ws_web_site_sk, ws_sold_time_sk)
from  web_sales;

生成的執行計劃如下:

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0Fetch Operatorlimit:-1Stage-1Reducer 2File Output Operator [FS_6]Group By Operator [GBY_4] (rows=1 width=16)Output:["_col0"],aggregations:["count(DISTINCT KEY._col0:0._col0, KEY._col0:0._col1)"]<-Map 1 [SIMPLE_EDGE]SHUFFLE [RS_3]Group By Operator [GBY_2] (rows=720000376 width=934)Output:["_col0","_col1","_col2"],aggregations:["count(DISTINCT ws_web_site_sk, ws_sold_time_sk)"],keys:ws_web_site_sk, ws_sold_time_skSelect Operator [SEL_1] (rows=720000376 width=934)Output:["ws_sold_time_sk","ws_web_site_sk"]TableScan [TS_0] (rows=720000376 width=934)tpcds_bos_parquet_1000@web_sales,web_sales,Tbl:COMPLETE,Col:NONE,Output:["ws_sold_time_sk","ws_web_site_sk"]

SkewJoinOptimizer

僅僅對 MR 有效。

drop table if exists skew_web_sales;
CREATE TABLE `skew_web_sales`(`ws_sold_time_sk` bigint, `ws_ship_date_sk` bigint, `ws_item_sk` bigint, `ws_bill_customer_sk` bigint, `ws_bill_cdemo_sk` bigint, `ws_bill_hdemo_sk` bigint, `ws_bill_addr_sk` bigint, `ws_ship_customer_sk` bigint, `ws_ship_cdemo_sk` bigint, `ws_ship_hdemo_sk` bigint, `ws_ship_addr_sk` bigint, `ws_web_page_sk` bigint, `ws_web_site_sk` bigint, `ws_ship_mode_sk` bigint, `ws_warehouse_sk` bigint, `ws_promo_sk` bigint, `ws_order_number` bigint, `ws_quantity` int, `ws_wholesale_cost` decimal(7,2), `ws_list_price` decimal(7,2), `ws_sales_price` decimal(7,2), `ws_ext_discount_amt` decimal(7,2), `ws_ext_sales_price` decimal(7,2), `ws_ext_wholesale_cost` decimal(7,2), `ws_ext_list_price` decimal(7,2), `ws_ext_tax` decimal(7,2), `ws_coupon_amt` decimal(7,2), `ws_ext_ship_cost` decimal(7,2), `ws_net_paid` decimal(7,2), `ws_net_paid_inc_tax` decimal(7,2), `ws_net_paid_inc_ship` decimal(7,2), `ws_net_paid_inc_ship_tax` decimal(7,2), `ws_net_profit` decimal(7,2),`ws_sold_date_sk` bigint) 
skewed by (ws_web_site_sk) on (1);
insert overwrite table skew_web_sales select * from web_sales where ws_web_site_sk=1;
insert into table skew_web_sales select * from web_sales where ws_sold_date_sk=2452642 and ws_web_site_sk=3 limit 1;
set hive.optimize.skewjoin.compiletime=true;
set hive.execution.engine=mr;
explain  
select ws_web_site_sk,web_class,sum(ws_net_paid) 
from  web_site join  skew_web_sales
on skew_web_sales.ws_web_site_sk=web_site.web_site_sk 
group by ws_web_site_sk,web_class;

生成的執行計劃如下:

Explain
LOGICAL PLAN:
skew_web_sales TableScan (TS_1)alias: skew_web_salesStatistics: Num rows: 126827 Data size: 15217112 Basic stats: COMPLETE Column stats: COMPLETEFilter Operator (FIL_24)predicate: (ws_web_site_sk is not null and (ws_web_site_sk = 1L)) (type: boolean)Statistics: Num rows: 63414 Data size: 7608672 Basic stats: COMPLETE Column stats: COMPLETEReduce Output Operator (RS_5)key expressions: ws_web_site_sk (type: bigint)sort order: +Map-reduce partition columns: ws_web_site_sk (type: bigint)Statistics: Num rows: 63414 Data size: 7608672 Basic stats: COMPLETE Column stats: COMPLETEvalue expressions: ws_net_paid (type: decimal(7,2))Join Operator (JOIN_6)condition map:Inner Join 0 to 1keys:0 web_site_sk (type: bigint)1 ws_web_site_sk (type: bigint)outputColumnNames: _col7, _col42, _col58Statistics: Num rows: 63414 Data size: 13379346 Basic stats: COMPLETE Column stats: COMPLETEUnion (UNION_27)Statistics: Num rows: 94130 Data size: 19859414 Basic stats: COMPLETE Column stats: COMPLETEGroup By Operator (GBY_8)aggregations: sum(_col58)keys: _col42 (type: bigint), _col7 (type: varchar(50))mode: hashoutputColumnNames: _col0, _col1, _col2Statistics: Num rows: 2 Data size: 422 Basic stats: COMPLETE Column stats: COMPLETEReduce Output Operator (RS_9)key expressions: _col0 (type: bigint), _col1 (type: varchar(50))sort order: ++Map-reduce partition columns: _col0 (type: bigint), _col1 (type: varchar(50))Statistics: Num rows: 2 Data size: 422 Basic stats: COMPLETE Column stats: COMPLETEvalue expressions: _col2 (type: decimal(17,2))Group By Operator (GBY_10)aggregations: sum(VALUE._col0)keys: KEY._col0 (type: bigint), KEY._col1 (type: varchar(50))mode: mergepartialoutputColumnNames: _col0, _col1, _col2Statistics: Num rows: 2 Data size: 422 Basic stats: COMPLETE Column stats: COMPLETEFile Output Operator (FS_12)compressed: falseStatistics: Num rows: 2 Data size: 422 Basic stats: COMPLETE Column stats: COMPLETEtable:input format: org.apache.hadoop.mapred.SequenceFileInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
subquery1:skew_web_sales TableScan (TS_18)alias: skew_web_salesStatistics: Num rows: 126827 Data size: 15217112 Basic stats: COMPLETE Column stats: COMPLETEFilter Operator (FIL_26)predicate: (ws_web_site_sk is not null and (not (ws_web_site_sk = 1L))) (type: boolean)Statistics: Num rows: 63413 Data size: 7608552 Basic stats: COMPLETE Column stats: COMPLETEReduce Output Operator (RS_20)key expressions: ws_web_site_sk (type: bigint)sort order: +Map-reduce partition columns: ws_web_site_sk (type: bigint)Statistics: Num rows: 63413 Data size: 7608552 Basic stats: COMPLETE Column stats: COMPLETEvalue expressions: ws_net_paid (type: decimal(7,2))Join Operator (JOIN_21)condition map:Inner Join 0 to 1keys:0 web_site_sk (type: bigint)1 ws_web_site_sk (type: bigint)outputColumnNames: _col7, _col42, _col58Statistics: Num rows: 30716 Data size: 6480068 Basic stats: COMPLETE Column stats: COMPLETEUnion (UNION_27)Statistics: Num rows: 94130 Data size: 19859414 Basic stats: COMPLETE Column stats: COMPLETE
subquery1:web_site TableScan (TS_15)alias: web_siteStatistics: Num rows: 32 Data size: 3168 Basic stats: COMPLETE Column stats: COMPLETEFilter Operator (FIL_25)predicate: (web_site_sk is not null and (not (ws_web_site_sk = 1L))) (type: boolean)Statistics: Num rows: 1 Data size: 99 Basic stats: COMPLETE Column stats: COMPLETEReduce Output Operator (RS_17)key expressions: web_site_sk (type: bigint)sort order: +Map-reduce partition columns: web_site_sk (type: bigint)Statistics: Num rows: 1 Data size: 99 Basic stats: COMPLETE Column stats: COMPLETEvalue expressions: web_class (type: varchar(50))Join Operator (JOIN_21)condition map:Inner Join 0 to 1keys:0 web_site_sk (type: bigint)1 ws_web_site_sk (type: bigint)outputColumnNames: _col7, _col42, _col58Statistics: Num rows: 30716 Data size: 6480068 Basic stats: COMPLETE Column stats: COMPLETE
web_site TableScan (TS_0)alias: web_siteStatistics: Num rows: 32 Data size: 3168 Basic stats: COMPLETE Column stats: COMPLETEFilter Operator (FIL_23)predicate: (web_site_sk is not null and (ws_web_site_sk = 1L)) (type: boolean)Statistics: Num rows: 32 Data size: 3168 Basic stats: COMPLETE Column stats: COMPLETEReduce Output Operator (RS_3)key expressions: web_site_sk (type: bigint)sort order: +Map-reduce partition columns: web_site_sk (type: bigint)Statistics: Num rows: 32 Data size: 3168 Basic stats: COMPLETE Column stats: COMPLETEvalue expressions: web_class (type: varchar(50))Join Operator (JOIN_6)condition map:Inner Join 0 to 1keys:0 web_site_sk (type: bigint)1 ws_web_site_sk (type: bigint)outputColumnNames: _col7, _col42, _col58Statistics: Num rows: 63414 Data size: 13379346 Basic stats: COMPLETE Column stats: COMPLETE

SamplePruner

示例:

drop table if exists t1 ;
create table t1(c1 string) stored as textfile;
load data local inpath '/etc/profile' overwrite into table t1;drop table if exists t2;
create external table t2(c1 string) CLUSTERED BY (c1) into 8 buckets;
insert overwrite table t2 select * from t1;
set hive.execution.engine=mr;
set hive.fetch.task.conversion=none;
explain select * from t2 TABLESAMPLE(BUCKET 3 OUT OF 8 ON c1) s;

SamplePruner 生效條件:

  1. t2 是外部表。
  2. CLUSTERED BY 字段 和 select 中 TABLESAMPLE 的 ON 字段一樣。如果是on rand 不行。
  3. 創建表中into 8 buckets 和檢索中 OUT OF 8 的數字要一樣。
  4. 檢索中的 (BUCKET 3 OUT OF 8 ON c1) 的 3 代表是第3個文件。

MapJoinProcessor

用戶指定使用 MapJoin 的 hint。這種方法已經不再使用,現在是基于統計信息自動把 join 轉為 mapjoin.

set hive.ignore.mapjoin.hint=false;
explain select /*+mapjoin(web_site) */ sum(ws_net_paid) from web_sales join web_site on web_site_sk=ws_web_site_sk 
where ws_sold_date_sk=2452640 ;

BucketMapJoinOptimizer

使用條件:

  1. mr 引擎
  2. sql 中 mapjoin hint 生效。
  3. hive.optimize.bucketmapjoin 為 true,默認為 false.

BucketMapJoinOptimizer,SortedMergeBucketMapJoinOptimizer

mr 引擎生效,并且默認不啟用。

BucketingSortingReduceSinkOptimizer

對于 insert overwrite table T2 select * from T1;
如果 T1 和 T2 都 bucket 和 sort key 相同,并且 bucket 的數量相同,那么就不用 reduer 任務。

UnionProcessor

如果 UNION 的兩個子查詢都是 map-only,把此信息存入 unionDesc/UnionOperator。如果其中的一個子查詢是 map-reduce 的作業,在 UNION 之前加入 FS。

UNION 的兩個子查詢都是 map-only 的示例.
union 和 union all 之間的區別。

union 在 operator union 后,有 group by 和 reduce sink,因為需要去重。
union all 在 operator union 后,沒有 group by 和 reduce sink,因為不需要去重。

explain logical select ws_web_site_sk from web_sales where ws_sold_date_sk=2452640
union
select ws_web_site_sk from web_sales where ws_sold_date_sk=2452641;

如果其中的一個子查詢是 map-reduce 的作業示例:

explain logical select ws_web_site_sk from web_sales where ws_sold_date_sk=2452640
union
select distinct ws_web_site_sk from web_sales where ws_sold_date_sk=2452641;

在 MapReduceCompiler 使用

BucketVersionPopulator

根據 hash 算法的不同,數據分到 N 個bucket 時會不一致。
Hive 根據 bucketingVersion 確定使用哪個 hash 算法。
在每個 Reduce Sink 之后,可以使用不同的 Bucketing version。因為 full shuffle 可以重新按新的 hash 算法分配數據。

hive 使用表的 bucket version 寫入數據。
如果從表讀數據,會考慮表的 bucketing_version 字段。

ReduceSinkDeDuplication

如果相鄰的兩個 reduce sink 有共同的 partition/sort 字段,并且字段的順序相同,可以合并為一個 reduce。

例如,以下sql 的兩處 ‘parition by’ 字段不一樣.

explain select 
avg(sum(web_tax_percentage)) over(partition by web_city)avg_monthly_sales,rank() over(partition by web_county, web_cityorder by web_county) rn
from web_site
group by web_county, web_city;

生成的語法樹如下:

Explain
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)

以下sql 生成的

explain select 
avg(sum(web_tax_percentage)) over(partition by web_county, web_city)avg_monthly_sales,rank() over(partition by web_county, web_cityorder by web_county) rn
from web_sitegroup by web_county, web_city;

生成的 vertex 如下, 比上一個 sql 少一個 reduce,因為 partition by 相同

Explain
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)

NonBlockingOpDeDupProc

相鄰的兩個投影(SEL)操作合并為一個投影(SEL)操作,相鄰的兩個過濾(FIL)操作合并為一個過濾(FIL)操作。

IdentityProjectRemover

刪除不必要的投影(SEL)操作。如果投影(SEL)僅僅是forward,沒有計算,如 select x+1 這種帶計算,并且沒有給列重命名,則可以去除。

explain select web_city 
from web_site  
where web_city > '1' 
group by web_city;

生成的 SQL 語法樹為, 沒有 select

Explain
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0Fetch Operatorlimit:-1Stage-1Reducer 2File Output Operator [FS_7]Group By Operator [GBY_5] (rows=1 width=91)Output:["_col0"],keys:KEY._col0<-Map 1 [SIMPLE_EDGE]SHUFFLE [RS_4]PartitionCols:_col0Group By Operator [GBY_3] (rows=1 width=91)Output:["_col0"],keys:web_cityFilter Operator [FIL_8] (rows=10 width=91)predicate:(web_city > '1')TableScan [TS_0] (rows=32 width=91)tpcds_hdfs_orc_3@web_site,web_site,Tbl:COMPLETE,Col:COMPLETE,Output:["web_city"]

如沒有 filter

explain select web_city 
from web_site  
group by web_city;

在 TS 后,有一個 select.

Explain
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0Fetch Operatorlimit:-1Stage-1Reducer 2File Output Operator [FS_6]Group By Operator [GBY_4] (rows=2 width=91)Output:["_col0"],keys:KEY._col0<-Map 1 [SIMPLE_EDGE]SHUFFLE [RS_3]PartitionCols:_col0Group By Operator [GBY_2] (rows=2 width=91)Output:["_col0"],keys:web_citySelect Operator [SEL_1] (rows=32 width=91)Output:["web_city"]TableScan [TS_0] (rows=32 width=91)tpcds_hdfs_orc_3@web_site,web_site,Tbl:COMPLETE,Col:COMPLETE,Output:["web_city"]

explain select upper(web_city) d_web_city
from web_site  
where web_city > '1' 
group by web_city;

在 File Output Operator 之前,多了一個 select 操作

Explain
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0Fetch Operatorlimit:-1Stage-1Reducer 2File Output Operator [FS_7]Select Operator [SEL_6] (rows=1 width=144)Output:["_col0"]Group By Operator [GBY_5] (rows=1 width=91)Output:["_col0"],keys:KEY._col0<-Map 1 [SIMPLE_EDGE]SHUFFLE [RS_4]PartitionCols:_col0Group By Operator [GBY_3] (rows=1 width=91)Output:["_col0"],keys:web_cityFilter Operator [FIL_8] (rows=10 width=91)predicate:(web_city > '1')TableScan [TS_0] (rows=32 width=91)tpcds_hdfs_orc_3@web_site,web_site,Tbl:COMPLETE,Col:COMPLETE,Output:["web_city"]

GlobalLimitOptimizer

set hive.limit.optimize.enable=true;
select * from web_sales limit 10;

如 web_sales 有很多分區,每個分區下很多文件,查詢10條數據,沒必要打開所有文件,可能第1個文件就有10條記錄,可以減少收入。

CorrelationOptimizer

僅僅對 MR 引擎生效。參考論文:YSmart: Yet Another SQL-to-MapReduce Translator(Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang)

LimitPushdownOptimizer

explain extended select sum(web_tax_percentage)
from web_site  
group by web_city limit 1;

order by: 把 limit 推到最后的 reduce sink。

explain select ws_web_site_sk
from web_sales  
order by ws_web_site_sk limit 1;

StatsOptimizer

有些 query 的結果,可以從 stats 信息中,直接獲取。
先生成表的統計信息。

analyze table web_site compute statistics for columns;

執行 sql:

explain select min(web_site_sk) from web_site;

從統計信息獲取結果的執行計劃如下,僅有一個 Fetch

Explain
Stage-0Fetch Operatorlimit:1
  • 不能從統計信息獲取結果示例
explain select min(web_site_sk) from web_site where web_site_sk <> 2;

此 SQL 的執行計劃如下:

Explain
Vertex dependency in root stage
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)Stage-0Fetch Operatorlimit:-1Stage-1Reducer 2File Output Operator [FS_7]Group By Operator [GBY_5] (rows=1 width=8)Output:["_col0"],aggregations:["min(VALUE._col0)"]<-Map 1 [CUSTOM_SIMPLE_EDGE]PARTITION_ONLY_SHUFFLE [RS_4]Group By Operator [GBY_3] (rows=1 width=8)Output:["_col0"],aggregations:["min(web_site_sk)"]Filter Operator [FIL_8] (rows=32 width=8)predicate:(web_site_sk <> 2L)TableScan [TS_0] (rows=32 width=8)tpcds_hdfs_orc_3@web_site,web_site,Tbl:COMPLETE,Col:COMPLETE,Output:["web_site_sk"]

AnnotateWithStatistics, AnnotateWithOpTraits

當 explain 的時候,并且是 mr 引擎時,在邏輯執行計劃優化(Optimizer)時執行。
TEZ 在物理執行計劃優化時執行。
AnnotateWithStatistics 給各 Operator 設置 stats(統計信息)。
AnnotateWithOpTraits 設置 opTraits,適用于 bucket 表.如兩個 bucket 表做 join。
OpTraits 定義如下,

public OpTraits(List<List<String>> bucketColNames, int numBuckets,List<List<String>> sortColNames, int numReduceSinks) {this.bucketColNames = bucketColNames;this.numBuckets = numBuckets;this.sortColNames = sortColNames;this.numReduceSinks = numReduceSinks;}
explain
select web_site.web_city, sum(ws_ext_tax) ws_ext_tax_sum
from web_site join web_sales on web_site.web_site_sk = web_sales.ws_web_site_sk
where ws_sold_date_sk=2452642 and ws_web_site_sk =1 
group by web_site.web_city;

SimpleFetchOptimizer

對于單表簡單的操作(沒有 group by, 沒有 distinct,單表),不啟動分布式任務,直接在 fetch task里讀取表返回,可以加快執行速度。
hive.fetch.task.conversion 的值
none :禁用
minimal : 支持 SELECT *, 在分區字段過濾, LIMIT
more : SELECT, FILTER, LIMIT only (支持 TABLESAMPLE 和虛擬字段)

hive.fetch.task.conversion.threshold:控制適用這個優化的數據量,默認1G。

TablePropertyEnrichmentOptimizer

默認關閉,除了表的 properties 外,可以獲取表的SerDe 的屬性信息,然后都放到表的properties 中。
Serde 由參數 hive.optimize.update.table.properties.from.serde.list 控制,默認只有 AvroSerDe。

HiveOpConverterPostProc

以下3個條件都符合才執行,returnPathEnabled 默認 false,所以會不執行。

final boolean cboEnabled = HiveConf.getBoolVar(pctx.getConf(), HiveConf.ConfVars.HIVE_CBO_ENABLED);final boolean returnPathEnabled = HiveConf.getBoolVar(pctx.getConf(), HiveConf.ConfVars.HIVE_CBO_RETPATH_HIVEOP);final boolean cboSucceeded = pctx.getContext().isCboSucceeded();

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/bicheng/85291.shtml
繁體地址,請注明出處:http://hk.pswp.cn/bicheng/85291.shtml
英文地址,請注明出處:http://en.pswp.cn/bicheng/85291.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

ZYNQ Petalinux實戰:PCIe直通NVMe固態硬盤,解鎖存儲性能新極限!

突破SD卡和SATA的速度枷鎖!本文將手把手教你如何在ZYNQ平臺上通過PCIe接口驅動NVMe固態硬盤。從硬件設計、Linux內核配置到創新性的DMA零拷貝優化,實現2000MB/s+ 的存儲性能飛躍,附完整代碼解析和性能實測對比。 一、為什么選擇PCIe NVMe?存儲性能革命 ZYNQ傳統存儲方案面…

05-mcp-server案例分享-用豆包大模型 1.6 手搓文生圖視頻 MCP-server發布到PyPI官網

1前言 上期給大家介紹過mcp-server案例分享-用豆包大模型 1.6 手搓文生圖視頻 MCP-server。當時部署的方式使用了一個私有云SSE的部署。當時缺少一個本地部署的方式&#xff0c;有的小伙伴給我留言能不能有一個本地話部署方式了。今天就給大家帶來一個本地化部署的方案。 話不…

MCP Parameters 增加描述

場景&#xff1a;本地MCP開發完后是否發現CLINE上顯示的Parameters 顯示No description 方法1 &#xff1a;使用參數元數據 (Annotated) 可以使用 Pydantic 的with 類提供有關參數的其他元數據Annotated。這種方法更受歡迎&#xff0c;因為它更現代&#xff0c;并且將類型提示…

STM32 GPIO 寄存器開發

&#x1f527; ?一、核心寄存器概覽? ?寄存器??功能??位寬??關鍵位域??GPIOx_CRL/CRH?配置引腳模式&#xff08;輸入/輸出/復用/模擬&#xff09;和輸出參數32位每4位控制1個引腳&#xff1a;CNF[1:0]&#xff08;模式&#xff09; MODE[1:0]&#xff08;速度&am…

powershell 獲取 用戶及進程列表

在PowerShell中獲取用戶的進程列表&#xff0c;可以通過幾種方法實現。以下是一些常見的方法&#xff1a; 方法1&#xff1a;使用Get-WmiObject Get-WmiObject命令可以用來查詢Windows Management Instrumentation (WMI)數據庫&#xff0c;從而獲取關于進程和用戶的信息。 # …

量化面試綠皮書:15. 假幣一

文中內容僅限技術學習與代碼實踐參考&#xff0c;市場存在不確定性&#xff0c;技術分析需謹慎驗證&#xff0c;不構成任何投資建議。 15. 假幣一 有 10個袋子&#xff0c;每個袋子里有 100個相同的硬幣。 在除一個以外的所有袋子中&#xff0c;每枚硬幣重10 克。 然而&#x…

Java求職者面試:Spring AI、MCP、RAG、向量數據庫與Embedding模型技術解析

Java求職者面試&#xff1a;Spring AI、MCP、RAG、向量數據庫與Embedding模型技術解析 第一輪&#xff1a;基礎概念問題 1. 請解釋Spring AI是什么&#xff1f;它與傳統Spring框架有何不同&#xff1f; Spring AI是Spring生態系統的一部分&#xff0c;專注于人工智能和機器學…

tp框架導出excel的時候報錯:unexcepted identifier “Closure“,excepting variable

記錄一個簡單的錯誤。 背景 用的是PhpOffice/PhpSpreadsheet 在本地環境下是可以正常導出excel的。但是線上就不行。 就會報錯unexcepted identifier “Closure”&#xff0c;好像是不能用匿名函數。 首先 本地可以正常導出&#xff0c;然后服務器上不可以。看了各種日志。ph…

[Java惡補day24] 74. 搜索二維矩陣

給你一個滿足下述兩條屬性的 m x n 整數矩陣&#xff1a; 每行中的整數從左到右按非嚴格遞增順序排列。 每行的第一個整數大于前一行的最后一個整數。 給你一個整數 target &#xff0c;如果 target 在矩陣中&#xff0c;返回 true &#xff1b;否則&#xff0c;返回 false 。 …

解鎖VSCode:從入門到精通的全攻略

目錄 一、VSCode 初相識二、安裝與基礎設置2.1 下載安裝2.2 基礎設置三、核心功能深度剖析3.1 強大的代碼編輯3.2 高效的版本控制集成3.3 實用的調試工具四、插件擴展,拓展無限可能4.1 插件市場探秘4.2 必備插件推薦五、個性化定制,打造專屬開發環境5.1 快捷鍵設置5.2 用戶代…

RFC4291-IPv6地址架構

RFC4291 IP Version 6 Addressing Architecture Author&#xff1a;Once Day Date&#xff1a;2025年6月15日 本文翻譯自RFC 4291 - IP Version 6 Addressing Architecture 這篇文章總結了IPv6的基礎概念&#xff0c;屬于IPv6協議入門內容。 文章目錄 RFC4291 IP Version 6 …

基礎數據結構第03天:順序表(實戰篇)

目錄 求奇數的乘積 數值統計 青年歌手大獎賽_評委會打分 猜數字 拿硬幣 值相等的最小索引 最大連續1的個數 差的絕對值為K的數對數目 數組中兩元素的最大乘積 數組元素和與數字和的絕對差 K個元素的最大和 等差三元組的數目 移除元素 基于排列構建數組 數組串聯…

10.OpenCV—聯合QT界面顯示

1.顯示在graphicsView控件上 .h文件 #ifndef MAINWINDOW_H #define MAINWINDOW_H#include <QMainWindow>#include <QGraphicsPixmapItem> //1.聲明頭文件 namespace Ui { class MainWindow; }class MainWindow : public QMainWindow {Q_OBJECTpublic:explicit Ma…

ChromaDB深度技術研究報告

第一章: ChromaDB核心概念與架構 1.1 向量數據庫:新一代AI應用基石 向量數據庫是為存儲、管理和搜索向量嵌入(Vector Embeddings)而專門設計的數據庫系統。在高維空間中,向量嵌入是數據(如文本、圖片、音頻等)的數值表示。向量數據庫的核心能力在于,它能夠高效地執行相…

react 自定義狀態管理庫

核心實現原理 &#xff1a; 全局狀態容器&#xff1a;維護單一狀態源 訂閱機制&#xff1a;組件訂閱狀態變化 狀態更新調度&#xff1a;通過 Hooks 觸發組件重渲染 基礎版實現–核心代碼 // 1. 創建全局狀態存儲 const createStore (initialState) > {let state initial…

解決idea無法正常加載lombok包

問題 近期發現了一個問題&#xff0c;就是很多同學在導包的&#xff0c;lombok經常會爆紅&#xff0c;經過研究找到了解決方法。 解決 1、更改lombok包的版本 通過手動調整pom.xml文件的lombok&#xff0c;通常講版本調整為1.18.20&#xff0c;或者1.18.32。 <dependenc…

0_1樹和圖

樹的概念 樹(tree)是一種能夠分層存儲數據的重要數據結構,樹中的每個元素被稱為樹的節點,每個節點有若干個指針指向子節點。從節點的角度來看,樹是由唯一的起始節點引出的節點集合。這個起始結點稱為根(root)。樹中節點的子樹數目稱為節點的度(degree)。在面試中,關于樹的…

從0搭建出海 Demo:免費香港服務器實戰指南

你有沒有在通勤地鐵上、午飯后摸魚時&#xff0c;突然冒出一個想法&#xff1a;“要不我也做個應用試試&#xff1f;好像不少人靠這個補貼生活開銷啊&#xff01;” 結果隨手搜了幾篇“海外項目經驗分享”&#xff0c;瞬間被一堆術語勸退&#xff1a;CDN、備案、分發平臺、服務…

《仿盒馬》app開發技術分享--未完成訂單列表展示邏輯優化(61)

技術棧 Appgallery connect 前言&#xff1a; 上一節我們實現訂單與優惠券的聯合提交時&#xff0c;我去到訂單列表頁面查看生成的訂單信息&#xff0c;發現現在的訂單從信息展示到價格計算全都是有問題的。所以緊急的把對應的問題修改一下。 問題來源&#xff1a; async …

手搓多模態-08 主模型的搭建(上)

前情回顧 在之前的章節我們已經構建好了視覺編碼器&#xff0c;預處理模塊&#xff0c;以及gemma模型的頂層。gemma模型的頂層&#xff0c;主要是構建圖中圈出的輸入&#xff0c;它把視覺編碼器里每個圖像patch的編碼維度對齊到自然語言token的嵌入維度&#xff0c;并組裝成了一…