Hive的數據模型—桶表

概述

桶表是對數據進行哈希取值，然后放到不同文件中存儲。

數據加載到桶表時，會對字段取hash值，然后與桶的數量取模。把數據放到對應的文件中。
物理上，每個桶就是表(或分區）目錄里的一個文件，一個作業產生的桶(輸出文件)和reduce任務個數相同。

作用

桶表專門用于抽樣查詢，是很專業性的，不是日常用來存儲數據的表，需要抽樣查詢時，才創建和使用桶表。

實驗

創建

[22:39:03]hive (zmgdb)> create table bucket_t1(id string)
[22:39:26] ? ? ? ? ? ?> clustered by(id) into 6 buckets;
[22:39:27]OK
[22:39:27]Time taken: 0.546 seconds
clustered by：以哪個字段分桶。對id進行哈希取值，隨機 ?地放到4個桶里。

-----------------------------

準備數據

[root@hello110 data]# vi bucket_test
1
2
3
4
5
6

.............

.........

導入數據

正確的導入方式：從日常保存數據的表insert

[21:27:45]hive (zmgdb)> create table t2(id string);
[21:27:45]OK
[21:27:45]Time taken: 0.073 seconds
[21:28:24]hive (zmgdb)> load data local inpath '/data/bucket_test' into table t2;
[21:28:24]Loading data to table zmgdb.t2
[21:28:25]OK

從日常表導入

[22:39:47]hive (zmgdb)> insert overwrite table bucket_t1 select id from t2;

hive會啟動mapreduce
[22:39:48]WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
[22:39:48]Query ID = hadoop_20160922063946_34bf30c4-3f23-43e9-ad8f-edd5ee214948
[22:39:48]Total jobs = 1
[22:39:48]Launching Job 1 out of 1
[22:39:48]Number of reduce tasks determined at compile time: 6
[22:39:48]In order to change the average load for a reducer (in bytes):
[22:39:48] ?set hive.exec.reducers.bytes.per.reducer=<number>
[22:39:48]In order to limit the maximum number of reducers:
[22:39:48] ?set hive.exec.reducers.max=<number>
[22:39:48]In order to set a constant number of reducers:
[22:39:48] ?set mapreduce.job.reduces=<number>
[22:39:51]Starting Job = job_1474497386931_0001, Tracking URL = http://hello110:8088/proxy/application_1474497386931_0001/
[22:39:51]Kill Command = /home/hadoop/app/hadoop-2.7.2/bin/hadoop job ?-kill job_1474497386931_0001
[22:39:59]Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 6
[22:39:59]2016-09-22 06:39:59,419 Stage-1 map = 0%, ?reduce = 0%
[22:40:06]2016-09-22 06:40:05,828 Stage-1 map = 100%, ?reduce = 0%, Cumulative CPU 1.63 sec
[22:40:12]2016-09-22 06:40:12,347 Stage-1 map = 100%, ?reduce = 17%, Cumulative CPU 3.48 sec
[22:40:16]2016-09-22 06:40:15,739 Stage-1 map = 100%, ?reduce = 33%, Cumulative CPU 5.4 sec
[22:40:17]2016-09-22 06:40:16,807 Stage-1 map = 100%, ?reduce = 50%, Cumulative CPU 7.52 sec
[22:40:19]2016-09-22 06:40:18,929 Stage-1 map = 100%, ?reduce = 83%, Cumulative CPU 11.35 sec
[22:40:20]2016-09-22 06:40:19,991 Stage-1 map = 100%, ?reduce = 100%, Cumulative CPU 13.19 sec
[22:40:21]MapReduce Total cumulative CPU time: 13 seconds 190 msec
[22:40:21]Ended Job = job_1474497386931_0001
[22:40:21]Loading data to table zmgdb.bucket_t1
[22:40:22]MapReduce Jobs Launched:?
[22:40:22]Stage-Stage-1: Map: 1 ?Reduce: 6 ? Cumulative CPU: 13.19 sec ? HDFS Read: 25355 HDFS Write: 1434 SUCCESS
[22:40:22]Total MapReduce CPU Time Spent: 13 seconds 190 msec
[22:40:22]OK
[22:40:22]id
[22:40:22]Time taken: 34.91 seconds

錯誤的導入方式：從文件load data?

hive (zmgdb)> create table bucket_t2 like bucket_t1;
OK
Time taken: 0.707 seconds

hive (zmgdb)> load data local inpath '/data/bucket_test' into table bucket_t2;
Loading data to table zmgdb.bucket_t2
OK
Time taken: 1.485 seconds

沒有啟動mapreduce對數據進行哈希取值，只是簡單的原樣導入，沒有起到抽樣查詢的目的。通過select * from 比較會發現bucket_t1的數據和bucket_t2的數據順序是不同的，bucket_t2的表順序與原數據文件順序一致，沒有做過哈希取值。

查詢

select * from bucket_table tablesample(bucket x out of y on column);
tablesample是抽樣語句
語法解析：TABLESAMPLE(BUCKET x OUT OF y on 字段)
y必須是table總bucket數的倍數或者因子。
hive根據y的大小，決定抽樣的比例。
例如，table總共分了64份，當y=32時，抽取(64/32=)2個bucket的數據，當y=128時，抽取(64/128=)1/2個bucket的數據。x表示從哪個bucket開始抽取。
例如，table總bucket數為32，tablesample(bucket 3 out of 16)，表示總共抽取（32/16=）2個bucket的數據，分別為第3個bucket和第（3+16=）19個bucket的數據。如果是y=64，則抽取半個第3個桶的值。

[22:44:31]hive (zmgdb)> select * from bucket_t1 tablesample (bucket 1 out of 6 on id);
[22:44:31]OK
[22:44:31]bucket_t1.id
[22:44:31]6
[22:44:31]iu
[22:44:31]0
[22:44:31]6
[22:44:31]hj
[22:44:31]6
[22:44:31]6
[22:44:31]51
[22:44:31]
[22:44:31]
[22:44:31]r
[22:44:31]99
[22:44:31]0
[22:44:31]57
[22:44:31]loo
[22:44:31]r
[22:44:31]r
[22:44:31]r
[22:44:31]60
[22:44:31]66
[22:44:31]75
[22:44:31]6
[22:44:31]84
[22:44:31]x
[22:44:31]24
[22:44:31]93
[22:44:31]99
[22:44:31]105
[22:44:31]f
[22:44:31]r
[22:44:31]114
[22:44:31]0
[22:44:31]123
[22:44:31]129
[22:44:31]132
[22:44:31]x
[22:44:31]138
[22:44:31]141
[22:44:31]147
[22:44:31]33
[22:44:31]150
[22:44:31]156
[22:44:31]r
[22:44:31]f
[22:44:31]39
[22:44:31]15
[22:44:31]r
[22:44:31]ddd
[22:44:31]
[22:44:31]06
[22:44:31]hj
[22:44:31]f
[22:44:31]l
[22:44:31]f
[22:44:31]f
[22:44:31]f
[22:44:31]f
[22:44:31]42
[22:44:31]f
[22:44:31]r
[22:44:31]r
[22:44:31]f
[22:44:31]f
[22:44:31]r
[22:44:31]48
[22:44:31]6
[22:44:31]Time taken: 0.142 seconds, Fetched:66 row(s)

[22:44:43]hive (zmgdb)> select * from bucket_t1 tablesample (bucket 1 out of 60 on id);
[22:44:43]OK
[22:44:43]bucket_t1.id
[22:44:43]
[22:44:43]
[22:44:43]loo
[22:44:43]x
[22:44:43]114
[22:44:43]132
[22:44:43]x
[22:44:43]150
[22:44:43]ddd
[22:44:43]
[22:44:43]Time taken: 0.064 seconds, Fetched: 10 row(s)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/539171.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/539171.shtml
英文地址，請注明出處：http://en.pswp.cn/news/539171.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！