Hive-分區分桶操作

在大數據中，最常用的一種思想就是分治，我們可以把大的文件切割劃分成一個個的小的文件，這樣每次操作一個小的文件就會很容易了，同樣的道理，在hive當中也是支持這種思想的，就是我們可以把大的數據，按照每天，或者每小時進行切分成一個個的小的文件，這樣去操作小的文件就會容易得多了。

一、分區表操作

企業常見的分區規則：按天進行分區（一天一個分區）

1、創建分區表語法

create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

2、創建一個表帶多個分區

create table score2 (s_id string,c_id string, s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by '\t';

3、加載數據到分區表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');

4、加載數據到一個多分區的表中去

load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition(year='2018',month='06',day='01');

5、多分區聯合查詢使用union all來實現

select * from score where month = '201806' union all select * from score where month = '201806';
1

6、查看分區

show  partitions  score;

7、添加一個分區

alter table score add partition(month='201805');

8、同時添加多個分區

alter table score add partition(month='201804') partition(month = '201803');

注意：添加分區之后就可以在hdfs文件系統當中看到表下面多了一個文件夾

9、刪除分區

alter table score drop partition(month = '201806');

特別強調:
分區字段絕對不能出現在數據庫表已有的字段中!

作用:
將數據按區域劃分開，查詢時不用掃描無關的數據，加快查詢速度。

二、分桶表操作

是在已有的表結構之上新添加了特殊的結構。

將數據按照指定的字段進行分成多個桶中去，說白了就是將數據按照字段進行劃分，可以將數據按照字段劃分到多個文件當中去

1、開啟hive的桶表功能

set hive.enforce.bucketing=true;

2、設置reduce的個數

set mapreduce.job.reduces=3;

3、創建桶表

create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

桶表的數據加載，由于通標的數據加載通過hdfs dfs -put文件或者通過load data均不好使，只能通過insert overwrite

創建普通表，并通過insert overwrite的方式將普通表的數據通過查詢的方式加載到桶表當中去

4、創建普通表

create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';

5、普通表中加載數據

load data local inpath '/export/servers/hivedatas/course.csv' into table course_common;

6、通過insert overwrite給桶表中加載數據

insert overwrite table course select * from course_common cluster by(c_id);

特別強調:
分桶字段必須是表中的字段。

分桶邏輯:
對分桶字段求哈希值,用哈希值與分桶的數量取余,余幾,這個數據就放在哪個桶內。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/535692.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/535692.shtml
英文地址，請注明出處：http://en.pswp.cn/news/535692.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！