【Hive入門】

之前實習寫的筆記，上傳留個備份。

1. 使用docker-compose快速搭建Hive集群

使用docker快速配置Hive環境

拉取鏡像

2. Hive數據類型

隱式轉換：窄的可以向寬的轉換
顯式轉換：cast

3. Hive讀寫文件

SerDe:序列化（對象轉為字節碼）、反序列化

3.1 hive讀寫文件流程

反序列化（將文件映射為表）

調用inputFormat，轉為<key,value>類型，然后進行反序列化。

3.2 SerDe語法

row format 指定序列化方式和分割符
- Delimited:默認序列化方式
- Json:改變序列化方式
hive 默認分割符"\001"

4. 存儲路徑

默認存儲：/usr/hive/warehouse
指定存儲路徑：location hdfs_path

5. 練習

創建表并加載數據。

use ods;
create external table hero_info_1(id bigint comment "ID",name string comment "英雄名稱",hp_max bigint comment "最大生命"
) comment "王者榮耀信息"
row format delimited
fields terminated by "\t";

將文件上傳到相應路徑，只要指定好分割符就可以。

hadoop fs -put test1.txt /usr/hive/warehouse/test.db/hero_info_1

map類型

create table hero_info_2(id int comment "ID",name string comment "英雄名字",win_rate int comment "勝率",skin map<string, int> comment "皮膚：價格" -- 注意map分割類型
) comment "英雄皮膚表"
row format delimited
fields terminated by "," -- 指定字段分割符
collection items terminated by '-' -- 指定集合元素之間分割符
map keys terminated by ':'; -- 指定map元素kv之間的分割符

hadoop fs -put test2.txt /usr/hive/warehouse/test.db/hero_info_2

6. 指定路徑使用

create table t_hero_info_3(id int comment "ID",name string comment "英雄名字",win_rate int comment "勝率",skin map<string, int> comment "皮膚：價格" -- 注意map分割類型
) comment "英雄皮膚表"
location "/tmp";
?
select * from t_hero_info_3;

7. 內部表和外部表

外部表，刪除不會刪除hdfs文件
一般都用外部表

drop table t_hero_info_3; -- 文件也被刪除

9. 分區表

上傳多個文件
發現sql執行很慢，因為where需要進行全表掃描，所以效率慢
但是我們是根據射手類型來進行分類的，因此可以只掃描這一個分區的數據
分區字段不能是表中已經存在的字段

create external table t_hero_info_1(id int comment "ID",name string comment "名字"
) comment "英雄信息"
partitioned by (role string)
row format delimited
fields terminated by "\t";

靜態分區

load data local inpath '/root/a.txt' into table t_hero_info_1 partition(role='sheshou');
?
?
-- 分區掃描 role是分區字段，不用全表掃描
select count(*) from t_hero_info_1 where role = "sheshou" and hp_max > 6000;
?

10. 多重分區表

一般為雙重分區表

create external table t_hero_info_1(id int comment "ID",name string comment "名字"
) comment "英雄信息"
partitioned by (province string, city string); -- 分區字段存在順序
?
-- 分區1
load data local inpath '/root/a.txt' into table t_hero_info_1 partition(province='beijing',city='chaoyang');
-- 分區2
load data local inpath '/root/b.txt' into table t_hero_info_1 partition(province='beijing',city='haidian');
-- 多重分區
load data local inpath '/root/b.txt' into table t_hero_info_1 partition(province='shanghai',city='pudong');

11. 動態分區

根據字段值來進行動態分區，使用insert+select
步驟：創建完分區表后，存在一個分區字段role，這時我們使用insert+select方法將原先表的數據插入到分區表中。

-- 原始數據表 t_all_hero
-- 分區表 t_all_hero_part
?
-- role這里是分區字段，role_main是我們給指定的分區類型
insert into table t_all_hero_part partition(role) select tmp.*, tmp.role_main from t_all_hero tmp;

在企業中，一般根據日期來進行分區表。
注意：分區的字段不能是已有的字段，即字段名字不能重復
分區的字段是個虛擬的字段，并不存在于底層當中

12. 分桶表

來進行優化查詢
分桶是將一個文件分為若干個文件

規則

將文件中數據哈希，從而分到不同桶中。
一般是根據主鍵來進行分桶
創建一個普通的表，然后上傳數據；通過inset+select來加載分桶

-- 創建分桶表
create table test.t_state_info()
clustered by(state) into 5 buckets; -- state一定是表中已有的字段
?
-- 插入數據
insert into t_state_info_bucket select * from t_state_info;

好處

可以基于分桶字段來查找，不需要進行全表過濾
join時減少笛卡爾積數量

窗口函數
- over后返回的表行數不變

解析json

get_json_object:一次只能解析一個字段

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/908184.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/908184.shtml
英文地址，請注明出處：http://en.pswp.cn/news/908184.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！