Hive文件格式

Hive有四種文件格式：TextFile，SequenceFile，RCFile，ORC

TextFile

默認的格式，文本格式。

SequenceFile

簡介

見：http://blog.csdn.net/zengmingen/article/details/52242768

操作

hive (zmgdb)>create table t2(str string) stored assequencefile;

OK

Time taken: 0.299 seconds

hive (zmgdb)> desc formatted t2;

OK

..............................

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Compressed: No

sequenceFile的表導入數據不能用load，

[root@hello110 data]# vi test_data

3

we

ew

e

re

er51

2

hive (zmgdb)> load data local inpath '/data/test_data' into table t1;

Loading data to table zmgdb.t1

OK

Time taken: 1.498 seconds

hive (zmgdb)>load data local inpath '/data/test_data' into table t2;

FAILED: SemanticException Unable to load data to destination table. Error: The file that you are trying to loaddoes not match the file format of the destination table.

要用 INSERT OVERWRITE TABLE test2 SELECT * FROM test1;開啟mapreduce保存

hive (zmgdb)>insert overwrite table t2 select * from t1;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = hadoop_20160914215205_992081a3-1783-4052-8da8-53e6097a2775

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1473855624724_0001, Tracking URL = http://hello110:8088/proxy/application_1473855624724_0001/

Kill Command = /home/hadoop/app/hadoop-2.7.2/bin/hadoop job -kill job_1473855624724_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2016-09-14 21:52:22,073 Stage-1 map = 0%, reduce = 0%

2016-09-14 21:52:43,733 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 sec

MapReduce Total cumulative CPU time: 2 seconds 900 msec

Ended Job = job_1473855624724_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to directory hdfs://hello110:9000/user/hive/warehouse/zmgdb.db/t2/.hive-staging_hive_2016-09-14_21-52-05_274_2207100662758769951-1/-ext-10000

Loading data to table zmgdb.t2

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 2.9 sec HDFS Read: 3844 HDFS Write: 1534 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 900 msec

OK

t1.str

Time taken: 43.709 seconds

hive (zmgdb)> select * from t2;

OK

t2.str

1

2

2

43

4

dds

ads

fdsdsf

fds

ad

查看hdfs里sequencefile的原文件

sequencefile的底層保存的是二進制格式，0101010101的。

RCFile

一種行列存儲相結合的存儲方式。首先，其將數據按行分塊，保證同一個record在一個塊上，避免讀一個記錄需要讀取多個block。其次，塊數據列式存儲，有利于數據壓縮和快速的列存取。

hive (zmgdb)> create table rc_t1(id string) stored as rcfile;
OK
Time taken: 0.334 seconds

hive (zmgdb)> desc formatted rc_t1;
OK
col_name ? ? ? ?data_type ? ? ? comment
# col_name ? ? ? ? ? ? ?data_type ? ? ? ? ? ? ? comment ? ? ? ? ? ??
? ? ? ? ? ? ? ? ?
id ? ? ? ? ? ? ? ? ? ? ?string ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ?
# Detailed Table Information ? ? ? ? ? ??
Database: ? ? ? ? ? ? ? zmgdb ? ? ? ? ? ? ? ? ? ?
Owner: ? ? ? ? ? ? ? ? ?hadoop ? ? ? ? ? ? ? ? ??
CreateTime: ? ? ? ? ? ? Fri Sep 23 19:21:15 CST 2016 ? ??
LastAccessTime: ? ? ? ? UNKNOWN ? ? ? ? ? ? ? ? ?
Retention: ? ? ? ? ? ? ?0 ? ? ? ? ? ? ? ? ? ? ? ?
Location: ? ? ? ? ? ? ? hdfs://hello110:9000/user/hive/warehouse/zmgdb.db/rc_t1 ?
Table Type: ? ? ? ? ? ? MANAGED_TABLE ? ? ? ? ? ?
Table Parameters: ? ? ? ? ? ? ? ?
? ? ? ? COLUMN_STATS_ACCURATE ? {\"BASIC_STATS\":\"true\"}
? ? ? ? numFiles ? ? ? ? ? ? ? ?0 ? ? ? ? ? ? ? ? ??
? ? ? ? numRows ? ? ? ? ? ? ? ? 0 ? ? ? ? ? ? ? ? ??
? ? ? ? rawDataSize ? ? ? ? ? ? 0 ? ? ? ? ? ? ? ? ??
? ? ? ? totalSize ? ? ? ? ? ? ? 0 ? ? ? ? ? ? ? ? ??
? ? ? ? transient_lastDdlTime ? 1474629675 ? ? ? ? ?
? ? ? ? ? ? ? ? ?
# Storage Information ? ? ? ? ? ?
SerDe Library: ? ? ? ? ?org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe ??
InputFormat: ? ? ? ? ? ?org.apache.hadoop.hive.ql.io.RCFileInputFormat ??
OutputFormat: ? ? ? ? ? org.apache.hadoop.hive.ql.io.RCFileOutputFormat ?
Compressed: ? ? ? ? ? ? No ? ? ? ? ? ? ? ? ? ? ??
Num Buckets: ? ? ? ? ? ?-1 ? ? ? ? ? ? ? ? ? ? ??
Bucket Columns: ? ? ? ? [] ? ? ? ? ? ? ? ? ? ? ??
Sort Columns: ? ? ? ? ? [] ? ? ? ? ? ? ? ? ? ? ??
Storage Desc Params: ? ? ? ? ? ??
? ? ? ? serialization.format ? ?1 ? ? ? ? ? ? ? ? ??
Time taken: 0.135 seconds, Fetched: 30 row(s)
hive (zmgdb)> insert overwrite table rc_t1 select * from t2;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20160923192210_96320492-f8bf-483a-83c4-b9874fd05ef4
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1474629517907_0001, Tracking URL = http://hello110:8088/proxy/application_1474629517907_0001/
Kill Command = /home/hadoop/app/hadoop-2.7.2/bin/hadoop job ?-kill job_1474629517907_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-23 19:22:22,091 Stage-1 map = 0%, ?reduce = 0%
2016-09-23 19:22:28,446 Stage-1 map = 100%, ?reduce = 0%, Cumulative CPU 1.83 sec
MapReduce Total cumulative CPU time: 1 seconds 830 msec
Ended Job = job_1474629517907_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hello110:9000/user/hive/warehouse/zmgdb.db/rc_t1/.hive-staging_hive_2016-09-23_19-22-10_649_8279187505632970863-1/-ext-10000
Loading data to table zmgdb.rc_t1
MapReduce Jobs Launched:?
Stage-Stage-1: Map: 1 ? Cumulative CPU: 1.83 sec ? HDFS Read: 4755 HDFS Write: 876 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
t2.id
Time taken: 19.126 seconds

hive (zmgdb)> select * from rc_t1;
OK
rc_t1.id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

ORC

是RCfile的優化。自帶了壓縮和索引

存儲總結

textfile 存儲空間消耗比較大，并且壓縮的text 無法分割和合并查詢的效率最低,可以直接存儲，加載數據的速度最高

sequencefile 存儲空間消耗大,壓縮的文件可以分割和合并查詢效率高，需要通過text文件轉化來加載

rcfile 存儲空間最小，查詢的效率最高，需要通過text文件轉化來加載，加載的速度最低

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/539153.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/539153.shtml
英文地址，請注明出處：http://en.pswp.cn/news/539153.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！