介紹
時間:
Hadoop Archives (HAR files)是在0.18.0版本中引入的。
作用:
將hdfs里的小文件打包成一個文件,相當于windows的zip,rar。Linux的 tar等壓縮文件。把多個文件打包一個文件。
意義:
它的出現就是為了緩解大量小文件消耗namenode內存的問題。
原理:
HAR文件是通過在HDFS上構建一個層次化的文件系統來工作。
一個HAR文件是通過hadoop的archive命令來創建,而這個命令實際上也是運行了一個MapReduce任務來將小文件打包成HAR。
對于client端來說,使用HAR文件沒有任何影響。但在HDFS端它內部的文件數減少了。
讀取效率不高:
通過HAR來讀取一個文件并不會比直接從HDFS中讀取文件高效,而且實際上可能還會稍微低效一點,因為對每一個HAR文件的訪問都需要完成兩層 index文件的讀取和文件本身數據的讀取。
盡管HAR文件可以被用來作為MapReduce job的input,但是并沒有特殊的方法來使maps將HAR文件中打包的文件當作一個HDFS文件處理。
創建命令:
hadoop archive -archiveName xxx.har -p ?/src ?/dest
archive -archiveName <NAME>.har -p <parent path> [-r <replication factor>]<src>* <dest>
查看命令:
hadoop fs -ls -r har://路徑/xxx.har
操作實例:
注意:是hdfs里的文件才能打包,如果不是hdfs里的路徑會報錯。 ?
1、hdfs dfs -ls ?/
drwx------ ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:19 /tmp
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:19 /wc
2、hadoop archive -archiveName temp.har -p /tmp /
會啟動mapreduce
16/08/13 00:41:16 INFO client.RMProxy: Connecting to ResourceManager at hello110/192.168.255.130:8032
16/08/13 00:41:18 INFO client.RMProxy: Connecting to ResourceManager at hello110/192.168.255.130:8032
16/08/13 00:41:18 INFO client.RMProxy: Connecting to ResourceManager at hello110/192.168.255.130:8032
16/08/13 00:41:18 INFO mapreduce.JobSubmitter: number of splits:1
16/08/13 00:41:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1471019987033_0001
16/08/13 00:41:19 INFO impl.YarnClientImpl: Submitted application application_1471019987033_0001
16/08/13 00:41:19 INFO mapreduce.Job: The url to track the job: http://hello110:8088/proxy/application_1471019987033_0001/
16/08/13 00:41:19 INFO mapreduce.Job: Running job: job_1471019987033_0001
16/08/13 00:41:35 INFO mapreduce.Job: Job job_1471019987033_0001 running in uber mode : false
16/08/13 00:41:35 INFO mapreduce.Job: ?map 0% reduce 0%
16/08/13 00:41:57 INFO mapreduce.Job: ?map 100% reduce 0%
16/08/13 00:42:21 INFO mapreduce.Job: ?map 100% reduce 100%
16/08/13 00:42:23 INFO mapreduce.Job: Job job_1471019987033_0001 completed successfully
3、hdfs dfs -ls ?/
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-08-13 00:42 /temp.har ?(新增的)
drwx------ ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:19 /tmp
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:19 /wc
4、hadoop fs -ls -R har:///temp.har
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:19 har:///temp.har/hadoop-yarn
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging/hadoop
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-08-13 00:41 har:///temp.har/hadoop-yarn/staging/hadoop/.staging
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-08-13 00:41 har:///temp.har/hadoop-yarn/staging/hadoop/.staging/har_dj36hy
-rw-r--r-- ? 1 hadoop supergroup ? ? ? 1593 2016-08-13 00:41 har:///temp.har/hadoop-yarn/staging/hadoop/.staging/har_dj36hy/_har_src_files
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging/history
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop
-rw-r--r-- ? 1 hadoop supergroup ? ? ?33303 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001-1460643581404-hadoop-wcount.jar-1460643608082-1-1-SUCCEEDED-default-1460643592087.jhist
-rw-r--r-- ? 1 hadoop supergroup ? ? ? ?349 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001.summary
-rw-r--r-- ? 1 hadoop supergroup ? ? 115449 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001_conf.xml
5、?hdfs dfs -cat ?har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001_conf.xml
<property><name>mapreduce.tasktracker.instrumentation</name><value>org.apache.hadoop.mapred.TaskTrackerMetricsInst</value><source>mapred-default.xml</source><source>job.xml</source></property>
<property><name>io.seqfile.sorter.recordlimit</name><value>1000000</value><source>core-default.xml</source><source>job.xml</source></property>
<property><name>yarn.sharedcache.webapp.address</name><value>0.0.0.0:8788</value><source>yarn-default.xml</source><source>job.xml</source></property>
<property><name>yarn.app.mapreduce.am.resource.mb</name><value>1536</value><source>mapred-default.xml</source><source>job.xml</source></property>
<property><name>mapreduce.framework.name</name><value>yarn</value><source>mapred-site.xml</source><source>job.xml</source></property>
<property><name>mapreduce.job.reduce.slowstart.completedmaps</name><value>0.05</value><source>mapred-default.xml</source><source>job.xml</source></property>.....................太多了.....................................