Hadoop: Setting up a Single Node Cluster.

HADOOP：建立單節點集群

Purpose
Prerequisites
- Supported Platforms
- Required Software
- Installing Software
Download
Prepare to Start the Hadoop Cluster
Standalone Operation
Pseudo-Distributed Operation
- Configuration
- Setup passphraseless ssh
- Execution
- YARN on a Single Node
Fully-Distributed Operation

目的

前置條件

? ? 支持的平臺

? ? 需要的軟件

? ?安裝軟件

下載

準備開始建立hadoop集群

單機操作

偽分布式操作

? ?配置

? 設置ssh免密登陸

? 擴展

? 單節點中YARN

完全分布式

Purpose

This document describes how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).

目的

該文檔描述了如何安裝和配置一個單節點的Hadoop,以便于你可以快速的使用MapReduce和HDFS執行簡單的操作。

Prerequisites

前置條件

Supported Platforms

GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
Windows is also a supported platform but the followings steps are for Linux only. To set up Hadoop on Windows, see?wiki page.

支持的平臺

? ? ?開發和生產環境支持GUN/linux環境。Hadoop在GUN/LINUX平臺下證實可以創建2000個節點。

? ? ?windows平臺也是支持的，但是如下的操作只是針對linux平臺的，在windows上安裝hadoop,請參考?wiki page.

Required Software

Required software for Linux include:

Java? must be installed. Recommended Java versions are described at?HadoopJavaVersions.
ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons if the optional start and stop scripts are to be used. Additionally, it is recommmended that pdsh also be installed for better ssh resource management.

需要的軟件

? ? Java是必須的，需求的Java版本請查看HadoopJavaVersions.

? ?ssh是必須的，sshd必須使用hadoop腳本運行，如果使用開啟或關閉腳本來管理遠程機器上的hadoop進程。此外，為了更好的管理ssh資源pdsh也是需要安裝的。

Installing Software

If your cluster doesn’t have the requisite software you will need to install it.

For example on Ubuntu Linux:

  $ sudo apt-get install ssh$ sudo apt-get install pdsh

安裝軟件

如果你的集群沒有必要的軟件，你需要去安裝它。

例如在Ubuntu linux系統上：

? ?sudo apt-get install ssh

? sudo apt-get install pdsh

Download

To get a Hadoop distribution, download a recent stable release from one of the?Apache Download Mirrors.

下載：

為了獲取hadoop發行版，從Apache Download Mirrors.下載一個最近的穩定的發行版

Prepare to Start the Hadoop Cluster

Unpack the downloaded Hadoop distribution. In the distribution, edit the file?etc/hadoop/hadoop-env.sh?to define some parameters as follows:

  # set to the root of your Java installationexport JAVA_HOME=/usr/java/latest

Try the following command:

  $ bin/hadoop

This will display the usage documentation for the hadoop script.

Now you are ready to start your Hadoop cluster in one of the three supported modes:

Local (Standalone) Mode
Pseudo-Distributed Mode
Fully-Distributed Mode

準備去啟動hadoop集群

解壓下載的hadoop發行版，在解壓文件中，編輯etc/hadoop/hadoop-env.sh去設置如下的參數：
# set to the root of your Java installationexport JAVA_HOME=/usr/java/latest
執行如下命令
$bin/hadoop
這將會展現使用hadoop腳本的文檔現在你可以準備去啟動你的hadoop集群從以下3種模式之一
本地模式
偽分布式
完全分布式

Standalone Operation

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

  $ mkdir input$ cp etc/hadoop/*.xml input$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha4.jar grep input output 'dfs[a-z.]+'$ cat output/*

單機操作模式
默認情況下，hadoop是設置為非分布式模式，作為一個單獨的Java進程。這對于調試是有用的。以下的例子復制解壓的配置文件并且符合給定的表達式的文件作為輸入。輸出是被寫到給定的輸出文件夾。
$ mkdir input$ cp etc/hadoop/*.xml input$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha4.jar grep input output 'dfs[a-z.]+'$ cat output/*

Pseudo-Distributed Operation

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

偽分布式

hadoop同樣可以運行為單節點的偽分布式模式，在這種情況下每一個hadoop進程作為一個單獨的Java進程單獨運行。

Configuration

Use the following:

etc/hadoop/core-site.xml:

<configuration><property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration><property><name>dfs.replication</name><value>1</value></property>
</configuration>

配置如下

etc/hadoop/core-site.xml:

<configuration><property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration><property><name>dfs.replication</name><value>1</value></property>
</configuration>

Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:

  $ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

  $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys$ chmod 0600 ~/.ssh/authorized_keys

設置免密登陸

現在檢查你可以不使用密碼ssh到本地

? ?$ ssh localhost?

如果你不可以沒有密碼ssh到本地，執行如下命令：

? $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

? $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

?$ chmod 0600 ~/.ssh/authorized_keys

Execution

The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see?YARN on Single Node.

Format the filesystem:
```
  $ bin/hdfs namenode -format
```
Start NameNode daemon and DataNode daemon:
```
  $ sbin/start-dfs.sh
```
The hadoop daemon log output is written to the?$HADOOP_LOG_DIR?directory (defaults to?$HADOOP_HOME/logs).
Browse the web interface for the NameNode; by default it is available at:
- NameNode -?http://localhost:9870/

Make the HDFS directories required to execute MapReduce jobs:

  $ bin/hdfs dfs -mkdir /user$ bin/hdfs dfs -mkdir /user/<username>

Copy the input files into the distributed filesystem:

  $ bin/hdfs dfs -mkdir input$ bin/hdfs dfs -put etc/hadoop/*.xml input

Run some of the examples provided:

  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha4.jar grep input output 'dfs[a-z.]+'

Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:
```
  $ bin/hdfs dfs -get output output$ cat output/*
```
or

View the output files on the distributed filesystem:
```
  $ bin/hdfs dfs -cat output/*
```
When you’re done, stop the daemons with:
```
  $ sbin/stop-dfs.sh
```

執行

以下的指導描述了如何在本地運行一個MapReduce任務，如果你希望在YARN上執行MapReduce任務請參考后面

1、格式化文件系統

bin/hdfs ?namenode -format

2、啟動NameNode和DataNode

sbin/start-dfs.sh

hadoop進程日志的輸出文件夾由HADOOP_LOG_DIR設置

3、瀏覽NameNode的web頁面，默認是

NameNode http://localhost:9870

4、創建執行MapReduce任務的目錄

$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>

5、拷貝輸入文件到文件系統

$ bin/hdfs dfs -mkdir input
$ bin/hdfs dfs -put etc/hadoop/*.xml input

6、運行提供的一些例子

? $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha4.jar grep input output 'dfs[a-z.]+'

7、檢查輸出文件：從分布式文件系統中拷貝輸出文件到本地并且檢查他們

$ bin/hdfs dfs -get output output
$ cat output/*
或者

在分布式系統中查看輸出文件:

$ bin/hdfs dfs -cat output/*

8、做完之后，關閉hadoop進程

$ sbin/stop-dfs.sh

YARN on a Single Node

You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.

The following instructions assume that 1. ~ 4. steps of?the above instructions?are already executed.

Configure parameters as follows:

etc/hadoop/mapred-site.xml:

<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property>
</configuration>

etc/hadoop/yarn-site.xml:

<configuration><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><property><name>yarn.nodemanager.env-whitelist</name><value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value></property>
</configuration>

Start ResourceManager daemon and NodeManager daemon:
```
  $ sbin/start-yarn.sh
```
Browse the web interface for the ResourceManager; by default it is available at:
- ResourceManager -?http://localhost:8088/
Run a MapReduce job.
When you’re done, stop the daemons with:
```
  $ sbin/stop-yarn.sh
```

YARN上運行單節點
你可以通過在偽分布式系統中配置一些參數在YARN上運行一個MapReduce job,除此之外還可以運行ResourceManager和NodeManager

如下的操作假設1~4步已經執行執行

配置參數如下：

etc/hadoop/mapred-site.xml:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

啟動ResourceManager進程和NodeManager進程：

$ sbin/start-yarn.sh

瀏覽ResourceManager的web界面，默認認識http://localhost:8088/
ResourceManager - http://localhost:8088/

運行一個MapReduce job

完成之后，停止進程

$ sbin/stop-yarn.sh