基于Docker和YARN的大數據環境部署實踐
目的
本操作手冊旨在指導用戶通過Docker容器技術,快速搭建一個完整的大數據環境。該環境包含以下核心組件:
- Hadoop HDFS/YARN(分布式存儲與資源調度)
- Spark on YARN(分布式計算)
- Kafka(消息隊列)
- Hive(數據倉庫)
- JupyterLab(交互式開發環境)
通過清晰的步驟說明和驗證方法,讀者將掌握:
- 容器網絡的搭建(Weave)
- Docker Compose編排文件編寫技巧
- 多組件協同工作的配置要點
- 集群擴展與驗證方法
整體架構
組件功能表
組件名稱 | 功能描述 | 依賴服務 | 端口配置 | 數據存儲 |
---|---|---|---|---|
Hadoop NameNode | HDFS元數據管理 | 無 | 9870 (Web UI), 8020 | Docker卷: hadoop_namenode |
Hadoop DataNode | HDFS數據存儲節點 | NameNode | 9864 (數據傳輸) | 本地卷或Docker卷 |
YARN ResourceManager | 資源調度與管理 | NameNode | 8088 (Web UI), 8032 | 無 |
YARN NodeManager | 單個節點資源管理 | ResourceManager | 8042 (Web UI) | 無 |
Spark (YARN模式) | 分布式計算框架 | YARN ResourceManager | 無 | 集成在YARN中 |
JupyterLab | 交互式開發環境 | Spark, YARN | 8888 (Web UI) | 本地目錄掛載 |
Kafka | 分布式消息隊列 | ZooKeeper | 9092 (Broker) | Docker卷:kafka_data、kafka_logs |
Hive | 數據倉庫服務 | HDFS, MySQL | 10000 (HiveServer2) | MySQL存儲元數據 |
MySQL | 存儲Hive元數據 | 無 | 3306 | Docker卷: mysql_data |
ZooKeeper | 分布式協調服務(Kafka依賴) | 無 | 2181 | Docker卷:zookeeper_data |
關鍵交互流程
-
數據存儲:
- HDFS通過NameNode管理元數據,DataNode存儲實際數據。
- JupyterLab通過掛載本地目錄訪問數據,同時可讀寫HDFS。
-
資源調度:
- Spark作業通過YARN ResourceManager申請資源,由NodeManager執行任務。
-
數據處理:
- Kafka接收實時數據流,Spark消費后進行實時計算。
- Hive通過HDFS存儲表數據,元數據存儲在MySQL。
環境搭建步驟
1. 容器網絡準備(Weave)
# 安裝Weave網絡插件
sudo curl -L git.io/weave -o /usr/local/bin/weave
sudo chmod +x /usr/local/bin/weave
# 啟動Weave網絡
weave launch
# 驗證網絡狀態
weave status
#在其他節點上運行
weave launch 主節點IP
2. Docker Compose編排文件
創建 docker-compose.yml
,核心配置如下:
version: "3.8"services:# ZooKeeperzookeeper-1:image: bitnami/zookeeper:3.8.0privileged: true #使用二進制文件安裝的docker需要開啟特權模式,每個容器都需要開啟該模式container_name: zookeeper-1hostname: zookeeper-1ports:- "2181:2181"environment:- ALLOW_ANONYMOUS_LOGIN=yes- TZ=Asia/Shanghaivolumes:- zookeeper_data:/bitnami/zookeepernetworks:- bigdata-netdns:- 172.17.0.1restart: alwayslogging:driver: "json-file"options:max-size: "100m"max-file: "7"# Kafkakafka-1:image: bitnami/kafka:3.3.1container_name: kafka-1hostname: kafka-1environment:- KAFKA_BROKER_ID=1- KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper-1:2181- ALLOW_PLAINTEXT_LISTENER=yes- TZ=Asia/Shanghaiports:- "9092:9092"volumes:- kafka_data:/bitnami/kafka # Kafka數據持久化- kafka_logs:/kafka-logs # 獨立日志目錄depends_on:- zookeeper-1networks:- bigdata-netdns:- 172.17.0.1restart: alwayslogging:driver: "json-file"options:max-size: "100m"max-file: "7"# Hadoop HDFShadoop-namenode:image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8container_name: hadoop-namenodehostname: hadoop-namenodeenvironment:- CLUSTER_NAME=bigdata- CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020- HDFS_CONF_dfs_replication=2- TZ=Asia/Shanghaiports:- "9870:9870"- "8020:8020"networks:- bigdata-netdns:- 172.17.0.1volumes:- hadoop_namenode:/hadoop/dfs/namerestart: alwayshadoop-datanode:image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8container_name: hadoop-datanodehostname: hadoop-datanodeenvironment:- CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020- HDFS_CONF_dfs_replication=2- TZ=Asia/Shanghaidepends_on:- hadoop-namenodenetworks:- bigdata-netdns:- 172.17.0.1restart: alwayslogging:driver: "json-file"options:max-size: "100m"max-file: "7"# YARNhadoop-resourcemanager:image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8container_name: hadoop-resourcemanagerhostname: hadoop-resourcemanagerports:- "8088:8088" # YARN Web UIenvironment:- CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020- YARN_CONF_yarn_resourcemanager_hostname=hadoop-resourcemanager- TZ=Asia/Shanghaidepends_on:- hadoop-namenodenetworks:- bigdata-netdns:- 172.17.0.1restart: alwayslogging:driver: "json-file"options:max-size: "100m"max-file: "7"hadoop-nodemanager:image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8container_name: hadoop-nodemanagerhostname: hadoop-nodemanagerenvironment:- CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020- YARN_CONF_yarn_resourcemanager_hostname=hadoop-resourcemanager- TZ=Asia/Shanghaidepends_on:- hadoop-resourcemanagernetworks:- bigdata-netdns:- 172.17.0.1volumes:- ./hadoop-conf/yarn-site.xml:/etc/hadoop/yarn-site.xml # 掛載主節點的Hadoop配置文件,用于上報內存與cpu核心數restart: alwayslogging:driver: "json-file"options:max-size: "100m"max-file: "7"# Hivehive:image: bde2020/hive:2.3.2container_name: hivehostname: hiveenvironment:- HIVE_METASTORE_URI=thrift://hive:9083- SERVICE_PRECONDITION=hadoop-namenode:8020,mysql:3306- TZ=Asia/Shanghaiports:- "10000:10000"- "9083:9083"depends_on:- hadoop-namenode- mysqlnetworks:- bigdata-netdns:- 172.17.0.1restart: alwayslogging:driver: "json-file"options:max-size: "100m"max-file: "7"# MySQLmysql:image: mysql:8.0container_name: mysqlenvironment:- MYSQL_ROOT_PASSWORD=root- MYSQL_DATABASE=metastore- TZ=Asia/Shanghaiports:- "3306:3306"networks:- bigdata-netdns:- 172.17.0.1volumes:- mysql_data:/var/lib/mysqlrestart: alwayslogging:driver: "json-file"options:max-size: "100m"max-file: "7"# JupyterLab(集成Spark on YARN)jupyter:image: jupyter/all-spark-notebook:latestcontainer_name: jupyter-labenvironment:- JUPYTER_ENABLE_LAB=yes- TZ=Asia/Shanghai- SPARK_OPTS="--master yarn --deploy-mode client" # 默認使用YARN模式- HADOOP_CONF_DIR=/etc/hadoop/conf # 必須定義- YARN_CONF_DIR=/etc/hadoop/conf # 必須定義ports:- "8888:8888"volumes:- ./notebooks:/home/jovyan/work- /path/to/local/data:/data- ./hadoop-conf:/etc/hadoop/conf # 掛載Hadoop配置文件,./hadoop-conf代表在docker-compose.yml同目錄下的hadoop-confnetworks:- bigdata-netdns:- 172.17.0.1depends_on:- hadoop-resourcemanager- hadoop-namenoderestart: alwayslogging:driver: "json-file"options:max-size: "100m"max-file: "7"volumes:hadoop_namenode:mysql_data:zookeeper_data:kafka_data:kafka_logs:hadoop-nodemanager:networks:bigdata-net:external: truename: weave
Hadoop配置文件
yarn-site.xml:
<configuration><property><name>yarn.resourcemanager.hostname</name><value>hadoop-resourcemanager</value></property><property><name>yarn.resourcemanager.address</name><value>hadoop-resourcemanager:8032</value></property><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property>
</configuration>
core-site.xml:
<configuration><property><name>fs.defaultFS</name><value>hdfs://hadoop-namenode:8020</value></property>
</configuration>
將這兩個文件放置到hadoop-conf目錄下。
3. 啟動服務
# 啟動容器
docker-compose up -d# 查看容器狀態
docker-compose ps
4. 驗證服務是否可用
驗證HDFS
(1) 訪問HDFS Web UI
-
操作:瀏覽器打開
http://localhost:9870
。 -
預期結果:
Overview 頁面顯示HDFS總容量。
Datanodes 顯示至少1個活躍節點(對應
hadoop-datanode
容器)。
(2) 命令行操作HDFS
docker exec -it hadoop-namenode bash
# 創建測試目錄
hdfs dfs -mkdir /test
# 上傳本地文件
echo "hello hdfs" > test.txt
hdfs dfs -put test.txt /test/
# 查看文件
hdfs dfs -ls /test
#解除安全模式
hdfs dfsadmin -safemode leave
- 預期結果:成功創建目錄、上傳文件并列出文件。
如圖所示:
** 驗證YARN**
(1) 訪問YARN ResourceManager Web UI
- 操作:瀏覽器打開
http://localhost:8088
。 - 預期結果:
- Cluster Metrics 顯示總資源(如內存、CPU)。
- Nodes 顯示至少1個NodeManager(對應
hadoop-nodemanager
容器)。
(2) 提交測試作業到YARN
# 進入Jupyter容器提交Spark作業
docker exec -it jupyter-lab bash
# 提交Spark Pi示例作業
spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi /usr/local/spark/examples/jars/spark-examples_*.jar 10
- 預期結果:
- 作業輸出中包含
Pi is roughly 3.14
。 - 在YARN Web UI (
http://localhost:8088
) 中看到作業狀態為 SUCCEEDED。
- 作業輸出中包含
如圖所示:
若是出現報錯:Permission denied: user=jovyan, access=WRITE, inode=“/user”:root:supergroup:drwxr-xr-x
報錯原因:
當前運行 Spark 的用戶是:jovyan
(Jupyter 默認用戶);
Spark 提交任務后,會自動嘗試在 HDFS 上創建目錄 /user/jovyan
;
但是:這個目錄不存在,或者 /user
目錄不允許 jovyan
寫入;
所以 HDFS 拒絕創建臨時目錄,導致整個作業提交失敗;
解決方法
創建目錄并賦權
進入NameNode 容器:
docker exec -it hadoop-namenode bash
然后執行 HDFS 命令:
hdfs dfs -mkdir -p /user/jovyan
hdfs dfs -chown jovyan:supergroup /user/jovyan
這一步允許 jovyan
用戶有權寫入自己的臨時目錄。
提示:可以先執行
hdfs dfs -ls /user
看是否有jovyan
子目錄。
最后:再次執行 spark-submit 后,可以看到
-
控制臺打印:
Submitting application application_xxx to ResourceManager
-
YARN 8088 頁面:
- 出現作業記錄;
- 狀態為
RUNNING
或FINISHED
驗證Spark on YARN(通過JupyterLab)
(1) 訪問JupyterLab
- 操作:瀏覽器打開
http://localhost:8888
,使用Token登錄(通過docker logs jupyter-lab
獲取Token)。 - 預期結果:成功進入JupyterLab界面。
(2) 運行PySpark代碼
在Jupyter中新建Notebook,執行以下代碼:
from pyspark.sql import SparkSession
spark = SparkSession.builder \.appName("jupyter-yarn-test") \.getOrCreate()# 測試Spark Context
print("Spark Version:", spark.version)
print("YARN Cluster Mode:", spark.sparkContext.master)# 讀取HDFS文件
df = spark.read.text("hdfs://hadoop-namenode:8020/test/test.txt")
df.show()# 讀取數據
local_df = spark.read.csv("/data/example.csv", header=True) # 替換為實際文件路徑
local_df.show()
- 預期結果:
- 輸出Spark版本和YARN模式(如
yarn
)。 - 成功讀取HDFS文件并顯示內容
hello hdfs
。 - 成功讀取CSV文件(需提前放置測試文件)。
- 輸出Spark版本和YARN模式(如
如圖所示:

驗證Hive
(1) 創建Hive表并查詢
#使用docker cp命令將jdbc驅動放入容器內部,示例:
docker cp mysql-connector-java-8.0.12.jar 容器ID或容器名稱:/opt/hive/lib
docker exec -it hive bash
#重新初始化 Hive Metastore
schematool -dbType mysql -initSchema --verbose
#查詢MetaStore運行狀態
ps -ef | grep MetaStore
# 啟動Hive Beeline客戶端
beeline -u jdbc:hive2://localhost:10000 -n root
//驅動下載鏈接:https://downloads.mysql.com/archives/c-j/
若是執行上述命令報錯,可以按照以下步驟來進行更改
1、配置hive-site.xml文件:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--Licensed to the Apache Software Foundation (ASF) under one or morecontributor license agreements. See the NOTICE file distributed withthis work for additional information regarding copyright ownership.The ASF licenses this file to You under the Apache License, Version 2.0(the "License"); you may not use this file except in compliance withthe License. You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions andlimitations under the License.
--><configuration>
<property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://192.168.0.78:3306/metastore_db?createDatabaseIfNotExist=true</value><description>JDBC connect string for a JDBC metastore</description></property><property><name>javax.jdo.option.ConnectionDriverName</name><value>com.mysql.cj.jdbc.Driver</value></property> <property><name>javax.jdo.option.ConnectionUserName</name><value>root</value></property><property><name>javax.jdo.option.ConnectionPassword</name><value>root</value></property><!-- Metastor-->rash<property> <name>hive.metastore.uris</name><value>thrift://localhost:9083</value></property><property><name>hive.metastore.warehouse.dir</name><value>/user/hive/warehouse</value></property><property><name>hive.server2.thrift.bind.host</name><value>0.0.0.0</value></property><property><name>hive.server2.thrift.port</name><value>10000</value></property></configuration>
//將此配置使用docker cp命令拷貝至hive容器內的/opt/hive/conf目錄下。
2、配置core-site.xml文件:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--Licensed under the Apache License, Version 2.0 (the "License");you may not use this file except in compliance with the License.You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions andlimitations under the License. See accompanying LICENSE file.
--><!-- Put site-specific property overrides in this file. --><configuration>
<property><name>fs.defaultFS</name><value>hdfs://hadoop-namenode:8020</value></property>
</configuration>
//hadoop-namenode容器IP可以在宿主機執行weave ps 命令獲取,配置文件修改完畢后通過docker cp命令將文件拷貝至hive容器內的/opt/hadoop-2.7.4/etc/hadoop目錄與/opt/hive/conf目錄。
啟動metastore
#在hive容器內部執行
hive --service metastore &
啟動hiveserver2
#在hive容器內部執行(執行此命令需要先關閉hadoo-namenode的安全模式)
hive --service hiveserver2 --hiveconf hive.root.logger=DEBUG,console
執行HQL:
CREATE TABLE test_hive (id INT, name STRING);
INSERT INTO test_hive VALUES (1, 'hive-test');
SELECT * FROM test_hive;
- 預期結果:輸出
1, hive-test
。
(2) 驗證MySQL元數據
docker exec -it mysql mysql -uroot -proot
use metastore_db;
SELECT TBL_NAME FROM TBLS;
- 預期結果:顯示
test_hive
表名。
如圖所示:

驗證Kafka
(1) 生產與消費消息
docker exec -it kafka-1 bash
# 創建主題
kafka-topics.sh --create --topic test-topic --bootstrap-server localhost:9092
# 生產消息
echo "hello kafka" | kafka-console-producer.sh --topic test-topic --bootstrap-server localhost:9092
# 消費消息(需另開終端)
kafka-console-consumer.sh --topic test-topic --from-beginning --bootstrap-server localhost:9092
- 預期結果:消費者終端輸出
hello kafka
。
如圖所示:

** 驗證本地數據掛載**
在JupyterLab中:
- 左側文件瀏覽器中檢查
/home/jovyan/work
(對應本地./notebooks
目錄)。 - 檢查
/data
目錄是否包含本地掛載的文件(例如/path/to/local/data
中的內容)。
子節點設置
version: "3.8"services:# HDFS DataNode 服務hadoop-datanode:image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8privileged: true #使用二進制文件安裝的docker需要開啟特權模式,每個容器都需要開啟該模式container_name: hadoop-datanode-2 # 子節點容器名稱唯一(例如按編號命名)hostname: hadoop-datanode-2environment:- CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020 # 指向主節點NameNode- HDFS_CONF_dfs_replication=2- TZ=Asia/Shanghainetworks:- bigdata-netdns:- 172.17.0.1volumes:- ./hadoop-conf:/etc/hadoop/conf # 掛載主節點的Hadoop配置文件restart: always
# extra_hosts:
# - "hadoop-namenode:10.32.0.32"logging:driver: "json-file"options:max-size: "100m"max-file: "5"# YARN NodeManager 服務hadoop-nodemanager:image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8privileged: true #使用二進制文件安裝的docker需要開啟特權模式,每個容器都需要開啟該模式container_name: hadoop-nodemanager-2 # 子節點容器名稱唯一hostname: hadoop-nodemanager-2environment:- YARN_CONF_yarn_resourcemanager_hostname=hadoop-resourcemanager # 指向主節點ResourceManager- CORE_CONF_fs_defaultFS=hdfs://hadoop-namenode:8020- TZ=Asia/Shanghainetworks:- bigdata-netdns:- 172.17.0.1volumes:- ./hadoop-conf/yarn-site.xml:/etc/hadoop/yarn-site.xml # 掛載主節點的Hadoop配置文件,用于上報內存與cpu核心數depends_on:- hadoop-datanode # 確保DataNode先啟動(可選)restart: alwayslogging:driver: "json-file"options:max-size: "100m"max-file: "5"# 共享網絡配置(必須與主節點一致)
networks:bigdata-net:external: truename: weave # 使用主節點創建的Weave網絡
yarn配置文件
<configuration><property><name>yarn.resourcemanager.hostname</name><value>hadoop-resourcemanager</value></property><property><name>yarn.resourcemanager.address</name><value>hadoop-resourcemanager:8032</value></property><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><property><name>yarn.nodemanager.resource.memory-mb</name><value>4096</value></property><property><name>yarn.nodemanager.resource.cpu-vcores</name><value>8</value></property>
</configuration>
Ps:內存大小和cpu核心數需要按照實際情況填寫。
#######################################################################################
今日推薦
小說:《異種的營養是牛肉的六倍?》
簡介:【異種天災】+【美食】+【日常】 變異生物的蛋白質是牛肉的幾倍? 剛剛來到這個世界,劉筆就被自己的想法震驚到了。 在被孢子污染后的土地上覺醒了神廚系統,是不是搞錯了什么?各種變異生物都能做成美食嗎? 那就快端上來罷! 安全區邊緣的特色美食飯店,有點非常規的溫馨美食日常。