ssh免密登錄
ubuntu默認安裝有SSH client,還需要安裝 SSH server
sudo apt install openssh-server
- 本機生成公私鑰
cd ~/.ssh
ssh-keygen -t rsa
在.ssh/目錄下,會生成兩個文件:id_rsa和id_rsa.pub
注意:正確配置.ssh目錄以及其下文件權限
sudo chmod 700 .ssh/
sudo chmod 600 .ssh/authorized_keys
- 上傳公鑰到目標機器
ssh-copy-id star@192.168.0.100
注意:@前是用戶名,后是ip
- 測試免密登錄
ssh star@192.168.0.100
幾臺機器都需要設置,一定要確保可以互相免密登錄!
安裝 NCCL(Ubuntu)
在 Ubuntu 上安裝 NCCL 需要先將包含 NCCL 軟件包的倉庫添加到 APT 系統中,然后通過 APT 安裝 NCCL 軟件包。有兩個可用的倉庫:本地倉庫和網絡倉庫。建議選擇后者以便在發布新版本時輕松獲取升級。
- 安裝倉庫。
- 對于本地 NCCL 倉庫:
sudo dpkg -i nccl-repo-<version>.deb
注意:本地倉庫安裝將提示您安裝它嵌入的本地密鑰,并用該密鑰簽署軟件包。請確保按照說明安裝本地密鑰,否則安裝階段將失敗。
- 對于網絡倉庫
wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<architecture>/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
- 更新 APT 數據庫
sudo apt update
- 使用 APT 安裝 libnccl2 軟件包。此外,如果您需要編譯帶有 NCCL 的應用程序,可以安裝 libnccl-dev 軟件包:
如果您使用網絡倉庫,以下命令將升級 CUDA 到最新版本:
sudo apt install libnccl2 libnccl-dev
如果您希望保留舊版本的 CUDA,請指定特定版本:
sudo apt install libnccl2=2.8.4-1+cuda11.1 libnccl-dev=2.8.4-1+cuda11.1
安裝MPI(Ubuntu)
采用源碼編譯安裝
- 下載OpenMPI源碼
前往OpenMPI官方網站下載或者使用wget:
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz
- 解壓
tar -zxvf openmpi-4.1.4.tar.gz
- 編譯和安裝
./configure --prefix=/usr/local/openmpi
sudo make
sudo make install
- 配置環境變量
/etc/profile中添加
export PATH=/usr/local/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH
注意:需要重啟生效
- 驗證
運行以下命令來驗證OpenMPI是否正確安裝:
mpicc --version
mpirun --version
如果命令輸出相應的版本信息,說明OpenMPI已經成功安裝并配置好了
NCCL 測試
這些測試檢查 NCCL 操作的性能和正確性。
- 構建
要構建這些測試,只需輸入 make
。
如果 CUDA 沒有安裝在 /usr/local/cuda
,可以指定 CUDA_HOME
。類似地,如果 NCCL 沒有安裝在 /usr
,可以指定 NCCL_HOME
。
make CUDA_HOME=/path/to/cuda Nncc
NCCL 測試依賴 MPI 來在多個進程(因此多個節點)上工作。如果你想用 MPI 支持來編譯測試,需要設置 MPI=1
并將 MPI_HOME
設置為 MPI 安裝的路徑。
make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
- 使用方法
NCCL 測試可以在多個進程、多個線程和每個線程的多個 CUDA 設備上運行。進程的數量由 MPI 管理,因此不作為參數傳遞給測試。
- 示例
在 8 個 GPU 上運行(-g 8
),從 8 字節到 128M 字節:
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
在 4個進程上(2個節點)使用 MPI 運行,每個進程 1個 GPU,總共 4 個 GPU:
mpirun -np 4 -H 192.168.0.111:2,192.168.0.100:2 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
- 參數說明
所有測試都支持相同的參數集
- GPU 數量
-t, --nthreads <num threads>
每個進程的線程數。默認:1。-g, --ngpus <GPUs per thread>
每個線程的 GPU 數。默認:1。
- 掃描的大小
-b, --minbytes <min size in bytes>
開始的最小大小。默認:32M。-e, --maxbytes <max size in bytes>
結束的最大大小。默認:32M。- 增量可以是固定的也可以是乘數因子。只應使用其中之一。
-i, --stepbytes <increment size>
固定增量大小。默認:1M。-f, --stepfactor <increment factor>
增量的乘數因子。默認:禁用。
- NCCL 操作參數
-o, --op <sum/prod/min/max/avg/all>
指定要執行的歸約操作。僅與 Allreduce、Reduce 或 ReduceScatter 之類的歸約操作相關。默認:Sum。-d, --datatype <nccltype/all>
指定要使用的數據類型。默認:Float。-r, --root <root/all>
指定要使用的 root。僅用于有 root 的操作,如廣播或歸約。默認:0。
- 性能
-n, --iters <iteration count>
迭代次數。默認:20。-w, --warmup_iters <warmup iteration count>
熱身迭代次數(不計時)。默認:5。-m, --agg_iters <aggregation count>
每次迭代要聚合的操作次數。默認:1。-a, --average <0/1/2/3>
將性能報告為所有 ranks 的平均值(僅 MPI=1 時)。<0=Rank0,1=Avg,2=Min,3=Max>
。默認:1。
- 測試操作
-p, --parallel_init <0/1>
使用線程并行初始化 NCCL。默認:0。-c, --check <check iteration count>
執行計數迭代,檢查每次迭代的結果正確性。這在大量 GPU 上可能會很慢。默認:1。-z, --blocking <0/1>
使 NCCL 集體操作阻塞,即讓 CPU 在每次集體操作后等待并同步。默認:0。-G, --cudagraph <num graph launches>
將迭代捕獲為 CUDA 圖并指定重放次數。默認:0。
多機運行常見問題
問題1:
bash: orted: 未找到命令
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:* not finding the required libraries and/or binaries onone or more nodes. Please check your PATH and LD_LIBRARY_PATHsettings, or configure OMPI with --enable-orterun-prefix-by-default* lack of authority to execute on one or more specified nodes.Please verify your allocation and authorities.* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).Please check with your sys admin to determine the correct location to use.* compilation of the orted with dynamic libraries when static are required(e.g., on Cray). Please check your configure cmd line and consider usingone of the contrib/platform definitions for your system type.* an inability to create a connection back to mpirun due to alack of common network interfaces and/or no route found betweenthem. Please check network connectivity (including firewallsand network routing requirements).
--------------------------------------------------------------------------
方法:添加參數 --prefix
mpirun -np 4 -H 192.168.0.111:2,192.168.0.100:2 --prefix /usr/local/openmpi ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
問題2:
--------------------------------------------------------------------------
A compressed message was received by the Open MPI run time system
(PMIx) that could not be decompressed. This means that Open MPI has
compression support enabled on one node and not enabled on another.
This is an unsupported configuration.Compression support is enabled when both of the following conditions
are met:1. The Open MPI run time system (PMIx) is built with compressionsupport.
2. The necessary compression libraries (e.g., libz) can be found atrun time.You should check that both of these conditions are true on both the
node where mpirun is invoked and all the nodes where MPI processes
will be launched. The node listed below does not have both conditions
met:node without compression support: wenji-UbuntuNOTE: There may also be other nodes without compression support.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:* not finding the required libraries and/or binaries onone or more nodes. Please check your PATH and LD_LIBRARY_PATHsettings, or configure OMPI with --enable-orterun-prefix-by-default* lack of authority to execute on one or more specified nodes.Please verify your allocation and authorities.* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).Please check with your sys admin to determine the correct location to use.* compilation of the orted with dynamic libraries when static are required(e.g., on Cray). Please check your configure cmd line and consider usingone of the contrib/platform definitions for your system type.* an inability to create a connection back to mpirun due to alack of common network interfaces and/or no route found betweenthem. Please check network connectivity (including firewallsand network routing requirements).
--------------------------------------------------------------------------
方法:安裝zlib
sudo apt install zlib1g