目錄
一、錯誤核心原因
二、排查步驟
1. 檢查當前驅動版本
2. 檢查 CUDA 運行時版本
3. 驗證驅動與 CUDA 的兼容性
三、解決方法
1. 確保驅動正確加載
2. 重新安裝匹配的驅動與 CUDA
3. 驗證環境正確性
四、關鍵注意事項
報錯日志:
bash nccl.sh
------------5.安裝nccl-test并測試-------------
Cloning into 'nccl-tests'...
remote: Enumerating objects: 504, done.
remote: Counting objects: 100% (347/347), done.
remote: Compressing objects: 100% (153/153), done.
remote: Total 504 (delta 302), reused 206 (delta 194), pack-reused 157 (from 2)
Receiving objects: 100% (504/504), 188.86 KiB | 1.20 MiB/s, done.
Resolving deltas: 100% (341/341), done.
make -C src build BUILDDIR=/home/test/nccl-tests/build
make[1]: Entering directory '/home/test/nccl-tests/src'
Compiling timer.cc > /home/test/nccl-tests/build/timer.o
Compiling /home/test/nccl-tests/build/verifiable/verifiable.o
Compiling all_reduce.cu