歡迎關注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/148421901
免責聲明:本文來源于個人知識與公開資料,僅用于學術交流,歡迎討論,不支持轉載。
建議使用 Docker 配置 PyTorch 研發環境,原因是部分機器配置差異較大,而且環境各不相同,導致安裝到最后仍然無法啟動訓練任務,浪費大量時間。建議直接使用 Docker + Conda(Mamba) 環境構建虛擬環境,即可支持多數任務。
1. 網絡代理
常用的 GitHub 下載較慢,建議使用代理,提速。
使用特定的網絡代理,即:
export https_proxy=http://xxx:80
export http_proxy=http://xxx:80unset https_proxy http_proxy
xxx 是 IP 地址。
或 使用在線的免費代理,即:https://ghproxy.link/
# https://ghfast.top
git clone https://ghfast.top/https://github.com/hiyouga/LLaMA-Factory.git # 示例
注意:免費代理可能失效,需要實時查看。
Huggingface 環境,參考:https://hf-mirror.com/:
export HF_ENDPOINT=https://hf-mirror.com
2. 環境變量
打印系統環境變量:
printenv
配置大模型相關的環境變量,寫入 ~/.bashrc
如下:
export WORK_DIR="xxx"
export TORCH_HOME="$WORK_DIR/torch_home/"
export HF_HOME="$WORK_DIR/huggingface/"
export HUGGINGFACE_TOKEN="xxx"
export MODELSCOPE_CACHE="$WORK_DIR/modelscope_models/"
export MODELSCOPE_API_TOKEN="xxx"
export CUDA_HOME="/usr/local/cuda"
export OMP_NUM_THREADS=64
3. Docker
建議使用 Nvidia 的鏡像,其中包含默認的配置與環境:https://docker.aityp.com/r/docker.io/nvcr.io/nvidia/pytorch
拉取 Docker 鏡像(國內代理),建議使用 24.12-py3
版本,不要使用最新版本,兼容異常:
docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/nvcr.io/nvidia/pytorch:24.12-py3
啟動 Docker 的標準模版,即:
docker run -itd \
--name [your name] \
--gpus all \
--shm-size=128g \
--memory=256g \
--cpus=64 \
--restart=unless-stopped \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v [your path]:[your path] \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
--privileged \
--network host \
swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/nvcr.io/nvidia/pytorch:24.12-py3 \
/bin/bash
4. 虛擬環境
建議,使用 Conda 或 Mamba,以 Mamba 為例:
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
手動下載模式,直接下載 micro.mamba.pm/install.sh,即 GitHub 路徑,同時 替換代理 https://ghfast.top/
。
配置 pip 源:
# docker 優先級
rm -rf /usr/pip.conf
rm -rf /root/.config/pip/pip.conf
rm -rf /etc/pip.conf
rm -rf /etc/xdg/pip/pip.conf# 配置其他源
mkdir ~/.pip
vim ~/.pip/pip.conf[global]
no-cache-dir = true
index-url = http://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com
安裝 torch_def
環境:
micromamba create -n torch_def python=3.11
pip3 install torch torchvision torchaudio --timeout=100
下載速度較慢,避免超時
--timeout=100
驗證 PyTorch 環境:
import torch
print(torch.__version__) # 2.7.0+cu126
print(torch.cuda.is_available()) # True
下載相關 Python 包:
pip install datasets accelerate bitsandbytes peft swanlab sentencepiece trl deepspeed modelscope
pip install -U "huggingface_hub[cli]"
配置下載 Huggingface 模型與數據集,參考:
huggingface-cli download Qwen/Qwen3-8B --local-dir Qwen/Qwen3-8B
huggingface-cli download --repo-type dataset FreedomIntelligence/medical-o1-reasoning-SFT --local-dir FreedomIntelligence/medical-o1-reasoning-SFT
參考數據集:FreedomIntelligence/medical-o1-reasoning-SFT
其他
已優化的 Mamba 安裝文件 mamba_install.sh
,如下:
#!/bin/shset -eu# Detect the shell from which the script was called
parent=$(ps -o comm $PPID |tail -1)
parent=${parent#-} # remove the leading dash that login shells have
case "$parent" in# shells supported by `micromamba shell init`bash|fish|xonsh|zsh)shell=$parent;;*)# use the login shell (basename of $SHELL) as a fallbackshell=${SHELL##*/};;
esac# Parsing arguments
if [ -t 0 ] ; thenprintf "Micromamba binary folder? [~/.local/bin] "read BIN_FOLDERprintf "Init shell ($shell)? [Y/n] "read INIT_YESprintf "Configure conda-forge? [Y/n] "read CONDA_FORGE_YES
fi# Fallbacks
BIN_FOLDER="${BIN_FOLDER:-${HOME}/.local/bin}"
INIT_YES="${INIT_YES:-yes}"
CONDA_FORGE_YES="${CONDA_FORGE_YES:-yes}"# Prefix location is relevant only if we want to call `micromamba shell init`
case "$INIT_YES" iny|Y|yes)if [ -t 0 ]; thenprintf "Prefix location? [~/micromamba] "read PREFIX_LOCATIONfi;;
esac
PREFIX_LOCATION="${PREFIX_LOCATION:-${HOME}/micromamba}"# Computing artifact location
case "$(uname)" inLinux)PLATFORM="linux" ;;Darwin)PLATFORM="osx" ;;*NT*)PLATFORM="win" ;;
esacARCH="$(uname -m)"
case "$ARCH" inaarch64|ppc64le|arm64);; # pass*)ARCH="64" ;;
esaccase "$PLATFORM-$ARCH" inlinux-aarch64|linux-ppc64le|linux-64|osx-arm64|osx-64|win-64);; # pass*)echo "Failed to detect your OS" >&2exit 1;;
esacif [ "${VERSION:-}" = "" ]; thenRELEASE_URL="https://ghfast.top/https://github.com/mamba-org/micromamba-releases/releases/latest/download/micromamba-${PLATFORM}-${ARCH}"
elseRELEASE_URL="https://ghfast.top/https://github.com/mamba-org/micromamba-releases/releases/download/${VERSION}/micromamba-${PLATFORM}-${ARCH}"
fi# Downloading artifact
mkdir -p "${BIN_FOLDER}"
if hash curl >/dev/null 2>&1; thencurl "${RELEASE_URL}" -o "${BIN_FOLDER}/micromamba" -fsSL --compressed ${CURL_OPTS:-}
elif hash wget >/dev/null 2>&1; thenwget ${WGET_OPTS:-} -qO "${BIN_FOLDER}/micromamba" "${RELEASE_URL}"
elseecho "Neither curl nor wget was found" >&2exit 1
fi
chmod +x "${BIN_FOLDER}/micromamba"# Initializing shell
case "$INIT_YES" iny|Y|yes)case $("${BIN_FOLDER}/micromamba" --version) in1.*|0.*)shell_arg=-sprefix_arg=-p;;*)shell_arg=--shellprefix_arg=--root-prefix;;esac"${BIN_FOLDER}/micromamba" shell init $shell_arg "$shell" $prefix_arg "$PREFIX_LOCATION"echo "Please restart your shell to activate micromamba or run the following:\n"echo " source ~/.bashrc (or ~/.zshrc, ~/.xonshrc, ~/.config/fish/config.fish, ...)";;*)echo "You can initialize your shell later by running:"echo " micromamba shell init";;
esac# Initializing conda-forge
case "$CONDA_FORGE_YES" iny|Y|yes)"${BIN_FOLDER}/micromamba" config append channels conda-forge"${BIN_FOLDER}/micromamba" config append channels nodefaults"${BIN_FOLDER}/micromamba" config set channel_priority strict;;
esac