PyTorch 深度學習實戰(37):分布式訓練(DP/DDP/Deepspeed)實戰

在上一篇文章中,我們探討了混合精度訓練與梯度縮放技術。本文將深入介紹分布式訓練的三種主流方法:Data Parallel (DP)、Distributed Data Parallel (DDP) 和 DeepSpeed,幫助您掌握大規模模型訓練的關鍵技術。我們將使用PyTorch在CIFAR-10分類任務上實現這三種方法,并進行性能對比。

一、分布式訓練基礎

1. 三種分布式方法對比

特性DataParallel (DP)DistributedDataParallel (DDP)DeepSpeed
實現難度簡單中等復雜
多機支持不支持支持支持
通信效率低效(單進程)高效(NCCL)極高效(優化通信)
顯存優化一般優秀(ZeRO階段1-3)
適合場景單機多卡多機多卡超大模型訓練

2. 關鍵概念解析

class DistributedTerms:def __init__(self):self.world_size = "參與訓練的進程總數"self.rank = "當前進程的全局ID"self.local_rank = "當前節點上的進程ID"self.backend = "通信后端(nccl/gloo/mpi)"self.init_method = "初始化方式(env/file/tcp)"

二、DataParallel (DP) 實戰

1. 基礎DP實現

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F
?
# 設備設置 (自動使用所有可用GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
?
# 模型定義
class ConvNet(nn.Module):def __init__(self):super().__init__()self.conv1 = nn.Conv2d(3, 32, 3, 1)self.conv2 = nn.Conv2d(32, 64, 3, 1)self.fc1 = nn.Linear(64*6*6, 128)self.fc2 = nn.Linear(128, 10)def forward(self, x):x = F.relu(self.conv1(x))x = F.max_pool2d(x, 2)x = F.relu(self.conv2(x))x = F.max_pool2d(x, 2)x = x.view(-1, 64*6*6)x = F.relu(self.fc1(x))x = self.fc2(x)return x
?
# 數據準備
transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
?
# 包裝模型為DP
model = ConvNet().to(device)
if torch.cuda.device_count() > 1:print(f"使用 {torch.cuda.device_count()} 個GPU!")model = nn.DataParallel(model)
?
# 訓練循環
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
?
for epoch in range(10):model.train()for batch_idx, (data, target) in enumerate(train_loader):data, target = data.to(device), target.to(device)optimizer.zero_grad()output = model(data)loss = criterion(output, target)loss.backward()optimizer.step()if batch_idx % 100 == 0:print(f"Epoch: {epoch} | Batch: {batch_idx}/{len(train_loader)} | Loss: {loss.item():.4f}")

輸出為:

Epoch: 0 | Batch: 0/196 | Loss: 2.2991
Epoch: 0 | Batch: 100/196 | Loss: 1.4741
Epoch: 1 | Batch: 0/196 | Loss: 1.3207
Epoch: 1 | Batch: 100/196 | Loss: 1.1767
Epoch: 2 | Batch: 0/196 | Loss: 1.0995
Epoch: 2 | Batch: 100/196 | Loss: 1.0768
Epoch: 3 | Batch: 0/196 | Loss: 1.0013
Epoch: 3 | Batch: 100/196 | Loss: 0.8572
Epoch: 4 | Batch: 0/196 | Loss: 0.9214
Epoch: 4 | Batch: 100/196 | Loss: 0.8575
Epoch: 5 | Batch: 0/196 | Loss: 0.8531
Epoch: 5 | Batch: 100/196 | Loss: 0.9046
Epoch: 6 | Batch: 0/196 | Loss: 0.8146
Epoch: 6 | Batch: 100/196 | Loss: 0.6766
Epoch: 7 | Batch: 0/196 | Loss: 0.6395
Epoch: 7 | Batch: 100/196 | Loss: 0.6545
Epoch: 8 | Batch: 0/196 | Loss: 0.7298
Epoch: 8 | Batch: 100/196 | Loss: 0.7052
Epoch: 9 | Batch: 0/196 | Loss: 0.6747
Epoch: 9 | Batch: 100/196 | Loss: 0.6076

2. DP的局限性

class DPLimitations:def __init__(self):self.issues = ["單進程多線程導致GIL爭用","梯度聚合在單一GPU(主卡)形成瓶頸","顯存利用率不均衡(主卡OOM風險)","不支持多節點擴展"]self.solution = "使用DDP替代"

三、DistributedDataParallel (DDP) 實戰

1. 單機多卡DDP實現

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
import os
?
class ConvNet(nn.Module):def __init__(self):super().__init__()self.conv1 = nn.Conv2d(3, 32, 3, 1)self.conv2 = nn.Conv2d(32, 64, 3, 1)self.fc1 = nn.Linear(64 * 6 * 6, 128)self.fc2 = nn.Linear(128, 10)def forward(self, x):x = F.relu(self.conv1(x))x = F.max_pool2d(x, 2)x = F.relu(self.conv2(x))x = F.max_pool2d(x, 2)x = x.view(-1, 64 * 6 * 6)x = F.relu(self.fc1(x))x = self.fc2(x)return x
?
def ddp_setup(rank, world_size):os.environ['MASTER_ADDR'] = 'localhost'os.environ['MASTER_PORT'] = '12355'dist.init_process_group("nccl", rank=rank, world_size=world_size)torch.cuda.set_device(rank)
?
def cleanup():if dist.is_initialized():dist.destroy_process_group()
?
class DDPTrainer:def __init__(self, rank, world_size):self.rank = rankself.world_size = world_sizeself.model = ConvNet().to(rank)self.model = DDP(self.model, device_ids=[rank])self.optimizer = optim.Adam(self.model.parameters(), lr=0.001)self.criterion = nn.CrossEntropyLoss()self.train_loader = self._prepare_dataloader()def _prepare_dataloader(self):transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])dataset = datasets.CIFAR10(root='./data',train=True,download=True,transform=transform)sampler = torch.utils.data.distributed.DistributedSampler(dataset,num_replicas=self.world_size,rank=self.rank,shuffle=True)return DataLoader(dataset,batch_size=256,sampler=sampler,num_workers=2,pin_memory=True)def train(self, total_epochs=10):for epoch in range(total_epochs):self.model.train()self.train_loader.sampler.set_epoch(epoch) ?# 確保每個epoch有不同的shufflefor batch_idx, (data, target) in enumerate(self.train_loader):data, target = data.to(self.rank), target.to(self.rank)self.optimizer.zero_grad()output = self.model(data)loss = self.criterion(output, target)loss.backward()self.optimizer.step()if batch_idx % 100 == 0 and self.rank == 0: ?# 僅主進程打印print(f"Epoch: {epoch} | Batch: {batch_idx}/{len(self.train_loader)} | Loss: {loss.item():.4f}")if self.rank == 0:torch.save(self.model.module.state_dict(), "ddp_model.pth")
?
def main_ddp(rank, world_size):ddp_setup(rank, world_size)trainer = DDPTrainer(rank, world_size)trainer.train()cleanup()
?
if __name__ == "__main__":world_size = torch.cuda.device_count()mp.spawn(main_ddp, args=(world_size,), nprocs=world_size)

輸出為:

Files already downloaded and verified
Epoch: 0 | Batch: 0/196 | Loss: 2.3034
Epoch: 0 | Batch: 100/196 | Loss: 1.4479
Epoch: 1 | Batch: 0/196 | Loss: 1.2776
Epoch: 1 | Batch: 100/196 | Loss: 1.3570
Epoch: 2 | Batch: 0/196 | Loss: 1.1137
Epoch: 2 | Batch: 100/196 | Loss: 0.9530
Epoch: 3 | Batch: 0/196 | Loss: 1.0887
Epoch: 3 | Batch: 100/196 | Loss: 0.8944
Epoch: 4 | Batch: 0/196 | Loss: 0.8131
Epoch: 4 | Batch: 100/196 | Loss: 0.9242
Epoch: 5 | Batch: 0/196 | Loss: 0.9026
Epoch: 5 | Batch: 100/196 | Loss: 1.0272
Epoch: 6 | Batch: 0/196 | Loss: 0.6772
Epoch: 6 | Batch: 100/196 | Loss: 0.8683
Epoch: 7 | Batch: 0/196 | Loss: 0.7747
Epoch: 7 | Batch: 100/196 | Loss: 0.7627
Epoch: 8 | Batch: 0/196 | Loss: 0.6991
Epoch: 8 | Batch: 100/196 | Loss: 0.7003
Epoch: 9 | Batch: 0/196 | Loss: 0.6409
Epoch: 9 | Batch: 100/196 | Loss: 0.6621

2. 多機DDP關鍵配置

def multi_machine_setup():config = {"init_method": "env://", ?# 使用環境變量初始化"environment_vars": {"MASTER_ADDR": "192.168.1.1", ?# 主節點IP"MASTER_PORT": "29500", ? ? ? ?# 開放端口"WORLD_SIZE": "8", ? ? ? ? ? ? # 總GPU數"RANK": "0-7" ? ? ? ? ? ? ? ? ?# 各節點rank},"per_machine_cmd": "python train.py --local_rank=$LOCAL_RANK","launcher": "torchrun" ?# 推薦使用torchrun替代mp.spawn}return config

四、DeepSpeed 實戰

1. DeepSpeed基礎配置

# ds_config.json
{"train_batch_size": 32,"gradient_accumulation_steps": 1,"optimizer": {"type": "Adam","params": {"lr": 0.001,"betas": [0.8, 0.999],"eps": 1e-8,"weight_decay": 3e-7}},"fp16": {"enabled": true,"auto_cast": true,"loss_scale": 0,"loss_scale_window": 1000,"hysteresis": 2,"min_loss_scale": 1},"zero_optimization": {"stage": 2,"allgather_partitions": true,"allgather_bucket_size": 200000000,"overlap_comm": true,"reduce_scatter": true,"reduce_bucket_size": 200000000,"contiguous_gradients": true}}

2. DeepSpeed訓練實現

import deepspeed
import torch.nn as nn
from torchvision import datasets, transforms
import torch.nn.functional as F
import argparse
?
?
class ConvNet(nn.Module):def __init__(self):super().__init__()self.conv1 = nn.Conv2d(3, 32, 3, 1)self.conv2 = nn.Conv2d(32, 64, 3, 1)self.fc1 = nn.Linear(64 * 6 * 6, 128)self.fc2 = nn.Linear(128, 10)
?def forward(self, x):# Convert input to match model's dtype if neededif next(self.parameters()).dtype != x.dtype:x = x.to(next(self.parameters()).dtype)
?x = F.relu(self.conv1(x))x = F.max_pool2d(x, 2)x = F.relu(self.conv2(x))x = F.max_pool2d(x, 2)x = x.view(-1, 64 * 6 * 6)x = F.relu(self.fc1(x))x = self.fc2(x)return x
?
?
def train_deepspeed(args):model = ConvNet()
?transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])dataset = datasets.CIFAR10(root='./data', train=True, download=True,transform=transform)
?# Initialize DeepSpeedmodel_engine, optimizer, train_loader, _ = deepspeed.initialize(args=args,model=model,model_parameters=model.parameters(),training_data=dataset)
?for epoch in range(args.epochs):model_engine.train()for batch_idx, (inputs, targets) in enumerate(train_loader):inputs = inputs.to(model_engine.device)targets = targets.to(model_engine.device)
?# Ensure inputs match model dtypeif model_engine.fp16_enabled():inputs = inputs.half()
?outputs = model_engine(inputs)loss = F.cross_entropy(outputs, targets)
?model_engine.backward(loss)model_engine.step()
?if batch_idx % 100 == 0 and model_engine.local_rank == 0:print(f"Epoch: {epoch} | Batch: {batch_idx} | Loss: {loss.item():.4f}")
?
?
def get_args():parser = argparse.ArgumentParser(description='DeepSpeed CIFAR10 Training')parser.add_argument('--epochs', type=int, default=10)parser.add_argument('--local_rank', type=int, default=-1)parser.add_argument('--deepspeed_config', type=str,default='ds_config.json',help='path to DeepSpeed config file')return parser.parse_args()
?
?
if __name__ == "__main__":args = get_args()train_deepspeed(args)# 運行腳本deepspeed --num_gpus=1 deepspeed_train.py --deepspeed_config ds_config.json,其中num_gpus指gpu數量
# 需要pip install deepspeed

輸出為:

[2025-04-05 03:34:29,680] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-05 03:34:31,203] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-04-05 03:34:31,203] [INFO] [runner.py:605:main] cmd = /root/miniforge3/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None e38-3.py --deepspeed_config ds_config.json
[2025-04-05 03:34:32,803] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-05 03:34:34,305] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.21.5-1+cuda12.4
[2025-04-05 03:34:34,305] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.21.5-1
[2025-04-05 03:34:34,305] [INFO] [launch.py:139:main] 0 NCCL_VERSION=2.21.5-1
[2025-04-05 03:34:34,305] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2025-04-05 03:34:34,305] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.21.5-1+cuda12.4
[2025-04-05 03:34:34,305] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2025-04-05 03:34:34,305] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.21.5-1
[2025-04-05 03:34:34,305] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0]}
[2025-04-05 03:34:34,305] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=1, node_rank=0
[2025-04-05 03:34:34,305] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2025-04-05 03:34:34,305] [INFO] [launch.py:164:main] dist_world_size=1
[2025-04-05 03:34:34,305] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0
[2025-04-05 03:34:34,306] [INFO] [launch.py:256:main] process 13722 spawned with command: ['/root/miniforge3/bin/python3.10', '-u', 'e38-3.py', '--local_rank=0', '--deepspeed_config', 'ds_config.json']
[2025-04-05 03:34:35,911] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Files already downloaded and verified
[2025-04-05 03:34:38,546] [INFO] [logging.py:107:log_dist] [Rank -1] DeepSpeed info: version=0.16.5, git-hash=unknown, git-branch=unknown
[2025-04-05 03:34:38,546] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-04-05 03:34:38,546] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-04-05 03:34:38,549] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 1
[2025-04-05 03:34:38,992] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/fused_adam/build.ninja...
/root/miniforge3/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.055249929428100586 seconds
[2025-04-05 03:34:39,049] [INFO] [logging.py:107:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2025-04-05 03:34:39,049] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-04-05 03:34:39,049] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-04-05 03:34:39,049] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-04-05 03:34:39,049] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2025-04-05 03:34:39,049] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 200000000
[2025-04-05 03:34:39,049] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 200000000
[2025-04-05 03:34:39,050] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-04-05 03:34:39,050] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
[2025-04-05 03:34:39,174] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-04-05 03:34:39,175] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB ? ? ? ? Max_MA 0.0 GB ? ? ? ? CA 0.02 GB ? ? ? ? Max_CA 0 GB 
[2025-04-05 03:34:39,175] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 3.37 GB, percent = 11.0%
[2025-04-05 03:34:39,285] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-04-05 03:34:39,285] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB ? ? ? ? Max_MA 0.0 GB ? ? ? ? CA 0.02 GB ? ? ? ? Max_CA 0 GB 
[2025-04-05 03:34:39,286] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 3.37 GB, percent = 11.0%
[2025-04-05 03:34:39,286] [INFO] [stage_1_and_2.py:556:__init__] optimizer state initialized
[2025-04-05 03:34:39,380] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-04-05 03:34:39,380] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB ? ? ? ? Max_MA 0.0 GB ? ? ? ? CA 0.02 GB ? ? ? ? Max_CA 0 GB 
[2025-04-05 03:34:39,381] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 3.37 GB, percent = 11.0%
[2025-04-05 03:34:39,381] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2025-04-05 03:34:39,381] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[2025-04-05 03:34:39,381] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-04-05 03:34:39,381] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
[2025-04-05 03:34:39,381] [INFO] [config.py:1000:print] DeepSpeedEngine configuration:
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? activation_checkpointing_config  {"partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false
}
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? amp_enabled .................. False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? amp_params ................... False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? autotuning_config ............ {"enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3
}
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? bfloat16_enabled ............. False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? bfloat16_immediate_grad_update  False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? checkpoint_parallel_write_pipeline  False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? checkpoint_tag_validation_enabled  True
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? checkpoint_tag_validation_fail  False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f7f620f2b30>
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? communication_data_type ...... None
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? curriculum_enabled_legacy .... False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? curriculum_params_legacy ..... False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? data_efficiency_enabled ...... False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? dataloader_drop_last ......... False
[2025-04-05 03:34:39,382] [INFO] [config.py:1004:print] ? disable_allgather ............ False
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? dump_state ................... False
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? eigenvalue_enabled ........... False
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? eigenvalue_gas_boundary_resolution  1
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? eigenvalue_layer_name ........ bert.encoder.layer
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? eigenvalue_layer_num ......... 0
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? eigenvalue_max_iter .......... 100
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? eigenvalue_stability ......... 1e-06
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? eigenvalue_tol ............... 0.01
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? eigenvalue_verbose ........... False
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? elasticity_enabled ........... False
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? flops_profiler_config ........ {"enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null
}
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? fp16_auto_cast ............... True
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? fp16_enabled ................. True
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? fp16_master_weights_and_gradients  False
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? global_rank .................. 0
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? grad_accum_dtype ............. None
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? gradient_accumulation_steps .. 1
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? gradient_clipping ............ 0.0
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? gradient_predivide_factor .... 1.0
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? graph_harvesting ............. False
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? initial_dynamic_scale ........ 65536
[2025-04-05 03:34:39,383] [INFO] [config.py:1004:print] ? load_universal_checkpoint .... False
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? loss_scale ................... 0
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? memory_breakdown ............. False
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? mics_hierarchial_params_gather  False
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? mics_shard_size .............. -1
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? nebula_config ................ {"enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null
}
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? optimizer_legacy_fusion ...... False
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? optimizer_name ............... adam
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? optimizer_params ............. {'lr': 0.001, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? pld_enabled .................. False
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? pld_params ................... False
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? prescale_gradients ........... False
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? scheduler_name ............... None
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? scheduler_params ............. None
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? seq_parallel_communication_data_type  torch.float32
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? sparse_attention ............. None
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? sparse_gradients_enabled ..... False
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? steps_per_print .............. None
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? timers_config ................ enabled=True synchronized=True
[2025-04-05 03:34:39,384] [INFO] [config.py:1004:print] ? train_batch_size ............. 32
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? train_micro_batch_size_per_gpu  32
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? use_data_before_expert_parallel_  False
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? use_node_local_storage ....... False
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? wall_clock_breakdown ......... False
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? weight_quantization_config ... None
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? world_size ................... 1
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? zero_allow_untested_optimizer  False
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=200000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=200000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? zero_enabled ................. True
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? zero_force_ds_cpu_optimizer .. True
[2025-04-05 03:34:39,385] [INFO] [config.py:1004:print] ? zero_optimization_stage ...... 2
[2025-04-05 03:34:39,385] [INFO] [config.py:990:print_user_config] ? json = {"train_batch_size": 32, "gradient_accumulation_steps": 1, "optimizer": {"type": "Adam", "params": {"lr": 0.001, "betas": [0.8, 0.999], "eps": 1e-08, "weight_decay": 3e-07}}, "fp16": {"enabled": true, "auto_cast": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 2.000000e+08, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2.000000e+08, "contiguous_gradients": true}
}
[2025-04-05 03:34:40,640] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
Epoch: 0 | Batch: 0 | Loss: 2.3125
[2025-04-05 03:34:40,646] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 0 | Batch: 100 | Loss: 1.8994
Epoch: 0 | Batch: 200 | Loss: 1.5488
Epoch: 0 | Batch: 300 | Loss: 1.7979
Epoch: 0 | Batch: 400 | Loss: 1.5400
Epoch: 0 | Batch: 500 | Loss: 1.3965
Epoch: 0 | Batch: 600 | Loss: 1.6631
Epoch: 0 | Batch: 700 | Loss: 1.4385
Epoch: 0 | Batch: 800 | Loss: 1.1826
Epoch: 0 | Batch: 900 | Loss: 1.2168
Epoch: 0 | Batch: 1000 | Loss: 1.3877
[2025-04-05 03:34:47,610] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:34:47,617] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 0 | Batch: 1100 | Loss: 1.2227
Epoch: 0 | Batch: 1200 | Loss: 1.3486
Epoch: 0 | Batch: 1300 | Loss: 1.4824
Epoch: 0 | Batch: 1400 | Loss: 1.2939
Epoch: 0 | Batch: 1500 | Loss: 1.3477
Epoch: 1 | Batch: 0 | Loss: 1.0166
Epoch: 1 | Batch: 100 | Loss: 1.3037
Epoch: 1 | Batch: 200 | Loss: 0.9531
Epoch: 1 | Batch: 300 | Loss: 1.2051
Epoch: 1 | Batch: 400 | Loss: 1.1250
[2025-04-05 03:34:54,593] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:34:54,599] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 1 | Batch: 500 | Loss: 1.0322
Epoch: 1 | Batch: 600 | Loss: 1.1992
Epoch: 1 | Batch: 700 | Loss: 0.9302
Epoch: 1 | Batch: 800 | Loss: 1.0225
Epoch: 1 | Batch: 900 | Loss: 0.9985
Epoch: 1 | Batch: 1000 | Loss: 1.0186
Epoch: 1 | Batch: 1100 | Loss: 0.7178
Epoch: 1 | Batch: 1200 | Loss: 1.0322
Epoch: 1 | Batch: 1300 | Loss: 1.3174
Epoch: 1 | Batch: 1400 | Loss: 1.0762
[2025-04-05 03:35:01,555] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:35:01,560] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 1 | Batch: 1500 | Loss: 1.0664
Epoch: 2 | Batch: 0 | Loss: 0.7202
Epoch: 2 | Batch: 100 | Loss: 1.1133
Epoch: 2 | Batch: 200 | Loss: 0.7842
Epoch: 2 | Batch: 300 | Loss: 1.0312
Epoch: 2 | Batch: 400 | Loss: 0.8369
Epoch: 2 | Batch: 500 | Loss: 0.8916
Epoch: 2 | Batch: 600 | Loss: 0.8364
Epoch: 2 | Batch: 700 | Loss: 0.8110
Epoch: 2 | Batch: 800 | Loss: 0.8823
[2025-04-05 03:35:08,718] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:35:08,725] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 2 | Batch: 900 | Loss: 0.9097
Epoch: 2 | Batch: 1000 | Loss: 0.7773
Epoch: 2 | Batch: 1100 | Loss: 0.6465
Epoch: 2 | Batch: 1200 | Loss: 0.8125
Epoch: 2 | Batch: 1300 | Loss: 1.2598
Epoch: 2 | Batch: 1400 | Loss: 0.9136
Epoch: 2 | Batch: 1500 | Loss: 0.9321
Epoch: 3 | Batch: 0 | Loss: 0.5562
Epoch: 3 | Batch: 100 | Loss: 0.9727
Epoch: 3 | Batch: 200 | Loss: 0.6489
Epoch: 3 | Batch: 300 | Loss: 0.8125
[2025-04-05 03:35:15,595] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:35:15,600] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 3 | Batch: 400 | Loss: 0.6260
Epoch: 3 | Batch: 500 | Loss: 0.7510
Epoch: 3 | Batch: 600 | Loss: 0.6504
Epoch: 3 | Batch: 700 | Loss: 0.7002
Epoch: 3 | Batch: 800 | Loss: 0.9238
Epoch: 3 | Batch: 900 | Loss: 0.8267
Epoch: 3 | Batch: 1000 | Loss: 0.5518
Epoch: 3 | Batch: 1100 | Loss: 0.6138
Epoch: 3 | Batch: 1200 | Loss: 0.6401
Epoch: 3 | Batch: 1300 | Loss: 1.0996
[2025-04-05 03:35:22,512] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:35:22,518] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 3 | Batch: 1400 | Loss: 0.7559
Epoch: 3 | Batch: 1500 | Loss: 0.8115
Epoch: 4 | Batch: 0 | Loss: 0.4153
Epoch: 4 | Batch: 100 | Loss: 0.8018
Epoch: 4 | Batch: 200 | Loss: 0.5034
Epoch: 4 | Batch: 300 | Loss: 0.7324
Epoch: 4 | Batch: 400 | Loss: 0.5508
Epoch: 4 | Batch: 500 | Loss: 0.6274
Epoch: 4 | Batch: 600 | Loss: 0.4768
Epoch: 4 | Batch: 700 | Loss: 0.6201
[2025-04-05 03:35:29,241] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:35:29,247] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 4 | Batch: 800 | Loss: 0.9185
Epoch: 4 | Batch: 900 | Loss: 0.7114
Epoch: 4 | Batch: 1000 | Loss: 0.4934
Epoch: 4 | Batch: 1100 | Loss: 0.5396
Epoch: 4 | Batch: 1200 | Loss: 0.5234
Epoch: 4 | Batch: 1300 | Loss: 1.0225
Epoch: 4 | Batch: 1400 | Loss: 0.6353
Epoch: 4 | Batch: 1500 | Loss: 0.6787
Epoch: 5 | Batch: 0 | Loss: 0.3416
Epoch: 5 | Batch: 100 | Loss: 0.6562
Epoch: 5 | Batch: 200 | Loss: 0.3862
[2025-04-05 03:35:35,900] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:35:35,905] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 5 | Batch: 300 | Loss: 0.6436
Epoch: 5 | Batch: 400 | Loss: 0.5190
Epoch: 5 | Batch: 500 | Loss: 0.4709
Epoch: 5 | Batch: 600 | Loss: 0.3555
Epoch: 5 | Batch: 700 | Loss: 0.5088
Epoch: 5 | Batch: 800 | Loss: 0.8574
Epoch: 5 | Batch: 900 | Loss: 0.5903
Epoch: 5 | Batch: 1000 | Loss: 0.4016
Epoch: 5 | Batch: 1100 | Loss: 0.4583
Epoch: 5 | Batch: 1200 | Loss: 0.4756
[2025-04-05 03:35:42,493] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:35:42,499] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 5 | Batch: 1300 | Loss: 0.8867
Epoch: 5 | Batch: 1400 | Loss: 0.4451
Epoch: 5 | Batch: 1500 | Loss: 0.5991
Epoch: 6 | Batch: 0 | Loss: 0.3013
Epoch: 6 | Batch: 100 | Loss: 0.5547
Epoch: 6 | Batch: 200 | Loss: 0.4202
Epoch: 6 | Batch: 300 | Loss: 0.5542
Epoch: 6 | Batch: 400 | Loss: 0.5361
Epoch: 6 | Batch: 500 | Loss: 0.4175
Epoch: 6 | Batch: 600 | Loss: 0.2228
[2025-04-05 03:35:49,174] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:35:49,179] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 6 | Batch: 700 | Loss: 0.4939
Epoch: 6 | Batch: 800 | Loss: 0.7461
Epoch: 6 | Batch: 900 | Loss: 0.4568
Epoch: 6 | Batch: 1000 | Loss: 0.3052
Epoch: 6 | Batch: 1100 | Loss: 0.3608
Epoch: 6 | Batch: 1200 | Loss: 0.3230
Epoch: 6 | Batch: 1300 | Loss: 0.7437
Epoch: 6 | Batch: 1400 | Loss: 0.3889
Epoch: 6 | Batch: 1500 | Loss: 0.4861
Epoch: 7 | Batch: 0 | Loss: 0.2820
[2025-04-05 03:35:56,089] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:35:56,095] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 7 | Batch: 100 | Loss: 0.4968
Epoch: 7 | Batch: 200 | Loss: 0.4492
Epoch: 7 | Batch: 300 | Loss: 0.4956
Epoch: 7 | Batch: 400 | Loss: 0.4265
Epoch: 7 | Batch: 500 | Loss: 0.3296
Epoch: 7 | Batch: 600 | Loss: 0.2382
Epoch: 7 | Batch: 700 | Loss: 0.3796
Epoch: 7 | Batch: 800 | Loss: 0.7739
Epoch: 7 | Batch: 900 | Loss: 0.3135
Epoch: 7 | Batch: 1000 | Loss: 0.3032
[2025-04-05 03:36:02,665] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:36:02,672] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 7 | Batch: 1100 | Loss: 0.3655
Epoch: 7 | Batch: 1200 | Loss: 0.2328
Epoch: 7 | Batch: 1300 | Loss: 0.5957
Epoch: 7 | Batch: 1400 | Loss: 0.3350
Epoch: 7 | Batch: 1500 | Loss: 0.3853
Epoch: 8 | Batch: 0 | Loss: 0.1986
Epoch: 8 | Batch: 100 | Loss: 0.3679
Epoch: 8 | Batch: 200 | Loss: 0.4087
Epoch: 8 | Batch: 300 | Loss: 0.4595
Epoch: 8 | Batch: 400 | Loss: 0.3025
Epoch: 8 | Batch: 500 | Loss: 0.2805
[2025-04-05 03:36:09,754] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:36:09,761] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 8 | Batch: 600 | Loss: 0.1461
Epoch: 8 | Batch: 700 | Loss: 0.2495
Epoch: 8 | Batch: 800 | Loss: 0.7319
Epoch: 8 | Batch: 900 | Loss: 0.2815
Epoch: 8 | Batch: 1000 | Loss: 0.2094
Epoch: 8 | Batch: 1100 | Loss: 0.2241
Epoch: 8 | Batch: 1200 | Loss: 0.2345
Epoch: 8 | Batch: 1300 | Loss: 0.5073
Epoch: 8 | Batch: 1400 | Loss: 0.3091
Epoch: 8 | Batch: 1500 | Loss: 0.3027
[2025-04-05 03:36:16,435] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-05 03:36:16,440] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
Epoch: 9 | Batch: 0 | Loss: 0.2148
Epoch: 9 | Batch: 100 | Loss: 0.2515
Epoch: 9 | Batch: 200 | Loss: 0.3958
Epoch: 9 | Batch: 300 | Loss: 0.3501
[2025-04-05 03:36:19,196] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
Epoch: 9 | Batch: 400 | Loss: 0.3362
Epoch: 9 | Batch: 500 | Loss: 0.2739
Epoch: 9 | Batch: 600 | Loss: 0.2272
Epoch: 9 | Batch: 700 | Loss: 0.2355
Epoch: 9 | Batch: 800 | Loss: 0.5708
Epoch: 9 | Batch: 900 | Loss: 0.1432
Epoch: 9 | Batch: 1000 | Loss: 0.2303
Epoch: 9 | Batch: 1100 | Loss: 0.2595
Epoch: 9 | Batch: 1200 | Loss: 0.1442
Epoch: 9 | Batch: 1300 | Loss: 0.4683
Epoch: 9 | Batch: 1400 | Loss: 0.2491
Epoch: 9 | Batch: 1500 | Loss: 0.2827
[rank0]:[W405 03:36:27.708658550 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[2025-04-05 03:36:28,420] [INFO] [launch.py:351:main] Process 13722 exits successfully.

3. ZeRO階段對比

ZeRO階段顯存優化通信量適用場景
階段0正常基線對比
階段1優化器狀態分區中等規模模型
階段2+梯度分區大規模模型
階段3+參數分區超大規模模型

五、選擇指南

六、總結與展望

本文全面介紹了三種分布式訓練方法:

  1. DP:適合快速原型開發,但存在明顯瓶頸

  2. DDP:生產環境首選,支持多機多卡

  3. DeepSpeed:專為超大模型設計,提供ZeRO優化

典型使用場景

  • 單機4卡以下:DDP

  • 多機訓練:DDP + torchrun

  • 10B+參數模型:DeepSpeed + ZeRO-3

在下一篇文章中,我們將深入解析注意力機制,從Seq2Seq到Transformer的完整演進歷程,揭示現代深度學習架構的核心思想。

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/web/76307.shtml
繁體地址,請注明出處:http://hk.pswp.cn/web/76307.shtml
英文地址,請注明出處:http://en.pswp.cn/web/76307.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

OpenAI重返巔峰:o3與o4-mini引領AI推理新時代

引言 2025年4月16日&#xff0c;OpenAI發布了全新的o系列推理模型&#xff1a;o3和o4-mini&#xff0c;這兩款模型被官方稱為“迎今為止最智能、最強大的大語言模型&#xff08;LLM&#xff09;”。它們不僅在AI推理能力上實現了質的飛躍&#xff0c;更首次具備了全面的工具使…

【AI插件開發】Notepad++ AI插件開發實踐:支持配置界面

一、引用 此前的系列文章已基本完成了Notepad的AI插件的功能開發&#xff0c;但是此前使用的配置為JSON配置文件&#xff0c;不支持界面配置。 本章在此基礎上集成支持配置界面&#xff0c;這樣不需要手工修改配置文件&#xff0c;直接在界面上操作&#xff0c;方便快捷。 注…

Android12 ServiceManager::addService源碼解讀

源碼 Status ServiceManager::addService(const std::string& name, const sp<IBinder>& binder, bool allowIsolated, int32_t dumpPriority) {auto ctx mAccess->getCallingContext();// apps cannot add servicesif (multiuser_get_app_id(ctx.uid) >…

第十四節:實戰場景-何實現全局狀態管理?

React.createElement調用示例 Babel插件對JSX的轉換邏輯 React 全局狀態管理實戰與 JSX 轉換原理深度解析 一、React 全局狀態管理實現方案 1. Context API useReducer 方案&#xff08;輕量級首選&#xff09; // 創建全局 Context 對象 const GlobalContext createConte…

第四十八篇 電信行業數倉建設實戰指南:從架構設計到場景落地

目錄 一、云原生架構設計實戰1.1 計算存儲分離架構搭建1.2 實時離線融合方案 二、維度建模深度解析2.1 電信業務建模方法論2.2 典型模型設計示例 三、ETL流程優化實踐3.1 增量同步技術選型3.2 數據清洗規范 四、核心場景實現方案4.1 用戶流失預警模型 五、數據治理實施指南5.1 …

2025年山東燃氣瓶裝送氣工考試真題練習

燃氣瓶裝送氣工考試真題練習 單選題 1、液化石油氣主要成分是&#xff08; &#xff09;。 A. 甲烷 B. 丙烷、丁烷 C. 一氧化碳和氫氣 答案&#xff1a;B 2、燃氣鋼瓶搬運過程中&#xff0c;正確的做法是&#xff08; &#xff09;。 A. 滾動鋼瓶 B. 踢鋼瓶 C. 輕拿輕…

《AI大模型應知應會100篇》第24篇:限定輸出格式:如何讓AI回答更加結構化

第24篇&#xff1a;限定輸出格式&#xff1a;如何讓AI回答更加結構化 摘要 在日常使用AI的過程中&#xff0c;我們經常希望得到的不僅僅是“正確”的答案&#xff0c;更是一個清晰、規范、易于處理的回答。無論是生成數據分析報告、代碼片段&#xff0c;還是教學內容&#xff…

【MySQL】數據庫和表的操作詳解

目錄 一、數據庫&#xff1a; 1、查看數據庫&#xff1a; 2、創建數據庫&#xff1a; 3、刪除數據庫&#xff1a; 4、數據庫的編碼問題&#xff1a; 5、校驗規則對數據庫的影響&#xff1a; 6、修改數據庫&#xff1a; 7、庫的備份與恢復&#xff1a; 8、查看鏈接情況…

Docker--Docker鏡像原理

docker 是操作系統層的虛擬化&#xff0c;所以 docker 鏡像的本質是在模擬操作系統。 聯合文件系統&#xff08;UnionFS&#xff09; 聯合文件系統&#xff08;UnionFS&#xff09; 是Docker鏡像實現分層存儲的核心技術&#xff0c;它通過將多個只讀層&#xff08;Image Laye…

雙層Key緩存

雙層 Key 緩存是一種針對 緩存擊穿 和 雪崩問題 的優化方案&#xff0c;其核心思想是通過 主備雙緩存 的機制&#xff0c;確保在熱點數據過期時仍能提供可用服務&#xff0c;同時降低對數據庫的瞬時壓力。以下是其核心原理、實現細節及適用場景的深度解析&#xff1a; 一、核心…

力扣每日打卡 2176. 統計數組中相等且可以被整除的數對(簡單)

力扣 2176. 統計數組中相等且可以被整除的數對 簡單 前言一、題目內容二、解題方法1. 暴力解法2.官方題解官方也是暴力解法 前言 這是刷算法題的第十三天&#xff0c;用到的語言是JS 題目&#xff1a;力扣 2176. 統計數組中相等且可以被整除的數對(簡單) 一、題目內容 給你一…

云服務器和物理服務器

服務器&#xff0c;作為互聯網世界中數據存儲與處理的關鍵樞紐&#xff0c;其重要性不言而喻。在眾多服務器類型中&#xff0c;云服務器和物理服務器占據了主導地位&#xff0c;它們各自有著獨特的特點和應用場景。咱們就來深入探討一下這兩者的區別。

Kubernetes Pod 調度策略:從基礎到進階

文章目錄 環境Kubernetes 部署Kubernetes Pod 調度策略Kubernetes Pod 調度策略對照表調度流程經歷階段案例展示生成yaml文件默認調度節點選擇器為節點添加標簽編寫 Deployment 配置文件應用資源并查看調度結果 Node Affinity&#xff08;節點親和性&#xff09;為節點添加標簽…

SQLite、MySQL、SQL Server、Oracle 和 PostgreSQL 五種數據庫的區別

以下是 SQLite、MySQL、SQL Server、Oracle 和 PostgreSQL 五種主流關系型數據庫管理系統(RDBMS)的區別,從多個維度進行對比: 1. 架構與部署 SQLite(Structured Query Language Lite?): 嵌入式數據庫,無服務器架構。數據庫存儲在一個單一的磁盤文件中。部署簡單,適合輕量…

電路安全智控系統與主機安全防護系統主要功能是什么

電路安全智控系統被稱為電路安全用電控制系統。電路安全智控系統具備一系列強大且實用的功能。電路安全智控系統能夠對總電壓、總電流、總功率、總電能&#xff0c;以及各分路的電壓、電流、功率、電能和功率因素等進行全方位的監控。在大型工廠的電力分配中&#xff0c;通過對…

使用Lean 4和C#進行數學定理證明與邏輯推理

步驟1&#xff1a;安裝與配置環境 安裝Lean 4 訪問Lean官網或GitHub倉庫&#xff0c;按照指南安裝Lean 4及配套工具鏈&#xff08;如VS Code擴展&#xff09;。 設置C#開發環境 安裝.NET SDK及IDE&#xff08;如Visual Studio或Rider&#xff09;&#xff0c;確保C#開發環境正…

八股文---MySQl(3)

目錄 12.事務的特性是什么&#xff1f;可以詳細說一下嗎&#xff1f; 回答 13并發事務帶來哪些問題&#xff1f;怎么解決這些問題呢&#xff1f;MySQL的默認隔離級別是&#xff1f; 臟讀&#xff1a;一個事務讀到另外一個事務還沒有提交的數據。 不可重復讀&#xff1a;一個…

實驗五 內存管理實驗

實驗五 內存管理實驗 一、實驗目的 1、了解操作系統動態分區存儲管理過程和方法。 2、掌握動態分區存儲管理的主要數據結構--空閑表區。 3、加深理解動態分區存儲管理中內存的分配和回收。 4、掌握空閑區表中空閑區3種不同放置策略的基本思想和實現過程。 5、通過模擬程…

【MySQL】MySQL表的增刪改查(CRUD) —— 上篇

目錄 MySQL表的增刪改查&#xff08;CRUD&#xff09; 1. 新增&#xff08;Create&#xff09;/插入數據 1.1 單行數據 全列插入 insert into 表名 values(值, 值......); 1.2 單行數據 指定列插入 1.3 多行數據 指定列插入 1.4 關于時間日期&#xff08;datetime&am…

【MATLAB代碼例程】AOA與TOA結合的高精度平面地位,適用于四個基站的情況,附完整的代碼

本代碼實現了一種基于到達角(AOA) 和到達時間(TOA) 的混合定位算法,適用于二維平面內移動或靜止目標的定位。通過4個基站的協同測量,結合最小二乘法和幾何解算,能夠有效估計目標位置,并支持噪聲模擬、誤差分析和可視化輸出。適用于室內定位、無人機導航、工業監測等場景…