Ref
https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py
GitHub - deepseek-ai/EPLB: Expert Parallelism Load Balancer
DeepSeek-V3 Technical Report
DeepSeek的路由方法
class Gate(nn.Module):def __init__(self, args: ModelArgs):super().__init__()self.dim = args.dimself.topk = args.n_activated_expertsself.n_groups = args.n_expert_groupsself.topk_groups = args.n_limited_groupsself.score_func = args.score_funcself.route_scale = args.route_scaleself.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))self.bias = nn.Parameter(torch.empty(args.n_routed_experts)) if self.dim == 7168 else Nonedef forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:scores = linear(x, self.weight)if self.score_func == "softmax":scores = scores.softmax(dim=-1, dtype=torch.float32)else:scores = scores.sigmoid()original_scores = scoresif self.bias is not None:scores = scores + self.biasif self.n_groups > 1: # n_groups = n_expert_groups = 8scores = scores.view(x.size(0), self.n_groups, -1)if self.bias is None:group_scores = scores.amax(dim=-1)else:group_scores = scores.topk(2, dim=-1)[0].sum(dim=-1)indices = group_scores.topk(self.topk_groups, dim=-1)[1] # topk_groups = n_limited_groups = 4mask = torch.zeros_like(scores[..., 0]).scatter_(1, indices, True)scores = (scores * mask.unsqueeze(-1)).flatten(1)indices = torch.topk(scores, self.topk, dim=-1)[1] # topk = n_activated_experts = 8weights = original_scores.gather(1, indices)if self.score_func == "sigmoid":weights /= weights.sum(dim=-1, keepdim=True)weights *= self.route_scalereturn weights.type_as(x), indices
每個token有1個共享專家和256個路由專家,對這256個路由專家選擇出8個專家進行實際的計算。
常規的MOE是根據score選擇最高的topk個專家,具有較大的隨機性。
但是DeepSeek把這256個專家分成了連續的n_expert_groups=8個組,然后根據每個組的最高score選擇出n_limited_groups=4個組,其他組的score值清零。使得最終選擇的topk=8隨機分布在4個組內。但并沒有限制每個組一定有多少個專家選中。這一定程度上限制了topk出現的隨機性。
Expert-Parallel Load Balancer
- 核心問題:對于給定 MoE 模型,存在一些天然的高負載專家(expert),導致不同 GPU 的專家計算負載不均衡
- 優化目標:每個 GPU 上的專家計算量均衡(即最小化所有 GPU 的 dispatch 接收量的最大值)。To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens.
Moreover, thanks to the?group-limited expert routing?used in DeepSeek-V3, we also attempt to place the experts of the same group to the same node to reduce inter-node data traffic, whenever possible.
Prefill階段 Hierarchical Load Balancing
Prefill:路由專家 EP32、MLA 和共享專家 DP32,一個部署單元是 4 節點,32 個冗余路由專家,每張卡 9 個路由專家和 1 個共享專家
When the number of server nodes divides the number of expert groups, we use the hierarchical load balancing policy to harness the group-limited expert routing. We first pack the expert groups to nodes evenly, ensuring the loads of different nodes are balanced. Then, we replicate the experts within each node. Finally, we pack the replicated experts to individual GPUs to ensure different GPUs are load-balanced. The hierarchical load balancing policy can be used in prefilling stage with a smaller expert-parallel size.
To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens. To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly. The high-load experts are detected based on statistics collected during the online deployment and are adjusted periodically (e.g., every 10 minutes). After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads, striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert.
Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another.
Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly. Given the substantial computation involved in the prefilling stage, the overhead of computing this routing scheme is almost negligible.
Decoding階段 Global Load Balancing
Decode:路由專家 EP144、MLA 和共享專家 DP144,一個部署單元是 18 節點,32 個冗余路由專家,每張卡 2 個路由專家和 1 個共享專家
In other cases, we use the global load balancing policy that replicates the experts globally regardless of expert groups, and pack the replicated experts to individual GPUs. This policy can be adopted in decoding stage with a larger expert-parallel size.
During decoding, we treat the shared expert as a routed one. From this perspective, each token will select 9 experts during routing, where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320. For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts. All-to-all communication of the dispatch and combine parts is performed via direct point-to-point transfers over IB to achieve low latency. Additionally, we leverage the IBGDA (NVIDIA, 2022) technology to further minimize latency and enhance communication efficiency.?
Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service. However, we do not need to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead.