MoE Echo: Elastic Cloning for Hot Experts 🚀

Overview 🔍

MoE Echo is a research prototype of a new MoE training paradigm that targets large-scale distributed training. It focuses on achieving load balance and sync-free with Fully CUDA-graph-capturable on dropless MoE training.

Concretely, MoE Echo aims to:

⚖️ Reduce expert load imbalance across Expert Parallel (EP) ranks.
⏱️ Remove host-side synchronization with dynamic routing in dropless MoE.
📊 Enable CUDA-graph-capturable MoE with minimal compute and memory fragmentation.

Sync-Free MoE ⚡

In token-dropless MoE, the number of tokens sent to each EP rank can vary significantly from step to step. The routing decisions (and thus the per-rank shapes) are produced on the GPU, but the host traditionally needs this shape information to:

launch dispatch/combine and grouped GEMM kernels, and
allocate sufficient memory for these kernels.

Naively, this requires device-to-host copies and host-side synchronization on every step, which both slows down training and makes CUDA graph capture difficult.

To build a sync-free, CUDA-graph-friendly MoE, we:

Pre-allocate GPU buffers and decide kernel launches without waiting for host-visible shapes.
Avoid excessive over-provisioning of buffers, which would otherwise cause:
- Compute fragmentation (wasted compute/communication on padded tokens).
- Memory fragmentation (oversized static buffers).

MoE Echo tackles this by:

Reducing compute fragmentation: GPU kernels consume routing/shape information that stays on device and operate only on the true token volume. For example, HybridEP reads shapes directly from the routing map on GPU.
Reducing memory fragmentation: We reduce load imbalance across EP ranks and manage memory more efficiently inside CUDA graphs so the pre-allocated buffers are better utilized.

Elastic Cloning for Hot Experts (ECHO)

To further reduce expert load imbalance, MoE Echo introduces elastic cloning for hot experts (ECHO). The key idea is to dynamically clone high-traffic (“hot”) experts onto EP ranks that receive fewer-than-average tokens.

Cloning experts during training is challenging because expert weights and gradients must remain coherent across all clones at every step. This means:

Synchronizing cloned expert parameters and gradients.
Carefully limiting the number of cloned experts to balance extra communication cost against the load-balance benefit.

MoE Echo addresses this with:

An ECHO planner that decides which popular experts to clone and where to place them, given spare expert slots on each EP rank.
An ECHO dispatcher that:
- dispatches tokens to the appropriate cloned experts and their spare slots according to the plan, and
- during backward, handles any necessary re-dispatch when spare slots are shared across layers and then combines/reduces gradients from all cloned copies into the main expert.

Quick Start 🏁

Install Dependencies

HybridEP

git clone https://github.com/deepseek-ai/DeepEP.git
cd DeepEP & git checkout hybrid-ep
TORCH_CUDA_ARCH_LIST="10.0" pip install -e .

Device-inited-grouped gemm

Note that this kernel is only available for Blackwell GPUs.

git clone https://github.com/QiZhangNV/TransformerEngine.git
cd TransformerEngine & git checkout cutlass_device_grouped_gemm
git submodule update --init --recursive
NVTE_CUDA_ARCHS="100a" NVTE_BUILD_THREADS_PER_JOB=8 NVTE_FRAMEWORK=pytorch pip install --no-cache-dir --no-build-isolation .

Run MoE Echo ▶️

Add the following flags to the command line to enable Echo for your training:

--moe-enable-echo
--moe-num-echo-experts 32 # number of echo experts totally
--moe-echo-expert-dispatcher-type hybridep # Only hybridep support sync-free dispatch
--moe-received-token-capacity 2.0 # capacity of total received tokens on each ep rank (if not set, sync-version will be used)
--moe-use-device-initiated-grouped-gemm # use device-initiated grouped gemm (only available for Blackwell GPUs MXFP8 GEEM)
--fp8-format e4m3
--fp8-recipe mxfp8
--fp8-param-gather
--reuse-grad-buf-for-mxfp8-param-ag
--moe-echo-recompute-expert-dispatch # recompute expert dispatch, such that the echo expert buffer is shared across layers
# Enable CUDA Graph
--enable-cuda-graph 
--cuda-graph-scope full_iteration
--te-rng-tracker

Roadmap 🗺️

Acknowledgments 🙏

Contributors(Equal Contribution, sorted alphabetically): Ahan Huang, Dennis Liu, Nan Zheng, Patrick Haft, Qi Zhang, Robin Zhang, Tong Liu, Zijie Yan

Name		Name	Last commit message	Last commit date
Latest commit History 6,966 Commits
.github		.github
.gitlab		.gitlab
docker		docker
docs		docs
examples		examples
images		images
megatron		megatron
patches		patches
tasks		tasks
tests		tests
tools		tools
.flake8		.flake8
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pretrain_bert.py		pretrain_bert.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_ict.py		pretrain_ict.py
pretrain_mamba.py		pretrain_mamba.py
pretrain_retro.py		pretrain_retro.py
pretrain_t5.py		pretrain_t5.py
pretrain_vision_classify.py		pretrain_vision_classify.py
pretrain_vision_dino.py		pretrain_vision_dino.py
pretrain_vision_inpaint.py		pretrain_vision_inpaint.py
pretrain_vlm.py		pretrain_vlm.py
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoE Echo: Elastic Cloning for Hot Experts 🚀

Overview 🔍

Sync-Free MoE ⚡

Elastic Cloning for Hot Experts (ECHO)

Quick Start 🏁

Install Dependencies

HybridEP

Device-inited-grouped gemm

Run MoE Echo ▶️

Roadmap 🗺️

Acknowledgments 🙏

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Victarry/Megatron-MoE-Echo

Folders and files

Latest commit

History

Repository files navigation

MoE Echo: Elastic Cloning for Hot Experts 🚀

Overview 🔍

Sync-Free MoE ⚡

Elastic Cloning for Hot Experts (ECHO)

Quick Start 🏁

Install Dependencies

HybridEP

Device-inited-grouped gemm

Run MoE Echo ▶️

Roadmap 🗺️

Acknowledgments 🙏

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages