Skip to content

call check_quantized_moe_compatibility after initialize#13876

Merged
ispobock merged 11 commits intosgl-project:mainfrom
chunyuan-w:chunyuan/fix_fp8_check
Dec 13, 2025
Merged

call check_quantized_moe_compatibility after initialize#13876
ispobock merged 11 commits intosgl-project:mainfrom
chunyuan-w:chunyuan/fix_fp8_check

Conversation

@chunyuan-w
Copy link
Contributor

Motivation

Fixes the error when running DeepSeek-V3.1-Terminus-FP8 with TP=6 on CPU.

  File "/sglang/srt/model_executor/model_runner.py", line 306, in __init__
    self.check_quantized_moe_compatibility()
  File "/sglang/srt/model_executor/model_runner.py", line 597, in check_quantized_moe_compatibility
    raise ValueError(
ValueError: moe_intermediate_size 2048 must be divisible by moe_tp_size (6) which is tp_size (6) divided by moe_ep_size (1).

Modifications

Move check_quantized_moe_compatibility() to be after self.initialize(min_per_gpu_memory).

On CPU, we will do padding for moe_intermediate_size if it can't be divided by the tp_size inside self.initialize(min_per_gpu_memory):

if self.device == "cpu":
self.model_config = adjust_config_with_unaligned_cpu_tp(
self.model_config, self.load_config, self.tp_size
)

We need to call check_quantized_moe_compatibility() after this padding otherwise we will run into the above error.
We're not able to move the adjust_config_with_unaligned_cpu_tp() call to an earlier place because it requires self.load_config to be set first and this is done inside the self.initialize(min_per_gpu_memory) here:

self.load_config = LoadConfig(
load_format=self.server_args.load_format,
download_dir=self.server_args.download_dir,
model_loader_extra_config=self.server_args.model_loader_extra_config,
tp_rank=self.tp_rank,
remote_instance_weight_loader_seed_instance_ip=self.server_args.remote_instance_weight_loader_seed_instance_ip,
remote_instance_weight_loader_seed_instance_service_port=self.server_args.remote_instance_weight_loader_seed_instance_service_port,
remote_instance_weight_loader_send_weights_group_ports=self.server_args.remote_instance_weight_loader_send_weights_group_ports,
modelopt_config=modelopt_config,
)

@chunyuan-w chunyuan-w marked this pull request as ready for review November 25, 2025 05:03
@chunyuan-w
Copy link
Contributor Author

/rerun-failed-ci

@chunyuan-w
Copy link
Contributor Author

@zhyncs @Alcanderian could you please help review this PR? The CI failures are unrelated.

@Fridge003
Copy link
Collaborator

maybe cc @JustinTong0323

@chunyuan-w
Copy link
Contributor Author

Hi @JustinTong0323 could you please take a look at this PR?

@chunyuan-w
Copy link
Contributor Author

Hi @Alcanderian I checked that the CI failures are unrelated to this PR. Could you please help land this PR?

@ispobock ispobock merged commit 2a39cfe into sgl-project:main Dec 13, 2025
20 of 33 checks passed
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 13, 2025
…n_eagle3_npu

* 'main' of https://github.com/sgl-project/sglang: (25 commits)
  [NPU] perf update with kvcache nz & w4a8 quant (sgl-project#14423)
  [PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks (sgl-project#15027)
  Fix GLM-4.6 tool calls don't support streaming output for arguments i… (sgl-project#13989)
  feature: adding nightly wheel workflow and indexer (sgl-project#14924)
  [diffusion] feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization (sgl-project#14659)
  [Fix] Disable trtllm moe backend for draft model for a qucik fix (sgl-project#15002)
  [diffusion] fix: use NDRotaryEmbedding in flux_2   (sgl-project#15034)
  Mistral Large 3 NVFP4 support (sgl-project#14485)
  call check_quantized_moe_compatibility after initialize (sgl-project#13876)
  Add sgl_router_attempt_http_responses_total for single attempt information (sgl-project#15037)
  Add error code in prometheus metrics and add X-SMG-Error-Code header (sgl-project#15036)
  Provide more fine grained error reason for reqwest error (sgl-project#15032)
  Tiny change http router response format to unify (sgl-project#15031)
  Tiny unify grpc existing error responses into new format (sgl-project#15030)
  Add `code` field and unify error responses for router (sgl-project#15028)
  Super tiny remove unused log_request (sgl-project#15035)
  Fix decode OOM caused by retraction (sgl-project#14939)
  [CI]Add gb200 runner back (sgl-project#15024)
  Add a special label for b200 CI runner that can run kernel tests (sgl-project#15033)
  Fix regression caused by fa3 block_table (sgl-project#15009)
  ...

# Conflicts:
#	python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025
GuoYechang pushed a commit to GuoYechang/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments