feat:support exporting fp8 weights in FP8 E2E#6045
feat:support exporting fp8 weights in FP8 E2E#6045eternally-z wants to merge 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for FP8 weight export and rollout by adding export_weight_dtype and skip_mid_quantization configurations. It updates the Megatron engine and workers to handle FP8 weight initialization and optimizer parameter reloading, and ensures the Prometheus metrics directory is created on Ray workers. Feedback was provided regarding redundant validation and normalization of the export weight data type in the optimizer initialization logic, suggesting the use of the already validated value stored in the bridge.
| export_weight_dtype = self.engine_config.export_weight_dtype | ||
| if isinstance(export_weight_dtype, str): | ||
| export_weight_dtype = export_weight_dtype.lower() | ||
| if export_weight_dtype == "fp8" and self.bridge is not None: |
There was a problem hiding this comment.
The normalization and validation of export_weight_dtype is redundant here. It has already been processed and validated in _build_tf_config (lines 238-245) and stored in self.bridge.export_weight_dtype. Repeating this logic here is redundant and could lead to inconsistencies if the validation logic changes in one place but not the other.
| export_weight_dtype = self.engine_config.export_weight_dtype | |
| if isinstance(export_weight_dtype, str): | |
| export_weight_dtype = export_weight_dtype.lower() | |
| if export_weight_dtype == "fp8" and self.bridge is not None: | |
| if self.bridge is not None and getattr(self.bridge, "export_weight_dtype", None) == "fp8": |
| export_weight_dtype = self.config.actor.megatron.get("export_weight_dtype", "bf16") | ||
| if isinstance(export_weight_dtype, str): | ||
| export_weight_dtype = export_weight_dtype.lower() | ||
| if export_weight_dtype not in ("bf16", "fp16", "fp8"): | ||
| export_weight_dtype = "bf16" | ||
| if export_weight_dtype == "fp8" and self.bridge is not None: |
There was a problem hiding this comment.
This block redundantly re-fetches and re-normalizes export_weight_dtype from the configuration. This logic is already handled in _init_hf_config_and_tf_config (lines 239-246), where the value is validated (raising ValueError for invalid inputs) and stored in self.bridge.export_weight_dtype. Additionally, the fallback to "bf16" at line 534 is inconsistent with the ValueError raised earlier in the initialization flow. Using the stored value from the bridge is safer and cleaner.
| export_weight_dtype = self.config.actor.megatron.get("export_weight_dtype", "bf16") | |
| if isinstance(export_weight_dtype, str): | |
| export_weight_dtype = export_weight_dtype.lower() | |
| if export_weight_dtype not in ("bf16", "fp16", "fp8"): | |
| export_weight_dtype = "bf16" | |
| if export_weight_dtype == "fp8" and self.bridge is not None: | |
| if self.bridge is not None and getattr(self.bridge, "export_weight_dtype", None) == "fp8": |
What does this PR do?
we support exporting fp8 weights in Megatron-Bridge. This pr enable verl using the feature.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,vllm_omni,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.