[fsdp] fix: make GDN work correctly when using ulysses and varlen sequence#5346
[fsdp] fix: make GDN work correctly when using ulysses and varlen sequence#5346nono-Sang wants to merge 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces monkey patches for the qwen3-next model to make GatedDeltaNet compatible with Ulysses sequence parallelism and variable-length sequences. The changes involve gathering the full sequence within the patched GatedDeltaNet when sequence parallelism is used, and passing cu_seqlens to handle variable-length inputs. The trainer code is also updated to conditionally handle the validation dataset. The core logic for monkey-patching and handling sequence parallelism appears correct. However, I've identified a critical issue in the trainer logic that could lead to a crash during validation if a validation dataset is not provided but validation frequency is configured.
|
|
||
| # early exit or validation step | ||
| if is_last_step or (self.config.trainer.test_freq > 0 and is_valid_step): | ||
| if self.config.trainer.test_freq > 0 and (is_last_step or is_valid_step): |
There was a problem hiding this comment.
There is a potential TypeError here. If self.config.trainer.test_freq > 0 but no validation dataset is provided (self.val_dataloader will be None), the code will enter this if block and then fail at the loop for val_data in self.val_dataloader:. This can happen if config.trainer.test_freq > 0 is set in the configuration but config.data.val_files is None. To prevent this crash, you should add a check to ensure self.val_dataloader is not None before attempting to iterate over it.
| if self.config.trainer.test_freq > 0 and (is_last_step or is_valid_step): | |
| if self.config.trainer.test_freq > 0 and self.val_dataloader is not None and (is_last_step or is_valid_step): |
What does this PR do?
For the qwen3-next model, the original SFT implementation has the following issues:
GatedDeltaNetis cross-tokens, sequence parallelism cannot be directly used.GatedDeltaNetintransformersdoes not take into account varlen sequence.Therefore, I made the following fixes through the monkey patch:
cu_seqlensparameter toGatedDeltaNet.GatedDeltaNet, and obtain the corresponding fragments after completing the calculation. This ensures thatGatedDeltaNetcan work correctly whenulyssesanduse_remove_paddingis turned on.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.