Add MuonW optimizer: Muon with AdamW fallback for non-matrix parameters#882
Open
JenWei0312 wants to merge 1 commit intoallenai:mainfrom
Open
Add MuonW optimizer: Muon with AdamW fallback for non-matrix parameters#882JenWei0312 wants to merge 1 commit intoallenai:mainfrom
JenWei0312 wants to merge 1 commit intoallenai:mainfrom
Conversation
This PR adds the MuonW optimizer to OLMo, implementing the Muon optimization algorithm with AdamW fallback for non-matrix parameters. Key features: - Implements Muon's Newton-Schulz orthogonalization for matrix parameters (2D+) - Falls back to AdamW for scalar/vector parameters and embeddings/heads - Fully compatible with distributed training (FSDP) - Includes comprehensive metric tracking for monitoring Implementation details: - Based on the original Muon paper and reference implementation - Adds distributed metric collection and reduction - Handles conv filters through reshaping - Supports selective weight updates and gradient clipping Testing: - Tested on single GPU/CPU with comprehensive test suite - Mock tests verify distributed code paths - Convergence verified on regression tasks Happy to add config integration if there's interest. Tested locally - all core functionality working.
Author
|
Hi team, just wanted to gently follow up on this PR for the MuonW optimizer. I know you're all very busy, so no rush at all. Please let me know if there are any questions, changes, or additional tests I can provide from my end to help move the review process along. Thanks for your time and for maintaining this great project! |
Contributor
|
Hi there, thanks for your contribution and interest! We apologize for the delay in response to your PR - we are indeed at a busy time of year. We will take a look at this as soon as we can! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the MuonW optimizer to OLMo, implementing the Muon optimization algorithm with AdamW fallback for non-matrix parameters.
Key features:
Implementation details:
Testing:
Happy to add config integration if there's interest. Tested locally - all core functionality working.