Add MuonW optimizer: Muon with AdamW fallback for non-matrix parameters by JenWei0312 · Pull Request #882 · allenai/OLMo

JenWei0312 · 2025-08-31T02:24:24Z

This PR adds the MuonW optimizer to OLMo, implementing the Muon optimization algorithm with AdamW fallback for non-matrix parameters.

Key features:

Implements Muon's Newton-Schulz orthogonalization for matrix parameters (2D+)
Falls back to AdamW for scalar/vector parameters and embeddings/heads
Fully compatible with distributed training (FSDP)
Includes comprehensive metric tracking for monitoring

Implementation details:

Based on the original Muon paper and reference implementation
Adds distributed metric collection and reduction
Handles conv filters through reshaping
Supports selective weight updates and gradient clipping

Testing:

Tested on single GPU/CPU with comprehensive test suite.
Mock tests verify distributed code paths
Convergence verified on regression tasks

Happy to add config integration if there's interest. Tested locally - all core functionality working.

This PR adds the MuonW optimizer to OLMo, implementing the Muon optimization algorithm with AdamW fallback for non-matrix parameters. Key features: - Implements Muon's Newton-Schulz orthogonalization for matrix parameters (2D+) - Falls back to AdamW for scalar/vector parameters and embeddings/heads - Fully compatible with distributed training (FSDP) - Includes comprehensive metric tracking for monitoring Implementation details: - Based on the original Muon paper and reference implementation - Adds distributed metric collection and reduction - Handles conv filters through reshaping - Supports selective weight updates and gradient clipping Testing: - Tested on single GPU/CPU with comprehensive test suite - Mock tests verify distributed code paths - Convergence verified on regression tasks Happy to add config integration if there's interest. Tested locally - all core functionality working.

JenWei0312 · 2025-09-17T02:19:41Z

Hi team, just wanted to gently follow up on this PR for the MuonW optimizer.

I know you're all very busy, so no rush at all. Please let me know if there are any questions, changes, or additional tests I can provide from my end to help move the review process along.

Thanks for your time and for maintaining this great project!

baileykuehl · 2025-09-17T19:13:53Z

Hi there, thanks for your contribution and interest! We apologize for the delay in response to your PR - we are indeed at a busy time of year. We will take a look at this as soon as we can!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add MuonW optimizer: Muon with AdamW fallback for non-matrix parameters#882

Add MuonW optimizer: Muon with AdamW fallback for non-matrix parameters#882
JenWei0312 wants to merge 1 commit intoallenai:mainfrom
JenWei0312:patch-1

JenWei0312 commented Aug 31, 2025

Uh oh!

JenWei0312 commented Sep 17, 2025

Uh oh!

baileykuehl commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

JenWei0312 commented Aug 31, 2025

Uh oh!

JenWei0312 commented Sep 17, 2025

Uh oh!

baileykuehl commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants