Decoupled Momentum Optimization by peter-sk · Pull Request #771 · allenai/OLMo

peter-sk · 2024-12-24T12:24:15Z

Cleaned-up version of https://github.com/bloc97/DeMo for integrating efficient distributed training a la Decoupled Monentum Optimization (https://arxiv.org/abs/2411.19870)

dirkgr

This looks interesting, but I don't actually know what this optimizer is. Can you give some background?

Edit: There is a description at the top. Am blind.

dirkgr · 2025-02-04T01:27:06Z

olmo/optim.py

+            compression_topk=cfg.optimizer.compression_topk,
+            compression_chunk=cfg.optimizer.compression_chunk,
+            weight_decay=cfg.optimizer.weight_decay,
+            process_group=None,  # TODO: fix for hybrid sharding


This seems important? Hybrid is necessary for big models.

dirkgr · 2025-02-04T01:28:02Z

olmo/config.py

+    ### DeMo parameters
+    compression_decay: float = 0.999
+
+    compression_topk: int = 32
+    """
+    How many numbers of topk to transmit per chunk, if dynamic is enabled, this is the initial topk
+    """
+
+    compression_chunk: int = 64
+    """
+    Size of the chunk of the gradients, note that 2D gradients are chunked in 2D, which the topk sparsity is squared compared to 1D
+    """


Prefix these with demo_?

dirkgr · 2025-02-04T01:28:57Z

olmo/config.py

+    disable_grad_sync: bool = False
+


I see this setting twice, once here, and once in DDPGradSyncMode?

dirkgr · 2025-02-04T01:30:11Z

olmo/optim.py

            return metrics


+class DeMo(torch.optim.SGD, Optimizer):


It seems like the organization would make more sense if this class, and demo_utils.py, were in their own file together, and then we use __all__ to make this optimizer appear the same as the others.

dirkgr · 2025-02-04T01:33:47Z

Oh, I see. You put a reference in the description 🙈.

Paper says you pushed this to 1B/100B tokens. Can you go further? Experience says, things like this stop working if you go really big.

DeMo

1ca4390

peter-sk mentioned this pull request Jan 6, 2025

empty tensors when using DeMo with FSDP bloc97/DeMo#2

Open

Merge branch 'main' into demo

2d3baaf

dirkgr requested changes Feb 4, 2025

View reviewed changes

dirkgr self-assigned this Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Decoupled Momentum Optimization#771

Decoupled Momentum Optimization#771
peter-sk wants to merge 2 commits intoallenai:mainfrom
schneiderkamplab:demo

peter-sk commented Dec 24, 2024

Uh oh!

dirkgr left a comment •

edited

Loading

Uh oh!

dirkgr Feb 4, 2025

Uh oh!

dirkgr Feb 4, 2025

Uh oh!

dirkgr Feb 4, 2025

Uh oh!

dirkgr Feb 4, 2025

Uh oh!

dirkgr commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

peter-sk commented Dec 24, 2024

Uh oh!

dirkgr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dirkgr Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

dirkgr Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

dirkgr Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

dirkgr Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

dirkgr commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dirkgr left a comment •

edited

Loading