Train a few steps after time limit reached by epwalsh · Pull Request #362 · allenai/OLMo

epwalsh · 2023-11-06T21:08:59Z

This expands on the cancellation logic so that when a run is canceled due to reaching the time limit, it will train for 10 more steps after the cancellation goes into effect and after saving the final checkpoint. That way when we restart the run from the latest checkpoint we'll have some overlap in metrics on W&B, which is good for verifying that the restart worked properly.

2015aroras · 2023-11-06T21:41:00Z

olmo/train.py

+                    canceled = hard_stop = True

                # Maybe save sharded checkpoint.
                if canceled or (


This will save a checkpoint for all the extra steps. Consider making this and some later code hard_stop instead

Alternatively, you could have canceled represent a hard stop and cancel_initiated represent the beginning of a cancellation.

Ah good catch!

2015aroras · 2023-11-06T21:45:39Z

olmo/train.py

                # First check if we've reached the training time limit.
                should_cancel = True
                cancel_reason = "time limit reached"
+                extra_steps = 10  # train for 10 extra steps so we get an overlap in metrics when we restart


Consider making this a config setting

2015aroras · 2023-11-06T23:48:36Z

olmo/train.py


                # Maybe run evaluations.
-                if not canceled and self.global_step % self.cfg.eval_interval == 0:
+                if not cancel_initiated and self.global_step % self.cfg.eval_interval == 0:


To be clear, you don't want eval metrics if they happen in those extra steps?

Right... though it's debatable. I think when we cancel we want to stop ASAP, and the eval loop adds time.

Yeah, no eval loops. This is a sanity check.

olmo/util.py

Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>

dirkgr · 2024-01-04T01:45:41Z

Do we still want this? Can we get it merged?

epwalsh · 2024-01-04T19:00:30Z

@dirkgr yes, do you want to give a final review? Otherwise I think we're good to go with this.

dirkgr

I did not review again. I was fine with it last time, except those spelling errors.

Train a few steps after time limit reached

b828938

epwalsh requested review from 2015aroras and dirkgr November 6, 2023 21:09

2015aroras reviewed Nov 6, 2023

View reviewed changes

epwalsh added 2 commits November 6, 2023 14:40

fix: canceled vs cancel_initiated

a1c32e9

add configuration option

4ed81c6

epwalsh requested a review from 2015aroras November 6, 2023 22:43

2015aroras approved these changes Nov 6, 2023

View reviewed changes

Merge branch 'main' into epwalsh/train-after-cancel

56fc2cb

dirkgr requested changes Nov 8, 2023

View reviewed changes

olmo/util.py Outdated Show resolved Hide resolved

olmo/util.py Outdated Show resolved Hide resolved

I never won a spelling bee

84ad7a1

Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>

epwalsh requested a review from dirkgr November 9, 2023 01:26

epwalsh added 2 commits January 4, 2024 10:45

fix merge conflicts

7e044cd

Clean up

c38b642

dirkgr approved these changes Jan 4, 2024

View reviewed changes

epwalsh merged commit 23eb949 into main Jan 4, 2024

epwalsh deleted the epwalsh/train-after-cancel branch January 4, 2024 22:47

Comments

Conversation

epwalsh commented Nov 6, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dirkgr commented Jan 4, 2024

Uh oh!

epwalsh commented Jan 4, 2024

Uh oh!

dirkgr left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants