Skip to content

Comments

Train a few steps after time limit reached#362

Merged
epwalsh merged 7 commits intomainfrom
epwalsh/train-after-cancel
Jan 4, 2024
Merged

Train a few steps after time limit reached#362
epwalsh merged 7 commits intomainfrom
epwalsh/train-after-cancel

Conversation

@epwalsh
Copy link
Member

@epwalsh epwalsh commented Nov 6, 2023

This expands on the cancellation logic so that when a run is canceled due to reaching the time limit, it will train for 10 more steps after the cancellation goes into effect and after saving the final checkpoint. That way when we restart the run from the latest checkpoint we'll have some overlap in metrics on W&B, which is good for verifying that the restart worked properly.

@epwalsh epwalsh requested review from 2015aroras and dirkgr November 6, 2023 21:09
olmo/train.py Outdated
canceled = hard_stop = True

# Maybe save sharded checkpoint.
if canceled or (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will save a checkpoint for all the extra steps. Consider making this and some later code hard_stop instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could have canceled represent a hard stop and cancel_initiated represent the beginning of a cancellation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

olmo/train.py Outdated
# First check if we've reached the training time limit.
should_cancel = True
cancel_reason = "time limit reached"
extra_steps = 10 # train for 10 extra steps so we get an overlap in metrics when we restart
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this a config setting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epwalsh epwalsh requested a review from 2015aroras November 6, 2023 22:43

# Maybe run evaluations.
if not canceled and self.global_step % self.cfg.eval_interval == 0:
if not cancel_initiated and self.global_step % self.cfg.eval_interval == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, you don't want eval metrics if they happen in those extra steps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... though it's debatable. I think when we cancel we want to stop ASAP, and the eval loop adds time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no eval loops. This is a sanity check.

Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>
@epwalsh epwalsh requested a review from dirkgr November 9, 2023 01:26
@dirkgr
Copy link
Member

dirkgr commented Jan 4, 2024

Do we still want this? Can we get it merged?

@epwalsh
Copy link
Member Author

epwalsh commented Jan 4, 2024

@dirkgr yes, do you want to give a final review? Otherwise I think we're good to go with this.

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review again. I was fine with it last time, except those spelling errors.

@epwalsh epwalsh merged commit 23eb949 into main Jan 4, 2024
@epwalsh epwalsh deleted the epwalsh/train-after-cancel branch January 4, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants