Train a few steps after time limit reached#362
Conversation
olmo/train.py
Outdated
| canceled = hard_stop = True | ||
|
|
||
| # Maybe save sharded checkpoint. | ||
| if canceled or ( |
There was a problem hiding this comment.
This will save a checkpoint for all the extra steps. Consider making this and some later code hard_stop instead
There was a problem hiding this comment.
Alternatively, you could have canceled represent a hard stop and cancel_initiated represent the beginning of a cancellation.
olmo/train.py
Outdated
| # First check if we've reached the training time limit. | ||
| should_cancel = True | ||
| cancel_reason = "time limit reached" | ||
| extra_steps = 10 # train for 10 extra steps so we get an overlap in metrics when we restart |
There was a problem hiding this comment.
Consider making this a config setting
|
|
||
| # Maybe run evaluations. | ||
| if not canceled and self.global_step % self.cfg.eval_interval == 0: | ||
| if not cancel_initiated and self.global_step % self.cfg.eval_interval == 0: |
There was a problem hiding this comment.
To be clear, you don't want eval metrics if they happen in those extra steps?
There was a problem hiding this comment.
Right... though it's debatable. I think when we cancel we want to stop ASAP, and the eval loop adds time.
There was a problem hiding this comment.
Yeah, no eval loops. This is a sanity check.
Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>
|
Do we still want this? Can we get it merged? |
|
@dirkgr yes, do you want to give a final review? Otherwise I think we're good to go with this. |
dirkgr
left a comment
There was a problem hiding this comment.
I did not review again. I was fine with it last time, except those spelling errors.
This expands on the cancellation logic so that when a run is canceled due to reaching the time limit, it will train for 10 more steps after the cancellation goes into effect and after saving the final checkpoint. That way when we restart the run from the latest checkpoint we'll have some overlap in metrics on W&B, which is good for verifying that the restart worked properly.