Conversation
AkshitaB
left a comment
There was a problem hiding this comment.
Discussed my queries offline with @ananyahjha93
- How were model shapes decided? Based on Pythia and then number of parameters.
- How about LR? Also ballpark from Pythia.
Other things to note:
- Global batch size may also require some ablation
|
|
||
| if cfg.save_num_unsharded_checkpoints_to_keep < 1: | ||
| if cfg.save_num_unsharded_checkpoints_to_keep == 0: | ||
| log.warning( |
There was a problem hiding this comment.
What if save_num_checkpoints_to_keep is also 0?
There was a problem hiding this comment.
it then assumes that you did not want to keep checkpoints at all!
There was a problem hiding this comment.
-1 assumes you want to save all checkpoints and so I made it ==0 instead of < 1.
dirkgr
left a comment
There was a problem hiding this comment.
I assume the configs between the sizes are all the same, so I didn't look at all of them.
configs/tiny/OLMo-300M.yaml
Outdated
| weight_tying: false | ||
| alibi: false | ||
| rope: true | ||
| flash_attention: true # not available on AMD |
There was a problem hiding this comment.
removed the comment
|
|
||
| - label: basic_arithmetic | ||
| type: downstream | ||
|
|
There was a problem hiding this comment.
ah, basic_arithmetic should be in, others don't provide any signal based on my experience
There was a problem hiding this comment.
ah this was commented out saying
# Doesn't work from cache.
There was a problem hiding this comment.
Should work with cache v4
configs/tiny/OLMo-300M.yaml
Outdated
| stop_at: 100_000 | ||
| global_train_batch_size: 2048 | ||
| device_train_microbatch_size: 8 | ||
| max_duration: 2ep |
There was a problem hiding this comment.
This means you'll run into this bug: #584
It might not matter. The problem is only that the second epoch will be shuffled the same way the first one is shuffled.
There was a problem hiding this comment.
I'll add a stop_at 400k steps!
| @@ -9,17 +9,15 @@ wandb: | |||
| model: | |||
configs/tiny/OLMo-20M.yaml
Outdated
| grad_clip_warmup_steps: null | ||
| grad_clip_warmup_factor: 5 |
There was a problem hiding this comment.
took these from @AkshitaB 's llamaish1-normal-weka.yaml.
There was a problem hiding this comment.
removed them for now!
| paths: | ||
| ######### NON WEB DATA ######### | ||
| # ~> GUTENBERG BOOKS (5.256 GT) | ||
| - s3://ai2-llm/preprocessed/olmo-mix/v1_6-decontaminated/books/gpt-neox-olmo-dolma-v1_5/part-0-00000.npy |
There was a problem hiding this comment.
was planning to run on pluto, now I can see free nodes on jupiter, making the change!
| # Unsharded checkpoints (for ddp) | ||
| save_interval_unsharded: 5000 | ||
| save_num_unsharded_checkpoints_to_keep: 3 | ||
| save_num_unsharded_checkpoints_to_keep: -1 |
There was a problem hiding this comment.
-1 is for keeping all checkpoints, but I'll double check
dirkgr
left a comment
There was a problem hiding this comment.
Approved with a small comment about the long warmup.
| units: tokens | ||
| t_warmup: 4194304000 | ||
| t_max: 3e12 | ||
| t_warmup: 5000 |
There was a problem hiding this comment.
For normal init, this is a lot of warmup? Not a big deal, but unusual?
There was a problem hiding this comment.
smaller models, higher LR, did not take a chance! never bad doing a longer warmup!
| max_duration: 1ep | ||
| stop_at: 406_934 |
There was a problem hiding this comment.
Do you need both max_duration and stop_at?
There was a problem hiding this comment.
yes, from what I have observed and Dave mentioned the training goes past max_duration if stop_at is not set
| # Doesn't work from cache. | ||
| # - label: basic_arithmetic | ||
| # type: downstream | ||
|
|
| --wandb.group=$TASK_NAME \ | ||
| --wandb.project=tiny_olmo \ | ||
| --wandb.project=olmo-tiny \ | ||
| --max_grad_norm=2.0 \ |
There was a problem hiding this comment.
Do you want to do this clipping value for all small models?
There was a problem hiding this comment.
ah, let me fix this, so the model with clipping value 2.0 does not show any downstream improvement!
Co-authored-by: Pete <epwalsh10@gmail.com>
olmo/train.py
Outdated
| num_fwd_flops=self.model.num_fwd_flops, # this is per sequence | ||
| num_bck_flops=self.model.num_bck_flops, # this is per sequence |
There was a problem hiding this comment.
"this is per sequence" ... it's per-token now, right?
No description provided.