Peteish13 by dirkgr · Pull Request #739 · allenai/OLMo

dirkgr · 2024-10-21T22:24:22Z

Peteish13 configs
More options for torch.compile()
Apply compile() to one block at a time.
Fixes to run in the Google cloud
Scripts to run on the Augusta cluster

This reverts commit 6cc6f62.

…to peteish13-augusta

epwalsh · 2024-11-15T20:08:50Z

olmo/train.py


            del eval_batches

+        # Eval compiles a bunch more versions, and the result is terrible. This way we get back to zero.


What do you that the result is terrible?

so this prompted me to look into this a bit more and I think I've found a better solution: just mark the model input sizes as dynamic. I tested this out in OLMo-core and it appears to work well.
allenai/OLMo-core#105

I think it compiles a bunch of versions for different batch sizes, because that's how we call it during eval, and then they stick around. In all of my early runs I had high tps until the first eval, and then low tps afterwards. This is what fixed it.

I tried dynamic and it was bad. I don't remember the way in which it was bad, but it didn't work. That's why I added that version in the first place.

Ok, oh well. I tested with nightly so maybe it's just better now with recent compiler advances.

dirkgr and others added 30 commits October 4, 2024 11:23

New metrics

0656ce5

compile.dynamic=false

b699753

Don't stop

ca8c485

No more torch version pin

5f1a369

Run scripts to run on Augusta

d6cdc0c

This runs out of memory on eval.

57cc09b

Reproduce the eval we're now missing.

88a06bc

Run 1000 steps

0f6b896

Let's see if dynamic=true solves the problem.

ce11f9f

It did not.

c1d3ffe

Set device when initializing the process group

d3d39a0

Proper way of setting a device_id

d8f2aac

Maybe this incantation

2c0d11a

Don't eval with a compiled model

6cc6f62

Medium LR on Weka

43184f3

Revert "Don't eval with a compiled model"

435c3e6

This reverts commit 6cc6f62.

Let's try this instead.

a74ab6e

Updated metrics

08718b9

This might be faster yet.

b093945

Makes the launch_train.sh script work

51d0a39

Script to run something on all nodes

7cf11d8

Back to regular compile

212f55f

new config

68b022f

Show all errors when something fails

2594a2a

Peteish13 config for running in Google

4888fab

Specify the number of nodes on the command line

59f2247

Specify the number of nodes on the command line, part 2

746d394

Correct new path of the venv

ea5ca3f

Silence warning

b52f228

More informative logs

def6987

dirkgr and others added 13 commits November 13, 2024 10:49

Turn flash attention back on for 7B anneals

b3668f4

No whammy 3 config

d657a19

Urgent

35aaaf3

Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…

7dc73b3

…to peteish13-augusta

added config

bca921f

Run with retries

7b0c0f8

Don't run out of space.

949d80c

Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…

ab7d870

…to peteish13-augusta

Remove all Augusta specific stuff

93df396

Fix paths

bad96cd

Remove unused config

214aea5

Delete all the LUMI scripts

200bd1f

Remove metrics notebook

5d8da46

dirkgr requested review from 2015aroras and epwalsh November 15, 2024 19:49

dirkgr added 2 commits November 15, 2024 11:52

Changelog

8b709b9

Productivity through formatting

c7c0c5b

dirkgr marked this pull request as ready for review November 15, 2024 19:57

epwalsh reviewed Nov 15, 2024

View reviewed changes

epwalsh approved these changes Nov 15, 2024

View reviewed changes

dirkgr added 8 commits November 16, 2024 00:17

Config for more 13B anneals

6f4a49a

Needs to be urgent

4e57631

Hopefully fix a weird warning?

18b2f6a

Anneal Dirk's checkpoint

0b37469

That never should have said "-google".

2997b24

Two more anneals for the queue

c0fc15e

Same but longer

96e4b0d

Two more configs

a0c09ea

dirkgr merged commit 7e81a6c into main Nov 18, 2024

dirkgr deleted the peteish13-augusta branch November 18, 2024 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Peteish13#739

Peteish13#739
dirkgr merged 310 commits intomainfrom
peteish13-augusta

dirkgr commented Oct 21, 2024

Uh oh!

epwalsh Nov 15, 2024

Uh oh!

epwalsh Nov 15, 2024

Uh oh!

dirkgr Nov 15, 2024

Uh oh!

dirkgr Nov 15, 2024

Uh oh!

epwalsh Nov 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		del eval_batches

		# Eval compiles a bunch more versions, and the result is terrible. This way we get back to zero.

Comments

Conversation

dirkgr commented Oct 21, 2024

Uh oh!

epwalsh Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

epwalsh Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

dirkgr Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

dirkgr Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

epwalsh Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants