Conversation
README.md
Outdated
|
|
||
| - `config.yaml`: the config at that training step. | ||
| - `model.pt`, `optim.pt`, `train.pt`: model, optimizer and training state at that training step. | ||
| - `model.safetensors`, `optim.safetensors`, `train.pt`: model, optimizer and training state at that training step. |
There was a problem hiding this comment.
train.safetensors? Also, for the original model we just have *.pt so we should have that format mentioned somewhere.
There was a problem hiding this comment.
We are going to save in .safetensors starting from OLMo-2
There was a problem hiding this comment.
Sure, but people might still try to use older OLMo models. The documentation should be backwards-compatible?
scripts/download_checkpoints.py
Outdated
| except requests.exceptions.HTTPError as e: | ||
| print(f"HTTP error for {pattern}: {e}") | ||
| except requests.exceptions.RequestException as e: | ||
| print(f"Connection error for {pattern}: {e}") |
There was a problem hiding this comment.
This is still swallowing exceptions. It just prints about them. This part of the code should check whether the exception is a 404 error, in which case it's fine, and otherwise let the exception propagate.
scripts/download_checkpoints.py
Outdated
| print(f"Saving to: {base_path}") | ||
| available_files = try_get_directory_listing(public_url) | ||
|
|
||
| if not available_files: |
There was a problem hiding this comment.
| if not available_files: | |
| if len(available_files) <= 0: |
There was a problem hiding this comment.
How does this make any better? found_files can never be negative right? if not available_files: is same as if len(available_files) == 0:.
scripts/download_checkpoints.py
Outdated
| try: | ||
| test_url = urljoin(url.rstrip('/') + '/', pattern) | ||
| response = requests.head(test_url) | ||
| # response.raise_for_status() |
scripts/download_checkpoints.py
Outdated
| try: | ||
| print(f"\nDownloading: {file}") | ||
| download_file(file_url, file_path) | ||
| except requests.exceptions.Timeout: | ||
| print(f"Timeout error for {file}, retrying once...") | ||
| try: | ||
| download_file(file_url, file_path) | ||
| except requests.exceptions.RequestException as e: | ||
| failed_files.append(file) | ||
| print(f"Failed to download {file}: {e}") | ||
| except requests.exceptions.RequestException as e: | ||
| failed_files.append(file) | ||
| print(f"Failed to download {file}: {e}") |
There was a problem hiding this comment.
Don't retry in this way. The Python requests library already supports retries. In this code we should just call download_file(). If a file fails, we fail the whole process (i.e., we let the exception propagate).
scripts/download_checkpoints.py
Outdated
| response = requests.get(url, stream=True) | ||
| response.raise_for_status() | ||
| total_size = int(response.headers.get('content-length', 0)) | ||
| save_path.parent.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| with open(save_path, 'wb') as f: | ||
| with tqdm(total=total_size, unit='B', unit_scale=True, desc=save_path.name) as pbar: | ||
| for chunk in response.iter_content(chunk_size=chunk_size): | ||
| if chunk: | ||
| f.write(chunk) | ||
| pbar.update(len(chunk)) |
There was a problem hiding this comment.
Retries should be handled in here. If the request fails with a recoverable error code (anything that starts with 5XX, 408, 409, 429), it should wait one second, and then try again. Try up to 5 times, and then give up. When giving up, it must make sure that the file at save_path does not exist.
scripts/download_checkpoints.py
Outdated
| print(f"Error: Step {args.step} not found in the CSV file.") | ||
| print("Use list argument to see available step numbers.") | ||
| return |
There was a problem hiding this comment.
Don't print and then return. If you do this, the process will return normally, and whoever called it will think it succeeded. If you want to fail, raise an exception and let it crash the process.
scripts/download_checkpoints.py
Outdated
| reader = csv.DictReader(f) | ||
| urls = [(row['Step'], row['Checkpoint Directory']) for row in reader] | ||
|
|
||
| if args.command == 'list': |
There was a problem hiding this comment.
This should be an if ladder:
if args.command == 'foo':
do_foo_things()
do_other_things()
elif args.command == 'bar':
do_bar_things()
do_more_things()
else:
raise RuntimeException(f"Unexpected command: {args.command}")
scripts/download_checkpoints.py
Outdated
| print(f"Step {step}") | ||
| return | ||
|
|
||
| if args.step: |
There was a problem hiding this comment.
| if args.step: | |
| if args.step is not None: |
scripts/download_checkpoints.py
Outdated
| help='Download checkpoints from CSV file') | ||
| download_parser.add_argument('csv_file', type=str, | ||
| help='Path to the CSV file containing checkpoint URLs') | ||
| download_parser.add_argument('--step', type=str, required=True, |
There was a problem hiding this comment.
In the code it seems like step is not required if you want to download everything?
There was a problem hiding this comment.
No, step is required argument. It throws this error when user fails to add --step.
python scripts/download_checkpoints.py download checkpoints/official/OLMo-1B.csv --save-dir ./new_checkpoints
usage: download_checkpoints.py download [-h] --step STEP [--save-dir SAVE_DIR] csv_file
download_checkpoints.py download: error: the following arguments are required: --step
scripts/download_checkpoints.py
Outdated
| r2_url = convert_to_r2_url(url) | ||
| public_url = convert_to_public_url(r2_url) |
There was a problem hiding this comment.
The URLs in the CSV are already public URLs? Why are we doing this change?
There was a problem hiding this comment.
I've taken these functions from my another file where I will have r2 urls in csv while uploading. I missed this here.
Documentation Improvements
Changes Made
scripts/download_checkpoints.pyto automate checkpoint downloads.scripts/train.py.New Features
The new
scripts/download_checkpoints.pyscript: