Improve run_dataloader.py usability and performance feedback(addresses #869) by ezylopx5 · Pull Request #902 · allenai/OLMo

ezylopx5 · 2026-01-14T18:11:54Z

Summary

Improves run_dataloader.py usability for users who need to extract partial training data or debug data loading performance (addresses #869).

Changes

Add --max_batches flag to limit processing (useful for testing or partial data extraction)
Add --local_data_root flag to auto-substitute remote paths with local cached data
Add real-time throughput statistics (batches/s, MB/s) to progress bar
Warn users when loading from remote paths without local cache
Fix: Save any remaining batches that don't fill a complete file (was silently dropping data)
Log final statistics at completion

Motivation

Users running run_dataloader.py with remote data (S3/R2) experience very slow performance due to per-chunk network requests (#869 reported only 44 batches after running overnight).

This PR makes the script more user-friendly by:

Allowing partial runs with --max_batches for testing
Making it easier to use local data with --local_data_root
Providing clear feedback about processing speed

Example usage

# Process only first 100 batches
python scripts/run_dataloader.py -o ./output --max_batches 100 config.yaml

# Use locally cached data instead of S3
python scripts/run_dataloader.py -o ./output --local_data_root /mnt/data config.yaml

- Add --max_batches flag to limit processing (addresses allenai#869) - Add --local_data_root flag to auto-substitute remote paths with local data - Add real-time throughput stats (batches/s, MB/s) to progress bar - Warn users when loading from remote paths without local cache - Fix: Save any remaining batches that don't fill a complete file - Log final statistics at completion This addresses user pain points when using run_dataloader.py with remote data sources (S3/R2), which can be extremely slow due to per-chunk network requests. Users can now: 1. Limit runs with --max_batches for testing 2. Use locally cached data easily with --local_data_root 3. See real-time feedback on processing speed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Improve run_dataloader.py usability and performance feedback(addresses #869)#902

Improve run_dataloader.py usability and performance feedback(addresses #869)#902
ezylopx5 wants to merge 1 commit intoallenai:mainfrom
ezylopx5:improve-run-dataloader-ux

ezylopx5 commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

ezylopx5 commented Jan 14, 2026

Summary

Changes

Motivation

Example usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant