Skip to content

Comments

Improve run_dataloader.py usability and performance feedback(addresses #869)#902

Open
ezylopx5 wants to merge 1 commit intoallenai:mainfrom
ezylopx5:improve-run-dataloader-ux
Open

Improve run_dataloader.py usability and performance feedback(addresses #869)#902
ezylopx5 wants to merge 1 commit intoallenai:mainfrom
ezylopx5:improve-run-dataloader-ux

Conversation

@ezylopx5
Copy link

Summary

Improves run_dataloader.py usability for users who need to extract partial training data or debug data loading performance (addresses #869).

Changes

  • Add --max_batches flag to limit processing (useful for testing or partial data extraction)
  • Add --local_data_root flag to auto-substitute remote paths with local cached data
  • Add real-time throughput statistics (batches/s, MB/s) to progress bar
  • Warn users when loading from remote paths without local cache
  • Fix: Save any remaining batches that don't fill a complete file (was silently dropping data)
  • Log final statistics at completion

Motivation

Users running run_dataloader.py with remote data (S3/R2) experience very slow performance due to per-chunk network requests (#869 reported only 44 batches after running overnight).

This PR makes the script more user-friendly by:

  1. Allowing partial runs with --max_batches for testing
  2. Making it easier to use local data with --local_data_root
  3. Providing clear feedback about processing speed

Example usage

# Process only first 100 batches
python scripts/run_dataloader.py -o ./output --max_batches 100 config.yaml

# Use locally cached data instead of S3
python scripts/run_dataloader.py -o ./output --local_data_root /mnt/data config.yaml

- Add --max_batches flag to limit processing (addresses allenai#869)
- Add --local_data_root flag to auto-substitute remote paths with local data
- Add real-time throughput stats (batches/s, MB/s) to progress bar
- Warn users when loading from remote paths without local cache
- Fix: Save any remaining batches that don't fill a complete file
- Log final statistics at completion

This addresses user pain points when using run_dataloader.py with remote
data sources (S3/R2), which can be extremely slow due to per-chunk network
requests. Users can now:
1. Limit runs with --max_batches for testing
2. Use locally cached data easily with --local_data_root
3. See real-time feedback on processing speed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant