Improve run_dataloader.py usability and performance feedback(addresses #869)#902
Open
ezylopx5 wants to merge 1 commit intoallenai:mainfrom
Open
Improve run_dataloader.py usability and performance feedback(addresses #869)#902ezylopx5 wants to merge 1 commit intoallenai:mainfrom
ezylopx5 wants to merge 1 commit intoallenai:mainfrom
Conversation
- Add --max_batches flag to limit processing (addresses allenai#869) - Add --local_data_root flag to auto-substitute remote paths with local data - Add real-time throughput stats (batches/s, MB/s) to progress bar - Warn users when loading from remote paths without local cache - Fix: Save any remaining batches that don't fill a complete file - Log final statistics at completion This addresses user pain points when using run_dataloader.py with remote data sources (S3/R2), which can be extremely slow due to per-chunk network requests. Users can now: 1. Limit runs with --max_batches for testing 2. Use locally cached data easily with --local_data_root 3. See real-time feedback on processing speed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improves
run_dataloader.pyusability for users who need to extract partial training data or debug data loading performance (addresses #869).Changes
--max_batchesflag to limit processing (useful for testing or partial data extraction)--local_data_rootflag to auto-substitute remote paths with local cached dataMotivation
Users running
run_dataloader.pywith remote data (S3/R2) experience very slow performance due to per-chunk network requests (#869 reported only 44 batches after running overnight).This PR makes the script more user-friendly by:
--max_batchesfor testing--local_data_rootExample usage