Skip to content

Running run_dataloader.py is very slow #869

@andrewivan123

Description

@andrewivan123

🐛 Describe the bug

I wanted to get the ordered training data for the first n indices by running run_dataloader.py. However, I noticed that running it is very slow. I leave it for 1 night and it has only finished 44 batches. I would like to get the first 20k steps. I expect it to be done in around 2 days as it is the rough training time using 8 H100 cards when the dataset has been downloaded. Is there a problem with olmo's data server? Is there a way to improve the speed?

Image

Versions

Python 3.13.5
absl-py==2.3.1
accelerate==1.8.1
-e git+https://github.com/allenai/OLMo.git@f3dff833c880add075b123df9ddc31423086ef31#egg=ai2_olmo
ai2-olmo-core==2.1.0
ai2-olmo-eval==0.7.1
aiohappyeyeballs==2.6.1
aiohttp==3.12.13
aiosignal==1.3.2
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
attrs==25.3.0
beaker-gantry==2.7.1
beaker-py==2.4.4
black==23.12.1
blessed==1.21.0
boltons==25.0.0
boto3==1.38.46
botocore==1.38.46
build==1.2.2.post1
cached_path==1.7.3
cachetools==5.5.2
certifi==2025.6.15
cffi==1.17.1
chardet==5.2.0
charset-normalizer==2.0.12
click==8.2.1
click-help-colors==0.9.4
click-option-group==0.5.7
colorama==0.4.6
cryptography==45.0.4
DataProperty==1.1.0
datasets==3.6.0
dill==0.3.8
docutils==0.21.2
einops==0.8.1
enlighten==1.10.1
evaluate==0.4.5
face==24.0.0
filelock==3.18.0
flash_attn==2.8.0.post2
frozenlist==1.7.0
fsspec==2025.3.0
ftfy==6.3.1
gitdb==4.0.12
GitPython==3.1.44
glom==24.11.0
google-api-core==2.25.1
google-auth==2.40.3
google-cloud-core==2.4.3
google-cloud-storage==2.19.0
google-crc32c==1.7.1
google-resumable-media==2.7.2
googleapis-common-protos==1.70.0
grpcio==1.73.1
hf-xet==1.1.5
huggingface-hub==0.33.1
id==1.5.0
idna==3.10
importlib_resources==6.5.2
iniconfig==2.1.0
isort==5.12.0
jaraco.classes==3.4.0
jaraco.context==6.0.1
jaraco.functools==4.2.1
jeepney==0.9.0
Jinja2==3.1.6
jmespath==1.0.1
joblib==1.5.1
jsonlines==4.0.0
keyring==25.6.0
latexcodec==3.0.1
lightning-utilities==0.14.3
-e git+https://github.com/EleutherAI/lm-evaluation-harness@fcddf195ec6bb69c63e36d54d75354f6ecaabab7#egg=lm_eval
lxml==6.0.0
markdown-it-py==3.0.0
MarkupSafe==3.0.2
mbstrdecoder==1.1.4
mdurl==0.1.2
more-itertools==10.7.0
mpmath==1.3.0
msgspec==0.19.0
mtdata==0.4.0
multidict==6.6.2
multiprocess==0.70.16
mypy==1.3.0
mypy_extensions==1.1.0
necessary==0.4.3
networkx==3.5
nh3==0.2.21
nltk==3.9.1
numexpr==2.11.0
numpy==1.26.4
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvrtc-cu12==12.6.77
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.5.1.17
nvidia-cufft-cu12==11.3.0.4
nvidia-cufile-cu12==1.11.1.6
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.6.85
nvidia-nvtx-cu12==12.6.77
omegaconf==2.3.0
packaging==25.0
pandas==2.3.0
pathspec==0.12.1
pathvalidate==3.3.1
peft==0.16.0
petname==2.6
platformdirs==4.3.8
pluggy==1.6.0
portalocker==2.3.0
prefixed==0.9.0
propcache==0.3.2
proto-plus==1.26.1
protobuf==5.29.5
psutil==7.0.0
pyarrow==20.0.0
pyasn1==0.6.1
pyasn1_modules==0.4.2
pybind11==3.0.0
pybtex==0.24.0
pycparser==2.22
pydantic==2.11.7
pydantic_core==2.33.2
Pygments==2.19.2
pyproject_hooks==1.2.0
pytablewriter==1.2.1
pytest==8.4.1
pytest-sphinx==0.6.3
python-dateutil==2.9.0.post0
pytz==2025.2
PyYAML==6.0.2
readme_renderer==44.0
regex==2024.11.6
requests==2.32.4
requests-toolbelt==1.0.0
requirements-parser==0.13.0
rfc3986==2.0.0
rich==13.9.4
rouge_score==0.1.2
rsa==4.9.1
ruamel.yaml==0.18.14
ruamel.yaml.clib==0.2.12
ruff==0.12.1
s3transfer==0.13.0
sacrebleu==2.5.1
safetensors==0.5.3
scikit-learn==1.7.0
scipy==1.16.0
SecretStorage==3.3.3
sentencepiece @ file:///croot/sentencepiece-split_1742566759237/work/python
sentry-sdk==2.32.0
setproctitle==1.3.6
setuptools==78.1.1
six==1.17.0
smart-open==7.1.0
smashed==0.21.5
smmap==5.0.2
sqlitedict==2.1.0
sympy==1.14.0
tabledata==1.3.4
tabulate==0.9.0
tcolorpy==0.1.7
threadpoolctl==3.6.0
tokenizers==0.21.2
torch==2.7.1
torchmetrics==1.7.3
tqdm==4.67.1
tqdm-multiprocess==0.0.11
-e git+https://github.com/huggingface/transformers.git@67ddc82fbc7e52c6f42a395b4a6d278c55b77a39#egg=transformers
triton==3.3.1
trouting==0.3.3
twine==6.1.0
typepy==1.3.4
typing-inspection==0.4.1
typing_extensions==4.14.0
tzdata==2025.2
urllib3==1.26.20
wandb==0.20.1
wcwidth==0.2.13
wheel==0.45.1
wmtformat @ git+https://github.com/wmt-conference/wmt-format-tools.git@d46d4d75cf47095fbe7b15da29afd9348dfafeb1
word2number==1.1
wrapt==1.17.2
xxhash==3.5.0
yarl==1.20.1
zstandard==0.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugAn issue about a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions