Skip to content

cache_dir option in download_config in load_dataset is not respected #8029

@TsXor

Description

@TsXor

Describe the bug

Downloaded files still go to ~/.cache/huggingface/hub/ even if I specified cache_dir option in download_config in load_dataset.

Steps to reproduce the bug

Run my freshly written script and found that downloaded files did not go where I want.

'''
下载OpenWebText数据集,允许使用代理
'''

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(description='Download TikToken Files')
    parser.add_argument('--output-path', required=True, metavar='PATH', help='输出目录')
    parser.add_argument('--mirror', required=False, metavar='URL', help='HF镜像网址,例如:https://hf-mirror.com')
    parser.add_argument('--proxy', required=False, metavar='URL', help='代理网址')
    args = parser.parse_args()
else: args = None


import os
import shutil
from pathlib import Path
from typing import cast


if __name__ == '__main__':
    assert args is not None
    output_path = Path(args.output_path).resolve()
    proxy_url = None if args.proxy is None else str(args.proxy)
    mirror_url = None if args.mirror is None else str(args.mirror)

    output_path.mkdir(parents=True, exist_ok=True)
    download_cache_dir = output_path / 'download_cache'
    read_cache_dir = output_path / 'read_cache'
    save_dir = output_path / 'saved'
    complete_mark = output_path / 'completed'

    def clear_cache():
        shutil.rmtree(download_cache_dir)
        shutil.rmtree(read_cache_dir)

    def download_and_save():
        if mirror_url is not None:
            os.environ["HF_ENDPOINT"] = mirror_url

        from datasets import DownloadConfig, load_dataset

        if proxy_url is not None: proxy_dict = { "http": proxy_url, "https": proxy_url }
        else: proxy_dict = None

        dataset = load_dataset(
            'openwebtext',
            cache_dir=str(read_cache_dir),
            download_config=DownloadConfig(cache_dir=download_cache_dir, proxies=proxy_dict)
        )
        dataset.save_to_disk(save_dir)

    if complete_mark.is_file():
        print('OpenWebText is already downloaded')
        clear_cache()
    else:
        download_and_save()
        complete_mark.touch(exist_ok=True)
        clear_cache()

Expected behavior

Downloaded files goes to where I specified in download_config.

Environment info

> uv run datasets-cli env

Copy-and-paste the text below in your GitHub issue.

- `datasets` version: 4.6.0
- Platform: Windows-11-10.0.26200-SP0
- Python version: 3.14.3
- `huggingface_hub` version: 1.5.0
- PyArrow version: 23.0.1
- Pandas version: 3.0.1
- `fsspec` version: 2026.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions