Skip to content

Empty string defaults for HF dataset config fields cause ValueError in datasets.load_dataset() #3097

@katjasrz

Description

@katjasrz

Bug report

The default empty string values for hf_data_dir, hf_train_files, and hf_eval_files in base.yml cause a ValueError when passed to datasets.load_dataset().

Location

Config file: https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/configs/base.yml (lines
614-616):

  hf_data_dir: ''
  hf_train_files: ''
  hf_eval_files: ''

Affected code

https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/input_pipeline/_hf_data_processing.py

Line 367

  train_ds = datasets.load_dataset(
      config.hf_path,
      name=config.hf_name,
      data_dir=config.hf_data_dir,      # '' passed here
      data_files=config.hf_train_files,  # '' passed here
      split=config.train_split,
      streaming=True,
      token=config.hf_access_token,
  )

Line 422

 eval_ds = datasets.load_dataset(
      config.hf_path,
      name=config.hf_name,
      data_dir=config.hf_data_dir,      # '' passed here
      data_files=config.hf_eval_files,   # '' passed here
      split=config.hf_eval_split,
      streaming=True,
      token=config.hf_access_token,
  )

Root Cause

The datasets library distinguishes between None (use default behavior) and '' (invalid empty string). When users don't specify these optional fields, the empty string defaults are passed through, causing the error.

Workaround

Users can monkey-patch datasets.load_dataset to convert empty strings to None:

import datasets
if not hasattr(datasets, '_original_load_dataset'):
    datasets._original_load_dataset = datasets.load_dataset

    def _patched_load_dataset(*args, **kwargs):
        for key in ['data_files', 'data_dir']:
            if key in kwargs and kwargs[key] == '':
                kwargs[key] = None
        return datasets._original_load_dataset(*args, **kwargs)

    datasets.load_dataset = _patched_load_dataset

Logs/Output

Error Message

ValueError: Empty 'data_files': ''. It should be either non-empty or None (default).

Environment Information

Running on a system with 8 x H100 GPUs, in the latest NGC container.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions