-
Notifications
You must be signed in to change notification settings - Fork 475
Description
Bug report
The default empty string values for hf_data_dir, hf_train_files, and hf_eval_files in base.yml cause a ValueError when passed to datasets.load_dataset().
Location
Config file: https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/configs/base.yml (lines
614-616):
hf_data_dir: ''
hf_train_files: ''
hf_eval_files: ''
Affected code
Line 367
train_ds = datasets.load_dataset(
config.hf_path,
name=config.hf_name,
data_dir=config.hf_data_dir, # '' passed here
data_files=config.hf_train_files, # '' passed here
split=config.train_split,
streaming=True,
token=config.hf_access_token,
)
Line 422
eval_ds = datasets.load_dataset(
config.hf_path,
name=config.hf_name,
data_dir=config.hf_data_dir, # '' passed here
data_files=config.hf_eval_files, # '' passed here
split=config.hf_eval_split,
streaming=True,
token=config.hf_access_token,
)
Root Cause
The datasets library distinguishes between None (use default behavior) and '' (invalid empty string). When users don't specify these optional fields, the empty string defaults are passed through, causing the error.
Workaround
Users can monkey-patch datasets.load_dataset to convert empty strings to None:
import datasets
if not hasattr(datasets, '_original_load_dataset'):
datasets._original_load_dataset = datasets.load_dataset
def _patched_load_dataset(*args, **kwargs):
for key in ['data_files', 'data_dir']:
if key in kwargs and kwargs[key] == '':
kwargs[key] = None
return datasets._original_load_dataset(*args, **kwargs)
datasets.load_dataset = _patched_load_dataset
Logs/Output
Error Message
ValueError: Empty 'data_files': ''. It should be either non-empty or None (default).
Environment Information
Running on a system with 8 x H100 GPUs, in the latest NGC container.
Additional Context
No response