Skip to content

omegaconf error (with ddp_spawn): Unsupported interpolation type hydra #495

@zekunlou

Description

@zekunlou

I think omegaconf makes mistakes in ddp_spawn training when interpolating strings like ${hydra:xxxxxxx}

The simplest way to reproduce such an error on my machine is as follows:

  1. pull the repo
  2. run python src/train.py trainer=ddp trainer.max_epochs=5 logger=csv

The command line output and error trace are (I leave out the part that seemed unimportant to me, mark by ########)

│ 18 │ test_loss    │ MeanMetric         │      0 │
│ 19 │ val_acc_best │ MaxMetric          │      0 │
└────┴──────────────┴────────────────────┴────────┘
Trainable params: 68.0 K                                                                                                                                      
Non-trainable params: 0                                                                                                                                       
Total params: 68.0 K                                                                                                                                          
Total estimated model params size (MB): 0                                                                                                                     
[2023-01-04 16:34:17,037][src.utils.utils][ERROR] - 
Traceback (most recent call last):
  File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/utils/utils.py", line 38, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
######## (This part is about multiprocessing)
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
######## (This part is about omegaconf)
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
    raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
    full_key: trainer.default_root_dir
    object_type=dict

[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Output dir: /nvme/louzekun/playground/lightning-hydra-template-1.5.0/logs/train/runs/2023-01-04_16-34-10
[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Closing loggers...
Error executing job with overrides: ['trainer=ddp', 'trainer.max_epochs=5', 'logger=csv']
Traceback (most recent call last):
  File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/train.py", line 122, in main
    metric_dict, _ = train(cfg)
######## (This part repeated the same errors as above)
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
    raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
    full_key: trainer.default_root_dir
    object_type=dict

One could find in the configs/trainer/default.yaml that trainer.default_root_dir=${paths.output_dir}, and further configs/paths/default.yaml writes output_dir: ${hydra:runtime.output_dir}

The error UnsupportedInterpolationType is raised at omegaconf/base.py:L702, where there is no resolver named 'hydra'.

It seems that the ${hydra:runtime.xxxxxx} works well before and after training (or the pl.Trainer cannot be properly instantiated, and there will be no logs like [2023-01-04 16:34:17,039][src.utils.utils][INFO] ... logs/train/runs/2023-01-04_16-34-10) and makes mistakes during ddp_spawn training (remember the Process 0 terminated with the following error in error trace).

To verify my guess, after cfg was created and before train(cfg) was called, I deleted the resolver 'hydra' from OmegaConf by OmegaConf.clear_resolver("hydra") and added a new resolver named hydra by adding __call__ method to a class based on HydraConfig.get(). The exact same error happened as above.

My python packages version:

# Name                    Version                   Build  Channel
hydra-colorlog            1.2.0                    pypi_0    pypi
hydra-core                1.3.1                    pypi_0    pypi
hydra-optuna-sweeper      1.2.0                    pypi_0    pypi
pytorch                   1.12.1          py3.10_cuda11.3_cudnn8.3.2_0    pytorch
pytorch-cluster           1.6.0           py310_torch_1.12.0_cu113    pyg
pytorch-lightning         1.8.3                    pypi_0    pypi
pytorch-mutex             1.0                        cuda    pytorch
pytorch-scatter           2.0.9           py310_torch_1.12.0_cu113    pyg
pytorch-sparse            0.6.15          py310_torch_1.12.0_cu113    pyg
torchaudio                0.12.1              py310_cu113    pytorch
torchmetrics              0.11.0             pyhd8ed1ab_0    conda-forge
torchvision               0.13.1              py310_cu113    pytorch

My GPUs are 8xA100-SXM4-80GB
My GPU driver version:

NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4

So is it my own mistake, or might there be some remedies? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions