omegaconf error (with ddp_spawn): Unsupported interpolation type hydra

I think `omegaconf` makes mistakes in `ddp_spawn` training when interpolating strings like `${hydra:xxxxxxx}`

The simplest way to reproduce such an error on my machine is as follows:

1. pull the repo
2. run `python src/train.py trainer=ddp trainer.max_epochs=5 logger=csv`

The command line output and error trace are (I leave out the part that seemed unimportant to me, mark by `########`)

```
│ 18 │ test_loss    │ MeanMetric         │      0 │
│ 19 │ val_acc_best │ MaxMetric          │      0 │
└────┴──────────────┴────────────────────┴────────┘
Trainable params: 68.0 K                                                                                                                                      
Non-trainable params: 0                                                                                                                                       
Total params: 68.0 K                                                                                                                                          
Total estimated model params size (MB): 0                                                                                                                     
[2023-01-04 16:34:17,037][src.utils.utils][ERROR] - 
Traceback (most recent call last):
  File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/utils/utils.py", line 38, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
######## (This part is about multiprocessing)
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
######## (This part is about omegaconf)
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
    raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
    full_key: trainer.default_root_dir
    object_type=dict

[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Output dir: /nvme/louzekun/playground/lightning-hydra-template-1.5.0/logs/train/runs/2023-01-04_16-34-10
[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Closing loggers...
Error executing job with overrides: ['trainer=ddp', 'trainer.max_epochs=5', 'logger=csv']
Traceback (most recent call last):
  File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/train.py", line 122, in main
    metric_dict, _ = train(cfg)
######## (This part repeated the same errors as above)
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
    raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
    full_key: trainer.default_root_dir
    object_type=dict
```

One could find in the `configs/trainer/default.yaml` that `trainer.default_root_dir=${paths.output_dir}`, and further `configs/paths/default.yaml` writes `output_dir: ${hydra:runtime.output_dir}`

The error `UnsupportedInterpolationType` is raised at [`omegaconf/base.py:L702`](https://github.com/omry/omegaconf/blob/master/omegaconf/base.py#L702), where there is no resolver named `'hydra'`.

It seems that the `${hydra:runtime.xxxxxx}` works well before and after training (or the `pl.Trainer` cannot be properly instantiated, and there will be no logs like `[2023-01-04 16:34:17,039][src.utils.utils][INFO] ... logs/train/runs/2023-01-04_16-34-10`) and makes mistakes during `ddp_spawn` training (remember the `Process 0 terminated with the following error` in error trace).

To verify my guess, after `cfg` was created and before `train(cfg)` was called, I deleted the resolver `'hydra'` from `OmegaConf` by `OmegaConf.clear_resolver("hydra")` and added a new resolver named `hydra` by adding `__call__` method to a class based on `HydraConfig.get()`. The exact same error happened as above.

My python packages version:
```
# Name                    Version                   Build  Channel
hydra-colorlog            1.2.0                    pypi_0    pypi
hydra-core                1.3.1                    pypi_0    pypi
hydra-optuna-sweeper      1.2.0                    pypi_0    pypi
pytorch                   1.12.1          py3.10_cuda11.3_cudnn8.3.2_0    pytorch
pytorch-cluster           1.6.0           py310_torch_1.12.0_cu113    pyg
pytorch-lightning         1.8.3                    pypi_0    pypi
pytorch-mutex             1.0                        cuda    pytorch
pytorch-scatter           2.0.9           py310_torch_1.12.0_cu113    pyg
pytorch-sparse            0.6.15          py310_torch_1.12.0_cu113    pyg
torchaudio                0.12.1              py310_cu113    pytorch
torchmetrics              0.11.0             pyhd8ed1ab_0    conda-forge
torchvision               0.13.1              py310_cu113    pytorch
```
My GPUs are 8x`A100-SXM4-80GB`
My GPU driver version:
```
NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4
```

So is it my own mistake, or might there be some remedies? Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

omegaconf error (with ddp_spawn): Unsupported interpolation type hydra #495

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

omegaconf error (with ddp_spawn): Unsupported interpolation type hydra #495

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions