-
Notifications
You must be signed in to change notification settings - Fork 756
Description
I think omegaconf makes mistakes in ddp_spawn training when interpolating strings like ${hydra:xxxxxxx}
The simplest way to reproduce such an error on my machine is as follows:
- pull the repo
- run
python src/train.py trainer=ddp trainer.max_epochs=5 logger=csv
The command line output and error trace are (I leave out the part that seemed unimportant to me, mark by ########)
│ 18 │ test_loss │ MeanMetric │ 0 │
│ 19 │ val_acc_best │ MaxMetric │ 0 │
└────┴──────────────┴────────────────────┴────────┘
Trainable params: 68.0 K
Non-trainable params: 0
Total params: 68.0 K
Total estimated model params size (MB): 0
[2023-01-04 16:34:17,037][src.utils.utils][ERROR] -
Traceback (most recent call last):
File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/utils/utils.py", line 38, in wrap
metric_dict, object_dict = task_func(cfg=cfg)
######## (This part is about multiprocessing)
File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
######## (This part is about omegaconf)
File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
full_key: trainer.default_root_dir
object_type=dict
[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Output dir: /nvme/louzekun/playground/lightning-hydra-template-1.5.0/logs/train/runs/2023-01-04_16-34-10
[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Closing loggers...
Error executing job with overrides: ['trainer=ddp', 'trainer.max_epochs=5', 'logger=csv']
Traceback (most recent call last):
File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/train.py", line 122, in main
metric_dict, _ = train(cfg)
######## (This part repeated the same errors as above)
File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
full_key: trainer.default_root_dir
object_type=dict
One could find in the configs/trainer/default.yaml that trainer.default_root_dir=${paths.output_dir}, and further configs/paths/default.yaml writes output_dir: ${hydra:runtime.output_dir}
The error UnsupportedInterpolationType is raised at omegaconf/base.py:L702, where there is no resolver named 'hydra'.
It seems that the ${hydra:runtime.xxxxxx} works well before and after training (or the pl.Trainer cannot be properly instantiated, and there will be no logs like [2023-01-04 16:34:17,039][src.utils.utils][INFO] ... logs/train/runs/2023-01-04_16-34-10) and makes mistakes during ddp_spawn training (remember the Process 0 terminated with the following error in error trace).
To verify my guess, after cfg was created and before train(cfg) was called, I deleted the resolver 'hydra' from OmegaConf by OmegaConf.clear_resolver("hydra") and added a new resolver named hydra by adding __call__ method to a class based on HydraConfig.get(). The exact same error happened as above.
My python packages version:
# Name Version Build Channel
hydra-colorlog 1.2.0 pypi_0 pypi
hydra-core 1.3.1 pypi_0 pypi
hydra-optuna-sweeper 1.2.0 pypi_0 pypi
pytorch 1.12.1 py3.10_cuda11.3_cudnn8.3.2_0 pytorch
pytorch-cluster 1.6.0 py310_torch_1.12.0_cu113 pyg
pytorch-lightning 1.8.3 pypi_0 pypi
pytorch-mutex 1.0 cuda pytorch
pytorch-scatter 2.0.9 py310_torch_1.12.0_cu113 pyg
pytorch-sparse 0.6.15 py310_torch_1.12.0_cu113 pyg
torchaudio 0.12.1 py310_cu113 pytorch
torchmetrics 0.11.0 pyhd8ed1ab_0 conda-forge
torchvision 0.13.1 py310_cu113 pytorch
My GPUs are 8xA100-SXM4-80GB
My GPU driver version:
NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4
So is it my own mistake, or might there be some remedies? Thank you!