Skip to content

Error found in validating when use 2 gpu(But it'ok when using one gpu ).. #17

@ZHO9504

Description

@ZHO9504

Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
EM: 61.2193, f1: 69.6262, qas_used_fraction: 1.0000, loss: 4.3453 ||: : 17502it [6:26:59, 1.33s/it]
2019-07-20 15:09:22,954 - INFO - allennlp.training.trainer - Validating
EM: 48.9301, f1: 59.0550, qas_used_fraction: 1.0000, loss: 5.1889 ||: : 94it [00:41, 2.15it/s]Traceback (most recent call last):
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 21, in
run()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args
args.cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file
cache_directory, cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 243, in train_model
metrics = trainer.train()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 493, in train
val_loss, num_batches = self._validation_loss()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss
loss = self.batch_loss(batch_group, for_training=False)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 258, in batch_loss
output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/util.py", line 336, in data_parallel
losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: tensor.ndimension() == static_cast<int64_t>(expected_size.size()) ASSERT FAILED at /pytorch/torch/csrc/cuda/comm.cpp:232, please report a bug to PyTorch. (gather at /pytorch/torch/csrc/cuda/comm.cpp:232)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f6d3dad8441 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f6d3dad7d7a in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x962 (0x7f6d132be792 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #3: + 0x5a3d1c (0x7f6d33e0bd1c in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130fac (0x7f6d33998fac in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5567e0e3c6e4 in python3.7)
frame #6: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #7: _PyEval_EvalFrameDefault + 0x4e8c (0x5567e0e982bc in python3.7)
frame #8: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #9: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #10: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #11: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #12: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #13: THPFunction_apply(_object
, _object*) + 0x6b1 (0x7f6d33c1c301 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #14: PyCFunction_Call + 0xe7 (0x5567e0dffbe7 in python3.7)
frame #15: _PyEval_EvalFrameDefault + 0x5d21 (0x5567e0e99151 in python3.7)
frame #16: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #17: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #18: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #19: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #20: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #21: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #22: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #23: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #25: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #26: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #27: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #28: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #29: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #30: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #31: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #32: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #33: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #35: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #36: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #37: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #38: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #39: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #40: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #41: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #42: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #43: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #44: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #45: PyEval_EvalCodeEx + 0x44 (0x5567e0dda3c4 in python3.7)
frame #46: PyEval_EvalCode + 0x1c (0x5567e0dda3ec in python3.7)
frame #47: + 0x1e004d (0x5567e0ea304d in python3.7)
frame #48: _PyMethodDef_RawFastCallKeywords + 0xe9 (0x5567e0e3c569 in python3.7)
frame #49: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #50: _PyEval_EvalFrameDefault + 0x4755 (0x5567e0e97b85 in python3.7)
frame #51: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #52: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #53: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #54: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #55: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #56: + 0x222d77 (0x5567e0ee5d77 in python3.7)
frame #57: + 0x23ae95 (0x5567e0efde95 in python3.7)
frame #58: _Py_UnixMain + 0x3c (0x5567e0efdf7c in python3.7)
frame #59: __libc_start_main + 0xf0 (0x7f6d4ea12830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #60: + 0x1e0122 (0x5567e0ea3122 in python3.7)

I don't know why....

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions