Silent Mellanox jenkins failures was observed recently.
Failures seems to be observed for GitHub v2.x branch only.
20:54:55 + /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun -np 8 \
-bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh --report-state-on-timeout \
--get-stack-traces --timeout 900 -mca btl_openib_if_include mlx5_0:1 \
-x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm \
-mca pml ob1 -mca btl self,openib \
-mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 \
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
20:54:55 [1499968495.528199] [jenkins03:1355 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.535609] [jenkins03:1354 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.534361] [jenkins03:1359 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.541761] [jenkins03:1356 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.552215] [jenkins03:1360 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.560606] [jenkins03:1361 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.562930] [jenkins03:1353 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.567548] [jenkins03:1363 :0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 3496.08
20:54:56 + jenkins_cleanup
20:54:56 + echo 'Script exited with code = 1'
20:54:56 Script exited with code = 1
20:54:56 + rm -rf /tmp/tmp.8mj45mghXh
20:54:56 + echo 'rm -rf ... returned 0'
20:54:56 rm -rf ... returned 0
21:43:05 Hello, world, I am 4 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 6 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 0 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 2 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 7 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 5 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 1 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 3 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
$/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun --debug-daemons -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh --report-state-on-timeout --get-stack-traces --timeout 900 -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,tcp -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
[jenkins03:01400] [[15875,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[jenkins03:01400] [[15875,0],0] orted_cmd: received add_local_procs
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 8
MPIR_proctable:
(i, host, exe, pid) = (0, jenkins03, /usr/bin/taskset, 1416)
(i, host, exe, pid) = (1, jenkins03, /usr/bin/taskset, 1417)
(i, host, exe, pid) = (2, jenkins03, /usr/bin/taskset, 1419)
(i, host, exe, pid) = (3, jenkins03, /usr/bin/taskset, 1420)
(i, host, exe, pid) = (4, jenkins03, /usr/bin/taskset, 1421)
(i, host, exe, pid) = (5, jenkins03, /usr/bin/taskset, 1423)
(i, host, exe, pid) = (6, jenkins03, /usr/bin/taskset, 1428)
(i, host, exe, pid) = (7, jenkins03, /usr/bin/taskset, 1431)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
Hello, world, I am 2 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 4 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 0 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 3 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 7 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 6 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 1 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 5 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
[jenkins03:01400] [[15875,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_EXIT_CMD
[jenkins03:01400] [[15875,0],0] orted_cmd: received exit cmd
[jenkins03:01400] [[15875,0],0] orted_cmd: all routes and children gone - exiting
Mellanox Jenkins script is updated to output the exit status so in future this behavior will not cause such confusion.
Background information
Silent Mellanox jenkins failures was observed recently.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Failures seems to be observed for GitHub v2.x branch only.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Regular Mellanox CI build
Please describe the system on which you are running
Details of the problem
The following command silently fails:
While expected output is
Same command with btl/tcp works fine:
Here is more detailed log (with btl verbose on):
openib_failure.txt
Mellanox Jenkins script is updated to output the exit status so in future this behavior will not cause such confusion.