Skip to content

read init-p: connection reset by peer #20212

@wusikijeronii

Description

@wusikijeronii

Nomad version

Nomad v1.7.6
BuildDate 2024-03-12T07:27:36Z
Revision 594fedbfbc4f0e532b65e8a69b28ff9403eb822e

Operating system and Environment details

NAME="Oracle Linux Server"
VERSION="8.9"

Issue

After updating Nomad from 1.5.3 to 1.7.6, I can't run any job on two of the three nodes (the same job). I get the error:

failed to launch command with executor: rpc error: code = Unknown desc = unable to start container process: error during container init: read init-p: connection reset by peer

I also tried to create a simple job that will run /bin/bash, but I still face the issue. I also tried to reboot servers and update all packages on host machines, but that didn't help.
I also tried to remove all cache data from all servers. I thought it was a file access issue at first. If you use the default user (anonymous), the same error occurs.

Job file

job "test-job" {
  datacenters = ["dc1"]
  type = "service"
  group "test-group" {
    count = 3
    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }
    
    restart {
      attempts = 2
      interval = "5m"
      delay = "15s"
      mode = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "test-task" {
      user = "cored"
      driver = "exec"
      
      config {
        command = "/bin/bash"
        args = []
      }

      resources {
        cpu = 200
        }
      }
    }
}

Nomad logs

Details
-- Logs begin at Mon 2024-03-25 10:46:06 MSK, end at Mon 2024-03-25 11:14:14 MSK. --
Mar 25 10:58:00 srv2.prod nomad[5744]:     2024-03-25T10:58:00.101+0300 [DEBUG] client: updated allocations: index=443167 total=15 pulled=13 filtered=2
Mar 25 10:58:00 srv2.prod nomad[5744]:     2024-03-25T10:58:00.101+0300 [DEBUG] client: allocation updates: added=0 removed=0 updated=13 ignored=2
Mar 25 10:58:00 srv2.prod nomad[5744]:     2024-03-25T10:58:00.124+0300 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=13 ignored=2 errors=0
Mar 25 10:58:01 srv2.prod nomad[5744]:     2024-03-25T10:58:01.265+0300 [DEBUG] nomad: memberlist: Initiating push/pull sync with: srv1-prod.global 10.0.1.4:4648
Mar 25 10:58:01 srv2.prod nomad[5744]:     2024-03-25T10:58:01.791+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="150.062µs"
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.417+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task path=/usr/bin/nomad args=["/usr/bin/nomad", "logmon"]
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.418+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task path=/usr/bin/nomad pid=37643
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.418+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task plugin=/usr/bin/nomad
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.451+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task network=unix @module=logmon address=/tmp/plugin4056566260 timestamp="2024-03-25T10:58:02.451+0300"
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.452+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task version=2
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.453+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task @module=logmon path=/opt/nomad/alloc/c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2/alloc/logs/.test-task.stdout.fifo timestamp="2024-03-25T10:58:02.453+0300"
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.453+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task @module=logmon path=/opt/nomad/alloc/c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2/alloc/logs/.test-task.stderr.fifo timestamp="2024-03-25T10:58:02.453+0300"
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.468+0300 [INFO]  client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:/bin/bash Args:[] ModePID: ModeIPC: CapAdd:[] CapDrop:[]}"
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.468+0300 [DEBUG] client.driver_mgr.exec.executor: starting plugin: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 driver=exec task_name=test-task path=/usr/bin/nomad args=["/usr/bin/nomad", "executor", "{\"LogFile\":\"/opt/nomad/alloc/c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2/test-task/executor.out\",\"LogLevel\":\"debug\",\"FSIsolation\":true,\"Compute\":{\"tc\":8000,\"nc\":4}}"]
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.469+0300 [DEBUG] client.driver_mgr.exec.executor: plugin started: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 driver=exec task_name=test-task path=/usr/bin/nomad pid=37654
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.469+0300 [DEBUG] client.driver_mgr.exec.executor: waiting for RPC address: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 driver=exec task_name=test-task plugin=/usr/bin/nomad
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.505+0300 [DEBUG] client.driver_mgr.exec.executor: using plugin: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 driver=exec task_name=test-task version=2
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.507+0300 [DEBUG] client.driver_mgr.exec: task capabilities: driver=exec capabilities=["CAP_AUDIT_WRITE", "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_MKNOD", "CAP_NET_BIND_SERVICE", "CAP_SETFCAP", "CAP_SETGID", "CAP_SETPCAP", "CAP_SETUID", "CAP_SYS_CHROOT"]
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.521+0300 [DEBUG] client.driver_mgr.exec.executor.nomad: time="2024-03-25T10:58:02+03:00" level=warning msg="cannot serialize hook of type configs.FuncHook, skipping": alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 driver=exec task_name=test-task
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.655+0300 [DEBUG] client.driver_mgr.exec.executor.stdio: received EOF, stopping recv loop: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 driver=exec task_name=test-task err="rpc error: code = Unavailable desc = error reading from server: EOF"
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.666+0300 [INFO]  client.driver_mgr.exec.executor: plugin process exited: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 driver=exec task_name=test-task plugin=/usr/bin/nomad id=37654
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.666+0300 [DEBUG] client.driver_mgr.exec.executor: plugin exited: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 driver=exec task_name=test-task
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.666+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task type="Driver Failure" msg="failed to launch command with executor: rpc error: code = Unknown desc = unable to start container process: error during container init: read init-p: connection reset by peer" failed=false
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.668+0300 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task error="failed to launch command with executor: rpc error: code = Unknown desc = unable to start container process: error during container init: read init-p: connection reset by peer"
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.668+0300 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task reason="Error was unrecoverable"
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.668+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task type="Not Restarting" msg="Error was unrecoverable" failed=true
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.674+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.stdio: received EOF, stopping recv loop: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task err="rpc error: code = Unavailable desc = error reading from server: EOF"
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.680+0300 [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task plugin=/usr/bin/nomad id=37643
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.680+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.680+0300 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2 task=test-task
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.680+0300 [INFO]  client.gc: marking allocation for GC: alloc_id=c4ff2ba3-82f5-96c0-bbb6-2f45d34507f2
Mar 25 10:58:02 srv2.prod nomad[5744]:     2024-03-25T10:58:02.886+0300 [DEBUG] nomad.client: adding evaluations for rescheduling failed allocations: num_evals=1
Mar 25 10:58:03 srv2.prod nomad[5744]:     2024-03-25T10:58:03.078+0300 [DEBUG] client: updated allocations: index=443169 total=15 pulled=13 filtered=2
Mar 25 10:58:03 srv2.prod nomad[5744]:     2024-03-25T10:58:03.078+0300 [DEBUG] client: allocation updates: added=0 removed=0 updated=13 ignored=2
Mar 25 10:58:03 srv2.prod nomad[5744]:     2024-03-25T10:58:03.105+0300 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=13 ignored=2 errors=0
Mar 25 10:58:04 srv2.prod nomad[5744]:     2024-03-25T10:58:04.212+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:32790
Mar 25 10:58:04 srv2.prod nomad[5744]:     2024-03-25T10:58:04.344+0300 [DEBUG] client: updated allocations: index=443171 total=15 pulled=14 filtered=1
Mar 25 10:58:04 srv2.prod nomad[5744]:     2024-03-25T10:58:04.344+0300 [DEBUG] client: allocation updates: added=0 removed=0 updated=14 ignored=1
Mar 25 10:58:04 srv2.prod nomad[5744]:     2024-03-25T10:58:04.375+0300 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=14 ignored=1 errors=0
Mar 25 10:58:05 srv2.prod nomad[5744]:     2024-03-25T10:58:05.618+0300 [DEBUG] nomad.client: adding evaluations for rescheduling failed allocations: num_evals=1
Mar 25 10:58:05 srv2.prod nomad[5744]:     2024-03-25T10:58:05.815+0300 [DEBUG] client: updated allocations: index=443175 total=15 pulled=14 filtered=1
Mar 25 10:58:05 srv2.prod nomad[5744]:     2024-03-25T10:58:05.815+0300 [DEBUG] client: allocation updates: added=0 removed=0 updated=14 ignored=1
Mar 25 10:58:05 srv2.prod nomad[5744]:     2024-03-25T10:58:05.839+0300 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=14 ignored=1 errors=0
Mar 25 10:58:08 srv2.prod nomad[5744]:     2024-03-25T10:58:08.459+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="684.523µs"
Mar 25 10:58:11 srv2.prod nomad[5744]:     2024-03-25T10:58:11.792+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="732.682µs"
Mar 25 10:58:14 srv2.prod nomad[5744]:     2024-03-25T10:58:14.213+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34694
Mar 25 10:58:18 srv2.prod nomad[5744]:     2024-03-25T10:58:18.461+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.276092ms
Mar 25 10:58:18 srv2.prod nomad[5744]:     2024-03-25T10:58:18.581+0300 [DEBUG] nomad.deployments_watcher: deadline hit: deployment_id=d8885905-d4de-46c0-d7b5-7ac35e164f14 job="<ns: \"default\", id: \"cored\">" rollback=false
Mar 25 10:58:18 srv2.prod nomad[5744]:     2024-03-25T10:58:18.616+0300 [DEBUG] http: request complete: method=GET path="/v1/deployment/d8885905-d4de-46c0-d7b5-7ac35e164f14?index=443082&stale=" duration=3m30.444134347s
Mar 25 10:58:21 srv2.prod nomad[5744]:     2024-03-25T10:58:21.793+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="108.49µs"
Mar 25 10:58:24 srv2.prod nomad[5744]:     2024-03-25T10:58:24.214+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:45212
Mar 25 10:58:28 srv2.prod nomad[5744]:     2024-03-25T10:58:28.462+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="147.718µs"
Mar 25 10:58:31 srv2.prod nomad[5744]:     2024-03-25T10:58:31.795+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="169.04µs"
Mar 25 10:58:34 srv2.prod nomad[5744]:     2024-03-25T10:58:34.215+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:33958
Mar 25 10:58:38 srv2.prod nomad[5744]:     2024-03-25T10:58:38.464+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="102.038µs"
Mar 25 10:58:41 srv2.prod nomad[5744]:     2024-03-25T10:58:41.797+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="202.884µs"
Mar 25 10:58:44 srv2.prod nomad[5744]:     2024-03-25T10:58:44.216+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:41588
Mar 25 10:58:48 srv2.prod nomad[5744]:     2024-03-25T10:58:48.465+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="153.443µs"
Mar 25 10:58:51 srv2.prod nomad[5744]:     2024-03-25T10:58:51.798+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="146.317µs"
Mar 25 10:58:53 srv2.prod nomad[5744]:     2024-03-25T10:58:53.286+0300 [DEBUG] nomad: memberlist: Stream connection from=10.0.1.8:35898
Mar 25 10:58:54 srv2.prod nomad[5744]:     2024-03-25T10:58:54.217+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:44692
Mar 25 10:58:58 srv2.prod nomad[5744]:     2024-03-25T10:58:58.467+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="130.781µs"
Mar 25 10:59:01 srv2.prod nomad[5744]:     2024-03-25T10:59:01.268+0300 [DEBUG] nomad: memberlist: Initiating push/pull sync with: srv1-prod.global 10.0.1.4:4648
Mar 25 10:59:01 srv2.prod nomad[5744]:     2024-03-25T10:59:01.800+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="170.41µs"
Mar 25 10:59:04 srv2.prod nomad[5744]:     2024-03-25T10:59:04.219+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:60018
Mar 25 10:59:08 srv2.prod nomad[5744]:     2024-03-25T10:59:08.469+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="108.13µs"
Mar 25 10:59:11 srv2.prod nomad[5744]:     2024-03-25T10:59:11.802+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="156.268µs"
Mar 25 10:59:14 srv2.prod nomad[5744]:     2024-03-25T10:59:14.219+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:54736
Mar 25 10:59:18 srv2.prod nomad[5744]:     2024-03-25T10:59:18.470+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="126.923µs"
Mar 25 10:59:21 srv2.prod nomad[5744]:     2024-03-25T10:59:21.804+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="147.992µs"
Mar 25 10:59:24 srv2.prod nomad[5744]:     2024-03-25T10:59:24.220+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:60886
Mar 25 10:59:28 srv2.prod nomad[5744]:     2024-03-25T10:59:28.472+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="134.108µs"
Mar 25 10:59:31 srv2.prod nomad[5744]:     2024-03-25T10:59:31.806+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="168.62µs"
Mar 25 10:59:34 srv2.prod nomad[5744]:     2024-03-25T10:59:34.221+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:51442
Mar 25 10:59:34 srv2.prod nomad[5744]:     2024-03-25T10:59:34.371+0300 [DEBUG] nomad: memberlist: Stream connection from=10.0.1.4:46336
Mar 25 10:59:38 srv2.prod nomad[5744]:     2024-03-25T10:59:38.474+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="146.373µs"
Mar 25 10:59:41 srv2.prod nomad[5744]:     2024-03-25T10:59:41.807+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="149.074µs"
Mar 25 10:59:44 srv2.prod nomad[5744]:     2024-03-25T10:59:44.221+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:40574
Mar 25 10:59:48 srv2.prod nomad[5744]:     2024-03-25T10:59:48.475+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="138.924µs"
Mar 25 10:59:51 srv2.prod nomad[5744]:     2024-03-25T10:59:51.808+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="164.435µs"
Mar 25 10:59:53 srv2.prod nomad[5744]:     2024-03-25T10:59:53.289+0300 [DEBUG] nomad: memberlist: Stream connection from=10.0.1.8:37994
Mar 25 10:59:54 srv2.prod nomad[5744]:     2024-03-25T10:59:54.222+0300 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:48038
Mar 25 10:59:58 srv2.prod nomad[5744]:     2024-03-25T10:59:58.476+0300 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="141.933µs"

If you need to check anything on my end, let me know. I just don't know what else to check.
Reverting to 1.5.3 fixes the issue

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions