Skip to content

Stop sshd before reboot (prevent reconnect/reboot race)#585

Merged
matusmarhefka merged 1 commit intomainfrom
stop_sshd_on_reboot
Apr 9, 2026
Merged

Stop sshd before reboot (prevent reconnect/reboot race)#585
matusmarhefka merged 1 commit intomainfrom
stop_sshd_on_reboot

Conversation

@comps
Copy link
Copy Markdown
Contributor

@comps comps commented Apr 8, 2026

The ATEX disconnect method reliably waits for the test to be at a deterministic point, but that doesn't stop another race from happening after issuing reboot.

For example:

  1. systemd kills all user sessions first (ahead of system daemons), also killing off the python-based test running over ssh
  2. ATEX sees ssh disconnect, but that was expected since the control channel was already disconnected safely, so it "waits for reboot" by repeatedly attempting a reconnect
  3. The reconnect succeeds, because sshd still wasn't shut down, despite user sessions being killed - maybe the OS is blocked for 2-3 minutes on a less important daemon shutting down before sshd
  4. ATEX restarts the test, assuming the OS has rebooted

This can be easily prevented by shutting off sshd and thus preventing new connections while keeping existing sessions alive. That ensures ATEX can never reconnect until something starts sshd again, which should happen only after the reboot.

This race was reliably reproducible on ppc64le, perhaps due to some daemons shutting down very slowly.

The ATEX disconnect method reliably waits for the test to be at
a deterministic point, but that doesn't stop another race from
happening *after* issuing 'reboot'.

For example:

1) systemd kills all user sessions first (ahead of system daemons),
   also killing off the python-based test running over ssh
2) ATEX sees ssh disconnect, but that was expected since the control
   channel was already disconnected safely, so it "waits for reboot"
   by repeatedly attempting a reconnect
3) the reconnect succeeds, because sshd still wasn't shut down,
   despite user sessions being killed - maybe the OS is blocked for
   2-3 minutes on a less important daemon shutting down before sshd
4) ATEX restarts the test, assuming the OS has rebooted

This can be easily prevented by shutting off sshd and thus preventing
new connections while keeping existing sessions alive.

That ensures ATEX can never reconnect until something starts sshd
again, which should happen only after the reboot.

This race was reliably reproducible on ppc64le, perhaps due to some
daemons shutting down very slowly.

Signed-off-by: Jiri Jaburek <comps@nomail.dom>
@matusmarhefka matusmarhefka merged commit d33c7e7 into main Apr 9, 2026
4 checks passed
@matusmarhefka matusmarhefka deleted the stop_sshd_on_reboot branch April 9, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants