Skip to content

Session does not continue if reconnecting to an agent after the controller stopped/crashed #888

@pieqq

Description

@pieqq

Bug Description

While testing #859, I came across the following issue, likely with Checkbox controller.

If the controller reconnects to an agent after the current job is finished, the session does not continue to the next job, and instead stay stuck to the output from the current job.

To Reproduce

Setup

If needed, here are the steps I followed to setup my device to easily reproduce this issue:

Steps to setup the Checkbox controller and agent as well as some sample jobs and test plan

Checkbox controller

On my laptop, I already have a virtual environment setup for Checkbox. I just point to your branch:

(venv) $ git switch solve-resume-on-remote

I use this venv for the Checkbox controller.

Checkbox agent

For the Checkbox agent, I create an LXC container running 22.04:

$ lxc launch images:ubuntu/22.04 jammy
$ lxc shell jammy

The rest of the commands are run in the container:

# apt install python3.10-venv python3-virtualenv git
# git clone https://github.com/canonical/checkbox.git
# cd checkbox/
# git switch solve-resume-on-remote

I follow the Contrib guide to get Checkbox installed in a venv. In the end, checkbox-cli lives in /root/checkbox/checkbox-ng/venv/bin/checkbox-cli and the providers are in described in /root/checkbox/checkbox-ng/venv/share/plainbox-providers-1.

I put the following in /etc/systemd/system/checkbox-ng.service:

[Unit]
Description=Checkbox Remote Service
Wants=network.target

[Service]
ExecStart=/root/checkbox/checkbox-ng/venv/bin/checkbox-cli run-agent
SyslogIdentifier=checkbox-ng.service
Environment="XDG_CACHE_HOME=/var/cache/"
Environment="PROVIDERPATH=/root/checkbox/checkbox-ng/venv/share/plainbox-providers-1"
Restart=always
RestartSec=1
TimeoutStopSec=30
Type=simple

[Install]
WantedBy=multi-user.target

and I install the checkbox-ng service and start it:

# systemctl daemon-reload
# systemctl enable checkbox-ng.service

Now, everything is in place. I can start a remote session from the controller by running:

(venv) $ checkbox-cli control <IP of my lxc container>

Sample jobs and test plan

In the 22.04 container, I create a new pieq.pxu file in /root/checkbox/providers/base/units/ and put the following in it:

unit: job
id: pieq/test
command:
 for i in $(seq 1 30);
 do
     echo "Iteration $i/30..."
     sleep 1
 done
flags: simple noreturn

unit: job
id: pieq/wrapup
command:
 echo "Wrapping up..."
flags: simple

unit: test plan
id: pieq
_name: pieq
include:
    pieq/test
    pieq/wrapup

the pieq/test job will run for 30 seconds and will show the current status of the job, so it's handy to see what's going on. It has the noreturn flag, but of course you can remove this flag if you want to test other use cases.

I need to restart the systemd service, otherwise this test plan will not be visible to Checkbox:

# systemctl restart checkbox-ng.service

Launcher

In order to simulate a non-interactive test run, I create the following launcher file (pieq.launcher):

[launcher]
launcher_version = 1
app_id = com.canonical.certification:PR859
stock_reports = text

[test plan]
unit = com.canonical.certification::pieq
forced = yes

[test selection]
forced = yes

[ui]
type = silent

[transport:outfile]
type = stream
stream = stdout

[exporter:text]
unit = com.canonical.plainbox::text

[report:screen]
transport = outfile
exporter = text

To run it from the controller side with:

(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher

Test

Reconnecting to agent after the controller stopped/crashed ❌

One of the issue this should fix is #22 , which mentions

While testing is ongoing, restart your host computer.

So:

  1. Run Checkbox remote using the launcher, which starts pieq/test (which runs for 30 seconds):
(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher

→ The test starts running

  1. Close the terminal where the controller is running. Wait for 30 seconds, then try reconnecting to the agent:
(venv) $ checkbox-cli control 10.146.223.75
$PROVIDERPATH is defined, so following provider sources are ignored ['/usr/local/share/plainbox-providers-1', '/usr/share/plainbox-providers-1', '/home/pieq/.local/share/plainbox-providers-1', '/var/tmp/checkbox-providers-develop'] 
Connecting to 10.146.223.75:18871. Timeout: 600s
Rejoined session.
In progress: com.canonical.certification::pieq/test (1/2)
Iteration 17/30...
Iteration 18/30...
Iteration 19/30...
Iteration 20/30...
Iteration 21/30...
Iteration 22/30...
Iteration 23/30...
Iteration 24/30...
Iteration 25/30...
Iteration 26/30...
Iteration 27/30...
Iteration 28/30...
Iteration 29/30...
Iteration 30/30...

aaaaaaaaand nothing happens. The session never goes on to the next job (pieq/wrapup), and never finishes. This is because the job has finished running by the time we reconnect to the agent.

Environment

  • Latest Checkbox from main

Relevant log output

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions