Bug Description
While testing #859, I came across the following issue, likely with Checkbox controller.
If the controller reconnects to an agent after the current job is finished, the session does not continue to the next job, and instead stay stuck to the output from the current job.
To Reproduce
Setup
If needed, here are the steps I followed to setup my device to easily reproduce this issue:
Steps to setup the Checkbox controller and agent as well as some sample jobs and test plan
Checkbox controller
On my laptop, I already have a virtual environment setup for Checkbox. I just point to your branch:
(venv) $ git switch solve-resume-on-remote
I use this venv for the Checkbox controller.
Checkbox agent
For the Checkbox agent, I create an LXC container running 22.04:
$ lxc launch images:ubuntu/22.04 jammy
$ lxc shell jammy
The rest of the commands are run in the container:
# apt install python3.10-venv python3-virtualenv git
# git clone https://github.com/canonical/checkbox.git
# cd checkbox/
# git switch solve-resume-on-remote
I follow the Contrib guide to get Checkbox installed in a venv. In the end, checkbox-cli lives in /root/checkbox/checkbox-ng/venv/bin/checkbox-cli and the providers are in described in /root/checkbox/checkbox-ng/venv/share/plainbox-providers-1.
I put the following in /etc/systemd/system/checkbox-ng.service:
[Unit]
Description=Checkbox Remote Service
Wants=network.target
[Service]
ExecStart=/root/checkbox/checkbox-ng/venv/bin/checkbox-cli run-agent
SyslogIdentifier=checkbox-ng.service
Environment="XDG_CACHE_HOME=/var/cache/"
Environment="PROVIDERPATH=/root/checkbox/checkbox-ng/venv/share/plainbox-providers-1"
Restart=always
RestartSec=1
TimeoutStopSec=30
Type=simple
[Install]
WantedBy=multi-user.target
and I install the checkbox-ng service and start it:
# systemctl daemon-reload
# systemctl enable checkbox-ng.service
Now, everything is in place. I can start a remote session from the controller by running:
(venv) $ checkbox-cli control <IP of my lxc container>
Sample jobs and test plan
In the 22.04 container, I create a new pieq.pxu file in /root/checkbox/providers/base/units/ and put the following in it:
unit: job
id: pieq/test
command:
for i in $(seq 1 30);
do
echo "Iteration $i/30..."
sleep 1
done
flags: simple noreturn
unit: job
id: pieq/wrapup
command:
echo "Wrapping up..."
flags: simple
unit: test plan
id: pieq
_name: pieq
include:
pieq/test
pieq/wrapup
the pieq/test job will run for 30 seconds and will show the current status of the job, so it's handy to see what's going on. It has the noreturn flag, but of course you can remove this flag if you want to test other use cases.
I need to restart the systemd service, otherwise this test plan will not be visible to Checkbox:
# systemctl restart checkbox-ng.service
Launcher
In order to simulate a non-interactive test run, I create the following launcher file (pieq.launcher):
[launcher]
launcher_version = 1
app_id = com.canonical.certification:PR859
stock_reports = text
[test plan]
unit = com.canonical.certification::pieq
forced = yes
[test selection]
forced = yes
[ui]
type = silent
[transport:outfile]
type = stream
stream = stdout
[exporter:text]
unit = com.canonical.plainbox::text
[report:screen]
transport = outfile
exporter = text
To run it from the controller side with:
(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher
Test
Reconnecting to agent after the controller stopped/crashed ❌
One of the issue this should fix is #22 , which mentions
While testing is ongoing, restart your host computer.
So:
- Run Checkbox remote using the launcher, which starts
pieq/test (which runs for 30 seconds):
(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher
→ The test starts running
- Close the terminal where the controller is running. Wait for 30 seconds, then try reconnecting to the agent:
(venv) $ checkbox-cli control 10.146.223.75
$PROVIDERPATH is defined, so following provider sources are ignored ['/usr/local/share/plainbox-providers-1', '/usr/share/plainbox-providers-1', '/home/pieq/.local/share/plainbox-providers-1', '/var/tmp/checkbox-providers-develop']
Connecting to 10.146.223.75:18871. Timeout: 600s
Rejoined session.
In progress: com.canonical.certification::pieq/test (1/2)
Iteration 17/30...
Iteration 18/30...
Iteration 19/30...
Iteration 20/30...
Iteration 21/30...
Iteration 22/30...
Iteration 23/30...
Iteration 24/30...
Iteration 25/30...
Iteration 26/30...
Iteration 27/30...
Iteration 28/30...
Iteration 29/30...
Iteration 30/30...
aaaaaaaaand nothing happens. The session never goes on to the next job (pieq/wrapup), and never finishes. This is because the job has finished running by the time we reconnect to the agent.
Environment
- Latest Checkbox from
main
Relevant log output
No response
Additional context
No response
Bug Description
While testing #859, I came across the following issue, likely with Checkbox controller.
If the controller reconnects to an agent after the current job is finished, the session does not continue to the next job, and instead stay stuck to the output from the current job.
To Reproduce
Setup
If needed, here are the steps I followed to setup my device to easily reproduce this issue:
Steps to setup the Checkbox controller and agent as well as some sample jobs and test plan
Checkbox controller
On my laptop, I already have a virtual environment setup for Checkbox. I just point to your branch:
I use this venv for the Checkbox controller.
Checkbox agent
For the Checkbox agent, I create an LXC container running 22.04:
The rest of the commands are run in the container:
I follow the Contrib guide to get Checkbox installed in a venv. In the end, checkbox-cli lives in
/root/checkbox/checkbox-ng/venv/bin/checkbox-cliand the providers are in described in/root/checkbox/checkbox-ng/venv/share/plainbox-providers-1.I put the following in
/etc/systemd/system/checkbox-ng.service:and I install the checkbox-ng service and start it:
Now, everything is in place. I can start a remote session from the controller by running:
Sample jobs and test plan
In the 22.04 container, I create a new
pieq.pxufile in/root/checkbox/providers/base/units/and put the following in it:the
pieq/testjob will run for 30 seconds and will show the current status of the job, so it's handy to see what's going on. It has thenoreturnflag, but of course you can remove this flag if you want to test other use cases.I need to restart the systemd service, otherwise this test plan will not be visible to Checkbox:
Launcher
In order to simulate a non-interactive test run, I create the following launcher file (
pieq.launcher):To run it from the controller side with:
Test
Reconnecting to agent after the controller stopped/crashed ❌
One of the issue this should fix is #22 , which mentions
So:
pieq/test(which runs for 30 seconds):→ The test starts running
aaaaaaaaand nothing happens. The session never goes on to the next job (
pieq/wrapup), and never finishes. This is because the job has finished running by the time we reconnect to the agent.Environment
mainRelevant log output
No response
Additional context
No response