Skip to content

Use json encoding to avoid spamming remote API (breaking)#1983

Merged
Hook25 merged 13 commits intomainfrom
jsonbox
Jul 9, 2025
Merged

Use json encoding to avoid spamming remote API (breaking)#1983
Hook25 merged 13 commits intomainfrom
jsonbox

Conversation

@Hook25
Copy link
Copy Markdown
Collaborator

@Hook25 Hook25 commented Jun 30, 2025

Description

Checkbox Remote is very, very slow when used over a slow connection. This is mostly due to a reason: When returning a mutable objects over an rpyc function, any further query on that object is a rpyc call. This is most evident in the following situations:

  • Test plan selection (iterating over the testplan list takes ~600 (number of testplans) rpyc calls
  • Test selection (modifying the desired list takes one call per test that will be selected, so deselecting just one test could lead to hundreds of rpyc calls)
  • Calculating the return code of a test run (This iterated over the whole job_state_map, leading to ~10k rpyc calls as undesired jobs are present when no reboots are done)
  • Exporting (Additional Configurations that were created client side lead to hundreds of rpyc remote calls when used afterward)

Additionally (this is a bit of a bonus), every test on a medium to high latency will always pay 0.5s to "wait" for the test to be completed + the latency. This adds a backoff mechanism that starts from 0 and goes to .5s

Resolved issues

Fixes: https://warthogs.atlassian.net/browse/CHECKBOX-1956
Fixes: https://warthogs.atlassian.net/browse/CHECKBOX-1955

Documentation

Documented this in the class docstring

Tests

Added unit tests for /most/ of the api changed.

To try this change and simulate this issue locally modify checkbox-ng/plainbox/vendor/rpyc/core/protocol.py add to the serve function of the Connection class the following

    def serve(self, timeout=1, wait_for_lock=True):  # serving
        import time
        time.sleep(.1)

This will simulate a 100ms latency (less than what it is Europe to Taiwan). Add the following test units to the metabox provider:

id: tons_of_jobs
_summary: Generates a lot of ids to fill the job_state_map
plugin: resource
shell: python
command:
  for i in range(100):
    print(f"id: stress_{i}")
    print()

unit: template
template-resource: tons_of_jobs
template-unit: job
id: {id}
template-id: template_tons_of_jobs
command:
  echo "Test {id}"
_summary: Test that stresses rpyc calls
flags: simple

unit: test plan
id: stress_remote_tps
_name: Stress test of the checkbox remote-api
bootstrap_include:
  tons_of_jobs
include:
  template_tons_of_jobs

A test run at 100ms latency of this testplan takes about 35minutes on main, 5m on this branch

Additionally, I've used this very ugly decorator to log all the calls made to the server to be able to track them down

if os.getuid() == 0:
    log = open("/tmp/root_agent.log", "w+")
else:
    log = open("/tmp/usr_control.log", "w+")


def _log_call(f):

    # return f

    def _f(self, *args, **kwargs):
        global total
        total += 1
        import json

        log.write(json.dumps([str(x) for x in [total, *args, kwargs]]))
        log.write("\n")
        print(total, *args, kwargs)
        to_r = f(self, *args, **kwargs)
        if (
            not any(isinstance(to_r, x) for x in [str, int, float, bool])
        ) and to_r is not None:
            # breakpoint()
            ...
        return to_r

    return _f

Decorate all handle[...] functions in the Connection class to use this.

Hook25 added 6 commits June 23, 2025 18:25
Minor: Also poll sooner on first try
This is done because every call we pay 1 roundtrip, that may be 100ms
and this, only to get a pretty bar going through the screen, I feel
like 58 tiks are enough out of 8Mb, we don't need 580
Minor: remove old debug stuff wrongly committed
Minor: forgot to bump remote api version
@Hook25 Hook25 changed the title Use json encoding to avoid spamming remote API (bugfix) Use json encoding to avoid spamming remote API (bugfix) Jun 30, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Jun 30, 2025

Codecov Report

❌ Patch coverage is 74.68354% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.80%. Comparing base (2305e2c) to head (0ff8f24).
⚠️ Report is 124 commits behind head on main.

Files with missing lines Patch % Lines
checkbox-ng/checkbox_ng/launcher/controller.py 61.29% 10 Missing and 2 partials ⚠️
...ckbox-ng/plainbox/impl/session/remote_assistant.py 87.17% 5 Missing ⚠️
checkbox-ng/checkbox_ng/launcher/stages.py 66.66% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1983      +/-   ##
==========================================
+ Coverage   50.60%   50.80%   +0.19%     
==========================================
  Files         384      384              
  Lines       41180    41277      +97     
  Branches     7642     7664      +22     
==========================================
+ Hits        20841    20969     +128     
+ Misses      19589    19552      -37     
- Partials      750      756       +6     
Flag Coverage Δ
checkbox-ng 70.71% <74.68%> (+0.39%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Hook25 Hook25 marked this pull request as ready for review July 1, 2025 09:02
Copy link
Copy Markdown
Collaborator

@pieqq pieqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First feedback, will test it locally later.

@Hook25 Hook25 changed the title Use json encoding to avoid spamming remote API (bugfix) Use json encoding to avoid spamming remote API (breaking) Jul 9, 2025
@pieqq
Copy link
Copy Markdown
Collaborator

pieqq commented Jul 9, 2025

I found a script that allows to slow down a connection, so I used it inside an LXC container to simulate a slow connection (I originally tried with slow 3G but it introduces a 200ms latency which made Checkbox completely unusable in remote, so I settled down with slow -b 100000 -l 5ms which introduces a 100k bandwidth and a 5ms latency).

Then I ran Checkbox in agent mode in the container (checkbox-cli run-agent) and connected to it from my host (checkbox-cli control <container_ip>).

With the current version of Checkbox in main, getting the test plan selection screen takes around 8 seconds. With the version in jsonbox (this PR), it takes less than 2 seconds which is much better.

I then used the following launcher to simulate a full test run:

[launcher]
launcher_version = 1
app_id = com.canonical.certification:test
stock_reports = text

[test plan]
unit = com.canonical.certification::info-attachment-cert-full
forced = yes

[test selection]
forced = yes

[ui]
type = silent

I then timed the run with both main and jsonbox branches:

With main:

$ time checkbox-cli control 10.155.182.243 lnch
(...)
________________________________________________________
Executed in   59.15 secs      fish           external
   usr time  497.38 millis  408.00 micros  496.98 millis
   sys time  135.36 millis  172.00 micros  135.18 millis

With this PR's branch:

$ time checkbox-cli control 10.155.182.243 lnch
(...)
________________________________________________________
Executed in   44.37 secs      fish           external
   usr time  436.31 millis  430.00 micros  435.88 millis
   sys time  110.99 millis  177.00 micros  110.82 millis

This will surely be helpful on slow connections!

Once the changes in my reviews (above) land, we're good to land this!

@Hook25 Hook25 merged commit 14aaed3 into main Jul 9, 2025
40 of 41 checks passed
@Hook25 Hook25 deleted the jsonbox branch July 9, 2025 13:11
mreed8855 pushed a commit that referenced this pull request Jul 30, 2025
* Initial jsonification

* More json apis

Minor: Also poll sooner on first try

* Increase chunk size greately

This is done because every call we pay 1 roundtrip, that may be 100ms
and this, only to get a pretty bar going through the screen, I feel
like 58 tiks are enough out of 8Mb, we don't need 580

* Document RemoteAssistant better with a docstr

Minor: remove old debug stuff wrongly committed

* Off by 1 error on the polling delay

* Cache attributes of job interactions

Minor: forgot to bump remote api version

* Adjust tests to the new api version

* Test all new json apis

* Fix bug in finish_job_json on None result

Minor: remove unused API
Minor: json api should call non-json API

* Wrong function called

* Test modified controller functions

* Document why we use the api to fetch the Configuration type

* Correct double bump done by mistake

Turns out, 13 + 1 = 14, not 15
mreed8855 pushed a commit that referenced this pull request Jul 31, 2025
* Initial jsonification

* More json apis

Minor: Also poll sooner on first try

* Increase chunk size greately

This is done because every call we pay 1 roundtrip, that may be 100ms
and this, only to get a pretty bar going through the screen, I feel
like 58 tiks are enough out of 8Mb, we don't need 580

* Document RemoteAssistant better with a docstr

Minor: remove old debug stuff wrongly committed

* Off by 1 error on the polling delay

* Cache attributes of job interactions

Minor: forgot to bump remote api version

* Adjust tests to the new api version

* Test all new json apis

* Fix bug in finish_job_json on None result

Minor: remove unused API
Minor: json api should call non-json API

* Wrong function called

* Test modified controller functions

* Document why we use the api to fetch the Configuration type

* Correct double bump done by mistake

Turns out, 13 + 1 = 14, not 15
bladernr pushed a commit that referenced this pull request Aug 28, 2025
* Initial jsonification

* More json apis

Minor: Also poll sooner on first try

* Increase chunk size greately

This is done because every call we pay 1 roundtrip, that may be 100ms
and this, only to get a pretty bar going through the screen, I feel
like 58 tiks are enough out of 8Mb, we don't need 580

* Document RemoteAssistant better with a docstr

Minor: remove old debug stuff wrongly committed

* Off by 1 error on the polling delay

* Cache attributes of job interactions

Minor: forgot to bump remote api version

* Adjust tests to the new api version

* Test all new json apis

* Fix bug in finish_job_json on None result

Minor: remove unused API
Minor: json api should call non-json API

* Wrong function called

* Test modified controller functions

* Document why we use the api to fetch the Configuration type

* Correct double bump done by mistake

Turns out, 13 + 1 = 14, not 15
stanley31huang pushed a commit that referenced this pull request Oct 3, 2025
* Initial jsonification

* More json apis

Minor: Also poll sooner on first try

* Increase chunk size greately

This is done because every call we pay 1 roundtrip, that may be 100ms
and this, only to get a pretty bar going through the screen, I feel
like 58 tiks are enough out of 8Mb, we don't need 580

* Document RemoteAssistant better with a docstr

Minor: remove old debug stuff wrongly committed

* Off by 1 error on the polling delay

* Cache attributes of job interactions

Minor: forgot to bump remote api version

* Adjust tests to the new api version

* Test all new json apis

* Fix bug in finish_job_json on None result

Minor: remove unused API
Minor: json api should call non-json API

* Wrong function called

* Test modified controller functions

* Document why we use the api to fetch the Configuration type

* Correct double bump done by mistake

Turns out, 13 + 1 = 14, not 15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants