Add tests for DSS on NVIDIA GPUs and only CPUs (New)#1609
Conversation
aa6301f to
968e272
Compare
|
Updated the PR with refactored scripts and decided against re-implementing them in Python. Furthermore, since last week, I have further lumped in work for CHECKBOX-1668 enabling customisation of Microk8s version in the Please see the updated description of the PR for more details. |
pieqq
left a comment
There was a problem hiding this comment.
Thanks for this big contribution! As usual, the very clear description and git commit messages help a lot with the review, along with the tests showing a successful run in TF.
Two things:
- I would refrain from removing
.shextension to Shell scripts. It's much easier to see what kind of file it is by looking at the extension when the scripts are in thebin/directory - There is already a
graphics_cardresource in Checkbox that should help with checking if there is at least one Intel/NVIDIA GPU available in the system. Check my inline comment for more info on how to use it.
contrib/checkbox-dss-validation/checkbox-provider-dss/units/resource.pxu
Outdated
Show resolved
Hide resolved
e0b2928 to
47e7f56
Compare
|
The relevant workflow run in Testflinger for the latest commit accommodating the requested changes: https://github.com/canonical/checkbox/actions/runs/12122214006 |
You will see that all jobs testing latest/edge of DSS fail here unfortunately. There was a release on this risk level yesterday for the DSS snap and it seems to have some bug (Issue reported here). Since these are not failures of the validation suite, I believe then this PR is ready. |
|
The latest commit is a minor change to the README and does not impact the code, so I propose not to re-run the validations in Testflinger. |
For the moment we lump it together in the validate-intel-gpu launcher... more refactoring coming
This is covered by checking that DSS's status says 'MLFlow deployment: Ready'. The way the removed test was implemented assumed position of the service's name in the output and made it flaky, especially when re-running the tests.
Since many tests here depend on some resources to be available, specifically: GPUs from Intel or NVIDIA, not all tests are expected to pass on a given machine and hence we should not waste our time too much retrying these tests.
the tests fail on re-runs because they start counting nvidia gpus too
one redundant test job has been removed since the new test-case now implicitly tests importing itex as well
one redundant test job has been removed since the new test-case now implicitly tests importing ipex as well
There seems to be a bug in the Intel GPU plugin where it starts counting NVIDIA GPUs too under its label once NVIDIA's plugin is enabled. The tests are now updated to check for matching the minimum slot count instead of an exact one.
It helps to know which script is being run
the previous approach was checking for driver, but that does not work for NVIDIA GPUs because we don't install their drivers on the machine (the drivers are installed in the k8s operator).
e81eeca to
1dc30bd
Compare
pieqq
left a comment
There was a problem hiding this comment.
Thanks for the modifications! LGTM :)
Description
Changes to tests jobs
graphics_cardresource, and enable skipping respective tests when the relevant GPUs are not available.Changes to the test plan
resource.pxuand additional NVIDIA tests, as explained above.Changes to the snap
checkbox-dsssnap produced with the provider has been changed fromvalidate-intel-gputovalidate-with-gpu.install-depshas been refactored, and now accepts specifying version of the main snaps to be installed, which currently include DSS itself, Microk8s, andkubectl.2.0to3.0, and changes have been made to the relevantsnapcraft.yamland to the README.Changes to the relevant GitHub workflow
Resolved issues
Documentation
No changes to Checkbox's documentation.
Tests
These DSS validations need to be run on machines from Testflinger. See a recent run of the workflow here (the relevant one for this PR is https://github.com/canonical/checkbox/actions/runs/12056842710).