Migrate DSS GPU setup to setup-phase (New)#2066
Conversation
what a classic mistake
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2066 +/- ##
==========================================
+ Coverage 51.86% 52.02% +0.16%
==========================================
Files 387 389 +2
Lines 41674 41819 +145
Branches 7741 7741
==========================================
+ Hits 21613 21758 +145
Misses 19294 19294
Partials 767 767
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Now the script needs to be called once per nvidia and intel
Manifest is first attempted to be fetched, and on failure, the provided default manifest is used.
wctaylor
left a comment
There was a problem hiding this comment.
I don't have the full context for how everything works, so I won't approve this. That said, I do think everything looks good. I just have one nitpick
contrib/checkbox-dss-validation/checkbox-provider-dss/bin/k8s_gpu_setup.py
Show resolved
Hide resolved
fernando79513
left a comment
There was a problem hiding this comment.
Good job here!
I'v just added a couple of small comments and answered in the variable renaming, but the rest looks nice.
contrib/checkbox-dss-validation/checkbox-provider-dss/bin/k8s_gpu_setup.py
Show resolved
Hide resolved
contrib/checkbox-dss-validation/checkbox-provider-dss/bin/k8s_gpu_setup.py
Outdated
Show resolved
Hide resolved
contrib/checkbox-dss-validation/checkbox-provider-dss/bin/k8s_gpu_setup.py
Outdated
Show resolved
Hide resolved
|
I have incorporated the feedback: mainly removing the unused global variables that were leftover by mistake from the previous implementation. The changes don't change anything functional, but I have triggered a new run just to confirm it. |
|
I had some bad luck with those triggered runs in the first few attempts, but re-running them recently succeeded. The earlier attempts had failed in fetching docker images for DSS notebooks, which were not related to the changes in this PR, but I chose to re-run the jobs again, and now they all pass. |
* Add installing helm snap * Add first impl of k8s_gpu_setup.py * Add setup job to setup gpus in k8s * Fix shellcheck complaints * Remove setting up GPUs in normal test plan * Add gpu versions to setup launchers * Add input for setting gpu versions * Fix detecting GPUs what a classic mistake * Fix using k8s_gpu_setup.py in setup job * Add tests for installing GPUs * Fix gpu setup job removing trailing escape * Fix tests for installing GPUs * Fix not parsing version correctly * Do not retry the main test-plan * Fix formatting of test * Add manifest entry jobs for nvidia and intel gpus * Remove GPU det. using UdevadmParser and simplify Now the script needs to be called once per nvidia and intel * Use manifest in setup jobs to setup gpus * Add using manifest in testflinger job Manifest is first attempted to be fetched, and on failure, the provided default manifest is used. * Add default manifests for queues and use it in job * Add main CLI tests * Disable unlikely branch from coverage reporting * Fix wrong requirement to insteall intel gpu tools * Add docs for GPU setup to README * Remove unused globals from prev impl
* Add installing helm snap * Add first impl of k8s_gpu_setup.py * Add setup job to setup gpus in k8s * Fix shellcheck complaints * Remove setting up GPUs in normal test plan * Add gpu versions to setup launchers * Add input for setting gpu versions * Fix detecting GPUs what a classic mistake * Fix using k8s_gpu_setup.py in setup job * Add tests for installing GPUs * Fix gpu setup job removing trailing escape * Fix tests for installing GPUs * Fix not parsing version correctly * Do not retry the main test-plan * Fix formatting of test * Add manifest entry jobs for nvidia and intel gpus * Remove GPU det. using UdevadmParser and simplify Now the script needs to be called once per nvidia and intel * Use manifest in setup jobs to setup gpus * Add using manifest in testflinger job Manifest is first attempted to be fetched, and on failure, the provided default manifest is used. * Add default manifests for queues and use it in job * Add main CLI tests * Disable unlikely branch from coverage reporting * Fix wrong requirement to insteall intel gpu tools * Add docs for GPU setup to README * Remove unused globals from prev impl
Description
This PR migrates jobs related to GPU-setup from the normal test-plan in the DSS provider to the setup test-plan (which will eventually be run in the setup-phase of Checkbox). This includes installing NVIDIA GPU operator and or the Intel GPU plugin based on results from the newly added manifest entry jobs. Setting up the relevant GPU is implemented in the newly added
k8s_gpu_setup.pyscript.New manifest entries
Since we cannot use Checkbox resources during the setup-phase, two new manifest entry jobs have been added to the provider:
has_intel_gpusandhas_nvidia_gpus. Thegpgpuprovider was used as a reference for this.These manifest entries are only used in the setup jobs. Normal test jobs still use resources to detect which GPUs are available. This is done to catch any mistakes made in the setup phase that didn't fail the setup jobs.
GPU Setup
NVIDIA GPU Operator is setup using Helm so that it can work with all Kubernetes clusters, but requires some special handling for Microk8s similar to how Microk8s' addon works.
Enabling Intel GPU plugin is done using
kustomize.Updates to Testflinger job
The GitHub workflow for running the DSS regression tests from this provider has been updated with:
Resolved issues
CHECKBOX-1902 and CHECKBOX-1903
Documentation
No changes to Checkbox's documentation. Information about GPU setup has been added to the README of the provider.
Tests
Full run of the GitHub Workflow