Skip to content

Migrate DSS GPU setup to setup-phase (New)#2066

Merged
motjuste merged 25 commits intomainfrom
CHECKBOX-1902-migrate-dss-gpu-to-setup
Aug 19, 2025
Merged

Migrate DSS GPU setup to setup-phase (New)#2066
motjuste merged 25 commits intomainfrom
CHECKBOX-1902-migrate-dss-gpu-to-setup

Conversation

@motjuste
Copy link
Copy Markdown
Contributor

@motjuste motjuste commented Aug 7, 2025

Description

This PR migrates jobs related to GPU-setup from the normal test-plan in the DSS provider to the setup test-plan (which will eventually be run in the setup-phase of Checkbox). This includes installing NVIDIA GPU operator and or the Intel GPU plugin based on results from the newly added manifest entry jobs. Setting up the relevant GPU is implemented in the newly added k8s_gpu_setup.py script.

New manifest entries

Since we cannot use Checkbox resources during the setup-phase, two new manifest entry jobs have been added to the provider: has_intel_gpus and has_nvidia_gpus. The gpgpu provider was used as a reference for this.

These manifest entries are only used in the setup jobs. Normal test jobs still use resources to detect which GPUs are available. This is done to catch any mistakes made in the setup phase that didn't fail the setup jobs.

GPU Setup

NVIDIA GPU Operator is setup using Helm so that it can work with all Kubernetes clusters, but requires some special handling for Microk8s similar to how Microk8s' addon works.

Enabling Intel GPU plugin is done using kustomize.

Updates to Testflinger job

The GitHub workflow for running the DSS regression tests from this provider has been updated with:

  • Choices to specify which version of the NVIDIA GPU Operator and Intel GPU Plugin to install.
  • Default manifest files for each of the used devices, which are used when the appropriate manifest is not available on C3.

Resolved issues

CHECKBOX-1902 and CHECKBOX-1903

Documentation

No changes to Checkbox's documentation. Information about GPU setup has been added to the README of the provider.

Tests

Full run of the GitHub Workflow

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.02%. Comparing base (fec2b5c) to head (56b6459).
⚠️ Report is 104 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2066      +/-   ##
==========================================
+ Coverage   51.86%   52.02%   +0.16%     
==========================================
  Files         387      389       +2     
  Lines       41674    41819     +145     
  Branches     7741     7741              
==========================================
+ Hits        21613    21758     +145     
  Misses      19294    19294              
  Partials      767      767              
Flag Coverage Δ
provider-dss 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@motjuste motjuste marked this pull request as ready for review August 13, 2025 11:51
@motjuste motjuste requested a review from a team as a code owner August 13, 2025 11:51
@motjuste motjuste requested a review from fernando79513 August 18, 2025 06:22
Copy link
Copy Markdown

@wctaylor wctaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the full context for how everything works, so I won't approve this. That said, I do think everything looks good. I just have one nitpick

Copy link
Copy Markdown
Collaborator

@fernando79513 fernando79513 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job here!
I'v just added a couple of small comments and answered in the variable renaming, but the rest looks nice.

@motjuste
Copy link
Copy Markdown
Contributor Author

I have incorporated the feedback: mainly removing the unused global variables that were leftover by mistake from the previous implementation. The changes don't change anything functional, but I have triggered a new run just to confirm it.

@motjuste
Copy link
Copy Markdown
Contributor Author

I had some bad luck with those triggered runs in the first few attempts, but re-running them recently succeeded. The earlier attempts had failed in fetching docker images for DSS notebooks, which were not related to the changes in this PR, but I chose to re-run the jobs again, and now they all pass.

@motjuste motjuste requested a review from fernando79513 August 19, 2025 08:52
Copy link
Copy Markdown
Collaborator

@fernando79513 fernando79513 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1!

@motjuste motjuste merged commit 8a97104 into main Aug 19, 2025
37 of 42 checks passed
@motjuste motjuste deleted the CHECKBOX-1902-migrate-dss-gpu-to-setup branch August 19, 2025 09:07
bladernr pushed a commit that referenced this pull request Aug 28, 2025
* Add installing helm snap

* Add first impl of k8s_gpu_setup.py

* Add setup job to setup gpus in k8s

* Fix shellcheck complaints

* Remove setting up GPUs in normal test plan

* Add gpu versions to setup launchers

* Add input for setting gpu versions

* Fix detecting GPUs

what a classic mistake

* Fix using k8s_gpu_setup.py in setup job

* Add tests for installing GPUs

* Fix gpu setup job removing trailing escape

* Fix tests for installing GPUs

* Fix not parsing version correctly

* Do not retry the main test-plan

* Fix formatting of test

* Add manifest entry jobs for nvidia and intel gpus

* Remove GPU det. using UdevadmParser and simplify

Now the script needs to be called once per nvidia and intel

* Use manifest in setup jobs to setup gpus

* Add using manifest in testflinger job

Manifest is first attempted to be fetched, and on failure, the provided
default manifest is used.

* Add default manifests for queues and use it in job

* Add main CLI tests

* Disable unlikely branch from coverage reporting

* Fix wrong requirement to insteall intel gpu tools

* Add docs for GPU setup to README

* Remove unused globals from prev impl
stanley31huang pushed a commit that referenced this pull request Oct 3, 2025
* Add installing helm snap

* Add first impl of k8s_gpu_setup.py

* Add setup job to setup gpus in k8s

* Fix shellcheck complaints

* Remove setting up GPUs in normal test plan

* Add gpu versions to setup launchers

* Add input for setting gpu versions

* Fix detecting GPUs

what a classic mistake

* Fix using k8s_gpu_setup.py in setup job

* Add tests for installing GPUs

* Fix gpu setup job removing trailing escape

* Fix tests for installing GPUs

* Fix not parsing version correctly

* Do not retry the main test-plan

* Fix formatting of test

* Add manifest entry jobs for nvidia and intel gpus

* Remove GPU det. using UdevadmParser and simplify

Now the script needs to be called once per nvidia and intel

* Use manifest in setup jobs to setup gpus

* Add using manifest in testflinger job

Manifest is first attempted to be fetched, and on failure, the provided
default manifest is used.

* Add default manifests for queues and use it in job

* Add main CLI tests

* Disable unlikely branch from coverage reporting

* Fix wrong requirement to insteall intel gpu tools

* Add docs for GPU setup to README

* Remove unused globals from prev impl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants