Migrate DSS GPU setup to setup-phase (New) by motjuste · Pull Request #2066 · canonical/checkbox

motjuste · 2025-08-07T06:47:22Z

Description

This PR migrates jobs related to GPU-setup from the normal test-plan in the DSS provider to the setup test-plan (which will eventually be run in the setup-phase of Checkbox). This includes installing NVIDIA GPU operator and or the Intel GPU plugin based on results from the newly added manifest entry jobs. Setting up the relevant GPU is implemented in the newly added k8s_gpu_setup.py script.

New manifest entries

Since we cannot use Checkbox resources during the setup-phase, two new manifest entry jobs have been added to the provider: has_intel_gpus and has_nvidia_gpus. The gpgpu provider was used as a reference for this.

These manifest entries are only used in the setup jobs. Normal test jobs still use resources to detect which GPUs are available. This is done to catch any mistakes made in the setup phase that didn't fail the setup jobs.

GPU Setup

NVIDIA GPU Operator is setup using Helm so that it can work with all Kubernetes clusters, but requires some special handling for Microk8s similar to how Microk8s' addon works.

Enabling Intel GPU plugin is done using kustomize.

Updates to Testflinger job

The GitHub workflow for running the DSS regression tests from this provider has been updated with:

Choices to specify which version of the NVIDIA GPU Operator and Intel GPU Plugin to install.
Default manifest files for each of the used devices, which are used when the appropriate manifest is not available on C3.

Resolved issues

CHECKBOX-1902 and CHECKBOX-1903

Documentation

No changes to Checkbox's documentation. Information about GPU setup has been added to the README of the provider.

Tests

Full run of the GitHub Workflow

what a classic mistake

codecov · 2025-08-07T06:48:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.02%. Comparing base (fec2b5c) to head (56b6459).
⚠️ Report is 104 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2066      +/-   ##
==========================================
+ Coverage   51.86%   52.02%   +0.16%     
==========================================
  Files         387      389       +2     
  Lines       41674    41819     +145     
  Branches     7741     7741              
==========================================
+ Hits        21613    21758     +145     
  Misses      19294    19294              
  Partials      767      767

Flag	Coverage Δ
provider-dss	`100.00% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Now the script needs to be called once per nvidia and intel

Manifest is first attempted to be fetched, and on failure, the provided default manifest is used.

wctaylor

I don't have the full context for how everything works, so I won't approve this. That said, I do think everything looks good. I just have one nitpick

contrib/checkbox-dss-validation/checkbox-provider-dss/bin/k8s_gpu_setup.py

fernando79513

Good job here!
I'v just added a couple of small comments and answered in the variable renaming, but the rest looks nice.

contrib/checkbox-dss-validation/checkbox-provider-dss/bin/k8s_gpu_setup.py

motjuste · 2025-08-18T14:26:35Z

I have incorporated the feedback: mainly removing the unused global variables that were leftover by mistake from the previous implementation. The changes don't change anything functional, but I have triggered a new run just to confirm it.

motjuste · 2025-08-19T08:52:50Z

I had some bad luck with those triggered runs in the first few attempts, but re-running them recently succeeded. The earlier attempts had failed in fetching docker images for DSS notebooks, which were not related to the changes in this PR, but I chose to re-run the jobs again, and now they all pass.

fernando79513

LGTM +1!

* Add installing helm snap * Add first impl of k8s_gpu_setup.py * Add setup job to setup gpus in k8s * Fix shellcheck complaints * Remove setting up GPUs in normal test plan * Add gpu versions to setup launchers * Add input for setting gpu versions * Fix detecting GPUs what a classic mistake * Fix using k8s_gpu_setup.py in setup job * Add tests for installing GPUs * Fix gpu setup job removing trailing escape * Fix tests for installing GPUs * Fix not parsing version correctly * Do not retry the main test-plan * Fix formatting of test * Add manifest entry jobs for nvidia and intel gpus * Remove GPU det. using UdevadmParser and simplify Now the script needs to be called once per nvidia and intel * Use manifest in setup jobs to setup gpus * Add using manifest in testflinger job Manifest is first attempted to be fetched, and on failure, the provided default manifest is used. * Add default manifests for queues and use it in job * Add main CLI tests * Disable unlikely branch from coverage reporting * Fix wrong requirement to insteall intel gpu tools * Add docs for GPU setup to README * Remove unused globals from prev impl

motjuste added 14 commits August 5, 2025 16:30

Add installing helm snap

a7bc6e4

Add first impl of k8s_gpu_setup.py

c0d0876

Add setup job to setup gpus in k8s

52577f1

Fix shellcheck complaints

cfd386e

Remove setting up GPUs in normal test plan

7e32632

Add gpu versions to setup launchers

8c49a03

Add input for setting gpu versions

3edd614

Fix detecting GPUs

2a17f0a

what a classic mistake

Fix using k8s_gpu_setup.py in setup job

17b0e78

Add tests for installing GPUs

54be393

Fix gpu setup job removing trailing escape

7cb3e3a

Fix tests for installing GPUs

9a1e2fc

Fix not parsing version correctly

81d2975

Do not retry the main test-plan

085cf9d

motjuste added 10 commits August 7, 2025 08:49

Fix formatting of test

7e4ee5d

Add manifest entry jobs for nvidia and intel gpus

02f4531

Remove GPU det. using UdevadmParser and simplify

408208b

Now the script needs to be called once per nvidia and intel

Use manifest in setup jobs to setup gpus

d444671

Add using manifest in testflinger job

b42d2c8

Manifest is first attempted to be fetched, and on failure, the provided default manifest is used.

Add default manifests for queues and use it in job

50a71bc

Add main CLI tests

31fea33

Disable unlikely branch from coverage reporting

c1ba372

Fix wrong requirement to insteall intel gpu tools

419034d

Add docs for GPU setup to README

d8380bf

motjuste marked this pull request as ready for review August 13, 2025 11:51

motjuste requested a review from a team as a code owner August 13, 2025 11:51

motjuste requested a review from fernando79513 August 18, 2025 06:22

wctaylor reviewed Aug 18, 2025

View reviewed changes

contrib/checkbox-dss-validation/checkbox-provider-dss/bin/k8s_gpu_setup.py Show resolved Hide resolved

fernando79513 requested changes Aug 18, 2025

View reviewed changes

Remove unused globals from prev impl

56b6459

motjuste requested a review from fernando79513 August 19, 2025 08:52

fernando79513 approved these changes Aug 19, 2025

View reviewed changes

motjuste merged commit 8a97104 into main Aug 19, 2025
37 of 42 checks passed

motjuste deleted the CHECKBOX-1902-migrate-dss-gpu-to-setup branch August 19, 2025 09:07

motjuste mentioned this pull request Aug 27, 2025

Add testing DSS on Canonical K8s (New) #2084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate DSS GPU setup to setup-phase (New)#2066

Migrate DSS GPU setup to setup-phase (New)#2066
motjuste merged 25 commits intomainfrom
CHECKBOX-1902-migrate-dss-gpu-to-setup

motjuste commented Aug 7, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

wctaylor left a comment

Uh oh!

Uh oh!

fernando79513 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

motjuste commented Aug 18, 2025

Uh oh!

motjuste commented Aug 19, 2025

Uh oh!

fernando79513 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

motjuste commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

New manifest entries

GPU Setup

Updates to Testflinger job

Resolved issues

Documentation

Tests

Uh oh!

codecov bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wctaylor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fernando79513 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

motjuste commented Aug 18, 2025

Uh oh!

motjuste commented Aug 19, 2025

Uh oh!

fernando79513 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

motjuste commented Aug 7, 2025 •

edited

Loading

codecov bot commented Aug 7, 2025 •

edited

Loading