Simplify enabling Intel GPU for DSS (New) by motjuste · Pull Request #1789 · canonical/checkbox

motjuste · 2025-03-11T11:32:43Z

Description

(Please also refer to the individual commits to get more specific details)

For background, we continue to follow the How-To from the DSS team to enable Intel GPU acceleration

Extract out the script for enabling Intel GPU plugin in the Kubernetes cluster.
Inline checking whether the Kubernetes node ends up with the correct label signifying Intel GPU being available.

Furthermore, multiple tests have been removed. DSS documentation does not ask to run these verifications.

The tests checking the daemonsets were redundant because we use kubectl rollout status ... to verify their rollout as part of enabling the Intel GPU plugin (that command won't succeed if the daemonsets don't rollout correctly).
The tests checking and relying on accurate counting of GPUs by the Kubernetes node label gpu.intel.com were wrong due to inaccuracy of what the label reports itself: the label also counts NVIDIA GPUs once the respective plugin is enabled. These tests were implemented by PE before we added tests with NVIDIA GPUs.

Resolved issues

Part of CHECKBOX-1693

Documentation

No changes to documentation.

Tests

No changes to the unit-tests. A full successful run of the provider in Testflinger can be found here.

One script enables the Intel GPU plugin, the second verifies the rollout status of the necessary daemonsets.

We check rollout of these daemonset in the /install job, and the command `kubectl rollout status` would not have finished successfully if these daemonsets did not become ready. We did that during /install to stick to the pattern of enabling NVIDIA GPU addon.

These tests are unreliable because the label gpu.intel.com inaccurately starts counting NVIDIA GPUs too once the respective plugin is enabled. The same source of inaccuracy also disallowed us from making accurate checks for the capacity and allocatable slots available in the cluster. We are removing this test also because the DSS how-to we have been following does not ask for it. See: <https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst>

codecov · 2025-03-11T11:34:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.83%. Comparing base (da6b4e9) to head (023e60e).
Report is 118 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1789   +/-   ##
=======================================
  Coverage   49.83%   49.83%           
=======================================
  Files         377      377           
  Lines       40719    40719           
  Branches     6851     6851           
=======================================
  Hits        20294    20294           
  Misses      19700    19700           
  Partials      725      725

Flag	Coverage Δ
provider-dss	`∅ <ø> (∅)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

motjuste · 2025-03-11T15:21:40Z

I am having a lot of bad luck with microk8s today. I am attaching (zipped version of) the html submission report from my manual, baby-sitted run of provider (zipped so GitHub would let me attach it). Please confirm that the following jobs (impacted by this PR) pass:

intel_gpu_plugin/install
intel_gpu_plugin/labels
dss/status_intel_gpu

submission_2025-03-11T15.13.02.329019.html.zip
submission_2025-03-11T15.13.02.329019.tar.xz.zip

motjuste · 2025-03-12T07:26:14Z

I used a different machine of the same model and got a full successful run. The difference with this machine was its location in the Taipei Lab (passing machine vs the failing one from yesterday).

motjuste · 2025-03-12T08:03:30Z

I just had a successful run today on the same machine where it was failing yesterday.

motjuste · 2025-03-13T10:48:34Z

@fernando79513 , is there anything holding this PR anymore? I am unfortunately sick today, but I hope to be able to look at fixing any issues tomorrow.

fernando79513

Good Job simplifying this tests!
I have left some comments, mainly focused on moving from the bash scripts to python scripts whenever it makes sense.

contrib/checkbox-dss-validation/checkbox-provider-dss/bin/check_intel_gpu_rollout.sh

contrib/checkbox-dss-validation/checkbox-provider-dss/bin/enable_intel_gpu_plugin.sh

contrib/checkbox-dss-validation/checkbox-provider-dss/units/jobs.pxu

While this branch is not impacted, but when enabling NVIDIA GPU in Canonical Kubernetes, the `worker` daemonset can have a version in its name that we cannot predict. We are making this update here as well to avoid merge conflicts or forgetting to update this.

Now we test that at least one node in a poetentially multi-node k8s cluster has the required label for Intel GPU being available.

fernando79513

LGTM +1!

* Move enabling intel gpu to a standalone scripts One script enables the Intel GPU plugin, the second verifies the rollout status of the necessary daemonsets. * Remove redundant tests to check daemonset rollout We check rollout of these daemonset in the /install job, and the command `kubectl rollout status` would not have finished successfully if these daemonsets did not become ready. We did that during /install to stick to the pattern of enabling NVIDIA GPU addon. * Move checking intel gpu label in-line * Remove tests relying on Intel GPU counts These tests are unreliable because the label gpu.intel.com inaccurately starts counting NVIDIA GPUs too once the respective plugin is enabled. The same source of inaccuracy also disallowed us from making accurate checks for the capacity and allocatable slots available in the cluster. We are removing this test also because the DSS how-to we have been following does not ask for it. See: <https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst> * Fix name of script to check nvidia gpu rollout * Add a combined script for checking GPU rollout * Use unified script to check GPU rollout * Remove individual scripts for checking GPU rollout * Switch to checking rollout of parent daemonset While this branch is not impacted, but when enabling NVIDIA GPU in Canonical Kubernetes, the `worker` daemonset can have a version in its name that we cannot predict. We are making this update here as well to avoid merge conflicts or forgetting to update this. * Update check to verify Intel GPU label Now we test that at least one node in a poetentially multi-node k8s cluster has the required label for Intel GPU being available.

motjuste added 5 commits March 11, 2025 16:03

Move enabling intel gpu to a standalone scripts

2369078

One script enables the Intel GPU plugin, the second verifies the rollout status of the necessary daemonsets.

Move checking intel gpu label in-line

11e3c3c

Fix name of script to check nvidia gpu rollout

46f4fc1

motjuste marked this pull request as ready for review March 11, 2025 11:54

motjuste requested a review from a team as a code owner March 11, 2025 11:54

motjuste requested a review from fernando79513 March 11, 2025 11:58

fernando79513 self-assigned this Mar 13, 2025

fernando79513 reviewed Mar 13, 2025

View reviewed changes

motjuste added 4 commits March 14, 2025 18:22

Add a combined script for checking GPU rollout

edb4fd0

Use unified script to check GPU rollout

3786924

Remove individual scripts for checking GPU rollout

cdc84b7

motjuste mentioned this pull request Mar 14, 2025

Add testing DSS with Canonical Kubernetes (New) #1793

Closed

Update check to verify Intel GPU label

023e60e

Now we test that at least one node in a poetentially multi-node k8s cluster has the required label for Intel GPU being available.

fernando79513 approved these changes Mar 19, 2025

View reviewed changes

fernando79513 merged commit 9feec35 into main Mar 19, 2025
18 checks passed

fernando79513 deleted the CHECKBOX-1693-simplify-enabling-intel-gpu-for-dss branch March 19, 2025 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify enabling Intel GPU for DSS (New)#1789

Simplify enabling Intel GPU for DSS (New)#1789
fernando79513 merged 10 commits intomainfrom
CHECKBOX-1693-simplify-enabling-intel-gpu-for-dss

motjuste commented Mar 11, 2025 •

edited

Loading

Uh oh!

codecov bot commented Mar 11, 2025 •

edited

Loading

Uh oh!

motjuste commented Mar 11, 2025

Uh oh!

motjuste commented Mar 12, 2025 •

edited

Loading

Uh oh!

motjuste commented Mar 12, 2025

Uh oh!

motjuste commented Mar 13, 2025

Uh oh!

fernando79513 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fernando79513 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

motjuste commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Resolved issues

Documentation

Tests

Uh oh!

codecov bot commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

motjuste commented Mar 11, 2025

Uh oh!

motjuste commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

motjuste commented Mar 12, 2025

Uh oh!

motjuste commented Mar 13, 2025

Uh oh!

fernando79513 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fernando79513 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

motjuste commented Mar 11, 2025 •

edited

Loading

codecov bot commented Mar 11, 2025 •

edited

Loading

motjuste commented Mar 12, 2025 •

edited

Loading