Skip to content

Simplify enabling Intel GPU for DSS (New)#1789

Merged
fernando79513 merged 10 commits intomainfrom
CHECKBOX-1693-simplify-enabling-intel-gpu-for-dss
Mar 19, 2025
Merged

Simplify enabling Intel GPU for DSS (New)#1789
fernando79513 merged 10 commits intomainfrom
CHECKBOX-1693-simplify-enabling-intel-gpu-for-dss

Conversation

@motjuste
Copy link
Copy Markdown
Contributor

@motjuste motjuste commented Mar 11, 2025

Description

(Please also refer to the individual commits to get more specific details)

For background, we continue to follow the How-To from the DSS team to enable Intel GPU acceleration

  • Extract out the script for enabling Intel GPU plugin in the Kubernetes cluster.
  • Inline checking whether the Kubernetes node ends up with the correct label signifying Intel GPU being available.

Furthermore, multiple tests have been removed. DSS documentation does not ask to run these verifications.

  • The tests checking the daemonsets were redundant because we use kubectl rollout status ... to verify their rollout as part of enabling the Intel GPU plugin (that command won't succeed if the daemonsets don't rollout correctly).
  • The tests checking and relying on accurate counting of GPUs by the Kubernetes node label gpu.intel.com were wrong due to inaccuracy of what the label reports itself: the label also counts NVIDIA GPUs once the respective plugin is enabled. These tests were implemented by PE before we added tests with NVIDIA GPUs.

Resolved issues

Documentation

No changes to documentation.

Tests

No changes to the unit-tests. A full successful run of the provider in Testflinger can be found here.

One script enables the Intel GPU plugin, the second verifies the rollout
status of the necessary daemonsets.
We check rollout of these daemonset in the /install job, and the command
`kubectl rollout status` would not have finished successfully if these
daemonsets did not become ready.  We did that during /install to stick
to the pattern of enabling NVIDIA GPU addon.
These tests are unreliable because the label gpu.intel.com inaccurately
starts counting NVIDIA GPUs too once the respective plugin is enabled.
The same source of inaccuracy also disallowed us from making accurate
checks for the capacity and allocatable slots available in the cluster.

We are removing this test also because the DSS how-to we have been
following does not ask for it.  See:
<https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.83%. Comparing base (da6b4e9) to head (023e60e).
Report is 118 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1789   +/-   ##
=======================================
  Coverage   49.83%   49.83%           
=======================================
  Files         377      377           
  Lines       40719    40719           
  Branches     6851     6851           
=======================================
  Hits        20294    20294           
  Misses      19700    19700           
  Partials      725      725           
Flag Coverage Δ
provider-dss ∅ <ø> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@motjuste motjuste marked this pull request as ready for review March 11, 2025 11:54
@motjuste motjuste requested a review from a team as a code owner March 11, 2025 11:54
@motjuste motjuste requested a review from fernando79513 March 11, 2025 11:58
@motjuste
Copy link
Copy Markdown
Contributor Author

I am having a lot of bad luck with microk8s today. I am attaching (zipped version of) the html submission report from my manual, baby-sitted run of provider (zipped so GitHub would let me attach it). Please confirm that the following jobs (impacted by this PR) pass:

  • intel_gpu_plugin/install
  • intel_gpu_plugin/labels
  • dss/status_intel_gpu

submission_2025-03-11T15.13.02.329019.html.zip
submission_2025-03-11T15.13.02.329019.tar.xz.zip

@motjuste
Copy link
Copy Markdown
Contributor Author

motjuste commented Mar 12, 2025

I used a different machine of the same model and got a full successful run. The difference with this machine was its location in the Taipei Lab (passing machine vs the failing one from yesterday).

@motjuste
Copy link
Copy Markdown
Contributor Author

I just had a successful run today on the same machine where it was failing yesterday.

@motjuste
Copy link
Copy Markdown
Contributor Author

@fernando79513 , is there anything holding this PR anymore? I am unfortunately sick today, but I hope to be able to look at fixing any issues tomorrow.

@fernando79513 fernando79513 self-assigned this Mar 13, 2025
Copy link
Copy Markdown
Collaborator

@fernando79513 fernando79513 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Job simplifying this tests!
I have left some comments, mainly focused on moving from the bash scripts to python scripts whenever it makes sense.

While this branch is not impacted, but when enabling NVIDIA GPU in
Canonical Kubernetes, the `worker` daemonset can have a version in its
name that we cannot predict.  We are making this update here as well to
avoid merge conflicts or forgetting to update this.
Now we test that at least one node in a poetentially multi-node k8s
cluster has the required label for Intel GPU being available.
Copy link
Copy Markdown
Collaborator

@fernando79513 fernando79513 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1!

@fernando79513 fernando79513 merged commit 9feec35 into main Mar 19, 2025
18 checks passed
@fernando79513 fernando79513 deleted the CHECKBOX-1693-simplify-enabling-intel-gpu-for-dss branch March 19, 2025 10:43
stanley31huang pushed a commit that referenced this pull request Mar 28, 2025
* Move enabling intel gpu to a standalone scripts

One script enables the Intel GPU plugin, the second verifies the rollout
status of the necessary daemonsets.

* Remove redundant tests to check daemonset rollout

We check rollout of these daemonset in the /install job, and the command
`kubectl rollout status` would not have finished successfully if these
daemonsets did not become ready.  We did that during /install to stick
to the pattern of enabling NVIDIA GPU addon.

* Move checking intel gpu label in-line

* Remove tests relying on Intel GPU counts

These tests are unreliable because the label gpu.intel.com inaccurately
starts counting NVIDIA GPUs too once the respective plugin is enabled.
The same source of inaccuracy also disallowed us from making accurate
checks for the capacity and allocatable slots available in the cluster.

We are removing this test also because the DSS how-to we have been
following does not ask for it.  See:
<https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst>

* Fix name of script to check nvidia gpu rollout

* Add a combined script for checking GPU rollout

* Use unified script to check GPU rollout

* Remove individual scripts for checking GPU rollout

* Switch to checking rollout of parent daemonset

While this branch is not impacted, but when enabling NVIDIA GPU in
Canonical Kubernetes, the `worker` daemonset can have a version in its
name that we cannot predict.  We are making this update here as well to
avoid merge conflicts or forgetting to update this.

* Update check to verify Intel GPU label

Now we test that at least one node in a poetentially multi-node k8s
cluster has the required label for Intel GPU being available.
mreed8855 pushed a commit that referenced this pull request Jul 30, 2025
* Move enabling intel gpu to a standalone scripts

One script enables the Intel GPU plugin, the second verifies the rollout
status of the necessary daemonsets.

* Remove redundant tests to check daemonset rollout

We check rollout of these daemonset in the /install job, and the command
`kubectl rollout status` would not have finished successfully if these
daemonsets did not become ready.  We did that during /install to stick
to the pattern of enabling NVIDIA GPU addon.

* Move checking intel gpu label in-line

* Remove tests relying on Intel GPU counts

These tests are unreliable because the label gpu.intel.com inaccurately
starts counting NVIDIA GPUs too once the respective plugin is enabled.
The same source of inaccuracy also disallowed us from making accurate
checks for the capacity and allocatable slots available in the cluster.

We are removing this test also because the DSS how-to we have been
following does not ask for it.  See:
<https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst>

* Fix name of script to check nvidia gpu rollout

* Add a combined script for checking GPU rollout

* Use unified script to check GPU rollout

* Remove individual scripts for checking GPU rollout

* Switch to checking rollout of parent daemonset

While this branch is not impacted, but when enabling NVIDIA GPU in
Canonical Kubernetes, the `worker` daemonset can have a version in its
name that we cannot predict.  We are making this update here as well to
avoid merge conflicts or forgetting to update this.

* Update check to verify Intel GPU label

Now we test that at least one node in a poetentially multi-node k8s
cluster has the required label for Intel GPU being available.
mreed8855 pushed a commit that referenced this pull request Jul 31, 2025
* Move enabling intel gpu to a standalone scripts

One script enables the Intel GPU plugin, the second verifies the rollout
status of the necessary daemonsets.

* Remove redundant tests to check daemonset rollout

We check rollout of these daemonset in the /install job, and the command
`kubectl rollout status` would not have finished successfully if these
daemonsets did not become ready.  We did that during /install to stick
to the pattern of enabling NVIDIA GPU addon.

* Move checking intel gpu label in-line

* Remove tests relying on Intel GPU counts

These tests are unreliable because the label gpu.intel.com inaccurately
starts counting NVIDIA GPUs too once the respective plugin is enabled.
The same source of inaccuracy also disallowed us from making accurate
checks for the capacity and allocatable slots available in the cluster.

We are removing this test also because the DSS how-to we have been
following does not ask for it.  See:
<https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst>

* Fix name of script to check nvidia gpu rollout

* Add a combined script for checking GPU rollout

* Use unified script to check GPU rollout

* Remove individual scripts for checking GPU rollout

* Switch to checking rollout of parent daemonset

While this branch is not impacted, but when enabling NVIDIA GPU in
Canonical Kubernetes, the `worker` daemonset can have a version in its
name that we cannot predict.  We are making this update here as well to
avoid merge conflicts or forgetting to update this.

* Update check to verify Intel GPU label

Now we test that at least one node in a poetentially multi-node k8s
cluster has the required label for Intel GPU being available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants