Simplify enabling Intel GPU for DSS (New)#1789
Conversation
One script enables the Intel GPU plugin, the second verifies the rollout status of the necessary daemonsets.
We check rollout of these daemonset in the /install job, and the command `kubectl rollout status` would not have finished successfully if these daemonsets did not become ready. We did that during /install to stick to the pattern of enabling NVIDIA GPU addon.
These tests are unreliable because the label gpu.intel.com inaccurately starts counting NVIDIA GPUs too once the respective plugin is enabled. The same source of inaccuracy also disallowed us from making accurate checks for the capacity and allocatable slots available in the cluster. We are removing this test also because the DSS how-to we have been following does not ask for it. See: <https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1789 +/- ##
=======================================
Coverage 49.83% 49.83%
=======================================
Files 377 377
Lines 40719 40719
Branches 6851 6851
=======================================
Hits 20294 20294
Misses 19700 19700
Partials 725 725
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I am having a lot of bad luck with microk8s today. I am attaching (zipped version of) the html submission report from my manual, baby-sitted run of provider (zipped so GitHub would let me attach it). Please confirm that the following jobs (impacted by this PR) pass:
submission_2025-03-11T15.13.02.329019.html.zip |
|
I used a different machine of the same model and got a full successful run. The difference with this machine was its location in the Taipei Lab (passing machine vs the failing one from yesterday). |
|
I just had a successful run today on the same machine where it was failing yesterday. |
|
@fernando79513 , is there anything holding this PR anymore? I am unfortunately sick today, but I hope to be able to look at fixing any issues tomorrow. |
fernando79513
left a comment
There was a problem hiding this comment.
Good Job simplifying this tests!
I have left some comments, mainly focused on moving from the bash scripts to python scripts whenever it makes sense.
contrib/checkbox-dss-validation/checkbox-provider-dss/bin/check_intel_gpu_rollout.sh
Outdated
Show resolved
Hide resolved
contrib/checkbox-dss-validation/checkbox-provider-dss/bin/enable_intel_gpu_plugin.sh
Show resolved
Hide resolved
While this branch is not impacted, but when enabling NVIDIA GPU in Canonical Kubernetes, the `worker` daemonset can have a version in its name that we cannot predict. We are making this update here as well to avoid merge conflicts or forgetting to update this.
Now we test that at least one node in a poetentially multi-node k8s cluster has the required label for Intel GPU being available.
* Move enabling intel gpu to a standalone scripts One script enables the Intel GPU plugin, the second verifies the rollout status of the necessary daemonsets. * Remove redundant tests to check daemonset rollout We check rollout of these daemonset in the /install job, and the command `kubectl rollout status` would not have finished successfully if these daemonsets did not become ready. We did that during /install to stick to the pattern of enabling NVIDIA GPU addon. * Move checking intel gpu label in-line * Remove tests relying on Intel GPU counts These tests are unreliable because the label gpu.intel.com inaccurately starts counting NVIDIA GPUs too once the respective plugin is enabled. The same source of inaccuracy also disallowed us from making accurate checks for the capacity and allocatable slots available in the cluster. We are removing this test also because the DSS how-to we have been following does not ask for it. See: <https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst> * Fix name of script to check nvidia gpu rollout * Add a combined script for checking GPU rollout * Use unified script to check GPU rollout * Remove individual scripts for checking GPU rollout * Switch to checking rollout of parent daemonset While this branch is not impacted, but when enabling NVIDIA GPU in Canonical Kubernetes, the `worker` daemonset can have a version in its name that we cannot predict. We are making this update here as well to avoid merge conflicts or forgetting to update this. * Update check to verify Intel GPU label Now we test that at least one node in a poetentially multi-node k8s cluster has the required label for Intel GPU being available.
* Move enabling intel gpu to a standalone scripts One script enables the Intel GPU plugin, the second verifies the rollout status of the necessary daemonsets. * Remove redundant tests to check daemonset rollout We check rollout of these daemonset in the /install job, and the command `kubectl rollout status` would not have finished successfully if these daemonsets did not become ready. We did that during /install to stick to the pattern of enabling NVIDIA GPU addon. * Move checking intel gpu label in-line * Remove tests relying on Intel GPU counts These tests are unreliable because the label gpu.intel.com inaccurately starts counting NVIDIA GPUs too once the respective plugin is enabled. The same source of inaccuracy also disallowed us from making accurate checks for the capacity and allocatable slots available in the cluster. We are removing this test also because the DSS how-to we have been following does not ask for it. See: <https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst> * Fix name of script to check nvidia gpu rollout * Add a combined script for checking GPU rollout * Use unified script to check GPU rollout * Remove individual scripts for checking GPU rollout * Switch to checking rollout of parent daemonset While this branch is not impacted, but when enabling NVIDIA GPU in Canonical Kubernetes, the `worker` daemonset can have a version in its name that we cannot predict. We are making this update here as well to avoid merge conflicts or forgetting to update this. * Update check to verify Intel GPU label Now we test that at least one node in a poetentially multi-node k8s cluster has the required label for Intel GPU being available.
* Move enabling intel gpu to a standalone scripts One script enables the Intel GPU plugin, the second verifies the rollout status of the necessary daemonsets. * Remove redundant tests to check daemonset rollout We check rollout of these daemonset in the /install job, and the command `kubectl rollout status` would not have finished successfully if these daemonsets did not become ready. We did that during /install to stick to the pattern of enabling NVIDIA GPU addon. * Move checking intel gpu label in-line * Remove tests relying on Intel GPU counts These tests are unreliable because the label gpu.intel.com inaccurately starts counting NVIDIA GPUs too once the respective plugin is enabled. The same source of inaccuracy also disallowed us from making accurate checks for the capacity and allocatable slots available in the cluster. We are removing this test also because the DSS how-to we have been following does not ask for it. See: <https://github.com/canonical/data-science-stack/blob/e495bacef97a1b6bf8bdb63dca29912be317be8c/docs/how-to/enable-gpus/enable-intel-gpu.rst> * Fix name of script to check nvidia gpu rollout * Add a combined script for checking GPU rollout * Use unified script to check GPU rollout * Remove individual scripts for checking GPU rollout * Switch to checking rollout of parent daemonset While this branch is not impacted, but when enabling NVIDIA GPU in Canonical Kubernetes, the `worker` daemonset can have a version in its name that we cannot predict. We are making this update here as well to avoid merge conflicts or forgetting to update this. * Update check to verify Intel GPU label Now we test that at least one node in a poetentially multi-node k8s cluster has the required label for Intel GPU being available.
Description
(Please also refer to the individual commits to get more specific details)
For background, we continue to follow the How-To from the DSS team to enable Intel GPU acceleration
Furthermore, multiple tests have been removed. DSS documentation does not ask to run these verifications.
kubectl rollout status ...to verify their rollout as part of enabling the Intel GPU plugin (that command won't succeed if the daemonsets don't rollout correctly).gpu.intel.comwere wrong due to inaccuracy of what the label reports itself: the label also counts NVIDIA GPUs once the respective plugin is enabled. These tests were implemented by PE before we added tests with NVIDIA GPUs.Resolved issues
Documentation
No changes to documentation.
Tests
No changes to the unit-tests. A full successful run of the provider in Testflinger can be found here.