Support CEL expressions in DRA ResourceClaimTemplates#9742
Support CEL expressions in DRA ResourceClaimTemplates#9742kannon92 wants to merge 3 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kannon92 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
CEL selectors are filters that narrow which devices satisfy a request but don't change how many devices are requested. Kueue only needs the device count and class name for quota management, so CEL selectors can be safely accepted without evaluation. The Kubernetes scheduler handles actual device matching at scheduling time.
Adds an e2e test that creates a ResourceClaimTemplate with a CEL selector filtering by driver name, verifies Kueue admits the workload with correct resource usage, and confirms the job completes successfully.
|
/lgtm |
|
LGTM label has been added. DetailsGit tree hash: 78f1e907166a95ca7b83bd59a220c6fcd4336b55 |
|
cc @sohankunkerkar @alaypatel07 ptal |
There was a problem hiding this comment.
I spent some time looking at this from both sides i.e. the Kueue DRA pipeline and the upstream K8s allocator and I think there's more to this than removing the CEL rejection.
The problem is that CEL expressions in DRA covers two very different things. For simple filtering (DRANET, selecting by driver name), CEL just narrows which devices match and the cost is still "1 device" and removing the rejection works fine. But for partitionable devices (KEP-4815), which was the primary motivation for this issue, two workloads can request the same DeviceClass with count=1, but one picks a small MIG partition consuming ~5Gi of 80Gi GPU memory while the other picks a full partition consuming the entire 80Gi. The upstream scheduler handles this through SharedCounters/ConsumesCounters on ResourceSlices but if Kueue just counts devices, both get the same quota cost and the quota becomes meaningless.
We know Kueue shouldn't evaluate CEL. But we do need some way to determine the actual cost in terms of counter consumption rather than device count; whether through admin config that assigns cost per DeviceClass, user-declared consumption, or ResourceSlice observation.
Removing the CEL rejection is a reasonable first step for simple filtering use cases, but I don't think we should close this issue until we've addressed the counter-based quota problem for partitionable devices.
Edit: Beyond the partitionable devices problem, there's an even more immediate concern with simply removing the rejection. CEL evaluation happens in the scheduler's Filter phase against real ResourceSlice data well beyond Kueue's reach. By the time the scheduler runs, Kueue has already admitted and charged quota. If a user submits a workload with a CEL selector that no device actually matches (wrong attribute name, impossible filter, typo), Kueue charges quota, the scheduler can't place the pod, and that quota is permanently consumed. This blocks other legitimate workloads from getting through. @harche reproduced this on a Kind cluster with the DRA example driver. Without WaitForPodsReady configured (which is optional), there's no self-correction at all.
Why is this a necessary assumption? I wouldn't make this as a "rule" if this steps in the way. I would assume there are some helpers which could help with that, and it is not rocket science . If this would solve the issue, then I think this is the right thing to do. At least I wouldn't do something just because of "code complexity" if the feature is broken otherwise. We may reject evaluation of CEL if there are other reasons, but I would like to undrestand those. General ask: let's think which semantics is best to serve our users. Then we can think how to best achieve it. |
|
Yea this is sorta a problem with Kueue generally. If a user requests a bogus gpu and sets ClusterQueue for it (or say they want a pod requiesting 256 gpus). I think Kueue will check against quota. But it won't check the scheduler if it can actually be placed. Kueue's TAS does incorporate scheduling capacity and becomes its own scheduler. For kueue as a quota manager I guess I'm not sure if we need CEL expressions to be evaulated at quota time. We can certainly do it and its required for TAS. But I don't know of anyone working on Kueue + DRA + TAS and incorportating DRA scheduling into TAS. |
Yeah, but would evaluating the CEL be equivalent to that? What are the resources Kueue would need to track to evaluate the CELs properly?
Yeah. The problem is that "as quota manager" Kueue works reasonably without TAS - just need to account for node fragmentation and node failures. Still, there were too many admission mistakes to be usable for some of our users. However, DRA without CEL evaluation seems like many more admission mistakes. So, I would like to explore evaluation as a possibility.
We are planning to staff this effort, probably this year. The idea is to use the "Scheduling library" as presented on the last wg-batch by @44past4. |
I think you picked the first line :D I wasn't ruling out CEL evaluation as a hard constraint. The rest of that sentence listed ResourceSlice observation as one of the approaches, which is essentially what you're suggesting. Evaluating the user's CEL against ResourceSlice data would solve both the quota leak problem and the counter-based cost problem for partitionable devices. The TOCTOU gap is the same one Kueue already lives with for regular resources. One thing to keep in mind from the extended resources discussion, we'd want to avoid importing from |
|
At least from DRA folks, I was told that staging/dynamic-resource-allocation was the approach they recommend for autoscaling evaulation of CEL constraints. I was thinking we'd have to go this route if we want the control-plane to evaulate the CEL expressions. I did also notice that the UX for CEL is tricky from even core k8s perspective. I filled kubernetes/kubernetes#137490 as if you have invalid CEL I didn't see any information actually bubbled up the user that the cel expression failed. |
|
New changes are detected. LGTM label has been removed. |
|
Pushed up a potential solution for cel. |
c598440 to
72e89fd
Compare
|
@kannon92: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Removes the restriction that rejected CEL selectors in DRA
ResourceClaimTemplatedevice requests. CEL selectors are filters that narrow which devices satisfy a request but don't change how many devices are requested. Kueue only needs the device count and class name for quota management, so CEL selectors can be safely accepted without evaluation. The Kubernetes scheduler handles actual device matching at scheduling time.This enables use cases like:
Which issue(s) this PR fixes:
Fixes #7298
Special notes for your reviewer:
This PR was developed with the assistance of Claude Code (AI).
The upstream issue kubernetes/kubernetes#137490 tracks improving error feedback when CEL selectors don't match any device at scheduling time — that is a separate scheduler-side concern.
Does this PR introduce a user-facing change?