Skip to content

Support CEL expressions in DRA ResourceClaimTemplates#9742

Open
kannon92 wants to merge 3 commits intokubernetes-sigs:mainfrom
kannon92:issue-7298-cel
Open

Support CEL expressions in DRA ResourceClaimTemplates#9742
kannon92 wants to merge 3 commits intokubernetes-sigs:mainfrom
kannon92:issue-7298-cel

Conversation

@kannon92
Copy link
Contributor

@kannon92 kannon92 commented Mar 8, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Removes the restriction that rejected CEL selectors in DRA ResourceClaimTemplate device requests. CEL selectors are filters that narrow which devices satisfy a request but don't change how many devices are requested. Kueue only needs the device count and class name for quota management, so CEL selectors can be safely accepted without evaluation. The Kubernetes scheduler handles actual device matching at scheduling time.

This enables use cases like:

  • Partitionable devices (MIG GPUs) via KEP-4815
  • Network device filtering via DRANET
  • Any DRA driver that publishes device attributes for CEL-based selection

Which issue(s) this PR fixes:

Fixes #7298

Special notes for your reviewer:

This PR was developed with the assistance of Claude Code (AI).

The upstream issue kubernetes/kubernetes#137490 tracks improving error feedback when CEL selectors don't match any device at scheduling time — that is a separate scheduler-side concern.

Does this PR introduce a user-facing change?

Support CEL expressions in DRA ResourceClaimTemplates. Kueue no longer rejects workloads that use CEL device selectors for filtering devices in resource claims.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 8, 2026
@netlify
Copy link

netlify bot commented Mar 8, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 72e89fd
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69af6700222af000082176c0

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 8, 2026
kannon92 added 2 commits March 8, 2026 16:09
CEL selectors are filters that narrow which devices satisfy a request
but don't change how many devices are requested. Kueue only needs the
device count and class name for quota management, so CEL selectors can
be safely accepted without evaluation. The Kubernetes scheduler handles
actual device matching at scheduling time.
Adds an e2e test that creates a ResourceClaimTemplate with a CEL
selector filtering by driver name, verifies Kueue admits the workload
with correct resource usage, and confirms the job completes successfully.
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 8, 2026
@vladikkuzn
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2026
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 78f1e907166a95ca7b83bd59a220c6fcd4336b55

@mimowo
Copy link
Contributor

mimowo commented Mar 9, 2026

cc @sohankunkerkar @alaypatel07 ptal

Copy link
Member

@sohankunkerkar sohankunkerkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time looking at this from both sides i.e. the Kueue DRA pipeline and the upstream K8s allocator and I think there's more to this than removing the CEL rejection.

The problem is that CEL expressions in DRA covers two very different things. For simple filtering (DRANET, selecting by driver name), CEL just narrows which devices match and the cost is still "1 device" and removing the rejection works fine. But for partitionable devices (KEP-4815), which was the primary motivation for this issue, two workloads can request the same DeviceClass with count=1, but one picks a small MIG partition consuming ~5Gi of 80Gi GPU memory while the other picks a full partition consuming the entire 80Gi. The upstream scheduler handles this through SharedCounters/ConsumesCounters on ResourceSlices but if Kueue just counts devices, both get the same quota cost and the quota becomes meaningless.

We know Kueue shouldn't evaluate CEL. But we do need some way to determine the actual cost in terms of counter consumption rather than device count; whether through admin config that assigns cost per DeviceClass, user-declared consumption, or ResourceSlice observation.

Removing the CEL rejection is a reasonable first step for simple filtering use cases, but I don't think we should close this issue until we've addressed the counter-based quota problem for partitionable devices.

Edit: Beyond the partitionable devices problem, there's an even more immediate concern with simply removing the rejection. CEL evaluation happens in the scheduler's Filter phase against real ResourceSlice data well beyond Kueue's reach. By the time the scheduler runs, Kueue has already admitted and charged quota. If a user submits a workload with a CEL selector that no device actually matches (wrong attribute name, impossible filter, typo), Kueue charges quota, the scheduler can't place the pod, and that quota is permanently consumed. This blocks other legitimate workloads from getting through. @harche reproduced this on a Kind cluster with the DRA example driver. Without WaitForPodsReady configured (which is optional), there's no self-correction at all.

@mimowo
Copy link
Contributor

mimowo commented Mar 9, 2026

We know Kueue shouldn't evaluate CEL.

Why is this a necessary assumption? I wouldn't make this as a "rule" if this steps in the way. I would assume there are some helpers which could help with that, and it is not rocket science . If this would solve the issue, then I think this is the right thing to do. At least I wouldn't do something just because of "code complexity" if the feature is broken otherwise. We may reject evaluation of CEL if there are other reasons, but I would like to undrestand those.

General ask: let's think which semantics is best to serve our users. Then we can think how to best achieve it.

@kannon92
Copy link
Contributor Author

kannon92 commented Mar 9, 2026

Yea this is sorta a problem with Kueue generally.

If a user requests a bogus gpu and sets ClusterQueue for it (or say they want a pod requiesting 256 gpus). I think Kueue will check against quota.

But it won't check the scheduler if it can actually be placed.

Kueue's TAS does incorporate scheduling capacity and becomes its own scheduler.
So this problem becomes critical to solve for DRA + TAS but we haven't yet designed or worked on that.

For kueue as a quota manager I guess I'm not sure if we need CEL expressions to be evaulated at quota time. We can certainly do it and its required for TAS. But I don't know of anyone working on Kueue + DRA + TAS and incorportating DRA scheduling into TAS.

@mimowo
Copy link
Contributor

mimowo commented Mar 9, 2026

But it won't check the scheduler if it can actually be placed.

Yeah, but would evaluating the CEL be equivalent to that? What are the resources Kueue would need to track to evaluate the CELs properly?

For kueue as a quota manager I guess I'm not sure if we need CEL expressions to be evaulated at quota time.

Yeah. The problem is that "as quota manager" Kueue works reasonably without TAS - just need to account for node fragmentation and node failures. Still, there were too many admission mistakes to be usable for some of our users. However, DRA without CEL evaluation seems like many more admission mistakes. So, I would like to explore evaluation as a possibility.

But I don't know of anyone working on Kueue + DRA + TAS and incorportating DRA scheduling into TAS.

We are planning to staff this effort, probably this year. The idea is to use the "Scheduling library" as presented on the last wg-batch by @44past4.

@sohankunkerkar
Copy link
Member

We know Kueue shouldn't evaluate CEL.

Why is this a necessary assumption? I wouldn't make this as a "rule" if this steps in the way. I would assume there are some helpers which could help with that, and it is not rocket science. If this would solve the issue, then I think this is the right thing to do.

I think you picked the first line :D I wasn't ruling out CEL evaluation as a hard constraint. The rest of that sentence listed ResourceSlice observation as one of the approaches, which is essentially what you're suggesting. Evaluating the user's CEL against ResourceSlice data would solve both the quota leak problem and the counter-based cost problem for partitionable devices. The TOCTOU gap is the same one Kueue already lives with for regular resources.

One thing to keep in mind from the extended resources discussion, we'd want to avoid importing from k8s.io/dynamic-resource-allocation staging repo (per your earlier feedback). cel-go is already a transitive dependency in the vendor tree, so we can build a minimal device-matching environment on top of that directly if we go that route.

@kannon92
Copy link
Contributor Author

kannon92 commented Mar 9, 2026

At least from DRA folks, I was told that staging/dynamic-resource-allocation was the approach they recommend for autoscaling evaulation of CEL constraints.

I was thinking we'd have to go this route if we want the control-plane to evaulate the CEL expressions.

I did also notice that the UX for CEL is tricky from even core k8s perspective. I filled kubernetes/kubernetes#137490 as if you have invalid CEL I didn't see any information actually bubbled up the user that the cel expression failed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2026
@k8s-ci-robot k8s-ci-robot requested a review from vladikkuzn March 9, 2026 21:28
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 9, 2026
@kannon92
Copy link
Contributor Author

kannon92 commented Mar 9, 2026

Pushed up a potential solution for cel.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Mar 10, 2026

@kannon92: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-e2e-dra-main 72e89fd link true /test pull-kueue-test-e2e-dra-main
pull-kueue-verify-main 72e89fd link true /test pull-kueue-verify-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support CEL expressions in the DRA support

5 participants