OCPBUGS-77056: Make external cert validation asynchronous#745
OCPBUGS-77056: Make external cert validation asynchronous#745bentito wants to merge 8 commits intoopenshift:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
@bentito: This pull request references Jira Issue OCPBUGS-77056, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
325f313 to
214b603
Compare
|
/hold |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds asynchronous, per-secret SAR validation and caching for external TLS certificates, two new status reasons ( Changes
Sequence DiagramsequenceDiagram
participant SM as RouteSecretManager
participant VS as ValidationModule
participant Cache as AsyncSARCache
participant SAR as SARClient
participant Secret as SecretsGetter
participant Status as StatusRecorder
participant Callback as onComplete\n(namespace, name)
SM->>VS: ValidateTLSExternalCertificate(route, fldPath, sarc, secretsGetter, Callback)
VS->>VS: Check ExternalCertificate presence
alt has ExternalCertificate
VS->>Cache: Lookup (namespace/secretName)
alt cache miss or not done
VS->>Cache: Insert pending asyncSARResult (done=false)
VS->>VS: Spawn async goroutine
rect rgba(100,150,200,0.5)
VS->>SAR: Perform SubjectAccessReview(s)
VS->>Secret: Fetch secret
VS->>VS: Validate keypair (tls.X509KeyPair)
VS->>Cache: Update asyncSARResult (errs, done=true)
VS->>Callback: Invoke(namespace, secretName)
end
Callback->>SM: Trigger route re-evaluation
SM->>Status: Record RouteUpdate or RouteRejection (SARCompleted / SecretLoaded)
else cache hit & done
VS->>Callback: Invoke immediately (or return cached errs)
end
else no ExternalCertificate
VS->>VS: Return early (no async work)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 288-298: The current Add handler unconditionally calls
RecordRouteRejection when a secret is replayed; change it to mirror the
UpdateFunc/SAR-completion logic: check the route's admission (via
populateRouteTLSFromSecret result or route.Status.Conditions) and if the route
is already admitted emit a normal update event instead of calling
RecordRouteRejection, otherwise record the rejection; extract the string
"ExternalCertificateSecretLoaded" into a shared constant (e.g.,
ExtCrtStatusReasonSecretLoaded) and add that constant to the
ignoreIngressConditionReason set in contention.go so the admission-ignoring
logic treats this replay reason as benign. Ensure you update references to
RecordRouteRejection, populateRouteTLSFromSecret, UpdateFunc,
ExternalCertificateSecretLoaded, and ignoreIngressConditionReason accordingly.
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 503-556: The async SAR flow stores only a single pending error
(pendingErr) in asyncSARCache for a cacheKey, so subsequent calls (in validate
flow) that hit the cache drop their onComplete callback and never get
re-enqueued; change the cache value to include a list of waiting callbacks (or a
small struct with errs + []onComplete callbacks) keyed by asyncSARCache so that
when you detect a pending entry you append the current onComplete for
route.Namespace/secretName, and in the goroutine after
asyncSARCache.Store(cacheKey, errs) iterate and invoke all stored callbacks (not
just the first) and then replace the cache entry with errs only; update code
paths that read the cache (the cache-hit branch and the goroutine completion) to
handle this new struct and ensure onComplete is invoked for every waiting route.
- Around line 485-490: The global asyncSARCache currently stores completed
validation results forever; change it to avoid permanent sticky entries by (a)
replacing the raw sync.Map value with a small struct that includes the cached
field.ErrorList plus an expiration timestamp or a source type flag, (b) only
writing non-transient successful validation results (or results with a short
TTL) into asyncSARCache, and (c) adding explicit invalidation helpers (e.g.,
InvalidateAsyncSARCache(namespace, secretName) and
InvalidateAllAsyncSARCacheForSecret(namespace, secretName)) and call them from
the secret add/update/delete/recreate paths in the secret manager
(route_secret_manager.go) so secret changes force revalidation; also update
ClearAsyncSARCacheForTest to clear the new structure. Ensure references to
asyncSARCache and ClearAsyncSARCacheForTest (and the code paths mentioned around
the 503-556 region) are updated to use the new value type and invalidation APIs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e07f375c-945a-4e58-b5ec-3347eaff109c
📒 Files selected for processing (5)
pkg/router/controller/contention.gopkg/router/controller/route_secret_manager.gopkg/router/controller/route_secret_manager_test.gopkg/router/routeapihelpers/validation.gopkg/router/routeapihelpers/validation_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/router/routeapihelpers/validation.go (1)
485-497:⚠️ Potential issue | 🟠 MajorCompleted SAR entries are still permanent.
Only secret add/update/delete clears this cache. A finished allow/deny result for
namespace/secretNamesurvives route deletion, later RBAC grants/revocations, and transient SAR/API failures, so a new route that reuses the same secret name can inherit a stale decision without performing a fresh check. Please put completed entries behind a TTL/generation and avoid caching transient internal errors indefinitely.Also applies to: 593-605
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/router/routeapihelpers/validation.go` around lines 485 - 497, The cache currently stores completed asyncSARResult entries forever; update asyncSARResult (and uses of asyncSARCache) to include a timestamp or generation counter and enforce a TTL/generation check before returning cached results so that completed allow/deny decisions expire after a short interval or when generation changes. Ensure only stable allow/deny outcomes are cached long-term; if asyncSARResult.errs contains transient/internal errors, set a much shorter TTL (or do not mark done) so they trigger fresh SARs. Update the cache read path to validate timestamp/generation and the write path to record it, and keep InvalidateAsyncSARCache(namespace, secretName) as-is to support manual invalidation.pkg/router/controller/route_secret_manager.go (1)
262-304:⚠️ Potential issue | 🟠 MajorDon't clear the fresh SAR result on the initial
SecretLoadedreplay.
RegisterRoute()is only reached aftervalidate()has already completed successfully, so this non-deletedAddFuncpath is replaying an already-validated secret. ClearingasyncSARCachehere throws away that fresh result, defeats the per-secret cache for shared secrets, and can bounce the route back into a second"authorization check pending"/ExternalCertificateValidationFailedcycle on theSecretLoadedrequeue. Keep invalidation on actual secret changes (recreated/updated/deleted), but not on the initial load replay.Suggested shape
AddFunc: func(obj interface{}) { secret := obj.(*kapi.Secret) log.V(4).Info("Secret added for route", "namespace", namespace, "secret", secret.Name, "route", routeName) - routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name) // Secret re-creation scenario // Check if the route key exists in the deletedSecrets map, indicating that the secret was previously deleted for this route. // If it exists, it means the secret is being recreated. Remove the key from the map and proceed with handling the route. @@ key := generateKey(namespace, routeName) if _, deleted := p.deletedSecrets.LoadAndDelete(key); deleted { + routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name) log.V(4).Info("Secret recreated for route", "namespace", namespace, "secret", secret.Name, "route", routeName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/router/controller/route_secret_manager.go` around lines 262 - 304, The AddFunc for secret events currently calls routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh per-secret async SAR results during the initial cache-sync replay; to fix, remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from the top of the AddFunc in RegisterRoute's secret handler and only invoke routeapihelpers.InvalidateAsyncSARCache when the secret is actually changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete "recreated" branch and in the Update/Delete handlers), using generateKey(namespace, routeName) and p.deletedSecrets to determine real recreation so you don't invalidate the SAR cache on the initial load.
🧹 Nitpick comments (1)
pkg/router/controller/route_secret_manager_test.go (1)
1355-1372: Put a timeout around these async waits.These bare channel receives will hang the whole suite on any missed informer callback. Now that the tests intentionally depend on async completion, a tiny helper with
select+time.Afterwill turn those deadlocks into normal test failures.Suggested helper
+func waitForRecorderEvent(t *testing.T, ch <-chan struct{}) { + t.Helper() + select { + case <-ch: + case <-time.After(2 * time.Second): + t.Fatal("timeout waiting for recorder event") + } +}Then replace each bare receive with:
-<-recorder.doneCh +waitForRecorderEvent(t, recorder.doneCh)Also applies to: 1434-1444, 1496-1516
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/router/controller/route_secret_manager_test.go` around lines 1355 - 1372, Replace bare channel receives on recorder.doneCh with a timed wait helper to avoid test hangs: add a small helper (e.g., waitForDone(ctxTimeout, ch) or waitWithTimeout(t, ch, duration)) that uses select with the channel and time.After to fail the test on timeout, then update every occurrence of "<-recorder.doneCh" in route_secret_manager_test.go (including the occurrences around the Secret Update flow that use kubeClient.CoreV1().Secrets(...).Update and the initial Add wait) to call that helper instead so missed informer callbacks produce a deterministic test failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 262-304: The AddFunc for secret events currently calls
routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh
per-secret async SAR results during the initial cache-sync replay; to fix,
remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from
the top of the AddFunc in RegisterRoute's secret handler and only invoke
routeapihelpers.InvalidateAsyncSARCache when the secret is actually
changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete
"recreated" branch and in the Update/Delete handlers), using
generateKey(namespace, routeName) and p.deletedSecrets to determine real
recreation so you don't invalidate the SAR cache on the initial load.
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 485-497: The cache currently stores completed asyncSARResult
entries forever; update asyncSARResult (and uses of asyncSARCache) to include a
timestamp or generation counter and enforce a TTL/generation check before
returning cached results so that completed allow/deny decisions expire after a
short interval or when generation changes. Ensure only stable allow/deny
outcomes are cached long-term; if asyncSARResult.errs contains
transient/internal errors, set a much shorter TTL (or do not mark done) so they
trigger fresh SARs. Update the cache read path to validate timestamp/generation
and the write path to record it, and keep InvalidateAsyncSARCache(namespace,
secretName) as-is to support manual invalidation.
---
Nitpick comments:
In `@pkg/router/controller/route_secret_manager_test.go`:
- Around line 1355-1372: Replace bare channel receives on recorder.doneCh with a
timed wait helper to avoid test hangs: add a small helper (e.g.,
waitForDone(ctxTimeout, ch) or waitWithTimeout(t, ch, duration)) that uses
select with the channel and time.After to fail the test on timeout, then update
every occurrence of "<-recorder.doneCh" in route_secret_manager_test.go
(including the occurrences around the Secret Update flow that use
kubeClient.CoreV1().Secrets(...).Update and the initial Add wait) to call that
helper instead so missed informer callbacks produce a deterministic test
failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c800f527-b5e1-4386-b6cf-e20eb4726a31
📒 Files selected for processing (4)
pkg/router/controller/contention.gopkg/router/controller/route_secret_manager.gopkg/router/controller/route_secret_manager_test.gopkg/router/routeapihelpers/validation.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/router/controller/contention.go
|
/retest-required |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 561-568: The semaphore (asyncSARSemaphore) is being acquired on
the caller path before launching the goroutine, causing the caller to block when
the limit is reached; move the acquire into the goroutine so the caller can
spawn all per-secret goroutines without blocking, then release the token inside
that goroutine (use defer to ensure release). Concretely: start the goroutine
immediately, perform asyncSARSemaphore <- struct{}{} at the top of the goroutine
body, and keep the existing defer func() { <-asyncSARSemaphore }() to guarantee
release; ensure any early returns in the goroutine still release the token.
- Around line 532-535: The current guard that returns nil when sarc or
secretsGetter is nil hides wiring bugs and skips RBAC/secret validation;
instead, change the conditional in validation.go so that if sarc == nil ||
secretsGetter == nil you return an internal error (e.g., fmt.Errorf or the
package's internal/server error type) describing "missing validation clients"
rather than nil, or alternatively gate this behind a test-only seam/flag so
production never silently succeeds; update callers to propagate/handle the
returned error as needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 31462f9b-48bb-466f-8d90-5c399b5fab8b
📒 Files selected for processing (1)
pkg/router/routeapihelpers/validation.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 526-529: The code currently returns a synthetic pending error on
cache miss without registering a callback if onComplete is nil; change the logic
in the validation path (the block that manipulates cached.callbacks and returns
cached.errs) so that onComplete must never be nil: either (A) make onComplete a
required parameter at the API boundary, or (B) when onComplete == nil execute
the async validation synchronously (drive the same validation routine on the
current goroutine, wait for completion, populate cached.errs, and then return
the real errors) instead of just returning the pending placeholder; if you
choose (B) ensure you reuse the same validation function that enqueues callbacks
(so cached state is updated) and that cached.callbacks is appended only when
onComplete is non-nil or after synchronous completion you return the finalized
errors.
- Around line 572-599: Wrap each SAR and secret GET with a per-request timeout
context (use context.WithTimeout and defer cancel) instead of context.TODO();
for the SAR checks stop using authorizationutil.Authorize and call the SAR
client Create() directly (e.g., sarClient.Create) with that timeout context so
you can inspect both (response, err) separately, treat API/transport errors as
retryable/internal (append field.InternalError or a distinct cached error) and
only append field.Forbidden when the SAR response explicitly denies, and call
secretsGetter.Secrets(route.Namespace).Get with the same bounded context and map
Get errors to NotFound vs InternalError accordingly. Ensure you still reference
routerServiceAccount, secretName, fldPath and preserve existing
field.Forbidden/field.InternalError/field.NotFound usages when classifying the
outcome.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bbebc97e-5cff-40e0-80d7-bff7bf16a68c
📒 Files selected for processing (1)
pkg/router/routeapihelpers/validation.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/router/routeapihelpers/validation.go (1)
637-639: Consider recovering from callback panics to ensure all callbacks are invoked.If a callback panics, the remaining callbacks in the slice won't be invoked. This is unlikely but could leave some routes stuck in a pending state if one route's callback handler has a bug.
🛡️ Proposed defensive callback invocation
for _, cb := range callbacks { - cb(route.Namespace, secretName) + func() { + defer func() { + if r := recover(); r != nil { + // Log panic but continue invoking remaining callbacks + } + }() + cb(route.Namespace, secretName) + }() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/router/routeapihelpers/validation.go` around lines 637 - 639, Wrap each callback invocation in a deferred recover to prevent a single panic from aborting the loop: inside the loop over callbacks (callbacks, cb) call each cb via a small closure that uses defer + recover to catch any panic, log or report the panic along with the callback context (e.g., route.Namespace and secretName) and continue to the next cb; ensure the recovered panic does not re-panic so all callbacks are invoked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 637-639: Wrap each callback invocation in a deferred recover to
prevent a single panic from aborting the loop: inside the loop over callbacks
(callbacks, cb) call each cb via a small closure that uses defer + recover to
catch any panic, log or report the panic along with the callback context (e.g.,
route.Namespace and secretName) and continue to the next cb; ensure the
recovered panic does not re-panic so all callbacks are invoked.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c9c78ea5-dba5-483e-8d75-13dc6f9905d7
📒 Files selected for processing (1)
pkg/router/routeapihelpers/validation.go
Signed-off-by: Brett Tofel <btofel@redhat.com>
Include a concurrency governor (semaphore) to limit the number of simultaneous SubjectAccessReview checks and secret syncs to 50. This prevents hitting the API server with thousands of concurrent requests during router startup.
…eouts - Enforce synchronous execution if onComplete is nil - Add 10s timeout to SAR and Secret GET requests - Call SAR client directly instead of using authorizationutil.Authorize to distinguish between internal errors and explicit denials
- Handle DeletedFinalStateUnknown tombstones in secret manager to prevent panics during namespace deletion - Update library-go dependency to include async secret registration fix (library-go PR #2132) - Use /tmp for fake haproxy unix sockets to avoid macOS path length limits in tests
716e042 to
0a99ee0
Compare
|
@bentito: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR implements asynchronous and cached
SubjectAccessReviewchecks forspec.tls.externalCertificateduring router startup.Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit
O(N * API_latency)delays and would hitCrashLoopBackOfffrom failing liveness probes.This fix maintains full RBAC security checks but executes them asynchronously in the background. A global
sync.Mapacts as anasyncSARCacheto drastically speed up repeated checks for the same underlying secret.Fixes: OCPBUGS-77056
Summary by CodeRabbit
New Features
Tests