✨ Add controller_runtime_reconcile_timeouts_total metric to track ReconciliationTimeout timeouts#3382
Conversation
|
Hi @godwinpang. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
0a03d36 to
39c3abd
Compare
|
sorry this ended up requesting a lot of reviewers - It's been a while since I've worked on c-r and I forgot bout the commit conventions hence the force pushes |
|
/ok-to-test |
| Expect(func() error { | ||
| Expect(ctrlmetrics.ReconcileTimeouts.WithLabelValues(ctrl.Name).Write(&reconcileTimeouts)).To(Succeed()) | ||
| if reconcileTimeouts.GetCounter().GetValue() != 0.0 { | ||
| return fmt.Errorf("metric reconcile timeouts not reset") | ||
| } | ||
| return nil | ||
| }()).Should(Succeed()) |
There was a problem hiding this comment.
Should this not be part of the before each?
There was a problem hiding this comment.
Yep good point - fixed and also rewrote to avoid the nesting
This PR adds a new metric controller_runtime_reconcile_timeouts_total to track when a controller's reconciliation has reached a ReconciliationTimeout. This provides visibility into when reconcile operations time out due to the controller-runtime wrapper timeout, allowing users to alert / monitor unexpectedly long running controller reconiliations.
39c3abd to
93b4145
Compare
|
Thx! /lgtm |
|
LGTM label has been added. DetailsGit tree hash: af34b7fbee2a048e1ad87d730bcc541934bc68bb |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, godwinpang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add `ReconciliationTimeout` panel to the controller details dashboard. The uses the newly introduced `controller_runtime_reconcile_timeouts_total metric` metric (see kubernetes-sigs/controller-runtime#3382).
Add `ReconciliationTimeout` panel to the controller details dashboard. The uses the newly introduced `controller_runtime_reconcile_timeouts_total metric` metric (see kubernetes-sigs/controller-runtime#3382).
Add `ReconciliationTimeout` panel to the controller details dashboard. The uses the newly introduced `controller_runtime_reconcile_timeouts_total metric` metric (see kubernetes-sigs/controller-runtime#3382).
…`v0.23.1`, `sigs.k8s.io/controller-tools` to `v0.20.0` (#13982) * `make tidy && make generate` * Solve `OpenTelemetry-Operator` dependency issue The `OpenTelemetry-Operator` includes dependencies to an older version of `sigs.k8s.io/controller-runtime` which is incompatible with the version (v0.23) used in Gardener. Copying the API packages solves the dependency conflict for now. See open-telemetry/opentelemetry-operator#4667 for more information. * Adapt to incompatible changes * Adapt `OpenAPI` generation The generation needs to be adapted due to kubernetes/kube-openapi#563 * Migrate to `k8s.io/client-go/tools/events` Motivated by kubernetes-sigs/controller-runtime#3262 * Refactor: Remove obsolete `rest.Config` constructor * Add `ReconciliationTimeout` panel Add `ReconciliationTimeout` panel to the controller details dashboard. The uses the newly introduced `controller_runtime_reconcile_timeouts_total metric` metric (see kubernetes-sigs/controller-runtime#3382). * Change `NodeAgentAuthorizer` integration test Don't use the event recorder to test patch permissions. We've seen cases where the event recorder drops events if they are recorded too quickly.
Add controller_runtime_reconcile_timeouts_total metric
Summary
This PR adds a new metric
controller_runtime_reconcile_timeouts_totalto track when a controller's reconciliation has reached aReconciliationTimeout. This provides visibility into when reconcile operations time out due to the controller-runtime wrapper timeout, allowing users to alert / monitor unexpectedly long running controller reconiliations.Problem
The
ReconciliationTimeoutfeature was added to prevent head-of-line (HOL) blocking by ensuring that a single reconcile cannot block a worker indefinitely. However, when a reconcile exits becauseReconciliationTimeoutfired, the existing metrics treat it like a normal return. There was no way to distinguish between:ReconciliationTimeoutcontext timeoutMetric
controller(controller name)