Skip to content

gatewayclass: Enable Horizontal Pod Autoscaling#1326

Open
Miciah wants to merge 3 commits intoopenshift:masterfrom
Miciah:gatewayclass-enable-Horizontal-Pod-Autoscaling
Open

gatewayclass: Enable Horizontal Pod Autoscaling#1326
Miciah wants to merge 3 commits intoopenshift:masterfrom
Miciah:gatewayclass-enable-Horizontal-Pod-Autoscaling

Conversation

@Miciah
Copy link
Contributor

@Miciah Miciah commented Dec 16, 2025

Enable Horizontal Pod Autoscaling (HPA) on Istio. Hard-code the autoscaling parameters based on the cluster infrastructure config:

  • If the infrastructure topology is "SingleReplica", set minimum to 1.
  • Otherwise, set minimum to 2.
  • In any case, set maximum to 10.

Summary by CodeRabbit

  • New Features

    • Added infrastructure topology awareness to gateway configuration, enabling automatic HPA (Horizontal Pod Autoscaling) settings based on cluster topology (single-replica or highly-available modes).
  • Refactor

    • Improved controller initialization to apply manager-level defaults consistently across gateway controllers.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025
@Miciah
Copy link
Contributor Author

Miciah commented Dec 16, 2025

/test e2e-aws-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gcs278 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Miciah
Copy link
Contributor Author

Miciah commented Dec 17, 2025

e2e-aws-operator passed. deployments.json shows that the istiod pod has 2 replicas:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2001047586798047232/artifacts/e2e-aws-operator/gather-extra/artifacts/deployments.json' | jq '.items.[]|select(.metadata.name=="istiod-openshift-gateway")|.spec.replicas'
2
%

However, there are some errors due to missing metrics:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2001047586798047232/artifacts/e2e-aws-operator/gather-extra/artifacts/events.json' | jq -r '[.items.[]|select(.metadata.namespace=="openshift-ingress" and .source.component=="horizontal-pod-autoscaler")]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|.involvedObject.name+": "+.message'
istiod-openshift-gateway: New size: 2; reason: Current number of replicas below Spec.MinReplicas
istiod-openshift-gateway: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
istiod-openshift-gateway: New size: 2; reason: Current number of replicas below Spec.MinReplicas
istiod-openshift-gateway: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
%

Moreover, the proxy pods do not get scaled out more than 1 replica:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2001047586798047232/artifacts/e2e-aws-operator/gather-extra/artifacts/events.json' | jq -r '[.items.[]|select(.metadata.namespace=="openshift-ingress" and (.metadata.name|startswith("test-gateway")) and .source.component=="deployment-controller")]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|.involvedObject.name+": "+.message'
test-gateway-openshift-default: Scaled up replica set test-gateway-openshift-default-74dff54c89 from 0 to 1
test-gateway-openshift-default: Scaled up replica set test-gateway-openshift-default-577b5f5db7 from 0 to 1
test-gateway-openshift-default: Scaled down replica set test-gateway-openshift-default-74dff54c89 from 1 to 0
test-gateway-update-openshift-default: Scaled up replica set test-gateway-update-openshift-default-75c4dbd84f from 0 to 1
%


// Watch the cluster infrastructure config in case the infrastructure
// topology changes.
if err := c.Watch(source.Kind[client.Object](operatorCache, &configv1.Infrastructure{}, reconciler.enqueueRequestForSomeGatewayClass())); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question here is, if you change Sail Operator parameter will it reflect on Pilot config?
Also, in case this parameter is also used for Gateway replica deployment, will Sail reconcile all the gateways and HPA?

(Giving some thoughts for test)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answering myself:
the change of HPA parameter from pilot does not impact on the GatewayClass, it needs its own definition from spec.infrastructure.paramRefs

@rikatz
Copy link
Member

rikatz commented Dec 17, 2025

testing on my own cluster (AWS, HA):

kubectl get hpa -A
NAMESPACE           NAME                       REFERENCE                             TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
openshift-ingress   istiod-openshift-gateway   Deployment/istiod-openshift-gateway   cpu: <unknown>/80%   2         10        2          36s

It is scaling pilot, but not sure about the gateways, let me test

@rikatz
Copy link
Member

rikatz commented Dec 17, 2025

IIUC we can pass to sail the configs of the Gateway class on the Istio resource, and it will create the configmap for us (even the gateway class being created by the user): https://github.com/openshift-service-mesh/sail-operator/blob/f3a3c6d7e6f2c2ae412b09ce1a78bc93258b4db4/resources/v1.28.0/charts/istiod/templates/gateway-class-configmap.yaml

We can make CIO set this property and the configmap will be created for us (this includes even setting the annotations for internal only clusters!), the problem is that today the json.RawMessage used by Sail API is not accepting the content correctly. I have asked the sail operator team about it and if they have been using it, otherwise we can improve this workflow on Sail Operator and consume on CIO

@Miciah Miciah force-pushed the gatewayclass-enable-Horizontal-Pod-Autoscaling branch from 1ea51fc to f407b29 Compare December 22, 2025 03:20
@Miciah
Copy link
Contributor Author

Miciah commented Dec 22, 2025

/test e2e-aws-operator

Enable Horizontal Pod Autoscaling (HPA) on Istio.  Hard-code the
autoscaling parameters based on the cluster infrastructure config:

- If the infrastructure topology is "SingleReplica", set minimum to 1.
- Otherwise, set minimum to 2.
- In any case, set maximum to 10.

* pkg/operator/controller/gatewayclass/controller.go
(gatewayclassControllerIndexFieldName): New const.
(NewUnmanaged): Watch infrastructures.  Initialize fieldIndexer in the
reconciler so that it can be used to create an index over gatewayclasses
later.  The index cannot be created directly in NewUnmanaged as the
gatewayclasses resource might not yet exist when NewUnmanaged is called.
(reconciler): Add fieldIndexer and startGatewayclassControllerIndex.
(Reconcile): Get the cluster infrastructure object and pass it to
ensureIstio.  Create an index over gatewayclasses by
spec.controllerName, using fieldIndexer from the reconciler,
gatewayclassControllerIndexFieldName for the index field name, and
startGatewayclassControllerIndex to ensure the index is only created
once, on first reconciliation.
* pkg/operator/controller/gatewayclass/controller_test.go
(Test_Reconcile): Add test cases for missing cluster infrastructure
config.  Add a test case for SingleReplica topology mode.  Add a test
case with multiple gatewayclasses. Add the cluster infrastructure config
object to existingObjects for the existing test cases. Add the expected
HPA configuration to the expected Istio CRs in test cases.  Initialize
fieldIndexer with a fake indexer in the reconciler.
(FakeIndexer, (FakeIndexer).IndexField): New type and method,
used in Test_Reconcile.
* pkg/operator/controller/gatewayclass/istio.go (ensureIstio): Add an
infraConfig parameter, and pass the argument to desiredIstio.  Use the
new index to list gatewayclasses with the OpenShift gateway controller
name, and pass the list of gatewayclasses to desiredIstio as well.
* pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Add an
infraConfig parameter and a gatewayclasses parameter, and use the
arguments to look up the infrastructure topology mode and and configure
HPA accordingly.
@Miciah Miciah force-pushed the gatewayclass-enable-Horizontal-Pod-Autoscaling branch from f407b29 to 3e6e8c5 Compare December 22, 2025 14:28
@Miciah
Copy link
Contributor Author

Miciah commented Dec 22, 2025

/test e2e-aws-operator

1 similar comment
@Miciah
Copy link
Contributor Author

Miciah commented Dec 22, 2025

/test e2e-aws-operator

Don't overwrite the gateway variable's value from
createGatewayWithListeners with a possibly nil value from
assertGatewaySuccessful.

The gateway variable is used in a cleanup handler, which would panic if
the variable had a nil value.  The gateway value from
assertGatewaySuccessful isn't really needed, so it can be safely
ignored.

* test/e2e/gateway_api_test.go (testGatewayAPIDNSListenerWithNoHostname)
(testGatewayAPIDNSListenerUpdate): Ignore the return value from
assertGatewaySuccessful.
@Miciah Miciah force-pushed the gatewayclass-enable-Horizontal-Pod-Autoscaling branch from fad49ff to c566d6a Compare December 23, 2025 02:45
@Miciah
Copy link
Contributor Author

Miciah commented Dec 23, 2025

/test e2e-aws-operator

Set default values, such as the logger, in the controller options for
unmanaged controllers, namely the gateway-labeler, gateway-service-dns,
and gatewayclass controllers.

Before controller-runtime v0.21.0, controller-runtime implicitly set
default values for controller options for managed and unmanaged
controllers alike.  Controller-runtime internally used the
DefaultFromConfig method to do so.  Since controller-runtime v0.21.0,
these default values are not implicitly set for unmanaged
controllers[1].  In particular, this means that the logger is not
initialized, and so the controller initialization and any reconciliation
errors are not logged.  This commit explicitly sets default values by
calling DefaultFromConfig for unmanaged controllers.

Follow-up to commit 66485b8.

1. kubernetes-sigs/controller-runtime@d9ff283

* pkg/operator/controller/gateway-labeler/controller.go (NewUnmanaged):
* pkg/operator/controller/gateway-service-dns/controller.go
(NewUnmanaged):
* pkg/operator/controller/gatewayclass/controller.go (NewUnmanaged):
Use DefaultFromConfig to set default values for controller options.
@Miciah Miciah marked this pull request as ready for review March 4, 2026 17:03
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

Walkthrough

Three controller implementations adopt manager-supplied default options for consistent configuration. The gatewayclass controller gains infrastructure topology awareness, dynamic field indexing of GatewayClass objects, and infrastructure-driven Istio configuration generation. Tests expand to cover multiple infrastructure topologies and field indexer integration.

Changes

Cohort / File(s) Summary
Gateway Controller Default Options
pkg/operator/controller/gateway-labeler/controller.go, pkg/operator/controller/gateway-service-dns/controller.go
Both controllers switch from direct controller construction to a two-step approach: create Options with reconciler, apply manager defaults via DefaultFromConfig(), then pass to controller.NewUnmanaged(). No functional changes to reconciliation logic.
GatewayClass Controller Infrastructure Integration
pkg/operator/controller/gatewayclass/controller.go
Adds infrastructure topology awareness: imports configv1, introduces field indexer for GatewayClass controller names, watches cluster infrastructure config, creates dynamic index on first reconciliation, fetches infrastructure config during reconciliation, and gates index creation with synchronized startup. Extends Reconcile to handle infrastructure-driven logic.
GatewayClass Controller Test Expansion
pkg/operator/controller/gatewayclass/controller_test.go
Expands test coverage with infraConfig helper, supports multiple topology modes (SingleReplica, HighlyAvailable), introduces FakeIndexer implementation with field indexing for GatewayClass spec.controllerName, and adds test cases for various gatewayclass configurations.
Istio Configuration Generation
pkg/operator/controller/gatewayclass/istio.go
Updates ensureIstio() and desiredIstio() function signatures to accept infrastructure config and gatewayclass list; generates gateway class-specific HPA configuration and serializes to JSON; returns error from desiredIstio() for error handling.
E2E Test Adjustment
test/e2e/gateway_api_test.go
Changes two test assertions to discard gateway return value instead of assigning to variable; only error result is used.

Sequence Diagram(s)

sequenceDiagram
    participant Client as GatewayClass Reconciler
    participant Cache as Field Index Cache
    participant API as Kubernetes API
    participant Infra as Infrastructure Config
    participant Istio as Istio Resource

    Client->>Client: Start reconciliation
    Client->>Cache: Check if index created
    alt Index not created
        Client->>Cache: Create index on spec.controllerName
    end
    Client->>API: Fetch infrastructure config
    API-->>Client: Return infrastructure topology
    Client->>API: List all GatewayClass objects
    API-->>Client: Return gatewayclass list
    Client->>Client: Generate HPA config per topology
    Client->>Client: Serialize gatewayclass config to JSON
    Client->>Istio: Create/update Istio with config
    Istio-->>Client: Confirm creation
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test assertions lack meaningful failure messages to aid diagnosis, and E2E changes discard return values that could mask bugs by preventing variable updates with freshly fetched data. Add explicit assertion messages to all calls, structure setup using BeforeEach/AfterEach blocks, and reconsider E2E changes where return values are discarded to capture gateway objects or justify using stale references.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the pull request - enabling Horizontal Pod Autoscaling for GatewayClass controllers managing Istio resources.
Stable And Deterministic Test Names ✅ Passed Test files use static test function names without dynamic information that changes between runs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from Thealisyed and alebedev87 March 4, 2026 17:04
@Miciah
Copy link
Contributor Author

Miciah commented Mar 4, 2026

#1350 merged, so Istio.spec.values.gatewayClasses should be functional now.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/operator/controller/gatewayclass/istio.go (1)

111-116: Consider adding nil-safety for infraConfig parameter.

The function accesses infraConfig.Status.InfrastructureTopology directly at line 114. While the current call site in ensureIstio ensures a non-nil infraConfig is passed, adding a defensive check would prevent potential panics if this function is called from additional locations in the future.

🛡️ Proposed defensive check
 func desiredIstio(name types.NamespacedName, ownerRef metav1.OwnerReference, istioVersion string, enableInferenceExtension bool, infraConfig *configv1.Infrastructure, gatewayclasses []gatewayapiv1.GatewayClass) (*sailv1.Istio, error) {
 	var minReplicas = 2
 	const maxReplicas = 10
+	if infraConfig == nil {
+		return nil, fmt.Errorf("infraConfig must not be nil")
+	}
 	if infraConfig.Status.InfrastructureTopology == configv1.SingleReplicaTopologyMode {
 		minReplicas = 1
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/gatewayclass/istio.go` around lines 111 - 116, The
function desiredIstio reads infraConfig.Status.InfrastructureTopology without
nil-checking; add a defensive nil check at the start of desiredIstio to handle a
nil infraConfig (e.g., treat as non-single-replica/default behavior or return an
error), so the access to infraConfig.Status is safe; update logic around
minReplicas (currently set to 2 then changed to 1 if InfrastructureTopology ==
SingleReplicaTopologyMode) to only inspect infraConfig when infraConfig != nil,
and ensure callers such as ensureIstio still work with the chosen default or
propagated error.
pkg/operator/controller/gatewayclass/controller_test.go (1)

134-142: JSON key ordering assumption may cause test flakiness.

The test helper generates JSON by iterating over the gatewayclasses slice in the order provided. The production code in istio.go uses json.Marshal on a map[string]any, which sorts keys alphabetically. The test at line 268 passes []string{"openshift-default", "openshift-internal"} which happens to be alphabetically sorted, so it works.

If a test case passes gatewayclasses in non-alphabetical order, the byte comparison would fail despite semantically equivalent JSON.

Consider using json.Marshal in the test helper for consistency, or add a JSON-aware comparison in the test assertions.

♻️ Suggested approach using json.Marshal for consistency
-	ret.Spec.Values.GatewayClasses = []byte(fmt.Sprintf(`{%s}`, strings.Join(func() []string {
-		var result []string
-
-		for _, name := range gatewayclasses {
-			result = append(result, fmt.Sprintf(`"%s":{"horizontalPodAutoscaler":{"spec":{"maxReplicas":10,"minReplicas":%d}}}`, name, minReplicas))
-		}
-
-		return result
-	}(), ",")))
+	gatewayclassConfig := map[string]any{
+		"horizontalPodAutoscaler": map[string]any{
+			"spec": map[string]any{
+				"minReplicas": minReplicas,
+				"maxReplicas": 10,
+			},
+		},
+	}
+	gatewayclassesConfig := map[string]any{}
+	for _, name := range gatewayclasses {
+		gatewayclassesConfig[name] = gatewayclassConfig
+	}
+	ret.Spec.Values.GatewayClasses, _ = json.Marshal(gatewayclassesConfig)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/gatewayclass/controller_test.go` around lines 134 -
142, The test helper builds ret.Spec.Values.GatewayClasses by manually joining
strings from the gatewayclasses slice which assumes a stable key order and can
cause flakiness versus production istio.go that json.Marshal's a map; change the
helper to construct a map[string]any (keys = names from gatewayclasses, values =
the HPA spec/minReplicas structure) and call json.Marshal on that map to produce
GatewayClasses bytes (or alternatively, keep the helper but update the test
assertion to perform a JSON-aware comparison of ret.Spec.Values.GatewayClasses
using json.Unmarshal into a map and compare maps) so ordering differences no
longer break tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/controller/gatewayclass/controller.go`:
- Around line 263-279: The current use of
r.startGatewayclassControllerIndex.Do(...) prevents retries if
fieldIndexer.IndexField fails; change the pattern so the index is only marked
"done" after a successful IndexField call (e.g., replace sync.Once.Do with a
small protected boolean or atomic flag plus a mutex or atomic.CompareAndSwap):
run gatewayclassControllerIndexFn and call r.fieldIndexer.IndexField(...) and if
it returns nil set the success flag (e.g., r.gatewayclassIndexInitialized =
true) so future reconciliations skip creation, but if it returns an error do not
set the flag and return/append the error so the reconcile can retry; keep
references to gatewayclassControllerIndexFn,
gatewayclassControllerIndexFieldName and r.fieldIndexer.IndexField when
implementing this change.

---

Nitpick comments:
In `@pkg/operator/controller/gatewayclass/controller_test.go`:
- Around line 134-142: The test helper builds ret.Spec.Values.GatewayClasses by
manually joining strings from the gatewayclasses slice which assumes a stable
key order and can cause flakiness versus production istio.go that json.Marshal's
a map; change the helper to construct a map[string]any (keys = names from
gatewayclasses, values = the HPA spec/minReplicas structure) and call
json.Marshal on that map to produce GatewayClasses bytes (or alternatively, keep
the helper but update the test assertion to perform a JSON-aware comparison of
ret.Spec.Values.GatewayClasses using json.Unmarshal into a map and compare maps)
so ordering differences no longer break tests.

In `@pkg/operator/controller/gatewayclass/istio.go`:
- Around line 111-116: The function desiredIstio reads
infraConfig.Status.InfrastructureTopology without nil-checking; add a defensive
nil check at the start of desiredIstio to handle a nil infraConfig (e.g., treat
as non-single-replica/default behavior or return an error), so the access to
infraConfig.Status is safe; update logic around minReplicas (currently set to 2
then changed to 1 if InfrastructureTopology == SingleReplicaTopologyMode) to
only inspect infraConfig when infraConfig != nil, and ensure callers such as
ensureIstio still work with the chosen default or propagated error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b377a103-d8f8-48fe-a7ec-6462842dcb0c

📥 Commits

Reviewing files that changed from the base of the PR and between 81c961d and 09b0c49.

📒 Files selected for processing (6)
  • pkg/operator/controller/gateway-labeler/controller.go
  • pkg/operator/controller/gateway-service-dns/controller.go
  • pkg/operator/controller/gatewayclass/controller.go
  • pkg/operator/controller/gatewayclass/controller_test.go
  • pkg/operator/controller/gatewayclass/istio.go
  • test/e2e/gateway_api_test.go

Comment on lines +263 to +279
// Create the Index gatewayclasses once on first reconciliation. The
// index cannot be created in NewUnmanaged as the CRD might not exist
// when NewUnmanaged is called, and so we create the index here.
r.startGatewayclassControllerIndex.Do(func() {
gatewayclassControllerIndexFn := client.IndexerFunc(func(o client.Object) []string {
gatewayclass, ok := o.(*gatewayapiv1.GatewayClass)
if !ok {
return []string{}
}

return []string{string(gatewayclass.Spec.ControllerName)}
})
if err := r.fieldIndexer.IndexField(context.Background(), &gatewayapiv1.GatewayClass{}, gatewayclassControllerIndexFieldName, gatewayclassControllerIndexFn); err != nil {
log.Error(err, "failed to create index for gatewayclasses", "request", request)
errs = append(errs, err)
}
})
Copy link

@coderabbitai coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Index creation failure is not retriable due to sync.Once.

If IndexField fails transiently (e.g., temporary API server issue), the sync.Once prevents any retry. Subsequent reconciliations will consistently fail when ensureIstio attempts to list with the non-existent index.

While the error is logged and the reconciliation will fail with an error (surfacing the issue), the system cannot self-heal without operator restart.

Consider using a different synchronization pattern that allows retry on failure, such as tracking success state explicitly.

🔧 Alternative pattern allowing retry on failure
-	// startGatewayclassControllerIndex ensures we create the gatewayclasses index at
-	// most once.
-	startGatewayclassControllerIndex sync.Once
+	// gatewayclassControllerIndexCreated tracks whether the index was successfully created.
+	gatewayclassControllerIndexCreated bool
+	gatewayclassControllerIndexMu      sync.Mutex
 }

 // In Reconcile:
-	r.startGatewayclassControllerIndex.Do(func() {
+	r.gatewayclassControllerIndexMu.Lock()
+	if !r.gatewayclassControllerIndexCreated {
 		gatewayclassControllerIndexFn := client.IndexerFunc(func(o client.Object) []string {
 			gatewayclass, ok := o.(*gatewayapiv1.GatewayClass)
 			if !ok {
 				return []string{}
 			}
 			return []string{string(gatewayclass.Spec.ControllerName)}
 		})
 		if err := r.fieldIndexer.IndexField(context.Background(), &gatewayapiv1.GatewayClass{}, gatewayclassControllerIndexFieldName, gatewayclassControllerIndexFn); err != nil {
 			log.Error(err, "failed to create index for gatewayclasses", "request", request)
 			errs = append(errs, err)
+		} else {
+			r.gatewayclassControllerIndexCreated = true
 		}
-	})
+	}
+	r.gatewayclassControllerIndexMu.Unlock()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/gatewayclass/controller.go` around lines 263 - 279,
The current use of r.startGatewayclassControllerIndex.Do(...) prevents retries
if fieldIndexer.IndexField fails; change the pattern so the index is only marked
"done" after a successful IndexField call (e.g., replace sync.Once.Do with a
small protected boolean or atomic flag plus a mutex or atomic.CompareAndSwap):
run gatewayclassControllerIndexFn and call r.fieldIndexer.IndexField(...) and if
it returns nil set the success flag (e.g., r.gatewayclassIndexInitialized =
true) so future reconciliations skip creation, but if it returns an error do not
set the flag and return/append the error so the reconcile can retry; keep
references to gatewayclassControllerIndexFn,
gatewayclassControllerIndexFieldName and r.fieldIndexer.IndexField when
implementing this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair point. On the one hand, the current logic for initializing the index follows pattern we established with the logic for initializing the watch on istios. On the other hand, this seems simple enough to fix, and I expect the watch on istios will go away in a couple releases. I would like to allow the current CI run to finish, but I can update the logic after that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah, sounds good! The suggested mutex+boolean pattern is straightforward to apply whenever you're ready. Since the startIstioWatch watch initialization has the same limitation, you may want to apply a similar retry-capable pattern there as well — though as you noted, that code is expected to be removed in a future release, so it may not be worth the effort.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2026

@Miciah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-operator 09b0c49 link true /test e2e-azure-operator
ci/prow/hypershift-e2e-aks 09b0c49 link true /test hypershift-e2e-aks
ci/prow/e2e-gcp-operator 09b0c49 link true /test e2e-gcp-operator
ci/prow/e2e-hypershift 09b0c49 link true /test e2e-hypershift
ci/prow/e2e-aws-ovn-serial-2of2 09b0c49 link true /test e2e-aws-ovn-serial-2of2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Miciah
Copy link
Contributor Author

Miciah commented Mar 5, 2026

The results from the last e2e-aws-operator run look better—replica sets for proxies are getting scaled to 2 replicas:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2029241388389896192/artifacts/e2e-aws-operator/gather-extra/artifacts/events.json' | jq -r '[.items.[]|select(.metadata.namespace=="openshift-ingress" and (.metadata.name|startswith("test-gateway")) and .source.component=="deployment-controller")]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|.involvedObject.name+": "+.message'
test-gateway-openshift-default: Scaled up replica set test-gateway-openshift-default-6b647f479b from 0 to 1
test-gateway-openshift-default: Scaled up replica set test-gateway-openshift-default-6b647f479b from 1 to 2
test-gateway-update-openshift-default: Scaled up replica set test-gateway-update-openshift-default-7f4bb66dc4 from 0 to 1
test-gateway-update-openshift-default: Scaled up replica set test-gateway-update-openshift-default-7f4bb66dc4 from 1 to 2
test-gateway-infra-annotations-openshift-default: Scaled up replica set test-gateway-infra-annotations-openshift-default-7c5db78c9c from 0 to 1
test-gateway-infra-annotations-openshift-default: Scaled up replica set test-gateway-infra-annotations-openshift-default-7c5db78c9c from 1 to 2
test-gateway-internal-lb-openshift-default: Scaled up replica set test-gateway-internal-lb-openshift-default-6665d79d55 from 0 to 1
% 

I'm not sure why the events only report scale-out to 1 replica for the "test-gateway-internal-lb-openshift-default" gateway.

There are some did not receive metrics for targeted pods (pods might be unready) errors; I don't know whether these are transient failures or something we should be concerned about:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2029241388389896192/artifacts/e2e-aws-operator/gather-extra/artifacts/events.json' | jq -r '[.items.[]|select(.metadata.namespace=="openshift-ingress" and .source.component=="horizontal-pod-autoscaler")]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|.involvedObject.name+": "+.message'
istiod-openshift-gateway: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
test-gateway-openshift-default: New size: 2; reason: Current number of replicas below Spec.MinReplicas
istiod-openshift-gateway: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
test-gateway-openshift-default: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
test-gateway-openshift-default: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
test-gateway-openshift-default: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
test-gateway-openshift-default: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
istiod-openshift-gateway: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
istiod-openshift-gateway: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
gw1-openshift-default: New size: 2; reason: Current number of replicas below Spec.MinReplicas
gw2-openshift-default: New size: 2; reason: Current number of replicas below Spec.MinReplicas
gw3-openshift-default: New size: 2; reason: Current number of replicas below Spec.MinReplicas
gw4-openshift-default: New size: 2; reason: Current number of replicas below Spec.MinReplicas
gw4-openshift-default: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
gw4-openshift-default: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
gw3-openshift-default: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
gw3-openshift-default: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
test-gateway-update-openshift-default: New size: 2; reason: Current number of replicas below Spec.MinReplicas
test-gateway-update-openshift-default: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
test-gateway-update-openshift-default: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
test-nohost-gateway-openshift-default: New size: 2; reason: Current number of replicas below Spec.MinReplicas
test-nohost-gateway-openshift-default: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
test-nohost-gateway-openshift-default: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
test-gateway-infra-annotations-openshift-default: New size: 2; reason: Current number of replicas below Spec.MinReplicas
% 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants