gatewayclass: Enable Horizontal Pod Autoscaling#1326
gatewayclass: Enable Horizontal Pod Autoscaling#1326Miciah wants to merge 3 commits intoopenshift:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
/test e2e-aws-operator |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
e2e-aws-operator passed. However, there are some errors due to missing metrics: Moreover, the proxy pods do not get scaled out more than 1 replica: |
|
|
||
| // Watch the cluster infrastructure config in case the infrastructure | ||
| // topology changes. | ||
| if err := c.Watch(source.Kind[client.Object](operatorCache, &configv1.Infrastructure{}, reconciler.enqueueRequestForSomeGatewayClass())); err != nil { |
There was a problem hiding this comment.
question here is, if you change Sail Operator parameter will it reflect on Pilot config?
Also, in case this parameter is also used for Gateway replica deployment, will Sail reconcile all the gateways and HPA?
(Giving some thoughts for test)
There was a problem hiding this comment.
answering myself:
the change of HPA parameter from pilot does not impact on the GatewayClass, it needs its own definition from spec.infrastructure.paramRefs
|
testing on my own cluster (AWS, HA): It is scaling pilot, but not sure about the gateways, let me test |
|
IIUC we can pass to sail the configs of the Gateway class on the Istio resource, and it will create the configmap for us (even the gateway class being created by the user): https://github.com/openshift-service-mesh/sail-operator/blob/f3a3c6d7e6f2c2ae412b09ce1a78bc93258b4db4/resources/v1.28.0/charts/istiod/templates/gateway-class-configmap.yaml We can make CIO set this property and the configmap will be created for us (this includes even setting the annotations for internal only clusters!), the problem is that today the json.RawMessage used by Sail API is not accepting the content correctly. I have asked the sail operator team about it and if they have been using it, otherwise we can improve this workflow on Sail Operator and consume on CIO |
1ea51fc to
f407b29
Compare
|
/test e2e-aws-operator |
Enable Horizontal Pod Autoscaling (HPA) on Istio. Hard-code the autoscaling parameters based on the cluster infrastructure config: - If the infrastructure topology is "SingleReplica", set minimum to 1. - Otherwise, set minimum to 2. - In any case, set maximum to 10. * pkg/operator/controller/gatewayclass/controller.go (gatewayclassControllerIndexFieldName): New const. (NewUnmanaged): Watch infrastructures. Initialize fieldIndexer in the reconciler so that it can be used to create an index over gatewayclasses later. The index cannot be created directly in NewUnmanaged as the gatewayclasses resource might not yet exist when NewUnmanaged is called. (reconciler): Add fieldIndexer and startGatewayclassControllerIndex. (Reconcile): Get the cluster infrastructure object and pass it to ensureIstio. Create an index over gatewayclasses by spec.controllerName, using fieldIndexer from the reconciler, gatewayclassControllerIndexFieldName for the index field name, and startGatewayclassControllerIndex to ensure the index is only created once, on first reconciliation. * pkg/operator/controller/gatewayclass/controller_test.go (Test_Reconcile): Add test cases for missing cluster infrastructure config. Add a test case for SingleReplica topology mode. Add a test case with multiple gatewayclasses. Add the cluster infrastructure config object to existingObjects for the existing test cases. Add the expected HPA configuration to the expected Istio CRs in test cases. Initialize fieldIndexer with a fake indexer in the reconciler. (FakeIndexer, (FakeIndexer).IndexField): New type and method, used in Test_Reconcile. * pkg/operator/controller/gatewayclass/istio.go (ensureIstio): Add an infraConfig parameter, and pass the argument to desiredIstio. Use the new index to list gatewayclasses with the OpenShift gateway controller name, and pass the list of gatewayclasses to desiredIstio as well. * pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Add an infraConfig parameter and a gatewayclasses parameter, and use the arguments to look up the infrastructure topology mode and and configure HPA accordingly.
f407b29 to
3e6e8c5
Compare
|
/test e2e-aws-operator |
1 similar comment
|
/test e2e-aws-operator |
Don't overwrite the gateway variable's value from createGatewayWithListeners with a possibly nil value from assertGatewaySuccessful. The gateway variable is used in a cleanup handler, which would panic if the variable had a nil value. The gateway value from assertGatewaySuccessful isn't really needed, so it can be safely ignored. * test/e2e/gateway_api_test.go (testGatewayAPIDNSListenerWithNoHostname) (testGatewayAPIDNSListenerUpdate): Ignore the return value from assertGatewaySuccessful.
fad49ff to
c566d6a
Compare
|
/test e2e-aws-operator |
Set default values, such as the logger, in the controller options for unmanaged controllers, namely the gateway-labeler, gateway-service-dns, and gatewayclass controllers. Before controller-runtime v0.21.0, controller-runtime implicitly set default values for controller options for managed and unmanaged controllers alike. Controller-runtime internally used the DefaultFromConfig method to do so. Since controller-runtime v0.21.0, these default values are not implicitly set for unmanaged controllers[1]. In particular, this means that the logger is not initialized, and so the controller initialization and any reconciliation errors are not logged. This commit explicitly sets default values by calling DefaultFromConfig for unmanaged controllers. Follow-up to commit 66485b8. 1. kubernetes-sigs/controller-runtime@d9ff283 * pkg/operator/controller/gateway-labeler/controller.go (NewUnmanaged): * pkg/operator/controller/gateway-service-dns/controller.go (NewUnmanaged): * pkg/operator/controller/gatewayclass/controller.go (NewUnmanaged): Use DefaultFromConfig to set default values for controller options.
WalkthroughThree controller implementations adopt manager-supplied default options for consistent configuration. The gatewayclass controller gains infrastructure topology awareness, dynamic field indexing of GatewayClass objects, and infrastructure-driven Istio configuration generation. Tests expand to cover multiple infrastructure topologies and field indexer integration. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as GatewayClass Reconciler
participant Cache as Field Index Cache
participant API as Kubernetes API
participant Infra as Infrastructure Config
participant Istio as Istio Resource
Client->>Client: Start reconciliation
Client->>Cache: Check if index created
alt Index not created
Client->>Cache: Create index on spec.controllerName
end
Client->>API: Fetch infrastructure config
API-->>Client: Return infrastructure topology
Client->>API: List all GatewayClass objects
API-->>Client: Return gatewayclass list
Client->>Client: Generate HPA config per topology
Client->>Client: Serialize gatewayclass config to JSON
Client->>Istio: Create/update Istio with config
Istio-->>Client: Confirm creation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
|
#1350 merged, so |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/operator/controller/gatewayclass/istio.go (1)
111-116: Consider adding nil-safety forinfraConfigparameter.The function accesses
infraConfig.Status.InfrastructureTopologydirectly at line 114. While the current call site inensureIstioensures a non-nilinfraConfigis passed, adding a defensive check would prevent potential panics if this function is called from additional locations in the future.🛡️ Proposed defensive check
func desiredIstio(name types.NamespacedName, ownerRef metav1.OwnerReference, istioVersion string, enableInferenceExtension bool, infraConfig *configv1.Infrastructure, gatewayclasses []gatewayapiv1.GatewayClass) (*sailv1.Istio, error) { var minReplicas = 2 const maxReplicas = 10 + if infraConfig == nil { + return nil, fmt.Errorf("infraConfig must not be nil") + } if infraConfig.Status.InfrastructureTopology == configv1.SingleReplicaTopologyMode { minReplicas = 1 }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/controller/gatewayclass/istio.go` around lines 111 - 116, The function desiredIstio reads infraConfig.Status.InfrastructureTopology without nil-checking; add a defensive nil check at the start of desiredIstio to handle a nil infraConfig (e.g., treat as non-single-replica/default behavior or return an error), so the access to infraConfig.Status is safe; update logic around minReplicas (currently set to 2 then changed to 1 if InfrastructureTopology == SingleReplicaTopologyMode) to only inspect infraConfig when infraConfig != nil, and ensure callers such as ensureIstio still work with the chosen default or propagated error.pkg/operator/controller/gatewayclass/controller_test.go (1)
134-142: JSON key ordering assumption may cause test flakiness.The test helper generates JSON by iterating over the
gatewayclassesslice in the order provided. The production code inistio.gousesjson.Marshalon amap[string]any, which sorts keys alphabetically. The test at line 268 passes[]string{"openshift-default", "openshift-internal"}which happens to be alphabetically sorted, so it works.If a test case passes gatewayclasses in non-alphabetical order, the byte comparison would fail despite semantically equivalent JSON.
Consider using
json.Marshalin the test helper for consistency, or add a JSON-aware comparison in the test assertions.♻️ Suggested approach using json.Marshal for consistency
- ret.Spec.Values.GatewayClasses = []byte(fmt.Sprintf(`{%s}`, strings.Join(func() []string { - var result []string - - for _, name := range gatewayclasses { - result = append(result, fmt.Sprintf(`"%s":{"horizontalPodAutoscaler":{"spec":{"maxReplicas":10,"minReplicas":%d}}}`, name, minReplicas)) - } - - return result - }(), ","))) + gatewayclassConfig := map[string]any{ + "horizontalPodAutoscaler": map[string]any{ + "spec": map[string]any{ + "minReplicas": minReplicas, + "maxReplicas": 10, + }, + }, + } + gatewayclassesConfig := map[string]any{} + for _, name := range gatewayclasses { + gatewayclassesConfig[name] = gatewayclassConfig + } + ret.Spec.Values.GatewayClasses, _ = json.Marshal(gatewayclassesConfig)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/controller/gatewayclass/controller_test.go` around lines 134 - 142, The test helper builds ret.Spec.Values.GatewayClasses by manually joining strings from the gatewayclasses slice which assumes a stable key order and can cause flakiness versus production istio.go that json.Marshal's a map; change the helper to construct a map[string]any (keys = names from gatewayclasses, values = the HPA spec/minReplicas structure) and call json.Marshal on that map to produce GatewayClasses bytes (or alternatively, keep the helper but update the test assertion to perform a JSON-aware comparison of ret.Spec.Values.GatewayClasses using json.Unmarshal into a map and compare maps) so ordering differences no longer break tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/controller/gatewayclass/controller.go`:
- Around line 263-279: The current use of
r.startGatewayclassControllerIndex.Do(...) prevents retries if
fieldIndexer.IndexField fails; change the pattern so the index is only marked
"done" after a successful IndexField call (e.g., replace sync.Once.Do with a
small protected boolean or atomic flag plus a mutex or atomic.CompareAndSwap):
run gatewayclassControllerIndexFn and call r.fieldIndexer.IndexField(...) and if
it returns nil set the success flag (e.g., r.gatewayclassIndexInitialized =
true) so future reconciliations skip creation, but if it returns an error do not
set the flag and return/append the error so the reconcile can retry; keep
references to gatewayclassControllerIndexFn,
gatewayclassControllerIndexFieldName and r.fieldIndexer.IndexField when
implementing this change.
---
Nitpick comments:
In `@pkg/operator/controller/gatewayclass/controller_test.go`:
- Around line 134-142: The test helper builds ret.Spec.Values.GatewayClasses by
manually joining strings from the gatewayclasses slice which assumes a stable
key order and can cause flakiness versus production istio.go that json.Marshal's
a map; change the helper to construct a map[string]any (keys = names from
gatewayclasses, values = the HPA spec/minReplicas structure) and call
json.Marshal on that map to produce GatewayClasses bytes (or alternatively, keep
the helper but update the test assertion to perform a JSON-aware comparison of
ret.Spec.Values.GatewayClasses using json.Unmarshal into a map and compare maps)
so ordering differences no longer break tests.
In `@pkg/operator/controller/gatewayclass/istio.go`:
- Around line 111-116: The function desiredIstio reads
infraConfig.Status.InfrastructureTopology without nil-checking; add a defensive
nil check at the start of desiredIstio to handle a nil infraConfig (e.g., treat
as non-single-replica/default behavior or return an error), so the access to
infraConfig.Status is safe; update logic around minReplicas (currently set to 2
then changed to 1 if InfrastructureTopology == SingleReplicaTopologyMode) to
only inspect infraConfig when infraConfig != nil, and ensure callers such as
ensureIstio still work with the chosen default or propagated error.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b377a103-d8f8-48fe-a7ec-6462842dcb0c
📒 Files selected for processing (6)
pkg/operator/controller/gateway-labeler/controller.gopkg/operator/controller/gateway-service-dns/controller.gopkg/operator/controller/gatewayclass/controller.gopkg/operator/controller/gatewayclass/controller_test.gopkg/operator/controller/gatewayclass/istio.gotest/e2e/gateway_api_test.go
| // Create the Index gatewayclasses once on first reconciliation. The | ||
| // index cannot be created in NewUnmanaged as the CRD might not exist | ||
| // when NewUnmanaged is called, and so we create the index here. | ||
| r.startGatewayclassControllerIndex.Do(func() { | ||
| gatewayclassControllerIndexFn := client.IndexerFunc(func(o client.Object) []string { | ||
| gatewayclass, ok := o.(*gatewayapiv1.GatewayClass) | ||
| if !ok { | ||
| return []string{} | ||
| } | ||
|
|
||
| return []string{string(gatewayclass.Spec.ControllerName)} | ||
| }) | ||
| if err := r.fieldIndexer.IndexField(context.Background(), &gatewayapiv1.GatewayClass{}, gatewayclassControllerIndexFieldName, gatewayclassControllerIndexFn); err != nil { | ||
| log.Error(err, "failed to create index for gatewayclasses", "request", request) | ||
| errs = append(errs, err) | ||
| } | ||
| }) |
There was a problem hiding this comment.
Index creation failure is not retriable due to sync.Once.
If IndexField fails transiently (e.g., temporary API server issue), the sync.Once prevents any retry. Subsequent reconciliations will consistently fail when ensureIstio attempts to list with the non-existent index.
While the error is logged and the reconciliation will fail with an error (surfacing the issue), the system cannot self-heal without operator restart.
Consider using a different synchronization pattern that allows retry on failure, such as tracking success state explicitly.
🔧 Alternative pattern allowing retry on failure
- // startGatewayclassControllerIndex ensures we create the gatewayclasses index at
- // most once.
- startGatewayclassControllerIndex sync.Once
+ // gatewayclassControllerIndexCreated tracks whether the index was successfully created.
+ gatewayclassControllerIndexCreated bool
+ gatewayclassControllerIndexMu sync.Mutex
}
// In Reconcile:
- r.startGatewayclassControllerIndex.Do(func() {
+ r.gatewayclassControllerIndexMu.Lock()
+ if !r.gatewayclassControllerIndexCreated {
gatewayclassControllerIndexFn := client.IndexerFunc(func(o client.Object) []string {
gatewayclass, ok := o.(*gatewayapiv1.GatewayClass)
if !ok {
return []string{}
}
return []string{string(gatewayclass.Spec.ControllerName)}
})
if err := r.fieldIndexer.IndexField(context.Background(), &gatewayapiv1.GatewayClass{}, gatewayclassControllerIndexFieldName, gatewayclassControllerIndexFn); err != nil {
log.Error(err, "failed to create index for gatewayclasses", "request", request)
errs = append(errs, err)
+ } else {
+ r.gatewayclassControllerIndexCreated = true
}
- })
+ }
+ r.gatewayclassControllerIndexMu.Unlock()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/controller/gatewayclass/controller.go` around lines 263 - 279,
The current use of r.startGatewayclassControllerIndex.Do(...) prevents retries
if fieldIndexer.IndexField fails; change the pattern so the index is only marked
"done" after a successful IndexField call (e.g., replace sync.Once.Do with a
small protected boolean or atomic flag plus a mutex or atomic.CompareAndSwap):
run gatewayclassControllerIndexFn and call r.fieldIndexer.IndexField(...) and if
it returns nil set the success flag (e.g., r.gatewayclassIndexInitialized =
true) so future reconciliations skip creation, but if it returns an error do not
set the flag and return/append the error so the reconcile can retry; keep
references to gatewayclassControllerIndexFn,
gatewayclassControllerIndexFieldName and r.fieldIndexer.IndexField when
implementing this change.
There was a problem hiding this comment.
This is a fair point. On the one hand, the current logic for initializing the index follows pattern we established with the logic for initializing the watch on istios. On the other hand, this seems simple enough to fix, and I expect the watch on istios will go away in a couple releases. I would like to allow the current CI run to finish, but I can update the logic after that.
There was a problem hiding this comment.
@Miciah, sounds good! The suggested mutex+boolean pattern is straightforward to apply whenever you're ready. Since the startIstioWatch watch initialization has the same limitation, you may want to apply a similar retry-capable pattern there as well — though as you noted, that code is expected to be removed in a future release, so it may not be worth the effort.
|
@Miciah: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
The results from the last e2e-aws-operator run look better—replica sets for proxies are getting scaled to 2 replicas: I'm not sure why the events only report scale-out to 1 replica for the "test-gateway-internal-lb-openshift-default" gateway. There are some |
Enable Horizontal Pod Autoscaling (HPA) on Istio. Hard-code the autoscaling parameters based on the cluster infrastructure config:
Summary by CodeRabbit
New Features
Refactor