Skip to content

Bug: Generated webhook configuration files too large to store in ETCD #5208

@Revyy

Description

@Revyy

Describe the bug

When installing the latest v2.18.0 version of the helm chart we encountered an issue where we could not apply changes to the generated webhook configuration yaml files that are included in the helm-chart, due to them being too large to store in ETCD.

Error message: etcdserver: request is too large.

When investigating we found that this happened in all of our persistent clusters where cert-manager had performed a certificate rotation at some point so that the clientConfig.caBundle field of a webhook configuration contained the two latest versions of the ca, which increases the total size of the object.
In clusters where no certificate rotation had been performed yet the upgrade went fine due to the smaller size.

Files:

  • admissionregistration.k8s.io_v1_validatingwebhookconfiguration_azureserviceoperator-validating-webhook-configuration.yaml
  • admissionregistration.k8s.io_v1_mutatingwebhookconfiguration_azureserviceoperator-mutating-webhook-configuration.yaml

We use ArgoCD to deploy changes using Kubernetes server side apply already to try and bring the size of the object down.

Azure Service Operator Version: mcr.microsoft.com/k8s/azureserviceoperator:v2.18.0/k8s/azureserviceoperator:v2.18.0
Kubernetes Version: 1.33.6

Temporary Remediation:

Deleting the webhook configuration files from the cluster and applying them again(letting Argo CD sync them) works because then only the latest ca.crt is injected as part of the caBundle.

Expected behavior

I would expect the webhook configurations to respect the crdPattern field of the helm chart and only create webhook configurations for the crd resources actually created and used in the cluster. With the current setup where webhook configurations are created for every resource regardless I would still assume this to work seamlessly, perhaps by splitting the objects into several.

To Reproduce

I have not tried to reproduce this in a fresh cluster but this is roughly the chain of events that we have identified as likely.

  1. Install Azure Service Operator v2.17.0 using the helm chart.
  2. Wait for the caBundle to be injected into the webhook configurations by cert-manager.
  3. Perform a certificate rotation so that the two latest ca.crt are injected as part of the caBundle by cert-manager.
  4. Attempt to upgrade to v2.18.0.

Additional context

We have a way of getting around the issue for now and we could probably solve it ourselves using the Kustomize layer or similar but an official fix or recommendation for workaround would be appreciated.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions