Skip to content

Cluster Autoscaler not backing off exhausted node group #6240

@elohmeier

Description

@elohmeier

Which component are you using?:

Cluster Autoscaler

What version of the component are you using?:

registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.4

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.26.6+k3s1
Kustomize Version: v4.5.7
Server Version: v1.26.6+k3s1

What environment is this in?:

Hetzner Cloud

What did you expect to happen?:

When the cluster autoscaler is configured with a priority expander, and multiple node groups of differing priorities are provided, the cluster autoscaler should back-off after some time if the cloud provider fails to provision nodes in the high priority node group due to resource unavailability and proceed to lower priority node groups.

What happened instead?:

The high priority node group (in the below log pool1) has no resources available currently to provision the requested nodes.
The cluster autoscaler is stuck in a loop trying to provision nodes in the high-prio group and not proceeding to pool2 (lower prio, resources available). I've also tried to set --max-node-group-backoff-duration=1m with no effect.

W1101 05:15:05.399825       1 hetzner_servers_cache.go:94] Fetching servers from Hetzner API
I1101 05:15:15.806488       1 hetzner_node_group.go:438] Set node group draining-node-pool size from 0 to 0, expected delta 0
I1101 05:15:15.806519       1 hetzner_node_group.go:438] Set node group pool1 size from 1 to 1, expected delta 0
I1101 05:15:15.806525       1 hetzner_node_group.go:438] Set node group pool2 size from 0 to 0, expected delta 0
I1101 05:15:15.808727       1 scale_up.go:608] Scale-up: setting group pool1 size to 4
E1101 05:15:16.068533       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:16.079704       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:16.126786       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
W1101 05:15:16.126816       1 hetzner_servers_cache.go:94] Fetching servers from Hetzner API
I1101 05:15:26.655179       1 hetzner_node_group.go:438] Set node group pool1 size from 1 to 1, expected delta 0
I1101 05:15:26.655243       1 hetzner_node_group.go:438] Set node group pool2 size from 0 to 0, expected delta 0
I1101 05:15:26.655257       1 hetzner_node_group.go:438] Set node group draining-node-pool size from 0 to 0, expected delta 0
I1101 05:15:26.660093       1 scale_up.go:608] Scale-up: setting group pool1 size to 4
E1101 05:15:26.948368       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:26.981452       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:27.044150       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)

How to reproduce it (as minimally and precisely as possible):

apiVersion: v1
data:
  priorities: |
    10:
      - pool2
    20:
      - pool1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    spec:
      containers:
      - command:
        - ./cluster-autoscaler
        - --scale-down-unneeded-time=5m
        - --cloud-provider=hetzner
        - --stderrthreshold=info
        - --nodes=0:4:CCX43:FSN1:pool1
        - --nodes=0:4:CCX43:NBG1:pool2
        - --expander=priority
        env:
        - name: HCLOUD_IMAGE
          value: debian-11
        - name: HCLOUD_TOKEN
          valueFrom:
            secretKeyRef:
              key: token
              name: hcloud
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.4
        name: cluster-autoscaler
      serviceAccountName: cluster-autoscaler

Anything else we need to know?:

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerIssues or PRs related to the Cluster Autoscaler componentarea/provider/hetznerIssues or PRs related to Hetzner providerkind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions