Skip to content

azure: Zombie Node Cleanup#9052

Open
alimaazamat wants to merge 3 commits intokubernetes:masterfrom
alimaazamat:zombie-node-cleanup
Open

azure: Zombie Node Cleanup#9052
alimaazamat wants to merge 3 commits intokubernetes:masterfrom
alimaazamat:zombie-node-cleanup

Conversation

@alimaazamat
Copy link

@alimaazamat alimaazamat commented Jan 13, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

A customer had a cleanup script to cleanup "zombie nodes", non-functional Azure infra/Kubernetes nodes that are persisting in a Kubernetes cluster (and in the component state machines in cloud-provider-azure and CA). The customer has an old-fashioned script that looks for well-known "bad terminal states" of VMSS VMs and then deletes those. This PR implements that customer need into CA so that logic can be done from a point of authority.

azure_zombie_cleanup.go is the cleanup implementation:

  1. Check if enabled - Returns early if EnableZombieCleanup is false
  2. Build K8s node lookup - Creates a map of normalized provider IDs to nodes for correlation
  3. Scan all VMSS - Lists all scale sets and their VMs with instance views
  4. Detect zombies - Calls evaluateZombieStatus() for each VM
  5. Split by registration status:
    • Unregistered zombies (no K8s node): Safe to delete directly
    • Registered zombies (has K8s node): Only logs and lets autoscaler handle
  6. Batch delete - Groups unregistered zombies by VMSS and calls DeleteInstancesAsync()

Key notes:

  • VMs that never registered have no K8s state so its safe to delete immediately manually
  • VMs with K8s nodes (unreachable/NotReady) we can pass off to autoscaler to handle proper state deletion
  • Age threshold (default 5 min) prevents deleting recently created VMs
  • Feature EnableZombieCleanup config flag
    • Dry-run mode ZombieCleanupDryRun would log what would be deleted but doesn’t actually take action
  • Batch deletion reduces API calls and improves efficiency

Updated:

When Cluster Autoscaler migreated to Track 2 Azure SDK we can now use vm.TimeCreated to populate the time within 2 seconds of VM creation versus Track 1's status.Time which can take 1-1.5 minutes.
Track 1: InstanceView.Statuses https://pkg.go.dev/github.com/Azure/azure-sdk-for-go@v68.0.0+incompatible/services/compute/mgmt/2022-08-01/compute#InstanceViewStatus
Track 2: TimeCreated https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute/v5#VirtualMachineScaleSetVMProperties

Functions Implemented:

cleanupZombieNodes(): Main entry point
cleanupZombieNodesWithContext(nodes): Accepts K8s nodes for correlation
evaluateZombieStatus(vm, k8sNodeMap, time, minAge): Returns (isZombie, hasK8sNode, reason)
normalizeProviderID(providerID): Matches Azure IDs to K8s provider IDs
The implementation is called from forceRefresh() in azure_manager.go runs every interval of VmssCacheTTLInSeconds (default is 1min)

Tests:

  • TestZombieCleanup_NoZombiesFound - Verifies behavior when no zombies exist
  • TestZombieCleanup_DetectsFailedProvisioning - Verifies Scenario 2: Provisioning failed
  • TestZombieCleanup_DetectsFailedExtensions - Verifies Scenario 1: Extensions failed
  • TestZombieCleanup_DetectsNeverRegisteredInstances - Verifies Scenario 3: Never registered
  • TestZombieCleanup_WithK8sNodesContext - Verifies registered zombies are NOT deleted (Scenario 4a: Unreachable taint)
  • TestZombieCleanup_RespectsMinAge - Verifies age threshold is respected
  • TestZombieCleanup_DryRunMode - Verifies dry-run doesn't delete anything
  • TestZombieCleanup_MultipleZombiesInSamePool - Verifies batch deletion
  • TestZombieCleanup_MultipleVMSSPools - Verifies cleanup across multiple pools
  • TestZombieCleanup_MixedZombiesAndHealthy - Verifies only zombies are deleted
  • TestZombieCleanup_IgnoresDeallocatedNodes - Verifies deallocated VMs are not deleted (Scenario 4b: NotReady but deallocated)
    Scenario Detection Tests:
  • TestZombieScenario_ExtensionsFailedToInstall - Demonstrates Scenario 1a: Extensions failed
  • TestZombieScenario_ExtensionsNeverInstalled - Demonstrates Scenario 1b: Extensions never installed (flapping zombie)
  • TestZombieScenario_ProvisioningFailed - Demonstrates Scenario 2: Provisioning failed
  • TestZombieScenario_NeverRegisteredInKubernetes - Demonstrates Scenario 3: Never registered (AllocationFailed)
  • TestZombieScenario_NodeUnreachableTaint - Demonstrates Scenario 4a: Node has unreachable taint
  • TestZombieScenario_NodeNotReady - Demonstrates Scenario 4b: Node NotReady with running VM
  • TestZombieScenario_DeallocatedNodesAreHealthy - Demonstrates healthy deallocated nodes are NOT zombies
  • TestZombieScenario_MultipleZombiesWasteQuota - Demonstrates severe quota waste scenario

Helper Functions:

  • setupMockManager - Setup mock manager with Azure clients
  • newTestAzureManagerForZombieCleanup - Setup a test Azure manager with default config
  • newHealthyVM - Creates a healthy VM for testing
  • newZombieVMWithFailedProvisioning - Creates a VM with failed provisioning state
  • newZombieVMWithFailedExtensions - Creates a VM with failed extensions
  • newZombieVMNeverRegistered - Creates a VM that never registered with K8s
  • newUnreachableZombieVM - Creates a VM that is running but will have unreachable taint
  • newRecentVM - Creates a recently created VM (below age threshold)
  • newDeallocatedVM - Creates a deallocated VM (from autoscaler scale-down)

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

azure: Zombie Node Cleanup

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-area labels Jan 13, 2026
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area labels Jan 13, 2026
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 14, 2026
@alimaazamat alimaazamat force-pushed the zombie-node-cleanup branch 2 times, most recently from 56ff2f5 to 9711168 Compare January 14, 2026 00:49
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 14, 2026
@alimaazamat alimaazamat changed the title [WIP] Zombie Node Cleanup Zombie Node Cleanup Jan 20, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 20, 2026
@jackfrancis
Copy link
Contributor

/retitle: azure: Zombie Node Cleanup

@k8s-ci-robot k8s-ci-robot changed the title Zombie Node Cleanup : azure: Zombie Node Cleanup Mar 11, 2026
@jackfrancis jackfrancis changed the title : azure: Zombie Node Cleanup azure: Zombie Node Cleanup Mar 11, 2026
@jackfrancis
Copy link
Contributor

/release-note-edit

azure: Zombie Node Cleanup

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 11, 2026
@alimaazamat alimaazamat force-pushed the zombie-node-cleanup branch from 40d3849 to 5f2060d Compare March 11, 2026 21:34
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alimaazamat
Once this PR has been reviewed and has the lgtm label, please assign tallaxes for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alimaazamat alimaazamat force-pushed the zombie-node-cleanup branch from 9c1db70 to ee114fa Compare March 11, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants