Skip to content

Hibernation fails due to existing node #1080

@gagan16k

Description

@gagan16k

How to categorize this issue?

/area robustness
/kind bug
/priority 3

What happened:
Observed that shoot hibernation failed with "not all nodes have been deleted" because an orphaned node remained with the finalizer set and no machine to reconcile it. MCM has also been scaled down in this state, due to replica count being zero.

It was not possible to accurately determine what mechanism leads to this state, as the logging pods are scaled down much before MCM during hibernation (discussed in this issue - Logging Components Terminated Too Early in Hibernation Flow), however a solution to one potentially likely mechanism was discussed.
During machine deletion, deleteNodeFinalizers() and deleteNodeObject() run as separate phases. If deleteNodeFinalizers() skips (e.g., node label not yet present on the machine), the flow advances to deleteNodeObject(), which deletes the node but never removes the node.machine.sapcloud.io/machine-controller finalizer. The node gets stuck terminating indefinitely after the machine is fully cleaned up.

What you expected to happen:
Shoot hibernates successfully with all nodes deleted

How to reproduce it (as minimally and precisely as possible):
Unsure

Metadata

Metadata

Assignees

Labels

area/robustnessRobustness, reliability, resilience relatedkind/bugBugpriority/3Priority (lower number equals higher priority)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions