-
Notifications
You must be signed in to change notification settings - Fork 135
Description
How to categorize this issue?
/area robustness
/kind bug
/priority 3
What happened:
Observed that shoot hibernation failed with "not all nodes have been deleted" because an orphaned node remained with the finalizer set and no machine to reconcile it. MCM has also been scaled down in this state, due to replica count being zero.
It was not possible to accurately determine what mechanism leads to this state, as the logging pods are scaled down much before MCM during hibernation (discussed in this issue - Logging Components Terminated Too Early in Hibernation Flow), however a solution to one potentially likely mechanism was discussed.
During machine deletion, deleteNodeFinalizers() and deleteNodeObject() run as separate phases. If deleteNodeFinalizers() skips (e.g., node label not yet present on the machine), the flow advances to deleteNodeObject(), which deletes the node but never removes the node.machine.sapcloud.io/machine-controller finalizer. The node gets stuck terminating indefinitely after the machine is fully cleaned up.
What you expected to happen:
Shoot hibernates successfully with all nodes deleted
How to reproduce it (as minimally and precisely as possible):
Unsure