Hibernation fails due to existing node

**How to categorize this issue?**

/area robustness
/kind bug
/priority 3

**What happened**:
Observed that shoot hibernation failed with "not all nodes have been deleted" because an orphaned node remained with the finalizer set and no machine to reconcile it. MCM has also been scaled down in this state, due to replica count being zero.

It was not possible to accurately determine what mechanism leads to this state, as the logging pods are scaled down much before MCM during hibernation (discussed in this issue - [Logging Components Terminated Too Early in Hibernation Flow](https://github.com/gardener/gardener/issues/14218)), however a solution to one potentially likely mechanism was discussed.
During machine deletion, `deleteNodeFinalizers()` and `deleteNodeObject()` run as separate phases. If `deleteNodeFinalizers()` skips (e.g., node label not yet present on the machine), the flow advances to `deleteNodeObject()`, which deletes the node but never removes the `node.machine.sapcloud.io/machine-controller` finalizer. The node gets stuck terminating indefinitely after the machine is fully cleaned up.

**What you expected to happen**:
Shoot hibernates successfully with all nodes deleted

**How to reproduce it (as minimally and precisely as possible)**:
Unsure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hibernation fails due to existing node #1080

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hibernation fails due to existing node #1080

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions