[Enhancement] Streamline etcd resource status

**Feature (What you would like to be added):**
Streamline etcd resource status to make it more meaningful and accurate for human operators.

**Motivation (Why is this needed?):**
Currently, etcd status has the following fields:
- `ObservedGeneration`
- `Etcd`
- `Conditions`
- `ServiceName`
- `LastError`
- `ClusterSize`
- `CurrentReplicas`
- `Replicas`
- `ReadyReplicas`
- `Ready`
- `UpdatedReplicas`
- `LabelSelector`
- `Members`
- `PeerUrlTLSEnabled`

As part of https://github.com/gardener/etcd-druid/pull/594, many of these fields have already been marked deprecated, such as `Status.ClusterSize`, `Status.ServiceName` and `Status.UpdatedReplicas`, and will be removed in the future. But there are still many fields which are not really required to deduce the state of the etcd resource, such as `Status.Etcd` (which provides a self-reference to the Etcd object and isn't required since anybody who has access to the etcd status already has a reference to the etcd object itself), `Status.Ready` (readiness is a state, and can be better represented by a `condition` denoting quorum in the cluster, which means that the cluster is "ready" to serve traffic), `Status.Members` (which will probably be replaced by something like `Status.MemberRefs` via https://github.com/gardener/etcd-druid/issues/206), and fields `Status.Replicas`, `Status.ReadyReplicas` and `Status.CurrentReplicas` which are blindly copied over from the statefulset status and provide no meaningful information about the cluster, but rather druid should to the work of deducing such member-specific info and correctly populating conditions.

The `Status.Conditions` types that set today are:

- `Ready`
  
  Denotes readiness of the cluster, but for a human operator this condition would make more sense when renamed to `QuorumReached`, to denote that the cluster has reached quorum, which means it's ready to serve traffic.

- `AllMembersReady`

  Currently, this does not provide any information that `QuorumReached` cannot already provide. We would still require this condition to deduce whether every member is alive, ie, both containers in the pod are up. In such a case, a better name for this condition would be `AllMembersAlive` since readiness of each member already denotes readiness of the cluster, whereas liveness of a member has no bearing on whether the cluster is ready or not. In order to enable this, we will need to enhance the lease renewal logic in the backup sidecar to also take into consideration the liveness of its etcd member, via a serializable GET call to the etcd (which can be achieved via a liveness endpoint on the etcd-wrapper container).

- `BackupReady`

  This denotes whether the backups are uploaded on schedule as expected and are not stale. It considers both full and delta backups, and is computed using a complex logic that considers various different combinations of the states of the full snapshot lease and delta snapshot lease, which becomes quite complex to read and maintain. It would make better sense to split this condition into two simpler conditions for full and delta snapshot staleness respectively, so that druid can individually compute the condition state for both these and populate them individually, providing a cleaner view to human operators.

**Approach/Hint to the implement solution (optional):**

Proposed status fields to be removed/deprecated, or already deprecated:
- [ ] `Etcd`
- [x] `ServiceName`
- [x] `ClusterSize`
- [ ] `Replicas`
- [ ] `ReadyReplicas`
- [ ] `CurrentReplicas`
- [x] `UpdatedReplicas`
- [x] `LabelSelector`
- [ ] `LastError`

Proposed conditions:
```
- type: QuorumReached
  status: True|False|Unknown
  reason: MajorityMembersReady | MajorityMembersUnready | QuorumNotChecked
  message: "x/y members ready: <member-id>,<member-id>; z/y members unready: <member-id>" | "Quorum not checked"
- type: AllMembersAlive
  status: True|False|Unknown
  reason: MemberLeasesRenewed | MemberLeasesStale | MemberLeasesNotChecked
  message: "x/y members alive: <member-id>,<member-id>; z/y members not alive: <member-id>" | "Member leases not checked"
- type: FullSnapshotStale/FullSnapshotOutdated
  status: True|False|Unknown
  reason: SnapshotUploadedOnSchedule | SnapshotMissedSchedule | CannotDetermineSnapshotUploadStatus
  message: "Full snapshot uploaded successfully in the last x (time)" | "Cannot determine snapshot upload status"
- type: DeltaSnapshotStale/DeltaSnapshotOutdated
  status: True|False|Unknown
  reason: SnapshotUploadedOnSchedule | SnapshotMissedSchedule | CannotDetermineSnapshotUploadStatus
  message: "Delta snapshot uploaded successfully in the last x (time)" | "Cannot determine snapshot upload status"
```

/area usability


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Streamline etcd resource status #645

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Streamline etcd resource status #645

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions