-
Notifications
You must be signed in to change notification settings - Fork 64
Description
Feature (What you would like to be added):
Streamline etcd resource status to make it more meaningful and accurate for human operators.
Motivation (Why is this needed?):
Currently, etcd status has the following fields:
ObservedGenerationEtcdConditionsServiceNameLastErrorClusterSizeCurrentReplicasReplicasReadyReplicasReadyUpdatedReplicasLabelSelectorMembersPeerUrlTLSEnabled
As part of #594, many of these fields have already been marked deprecated, such as Status.ClusterSize, Status.ServiceName and Status.UpdatedReplicas, and will be removed in the future. But there are still many fields which are not really required to deduce the state of the etcd resource, such as Status.Etcd (which provides a self-reference to the Etcd object and isn't required since anybody who has access to the etcd status already has a reference to the etcd object itself), Status.Ready (readiness is a state, and can be better represented by a condition denoting quorum in the cluster, which means that the cluster is "ready" to serve traffic), Status.Members (which will probably be replaced by something like Status.MemberRefs via #206), and fields Status.Replicas, Status.ReadyReplicas and Status.CurrentReplicas which are blindly copied over from the statefulset status and provide no meaningful information about the cluster, but rather druid should to the work of deducing such member-specific info and correctly populating conditions.
The Status.Conditions types that set today are:
-
ReadyDenotes readiness of the cluster, but for a human operator this condition would make more sense when renamed to
QuorumReached, to denote that the cluster has reached quorum, which means it's ready to serve traffic. -
AllMembersReadyCurrently, this does not provide any information that
QuorumReachedcannot already provide. We would still require this condition to deduce whether every member is alive, ie, both containers in the pod are up. In such a case, a better name for this condition would beAllMembersAlivesince readiness of each member already denotes readiness of the cluster, whereas liveness of a member has no bearing on whether the cluster is ready or not. In order to enable this, we will need to enhance the lease renewal logic in the backup sidecar to also take into consideration the liveness of its etcd member, via a serializable GET call to the etcd (which can be achieved via a liveness endpoint on the etcd-wrapper container). -
BackupReadyThis denotes whether the backups are uploaded on schedule as expected and are not stale. It considers both full and delta backups, and is computed using a complex logic that considers various different combinations of the states of the full snapshot lease and delta snapshot lease, which becomes quite complex to read and maintain. It would make better sense to split this condition into two simpler conditions for full and delta snapshot staleness respectively, so that druid can individually compute the condition state for both these and populate them individually, providing a cleaner view to human operators.
Approach/Hint to the implement solution (optional):
Proposed status fields to be removed/deprecated, or already deprecated:
-
Etcd -
ServiceName -
ClusterSize -
Replicas -
ReadyReplicas -
CurrentReplicas -
UpdatedReplicas -
LabelSelector -
LastError
Proposed conditions:
- type: QuorumReached
status: True|False|Unknown
reason: MajorityMembersReady | MajorityMembersUnready | QuorumNotChecked
message: "x/y members ready: <member-id>,<member-id>; z/y members unready: <member-id>" | "Quorum not checked"
- type: AllMembersAlive
status: True|False|Unknown
reason: MemberLeasesRenewed | MemberLeasesStale | MemberLeasesNotChecked
message: "x/y members alive: <member-id>,<member-id>; z/y members not alive: <member-id>" | "Member leases not checked"
- type: FullSnapshotStale/FullSnapshotOutdated
status: True|False|Unknown
reason: SnapshotUploadedOnSchedule | SnapshotMissedSchedule | CannotDetermineSnapshotUploadStatus
message: "Full snapshot uploaded successfully in the last x (time)" | "Cannot determine snapshot upload status"
- type: DeltaSnapshotStale/DeltaSnapshotOutdated
status: True|False|Unknown
reason: SnapshotUploadedOnSchedule | SnapshotMissedSchedule | CannotDetermineSnapshotUploadStatus
message: "Delta snapshot uploaded successfully in the last x (time)" | "Cannot determine snapshot upload status"
/area usability