Skip to content

[Enhancement] Streamline etcd resource status #645

@shreyas-s-rao

Description

@shreyas-s-rao

Feature (What you would like to be added):
Streamline etcd resource status to make it more meaningful and accurate for human operators.

Motivation (Why is this needed?):
Currently, etcd status has the following fields:

  • ObservedGeneration
  • Etcd
  • Conditions
  • ServiceName
  • LastError
  • ClusterSize
  • CurrentReplicas
  • Replicas
  • ReadyReplicas
  • Ready
  • UpdatedReplicas
  • LabelSelector
  • Members
  • PeerUrlTLSEnabled

As part of #594, many of these fields have already been marked deprecated, such as Status.ClusterSize, Status.ServiceName and Status.UpdatedReplicas, and will be removed in the future. But there are still many fields which are not really required to deduce the state of the etcd resource, such as Status.Etcd (which provides a self-reference to the Etcd object and isn't required since anybody who has access to the etcd status already has a reference to the etcd object itself), Status.Ready (readiness is a state, and can be better represented by a condition denoting quorum in the cluster, which means that the cluster is "ready" to serve traffic), Status.Members (which will probably be replaced by something like Status.MemberRefs via #206), and fields Status.Replicas, Status.ReadyReplicas and Status.CurrentReplicas which are blindly copied over from the statefulset status and provide no meaningful information about the cluster, but rather druid should to the work of deducing such member-specific info and correctly populating conditions.

The Status.Conditions types that set today are:

  • Ready

    Denotes readiness of the cluster, but for a human operator this condition would make more sense when renamed to QuorumReached, to denote that the cluster has reached quorum, which means it's ready to serve traffic.

  • AllMembersReady

    Currently, this does not provide any information that QuorumReached cannot already provide. We would still require this condition to deduce whether every member is alive, ie, both containers in the pod are up. In such a case, a better name for this condition would be AllMembersAlive since readiness of each member already denotes readiness of the cluster, whereas liveness of a member has no bearing on whether the cluster is ready or not. In order to enable this, we will need to enhance the lease renewal logic in the backup sidecar to also take into consideration the liveness of its etcd member, via a serializable GET call to the etcd (which can be achieved via a liveness endpoint on the etcd-wrapper container).

  • BackupReady

    This denotes whether the backups are uploaded on schedule as expected and are not stale. It considers both full and delta backups, and is computed using a complex logic that considers various different combinations of the states of the full snapshot lease and delta snapshot lease, which becomes quite complex to read and maintain. It would make better sense to split this condition into two simpler conditions for full and delta snapshot staleness respectively, so that druid can individually compute the condition state for both these and populate them individually, providing a cleaner view to human operators.

Approach/Hint to the implement solution (optional):

Proposed status fields to be removed/deprecated, or already deprecated:

  • Etcd
  • ServiceName
  • ClusterSize
  • Replicas
  • ReadyReplicas
  • CurrentReplicas
  • UpdatedReplicas
  • LabelSelector
  • LastError

Proposed conditions:

- type: QuorumReached
  status: True|False|Unknown
  reason: MajorityMembersReady | MajorityMembersUnready | QuorumNotChecked
  message: "x/y members ready: <member-id>,<member-id>; z/y members unready: <member-id>" | "Quorum not checked"
- type: AllMembersAlive
  status: True|False|Unknown
  reason: MemberLeasesRenewed | MemberLeasesStale | MemberLeasesNotChecked
  message: "x/y members alive: <member-id>,<member-id>; z/y members not alive: <member-id>" | "Member leases not checked"
- type: FullSnapshotStale/FullSnapshotOutdated
  status: True|False|Unknown
  reason: SnapshotUploadedOnSchedule | SnapshotMissedSchedule | CannotDetermineSnapshotUploadStatus
  message: "Full snapshot uploaded successfully in the last x (time)" | "Cannot determine snapshot upload status"
- type: DeltaSnapshotStale/DeltaSnapshotOutdated
  status: True|False|Unknown
  reason: SnapshotUploadedOnSchedule | SnapshotMissedSchedule | CannotDetermineSnapshotUploadStatus
  message: "Delta snapshot uploaded successfully in the last x (time)" | "Cannot determine snapshot upload status"

/area usability

Metadata

Metadata

Assignees

Labels

area/usabilityUsability relatedkind/enhancementEnhancement, improvement, extensionlifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions