Skip to content

Decouple backup ready condition for snapshot leases and remove support for deprecated fields in etcd status#820

Open
anveshreddy18 wants to merge 2 commits intogardener:masterfrom
anveshreddy18:decouple_snapshot_leases
Open

Decouple backup ready condition for snapshot leases and remove support for deprecated fields in etcd status#820
anveshreddy18 wants to merge 2 commits intogardener:masterfrom
anveshreddy18:decouple_snapshot_leases

Conversation

@anveshreddy18
Copy link
Member

@anveshreddy18 anveshreddy18 commented Jun 26, 2024

How to categorize this PR?

/area usability monitoring
/kind enhancement cleanup

What this PR does / why we need it:

This PR splits the existing BackupReady condition which talks about both full and delta snapshot health as a single condition into two new conditions FullSnapshotBackupReady and DeltaSnapshotBackupReady, one for each type of snapshot i.e Full and Delta respectively. This helps in better understanding of status of each backup type thus helping debug issues and analyze more effectively. The existing BackupReady condition will still be present, which aggregates both the individual snapshot conditions into one.

This PR also removes the support for few fields in etcd status which are in deprecated state for enough time. The fields that are removed are ServiceName, ClusterSize, UpdatedReplicas and LabelSelector. These fields were essentially taken from statefulset and not necessarily needed to be present under etcd.status hence they were deprecated and this PR stops the support.

This PR partially fixes this issue for streamlining etcd status

Which issue(s) this PR fixes:

Special notes for your reviewer:

Release note:

The existing `BackupReady` condition in `etcd.status` is split into 2 new specific conditions `FullSnapshotBackupReady` & `DeltaSnapshotBackupReady` to better understand the health of each type of snapshot backup
Support for fields `ServiceName`, `ClusterSize`,  `UpdatedReplicas` and `LabelSelector`  in `etcd.status`  are removed after they've been deprecated for 4 minor releases 

@anveshreddy18 anveshreddy18 requested a review from a team as a code owner June 26, 2024 16:32
@gardener-robot gardener-robot added needs/review Needs review size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs/second-opinion Needs second review by someone else labels Jun 26, 2024
@ghost ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 26, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 26, 2024
@gardener-robot gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related area/usability Usability related kind/cleanup Something that is not needed anymore and can be cleaned up kind/enhancement Enhancement, improvement, extension labels Jun 26, 2024
@gardener gardener deleted a comment from gardener-robot Jun 26, 2024
@anveshreddy18 anveshreddy18 self-assigned this Jun 26, 2024
@anveshreddy18 anveshreddy18 force-pushed the decouple_snapshot_leases branch from 715260b to d0f3163 Compare June 26, 2024 16:43
@ghost ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 26, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 26, 2024
@anveshreddy18
Copy link
Member Author

/retest

1 similar comment
@anveshreddy18
Copy link
Member Author

/retest

@seshachalam-yv
Copy link
Member

/assign

Copy link
Member

@seshachalam-yv seshachalam-yv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove Cluster Size and Backup Ready.

// +kubebuilder:printcolumn:name="Cluster Size",type=integer,JSONPath=`.spec.replicas`,priority=1

// +kubebuilder:printcolumn:name="Backup Ready",type=string,JSONPath=`.status.conditions[?(@.type=="BackupReady")].status`

Add FullSnapshotBackupReady and DeltaSnapshotBackupReady as part of print columns

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Jul 9, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 10, 2024
@anveshreddy18
Copy link
Member Author

@seshachalam-yv Nice find, thanks. Done

@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 18, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 21, 2024
@anveshreddy18 anveshreddy18 force-pushed the decouple_snapshot_leases branch from 9036d63 to b0b9725 Compare August 27, 2024 07:55
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 27, 2024
@anveshreddy18
Copy link
Member Author

/test pull-etcd-druid-e2e-kind

Copy link
Member

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to pair program this. Current code needs a LOT of changes.

// +kubebuilder:printcolumn:name="Ready Replicas",type=integer,JSONPath=`.status.readyReplicas`,priority=1
// +kubebuilder:printcolumn:name="Backup Ready",type=string,JSONPath=`.status.conditions[?(@.type=="BackupReady")].status`
// +kubebuilder:printcolumn:name="Full Backup Ready",type=string,JSONPath=`.status.conditions[?(@.type=="FullSnapshotBackupReady")].status`
// +kubebuilder:printcolumn:name="Delta Backup Ready",type=string,JSONPath=`.status.conditions[?(@.type=="DeltaSnapshotBackupReady")].status`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call them Delta Backup Ready or Delta Snapshot Backup Ready? Similarly for full snapshot.

- `AllMembersReady`: indicates readiness of all members of the etcd cluster.
- `Ready`: indicates overall readiness of the etcd cluster in serving traffic.
- `BackupReady`: indicates health of the etcd backups, i.e., whether etcd backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
- `FullSnapshotBackupReady`: indicates health of the full snapshot backups, i.e whether full snapshot backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which situation will this be set to false? If one full snapshot is missed or more than one missed?

- `Ready`: indicates overall readiness of the etcd cluster in serving traffic.
- `BackupReady`: indicates health of the etcd backups, i.e., whether etcd backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
- `FullSnapshotBackupReady`: indicates health of the full snapshot backups, i.e whether full snapshot backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
- `DeltaSnapshotBackupReady`: indicates health of the delta snapshot backups, i.e whether delta snapshot backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the criteria for this value to be false?

### Verify the Etcd cluster

To obtain information regarding the newly instantiated etcd cluster, perform the following step, which gives details such as the cluster size, readiness status of its members, and various other attributes.
To obtain information regarding the newly instantiated etcd cluster, perform the following step, which gives details such as the current & ready replicas, readiness status of its members, Full and Delta backup status and various other attributes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To obtain information regarding the newly instantiated etcd cluster, perform the following step, which gives details such as the current & ready replicas, readiness status of its members, Full and Delta backup status and various other attributes.
To obtain information regarding the newly instantiated etcd cluster, perform the following step, which gives details such as the current & ready replicas, readiness status of its members, Full Snapshot and Delta Snapshot backup status and various other attributes.

}

var FullSnapshotBackupReadyCheckResult, DeltaSnapshotBackupReadyCheckResult Result = nil, nil
for _, result := range b.results {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by using the loop variable result you are shadowing the variable result defined at the start of Check method. Please avoid shadowing.

)
fullSnapErr = a.cl.Get(ctx, types.NamespacedName{Name: getFullSnapLeaseName(&etcd), Namespace: etcd.ObjectMeta.Namespace}, fullSnapLease)
incrSnapErr = a.cl.Get(ctx, types.NamespacedName{Name: getDeltaSnapLeaseName(&etcd), Namespace: etcd.ObjectMeta.Namespace}, deltaSnapLease)
fullSnapErr = f.cl.Get(ctx, types.NamespacedName{Name: getFullSnapLeaseName(&etcd), Namespace: etcd.ObjectMeta.Namespace}, fullSnapLease)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is an error getting the full snapshot lease then is the error being ignored immediately but handled much later? If there is an error you should exit soon with error instead of continuing with the computations.

}
}

func (d *deltaSnapshotBackupReadyCheck) Check(ctx context.Context, etcd druidv1alpha1.Etcd) Result {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a LOT of code repetition and this code needs to be rewritten.

condition.AllMembersReadyCheck,
condition.ReadyCheck,
condition.BackupReadyCheck,
condition.FullSnapshotBackupReadyCheck,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have removed BackupReadyCheck here but you have also changed and kept the implementation of BackupReady in check_backup_ready.go. Any reason for that?

})()

results := make([]condition.Result, 0, len(ConditionChecks))
results := make([]condition.Result, 0, len(ConditionChecks)+1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this conditions checks + 1?

for r := range resultCh {
results = append(results, r)
}
backupReadyChecker := condition.BackupReadyCheck(c.cl, results)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really do not understand this. You removed it from ConditionChecks but introduce this here? why?

@anveshreddy18 anveshreddy18 removed their assignment Nov 11, 2024
@ghost ghost added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 2, 2025
@gardener-robot gardener-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 10, 2025
@gardener-ci-robot
Copy link

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close

/lifecycle rotten

@gardener-robot gardener-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 25, 2025
@gardener-ci-robot
Copy link

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten

/close

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jan 1, 2026
@shreyas-s-rao shreyas-s-rao removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 1, 2026
@shreyas-s-rao shreyas-s-rao reopened this Jan 1, 2026
@gardener-github-actions gardener-github-actions bot added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 1, 2026
@github-actions github-actions bot removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 1, 2026
@gardener-robot gardener-robot added status/accepted Issue was accepted as something we need to work on and removed status/closed Issue is closed (either delivered or triaged) labels Jan 1, 2026
@gardener-prow gardener-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2026
@gardener-prow
Copy link

gardener-prow bot commented Jan 13, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gardener-prow gardener-prow bot added the cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. label Feb 5, 2026
@gardener-prow
Copy link

gardener-prow bot commented Feb 5, 2026

@anveshreddy18: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-druid-check-renovate-config b0b9725 link true /test pull-etcd-druid-check-renovate-config
pull-etcd-druid-verify-image-build e149f93 link true /test pull-etcd-druid-verify-image-build
pull-etcd-druid-integration e149f93 link true /test pull-etcd-druid-integration
pull-etcd-druid-unit e149f93 link true /test pull-etcd-druid-unit
pull-etcd-druid-api-unit e149f93 link false /test pull-etcd-druid-api-unit
pull-etcd-druid-e2e-kind e149f93 link true /test pull-etcd-druid-e2e-kind
pull-etcd-druid-e2e-kind-recovery e149f93 link true /test pull-etcd-druid-e2e-kind-recovery
pull-etcd-druid-e2e-kind-cluster-update e149f93 link true /test pull-etcd-druid-e2e-kind-cluster-update
pull-etcd-druid-e2e-kind-snapshot-compaction e149f93 link true /test pull-etcd-druid-e2e-kind-snapshot-compaction
pull-etcd-druid-e2e-kind-secret-finalizers e149f93 link true /test pull-etcd-druid-e2e-kind-secret-finalizers
pull-etcd-druid-e2e-kind-basic e149f93 link true /test pull-etcd-druid-e2e-kind-basic
pull-etcd-druid-e2e-kind-scaleout e149f93 link true /test pull-etcd-druid-e2e-kind-scaleout
pull-etcd-druid-e2e-kind-tls-label-updates e149f93 link true /test pull-etcd-druid-e2e-kind-tls-label-updates

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gardener-ci-robot
Copy link

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as active with /lifecycle active
  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

@gardener-prow gardener-prow bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/monitoring Monitoring (including availability monitoring and alerting) related area/usability Usability related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/cleanup Something that is not needed anymore and can be cleaned up kind/enhancement Enhancement, improvement, extension lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/rebase Needs git rebase needs/review Needs review needs/second-opinion Needs second review by someone else needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/accepted Issue was accepted as something we need to work on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants