Skip to content

Take Full-Snapshot via EtcdOpsTask while hibernating and performing etcd version upgrade.#1300

Open
Shreyas-s14 wants to merge 4 commits intogardener:masterfrom
Shreyas-s14:etcd-upgr-snapshot
Open

Take Full-Snapshot via EtcdOpsTask while hibernating and performing etcd version upgrade.#1300
Shreyas-s14 wants to merge 4 commits intogardener:masterfrom
Shreyas-s14:etcd-upgr-snapshot

Conversation

@Shreyas-s14
Copy link
Member

@Shreyas-s14 Shreyas-s14 commented Mar 2, 2026

How to categorize this PR?

/area disaster-recovery
/area control-plane
/area compliance
/kind enhancement

What this PR does / why we need it:
This PR does the below 2 things:

  1. Performs a Full Snapshot Operation via EtcdOpsTask upon hibernation of the etcd resource (i.e spec.Replicas=0) with a backoff of 3.
  2. Performs a Full Snapshot Operation via EtcdOpsTask upon etcd upgrade with a backoff of 3. This Etcd upgrade is carried out by:
    • Introducing a feature gate in etcd-druid to carry out this procedure.
    • Maintaining multiple images for etcd-wrapper and etcd-backup-restore and selecting the right one based on whether the feature-gate is set.

Which issue(s) this PR fixes:
Fixes Partially #445

Checklist:

  • Update documentation in the /docs folder (if applicable)
  • Add tests that cover your changes (if applicable)
    • Unit tests
    • Integration tests
    • E2E tests

Special notes for your reviewer:
As of now, the PR for etcd-upgrade in etcd-wrapper and etcd-backup-restore have not been merged. Hence custom images have been used. Once they are merged, the images will be replaced with the official built ones.

Release note:

Added feature-gate for etcd upgrade to 3.5.26. Take Full Snapshot via EtcdOpsTask on hibernation of etcd and etcd upgrade via feature-gate.

@Shreyas-s14 Shreyas-s14 requested a review from a team as a code owner March 2, 2026 05:36
@gardener-prow gardener-prow bot added area/disaster-recovery Disaster recovery related area/control-plane Control plane related area/compliance Compliance related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 2, 2026
@federated-github-access federated-github-access bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Mar 2, 2026
@github-actions github-actions bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. and removed ok-to-test Indicates a non-member PR verified by an org member that is safe to test. labels Mar 2, 2026
@anveshreddy18 anveshreddy18 self-assigned this Mar 2, 2026
@anveshreddy18 anveshreddy18 added this to the v0.36.0 milestone Mar 2, 2026
Copy link
Member

@anveshreddy18 anveshreddy18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have the following comments for now, PTAL. I'll be adding more as I'm reviewing other files.

Copy link
Member

@anveshreddy18 anveshreddy18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all my review comments @Shreyas-s14.
/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2026
@gardener-prow
Copy link

gardener-prow bot commented Mar 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: anveshreddy18
Once this PR has been reviewed and has the lgtm label, please assign unmarshall for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@CaptainIRS CaptainIRS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shreyas-s14 thanks for the PR! I have a few comments, PTAL, thanks.


if tc.expectedNil {
g.Expect(syncErr).ToNot(HaveOccurred())
} else if tc.expectedRequeue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tc.expectedRequeue isn't being set anywhere, so this would always be false. Also I think the tc.expectedNil check should have an else component since if you change the expectedNil values in the current tests, the tests still pass.

Config: druidv1alpha1.EtcdOpsTaskConfig{
OnDemandSnapshot: &druidv1alpha1.OnDemandSnapshotConfig{Type: druidv1alpha1.OnDemandSnapshotTypeFull},
},
TTLSecondsAfterFinished: ptr.To(int32(36)), // 48 hours
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

36 seems too low, wondering if this change is intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks for the catch
It was supposed to be 2 days, but for testing purposes, i made it to 36 secs to see deletion flows. Will change


// Pre-sync snapshot task constants
preSyncTaskPrefixHibernation = "presync-snapshot-hibernation-"
preSyncTaskPrefixUpgrade = "presync-snapshot-upgrade-"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this to something generic since the pre-sync task also executes for any version change (both upgrade and downgrade)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of now, we do not take snapshots on downgrade.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed the return nil in the if !druidconfigv1alpha1.DefaultFeatureGates.IsEnabled(druidconfigv1alpha1.UpgradeEtcdVersion) { block, sorry.
Though, I wonder if we should be taking backups on downgrade (i.e if the feature flag is set to false and reconciliation is triggered) since the etcd doc suggests so in https://etcd.io/docs/v3.5/downgrades/downgrade_3_5/#server-downgrade-checklists

Comment on lines +31 to +34
// ImageKeyEtcdV3_5 is the key for the etcd v3.5 image in the image vector.
ImageKeyEtcdWrapperV3_5 = "etcd-wrapper-3_5"
// ImageKeyEtcdBackupRestoreV3_5 is the key for the etcd-backup-restore v3.5 image in the image vector.
ImageKeyEtcdBackupRestoreV3_5 = "etcd-backup-restore-3_5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: could we instead name it generic like ImageKeyEtcdWrapperNext so that in the future if we with to perform upgrades, we could just use the same mechanism to roll out upgrades in landscapes?

stsReplicas: 3,
etcdReplicas: 0,
etcdWrapperImage: currentImage,
existingTasks: []*druidv1alpha1.EtcdOpsTask{buildPreSyncTask(preSyncTaskPrefixHibernation, 2, ptr.To(druidv1alpha1.TaskStateFailed))},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have this reference the constant instead like maxPreSyncRetries - 1? (same for other entries below as well)

}
return nil
}
g.Eventually(checkFn).Within(timeout).WithPolling(pollInterval).WithContext(ctx).Should(BeNil())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
g.Eventually(checkFn).Within(timeout).WithPolling(pollInterval).WithContext(ctx).Should(BeNil())
g.Eventually(checkFn).Within(timeout).WithPolling(pollInterval).WithContext(ctx).Should(Succeed())

// This is now GA. Any attempt to disable this feature will be an error.
UseEtcdWrapper = "UseEtcdWrapper"

// UpgradeEtcdVersionTo is the name of the feature which enables upgrade of etcd version to v3.5.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// UpgradeEtcdVersionTo is the name of the feature which enables upgrade of etcd version to v3.5.
// UpgradeEtcdVersion is the name of the feature which enables upgrade of etcd version to v3.5.

@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2026
@gardener-prow gardener-prow bot requested a review from anveshreddy18 March 6, 2026 08:00
@CaptainIRS CaptainIRS self-assigned this Mar 6, 2026
UseEtcdWrapper = "UseEtcdWrapper"

// UpgradeEtcdVersionTo is the name of the feature which enables upgrade of etcd version to v3.5.
UpgradeEtcdVersion = "UpgradeEtcdVersion"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +8 to +9
sourceRepository: github.com/gardener/etcd-wrapper
repository: shreyas14/wrapper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this comment as reminder to update the image here.

Comment on lines +17 to +22
#TODO: @Shreyas-s14 Change the image to etcd-backup-restore image once PR is merged
- name: etcd-backup-restore-3_5
resourceId:
name: 'etcdbrctl-3_5'
sourceRepository: github.com/gardener/etcd-backup-restore
repository: shreyas14/backup-restore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@ishan16696
Copy link
Member

/hold until etcd-wrapper and etcd-backup-restore release as image tag need to be updated in this PR.

@gardener-prow gardener-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2026
@ishan16696
Copy link
Member

ishan16696 commented Mar 6, 2026

wording of release note doesn't seems to be correct:

Take Full Snapshot via EtcdOpsTask on hibernation of etcd and etcd upgrade via feature-gate.

I think you're taking full snapshot before hibernation via EtcdOpsTask irrespective of feature gate. I think it better to mention this as separate release note.

Comment on lines +106 to +108
if etcd.Spec.Replicas == 0 {
return r.ensurePreSyncSnapshot(ctx, etcd, preSyncTaskPrefixHibernation)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if etcd.Spec.Replicas == 0 {
return r.ensurePreSyncSnapshot(ctx, etcd, preSyncTaskPrefixHibernation)
}
if etcd.Spec.Replicas == 0 && *existingSts.Spec.Replicas > 0{
return r.ensurePreSyncSnapshot(ctx, etcd, preSyncTaskPrefixHibernation)
}

if we are taking a full snapshot before hibernation, isn't it make sense to also check that sts spec.replicas is not 0 yet, wdyt ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/compliance Compliance related area/control-plane Control plane related area/disaster-recovery Disaster recovery related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/enhancement Enhancement, improvement, extension needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants