Take Full-Snapshot via EtcdOpsTask while hibernating and performing etcd version upgrade.#1300
Take Full-Snapshot via EtcdOpsTask while hibernating and performing etcd version upgrade.#1300Shreyas-s14 wants to merge 4 commits intogardener:masterfrom
Full-Snapshot via EtcdOpsTask while hibernating and performing etcd version upgrade.#1300Conversation
anveshreddy18
left a comment
There was a problem hiding this comment.
Thanks for the PR. I have the following comments for now, PTAL. I'll be adding more as I'm reviewing other files.
ddafda0 to
36248bc
Compare
anveshreddy18
left a comment
There was a problem hiding this comment.
Thanks for addressing all my review comments @Shreyas-s14.
/lgtm
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: anveshreddy18 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
CaptainIRS
left a comment
There was a problem hiding this comment.
@Shreyas-s14 thanks for the PR! I have a few comments, PTAL, thanks.
|
|
||
| if tc.expectedNil { | ||
| g.Expect(syncErr).ToNot(HaveOccurred()) | ||
| } else if tc.expectedRequeue { |
There was a problem hiding this comment.
tc.expectedRequeue isn't being set anywhere, so this would always be false. Also I think the tc.expectedNil check should have an else component since if you change the expectedNil values in the current tests, the tests still pass.
| Config: druidv1alpha1.EtcdOpsTaskConfig{ | ||
| OnDemandSnapshot: &druidv1alpha1.OnDemandSnapshotConfig{Type: druidv1alpha1.OnDemandSnapshotTypeFull}, | ||
| }, | ||
| TTLSecondsAfterFinished: ptr.To(int32(36)), // 48 hours |
There was a problem hiding this comment.
36 seems too low, wondering if this change is intentional?
There was a problem hiding this comment.
Oh thanks for the catch
It was supposed to be 2 days, but for testing purposes, i made it to 36 secs to see deletion flows. Will change
|
|
||
| // Pre-sync snapshot task constants | ||
| preSyncTaskPrefixHibernation = "presync-snapshot-hibernation-" | ||
| preSyncTaskPrefixUpgrade = "presync-snapshot-upgrade-" |
There was a problem hiding this comment.
Should we rename this to something generic since the pre-sync task also executes for any version change (both upgrade and downgrade)?
There was a problem hiding this comment.
as of now, we do not take snapshots on downgrade.
There was a problem hiding this comment.
Ah, I missed the return nil in the if !druidconfigv1alpha1.DefaultFeatureGates.IsEnabled(druidconfigv1alpha1.UpgradeEtcdVersion) { block, sorry.
Though, I wonder if we should be taking backups on downgrade (i.e if the feature flag is set to false and reconciliation is triggered) since the etcd doc suggests so in https://etcd.io/docs/v3.5/downgrades/downgrade_3_5/#server-downgrade-checklists
| // ImageKeyEtcdV3_5 is the key for the etcd v3.5 image in the image vector. | ||
| ImageKeyEtcdWrapperV3_5 = "etcd-wrapper-3_5" | ||
| // ImageKeyEtcdBackupRestoreV3_5 is the key for the etcd-backup-restore v3.5 image in the image vector. | ||
| ImageKeyEtcdBackupRestoreV3_5 = "etcd-backup-restore-3_5" |
There was a problem hiding this comment.
Just a thought: could we instead name it generic like ImageKeyEtcdWrapperNext so that in the future if we with to perform upgrades, we could just use the same mechanism to roll out upgrades in landscapes?
| stsReplicas: 3, | ||
| etcdReplicas: 0, | ||
| etcdWrapperImage: currentImage, | ||
| existingTasks: []*druidv1alpha1.EtcdOpsTask{buildPreSyncTask(preSyncTaskPrefixHibernation, 2, ptr.To(druidv1alpha1.TaskStateFailed))}, |
There was a problem hiding this comment.
Could we have this reference the constant instead like maxPreSyncRetries - 1? (same for other entries below as well)
| } | ||
| return nil | ||
| } | ||
| g.Eventually(checkFn).Within(timeout).WithPolling(pollInterval).WithContext(ctx).Should(BeNil()) |
There was a problem hiding this comment.
| g.Eventually(checkFn).Within(timeout).WithPolling(pollInterval).WithContext(ctx).Should(BeNil()) | |
| g.Eventually(checkFn).Within(timeout).WithPolling(pollInterval).WithContext(ctx).Should(Succeed()) |
| // This is now GA. Any attempt to disable this feature will be an error. | ||
| UseEtcdWrapper = "UseEtcdWrapper" | ||
|
|
||
| // UpgradeEtcdVersionTo is the name of the feature which enables upgrade of etcd version to v3.5. |
There was a problem hiding this comment.
| // UpgradeEtcdVersionTo is the name of the feature which enables upgrade of etcd version to v3.5. | |
| // UpgradeEtcdVersion is the name of the feature which enables upgrade of etcd version to v3.5. |
| UseEtcdWrapper = "UseEtcdWrapper" | ||
|
|
||
| // UpgradeEtcdVersionTo is the name of the feature which enables upgrade of etcd version to v3.5. | ||
| UpgradeEtcdVersion = "UpgradeEtcdVersion" |
There was a problem hiding this comment.
can you also update the doc here https://github.com/gardener/etcd-druid/blob/master/docs/deployment/feature-gates.md#feature-gates-for-alpha-or-beta-features
| sourceRepository: github.com/gardener/etcd-wrapper | ||
| repository: shreyas14/wrapper |
There was a problem hiding this comment.
adding this comment as reminder to update the image here.
| #TODO: @Shreyas-s14 Change the image to etcd-backup-restore image once PR is merged | ||
| - name: etcd-backup-restore-3_5 | ||
| resourceId: | ||
| name: 'etcdbrctl-3_5' | ||
| sourceRepository: github.com/gardener/etcd-backup-restore | ||
| repository: shreyas14/backup-restore |
|
/hold until etcd-wrapper and etcd-backup-restore release as image tag need to be updated in this PR. |
|
wording of release note doesn't seems to be correct: I think you're taking full snapshot before hibernation via EtcdOpsTask irrespective of feature gate. I think it better to mention this as separate release note. |
| if etcd.Spec.Replicas == 0 { | ||
| return r.ensurePreSyncSnapshot(ctx, etcd, preSyncTaskPrefixHibernation) | ||
| } |
There was a problem hiding this comment.
| if etcd.Spec.Replicas == 0 { | |
| return r.ensurePreSyncSnapshot(ctx, etcd, preSyncTaskPrefixHibernation) | |
| } | |
| if etcd.Spec.Replicas == 0 && *existingSts.Spec.Replicas > 0{ | |
| return r.ensurePreSyncSnapshot(ctx, etcd, preSyncTaskPrefixHibernation) | |
| } |
if we are taking a full snapshot before hibernation, isn't it make sense to also check that sts spec.replicas is not 0 yet, wdyt ?
How to categorize this PR?
/area disaster-recovery
/area control-plane
/area compliance
/kind enhancement
What this PR does / why we need it:
This PR does the below 2 things:
EtcdOpsTaskupon hibernation of the etcd resource (i.espec.Replicas=0) with a backoff of 3.EtcdOpsTaskupon etcd upgrade with a backoff of 3. This Etcd upgrade is carried out by:etcd-wrapperandetcd-backup-restoreand selecting the right one based on whether the feature-gate is set.Which issue(s) this PR fixes:
Fixes Partially #445
Checklist:
/docsfolder (if applicable)Special notes for your reviewer:
As of now, the PR for etcd-upgrade in etcd-wrapper and etcd-backup-restore have not been merged. Hence custom images have been used. Once they are merged, the images will be replaced with the official built ones.
Release note: