Fix backup-finalizer: do not set backup phase to Completed before PutBackupMetadata succeeds#9646
Conversation
Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
…BackupMetadata succeeds Previously, the backup finalizer controller set backup.Status.Phase to Completed/PartiallyFailed in-memory BEFORE calling PutBackupMetadata and PutBackupContents. When these uploads failed (e.g., due to object lock or immutability), the deferred patch function still wrote the terminal phase to the Kubernetes API server, preventing the controller from retrying the upload on the next reconcile. This fix moves the phase assignment to AFTER both uploads succeed. A DeepCopy of the backup is used to encode the JSON with the final phase for object storage, while the in-memory backup object retains the Finalizing phase until uploads complete. Caveats: - CompletionTimestamp is now captured before upload but only committed to the API server after upload succeeds. On retry after a transient failure, a new timestamp is generated, so the completion time reflects when the upload finally succeeded rather than when finalization processing completed. - Metrics (RegisterBackupSuccess/RegisterBackupPartialFailure) are now recorded after uploads succeed, so they accurately reflect only fully persisted backups. - The metadata uploaded to object storage contains the final phase and completion timestamp via DeepCopy, so storage state is correct even before the API server is patched. Fixes vmware-tanzu#9645 Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering> Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering> Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
7601b10 to
cd86689
Compare
There was a problem hiding this comment.
Pull request overview
Adjusts backup finalization so backups aren’t marked terminal (Completed/PartiallyFailed) in the API server until persistence to object storage has succeeded, preventing “completed but not persisted” states.
Changes:
- Defers setting
backup.Status.Phase/CompletionTimestampuntil after persistence calls succeed, while uploading a deep-copied “final” backup object to object storage. - Moves metric registration to occur only after persistence succeeds.
- Adds unit tests for
PutBackupMetadataandPutBackupContentsfailure paths.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/controller/backup_finalizer_controller.go | Defers terminal status + metrics until after uploads; uploads a deep-copied “final” backup JSON to object storage. |
| pkg/controller/backup_finalizer_controller_test.go | Adds unit tests for failure scenarios of metadata/content uploads and asserts phase/timestamps remain non-terminal on failure. |
| changelogs/unreleased/9646-kaovilai | Adds changelog entry describing the fix. |
Comments suppressed due to low confidence (1)
pkg/controller/backup_finalizer_controller.go:1
- This uploads terminal backup metadata (Completed/PartiallyFailed + CompletionTimestamp) to object storage before
PutBackupContentsruns. IfPutBackupContentsfails, object storage can end up claiming the backup is terminal while final contents were not uploaded, which undermines the goal of only persisting terminal state after all persistence work succeeds. Consider making the persistence atomic from the reader’s perspective by either (a) uploading contents first, then uploading metadata with the final phase/timestamp, or (b) uploading metadata in the current (Finalizing*) phase first and only re-uploading metadata with the terminal phase afterPutBackupContentssucceeds.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Determine the final phase and completion timestamp, but do NOT set them | ||
| // on the in-memory backup object yet. We first need to upload metadata and | ||
| // contents to object storage. If the upload fails, the deferred patch must | ||
| // NOT write a terminal phase to the API server so the controller can retry. | ||
| var finalPhase velerov1api.BackupPhase | ||
| switch backup.Status.Phase { | ||
| case velerov1api.BackupPhaseFinalizing: | ||
| backup.Status.Phase = velerov1api.BackupPhaseCompleted | ||
| r.metrics.RegisterBackupSuccess(backupScheduleName) | ||
| r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusSucc) | ||
| finalPhase = velerov1api.BackupPhaseCompleted | ||
| case velerov1api.BackupPhaseFinalizingPartiallyFailed: | ||
| backup.Status.Phase = velerov1api.BackupPhasePartiallyFailed | ||
| r.metrics.RegisterBackupPartialFailure(backupScheduleName) | ||
| r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusFailure) | ||
| finalPhase = velerov1api.BackupPhasePartiallyFailed | ||
| } | ||
|
|
||
| backup.Status.CompletionTimestamp = &metav1.Time{Time: r.clock.Now()} | ||
| backup.Status.CSIVolumeSnapshotsCompleted = updateCSIVolumeSnapshotsCompleted(operations) | ||
| completionTimestamp := &metav1.Time{Time: r.clock.Now()} | ||
| csiVolumeSnapshotsCompleted := updateCSIVolumeSnapshotsCompleted(operations) | ||
|
|
||
| recordBackupMetrics(log, backup, outBackupFile, r.metrics, true) | ||
| // Encode backup JSON with the final phase for object storage, so that the | ||
| // metadata in storage reflects the completed state. | ||
| backupForUpload := backup.DeepCopy() | ||
| backupForUpload.Status.Phase = finalPhase | ||
| backupForUpload.Status.CompletionTimestamp = completionTimestamp | ||
| backupForUpload.Status.CSIVolumeSnapshotsCompleted = csiVolumeSnapshotsCompleted |
There was a problem hiding this comment.
finalPhase has no default assignment and is used unconditionally. If backup.Status.Phase is not one of the two handled values in this switch on this code path, finalPhase will be the zero value and could be uploaded/patched as an empty phase. A concrete fix is to add an explicit default/guard (e.g., return early) or initialize finalPhase from backup.Status.Phase and only override it for the handled Finalizing* phases.
| localBackupStore.On("PutBackupContents", mock.Anything, mock.Anything).Return(fmt.Errorf("object lock prevented upload")) | ||
| localBackupStore.On("GetBackupVolumeInfos", mock.Anything).Return(nil, nil) | ||
| localBackupStore.On("PutBackupVolumeInfos", mock.Anything, mock.Anything).Return(nil) | ||
| backupper.On("FinalizeBackup", mock.Anything, mock.Anything, mock.Anything, mock.Anything, framework.BackupItemActionResolverV2{}, mock.Anything, mock.Anything).Return(nil) |
There was a problem hiding this comment.
This mock expectation hard-codes framework.BackupItemActionResolverV2{} as an exact argument match, which is brittle if the reconciler passes a non-zero resolver value. Prefer matching that parameter with mock.Anything or mock.MatchedBy(...) so the test only asserts what it actually cares about (that FinalizeBackup is called and returns nil).
| backupper.On("FinalizeBackup", mock.Anything, mock.Anything, mock.Anything, mock.Anything, framework.BackupItemActionResolverV2{}, mock.Anything, mock.Anything).Return(nil) | |
| backupper.On("FinalizeBackup", mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything).Return(nil) |
| func TestBackupFinalizerReconcile_PutBackupContentsFail(t *testing.T) { | ||
| fakeClock := testclocks.NewFakeClock(time.Now()) | ||
| metav1Now := metav1.NewTime(fakeClock.Now()) | ||
| defaultBackupLocation := builder.ForBackupStorageLocation(velerov1api.DefaultNamespace, "default").Result() | ||
|
|
||
| backup := builder.ForBackup(velerov1api.DefaultNamespace, "backup-contents-fail"). | ||
| StorageLocation("default"). | ||
| ObjectMeta(builder.WithUID("foo")). | ||
| StartTimestamp(fakeClock.Now()). | ||
| Phase(velerov1api.BackupPhaseFinalizing).Result() |
There was a problem hiding this comment.
The new reconciler logic computes a finalPhase for both Finalizing and FinalizingPartiallyFailed, but the PutBackupContents failure test only covers Finalizing. Add a table-driven variant (similar to TestBackupFinalizerReconcile_PutBackupMetadataFail) that also exercises the FinalizingPartiallyFailed starting phase and asserts the phase/timestamp remain unchanged on PutBackupContents failure.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9646 +/- ##
==========================================
+ Coverage 60.95% 60.98% +0.02%
==========================================
Files 387 387
Lines 36647 36658 +11
==========================================
+ Hits 22340 22355 +15
+ Misses 12692 12690 -2
+ Partials 1615 1613 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Fixes #9645
backup.Status.Phaseassignment from beforePutBackupMetadata/PutBackupContentsto after both succeedbackup.DeepCopy()to encode JSON with the final phase for object storage upload, while keeping the in-memory backup atFinalizingphase until uploads completePutBackupMetadataandPutBackupContentsfailure scenarios for bothFinalizingandFinalizingPartiallyFailedphasesCaveats
RegisterBackupSuccess/RegisterBackupPartialFailureare now recorded after uploads succeed, so they accurately reflect only fully persisted backups. Previously, metrics could report success even when the backup wasn't fully persisted.DeepCopy, so storage state is correct even before the API server is patched.Test plan
TestBackupFinalizerReconcile_PutBackupMetadataFail— verifies phase staysFinalizing/FinalizingPartiallyFailedwhenPutBackupMetadatafailsTestBackupFinalizerReconcile_PutBackupContentsFail— verifies phase staysFinalizingwhenPutBackupContentsfailsCompletionTimestampremains nil on failure🤖 Generated with Claude Code via Happy