Skip to content

Fix backup-finalizer: do not set backup phase to Completed before PutBackupMetadata succeeds#9646

Open
kaovilai wants to merge 2 commits intovmware-tanzu:mainfrom
kaovilai:fix/backup-finalizer-premature-phase-completion
Open

Fix backup-finalizer: do not set backup phase to Completed before PutBackupMetadata succeeds#9646
kaovilai wants to merge 2 commits intovmware-tanzu:mainfrom
kaovilai:fix/backup-finalizer-premature-phase-completion

Conversation

@kaovilai
Copy link
Copy Markdown
Collaborator

Summary

Fixes #9645

  • Moves backup.Status.Phase assignment from before PutBackupMetadata/PutBackupContents to after both succeed
  • Uses backup.DeepCopy() to encode JSON with the final phase for object storage upload, while keeping the in-memory backup at Finalizing phase until uploads complete
  • Adds unit tests covering PutBackupMetadata and PutBackupContents failure scenarios for both Finalizing and FinalizingPartiallyFailed phases

Caveats

  • CompletionTimestamp: Now captured before upload but only committed to the API server after upload succeeds. On retry after a transient failure, a new timestamp is generated, so the completion time reflects when the upload finally succeeded rather than when finalization processing completed.
  • Metrics: RegisterBackupSuccess/RegisterBackupPartialFailure are now recorded after uploads succeed, so they accurately reflect only fully persisted backups. Previously, metrics could report success even when the backup wasn't fully persisted.
  • Object storage metadata: The uploaded JSON contains the final phase and completion timestamp via DeepCopy, so storage state is correct even before the API server is patched.

Test plan

  • New test: TestBackupFinalizerReconcile_PutBackupMetadataFail — verifies phase stays Finalizing/FinalizingPartiallyFailed when PutBackupMetadata fails
  • New test: TestBackupFinalizerReconcile_PutBackupContentsFail — verifies phase stays Finalizing when PutBackupContents fails
  • Both tests assert CompletionTimestamp remains nil on failure
  • Existing tests pass unchanged

🤖 Generated with Claude Code via Happy

Copilot AI review requested due to automatic review settings March 24, 2026 22:20
kaovilai added a commit to kaovilai/velero that referenced this pull request Mar 24, 2026
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
kaovilai and others added 2 commits March 24, 2026 18:20
…BackupMetadata succeeds

Previously, the backup finalizer controller set backup.Status.Phase to
Completed/PartiallyFailed in-memory BEFORE calling PutBackupMetadata and
PutBackupContents. When these uploads failed (e.g., due to object lock
or immutability), the deferred patch function still wrote the terminal
phase to the Kubernetes API server, preventing the controller from
retrying the upload on the next reconcile.

This fix moves the phase assignment to AFTER both uploads succeed. A
DeepCopy of the backup is used to encode the JSON with the final phase
for object storage, while the in-memory backup object retains the
Finalizing phase until uploads complete.

Caveats:
- CompletionTimestamp is now captured before upload but only committed to
  the API server after upload succeeds. On retry after a transient
  failure, a new timestamp is generated, so the completion time reflects
  when the upload finally succeeded rather than when finalization
  processing completed.
- Metrics (RegisterBackupSuccess/RegisterBackupPartialFailure) are now
  recorded after uploads succeed, so they accurately reflect only fully
  persisted backups.
- The metadata uploaded to object storage contains the final phase and
  completion timestamp via DeepCopy, so storage state is correct even
  before the API server is patched.

Fixes vmware-tanzu#9645

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai kaovilai force-pushed the fix/backup-finalizer-premature-phase-completion branch from 7601b10 to cd86689 Compare March 24, 2026 22:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts backup finalization so backups aren’t marked terminal (Completed/PartiallyFailed) in the API server until persistence to object storage has succeeded, preventing “completed but not persisted” states.

Changes:

  • Defers setting backup.Status.Phase / CompletionTimestamp until after persistence calls succeed, while uploading a deep-copied “final” backup object to object storage.
  • Moves metric registration to occur only after persistence succeeds.
  • Adds unit tests for PutBackupMetadata and PutBackupContents failure paths.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
pkg/controller/backup_finalizer_controller.go Defers terminal status + metrics until after uploads; uploads a deep-copied “final” backup JSON to object storage.
pkg/controller/backup_finalizer_controller_test.go Adds unit tests for failure scenarios of metadata/content uploads and asserts phase/timestamps remain non-terminal on failure.
changelogs/unreleased/9646-kaovilai Adds changelog entry describing the fix.
Comments suppressed due to low confidence (1)

pkg/controller/backup_finalizer_controller.go:1

  • This uploads terminal backup metadata (Completed/PartiallyFailed + CompletionTimestamp) to object storage before PutBackupContents runs. If PutBackupContents fails, object storage can end up claiming the backup is terminal while final contents were not uploaded, which undermines the goal of only persisting terminal state after all persistence work succeeds. Consider making the persistence atomic from the reader’s perspective by either (a) uploading contents first, then uploading metadata with the final phase/timestamp, or (b) uploading metadata in the current (Finalizing*) phase first and only re-uploading metadata with the terminal phase after PutBackupContents succeeds.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +206 to +226
// Determine the final phase and completion timestamp, but do NOT set them
// on the in-memory backup object yet. We first need to upload metadata and
// contents to object storage. If the upload fails, the deferred patch must
// NOT write a terminal phase to the API server so the controller can retry.
var finalPhase velerov1api.BackupPhase
switch backup.Status.Phase {
case velerov1api.BackupPhaseFinalizing:
backup.Status.Phase = velerov1api.BackupPhaseCompleted
r.metrics.RegisterBackupSuccess(backupScheduleName)
r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusSucc)
finalPhase = velerov1api.BackupPhaseCompleted
case velerov1api.BackupPhaseFinalizingPartiallyFailed:
backup.Status.Phase = velerov1api.BackupPhasePartiallyFailed
r.metrics.RegisterBackupPartialFailure(backupScheduleName)
r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusFailure)
finalPhase = velerov1api.BackupPhasePartiallyFailed
}

backup.Status.CompletionTimestamp = &metav1.Time{Time: r.clock.Now()}
backup.Status.CSIVolumeSnapshotsCompleted = updateCSIVolumeSnapshotsCompleted(operations)
completionTimestamp := &metav1.Time{Time: r.clock.Now()}
csiVolumeSnapshotsCompleted := updateCSIVolumeSnapshotsCompleted(operations)

recordBackupMetrics(log, backup, outBackupFile, r.metrics, true)
// Encode backup JSON with the final phase for object storage, so that the
// metadata in storage reflects the completed state.
backupForUpload := backup.DeepCopy()
backupForUpload.Status.Phase = finalPhase
backupForUpload.Status.CompletionTimestamp = completionTimestamp
backupForUpload.Status.CSIVolumeSnapshotsCompleted = csiVolumeSnapshotsCompleted
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalPhase has no default assignment and is used unconditionally. If backup.Status.Phase is not one of the two handled values in this switch on this code path, finalPhase will be the zero value and could be uploaded/patched as an empty phase. A concrete fix is to add an explicit default/guard (e.g., return early) or initialize finalPhase from backup.Status.Phase and only override it for the handled Finalizing* phases.

Copilot uses AI. Check for mistakes.
localBackupStore.On("PutBackupContents", mock.Anything, mock.Anything).Return(fmt.Errorf("object lock prevented upload"))
localBackupStore.On("GetBackupVolumeInfos", mock.Anything).Return(nil, nil)
localBackupStore.On("PutBackupVolumeInfos", mock.Anything, mock.Anything).Return(nil)
backupper.On("FinalizeBackup", mock.Anything, mock.Anything, mock.Anything, mock.Anything, framework.BackupItemActionResolverV2{}, mock.Anything, mock.Anything).Return(nil)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mock expectation hard-codes framework.BackupItemActionResolverV2{} as an exact argument match, which is brittle if the reconciler passes a non-zero resolver value. Prefer matching that parameter with mock.Anything or mock.MatchedBy(...) so the test only asserts what it actually cares about (that FinalizeBackup is called and returns nil).

Suggested change
backupper.On("FinalizeBackup", mock.Anything, mock.Anything, mock.Anything, mock.Anything, framework.BackupItemActionResolverV2{}, mock.Anything, mock.Anything).Return(nil)
backupper.On("FinalizeBackup", mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything).Return(nil)

Copilot uses AI. Check for mistakes.
Comment on lines +320 to +329
func TestBackupFinalizerReconcile_PutBackupContentsFail(t *testing.T) {
fakeClock := testclocks.NewFakeClock(time.Now())
metav1Now := metav1.NewTime(fakeClock.Now())
defaultBackupLocation := builder.ForBackupStorageLocation(velerov1api.DefaultNamespace, "default").Result()

backup := builder.ForBackup(velerov1api.DefaultNamespace, "backup-contents-fail").
StorageLocation("default").
ObjectMeta(builder.WithUID("foo")).
StartTimestamp(fakeClock.Now()).
Phase(velerov1api.BackupPhaseFinalizing).Result()
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new reconciler logic computes a finalPhase for both Finalizing and FinalizingPartiallyFailed, but the PutBackupContents failure test only covers Finalizing. Add a table-driven variant (similar to TestBackupFinalizerReconcile_PutBackupMetadataFail) that also exercises the FinalizingPartiallyFailed starting phase and asserts the phase/timestamp remain unchanged on PutBackupContents failure.

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 60.98%. Comparing base (a5391e1) to head (cd86689).

Files with missing lines Patch % Lines
pkg/controller/backup_finalizer_controller.go 95.23% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9646      +/-   ##
==========================================
+ Coverage   60.95%   60.98%   +0.02%     
==========================================
  Files         387      387              
  Lines       36647    36658      +11     
==========================================
+ Hits        22340    22355      +15     
+ Misses      12692    12690       -2     
+ Partials     1615     1613       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

backup-finalizer: do not set backup phase to Completed in-memory before PutBackupMetadata succeeds

2 participants