Skip to content

[CORE-14466] ts: Fix spillover race#28281

Merged
Lazin merged 3 commits intoredpanda-data:devfrom
Lazin:fix/downgrade-error-to-warn
Oct 30, 2025
Merged

[CORE-14466] ts: Fix spillover race#28281
Lazin merged 3 commits intoredpanda-data:devfrom
Lazin:fix/downgrade-error-to-warn

Conversation

@Lazin
Copy link
Copy Markdown
Contributor

@Lazin Lazin commented Oct 30, 2025

This PR

  • Downgrades error to warning.
  • Interrupts housekeeping in case if replication of the archival STM command times out. This is needed to prevent certain race conditions from happening (when the replication times out but the command is replicated and then applied asynchronously).

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

Lazin added 3 commits October 30, 2025 07:34
Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
If the metadata is not replicated or applied in time we shouldn't
proceed to spillover.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
If archive retention fails due to replication timeout fail the whole
housekeeping round.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses race conditions in the spillover handling of the NTP archiver service by downgrading a spillover invariant violation from an error to a warning and adding exception throwing for failed archive operations. The changes help prevent race conditions when replication of archival STM commands times out but are still applied asynchronously.

Key Changes

  • Downgrades spillover invariant violation log from error to warning level
  • Adds exception throwing for failed archive retention and garbage collection operations
  • Improves error handling consistency across archival operations

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#75314
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
LogCompactionTxRemovalTest test_tx_control_batch_removal null integration https://buildkite.com/redpanda/redpanda/builds/75314#019a3532-fce7-42fb-b9af-aac611c9ed5f FLAKY 14/21 upstream reliability is '95.38043478260869'. current run reliability is '66.66666666666666'. drift is 28.71377 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal

@dotnwat dotnwat requested review from andrwng and oleiman October 30, 2025 17:51
Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Lazin Lazin merged commit 4fc9cc5 into redpanda-data:dev Oct 30, 2025
18 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-28281-v25.1.x-806 remotes/upstream/v25.1.x
git cherry-pick -x 4e5630a2e0 a6058cc31e c85416a4da

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants