Skip to content

[CORE-13370] archival: Fix retention fallthrough behavior#27913

Merged
Lazin merged 4 commits into
redpanda-data:devfrom
Lazin:fix/retention-fall-through-behavior
Oct 28, 2025
Merged

[CORE-13370] archival: Fix retention fallthrough behavior#27913
Lazin merged 4 commits into
redpanda-data:devfrom
Lazin:fix/retention-fall-through-behavior

Conversation

@Lazin
Copy link
Copy Markdown
Contributor

@Lazin Lazin commented Oct 6, 2025

In the ntp_archiver::apply_retention method we're not throwing exception if we failed to replicate the command that moves start offset forward. This leads to a situation when even if we failed to replicate the command the apply_gc and apply_spillover methods are invoked. This leads to a situation when spillover invariant is broken. The apply_spillover method expects that the start offset is aligned with the first segment in the STM manifest. If this is not the case the spillover manifest that it produces can't be applied.

This is what leads to this problem:

  • apply_retention fails to replicate the command due to timeout. The timeout could be caused by the high cpu load. The command is still replciated and applied asynchronously.
  • apply_gc invoked before the command that moves start offset is applied. The method does nothing since start offset is still aligned with the first segment in the manifest.
  • apply_spillover is invoked after the command is applied. Now the start offset is moved forward and is no longer aligned with the first segment in the manifest. The method creates a spillover manifest, uploads it, puts it to the cloud storage cache, and replicates the command that adds spill metadata to the manifest. The command fails because it can't pass the check in the manifest (the manifest expects that the base offset of the spill is equal to start offset.

Before #27714 was merged this could cause cloud storage cache to fill up. Now this can only create a single failed spillover attempt.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

Copilot AI review requested due to automatic review settings October 6, 2025 16:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in the archival retention system where failed replication of start offset commands could lead to broken spillover invariants. The fix prevents subsequent operations from proceeding when the start offset update fails and adds an early validation check to ensure data consistency.

  • Throws an exception when start offset replication fails, preventing fallthrough to subsequent operations
  • Adds spillover invariant validation to ensure start offset aligns with the first segment
  • Fixes a typo in an error message from "offest" to "offset"

@Lazin Lazin changed the title archival: Fix retention fallthrough behavior [CORE-1337] archival: Fix retention fallthrough behavior Oct 6, 2025
@Lazin Lazin requested review from andrwng and oleiman October 6, 2025 16:50
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Oct 6, 2025

Retry command for Build#73642

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":2}
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":1}

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Oct 6, 2025

CI test results

test results on build#73642
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8a-ba60-4d71-9a40-7bbd52eeea28 FLAKY 2/21 upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8b-297a-4401-bd5d-c9ccc187a024 FLAKY 1/21 upstream reliability is '100.0'. current run reliability is '4.761904761904762'. drift is 95.2381 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8a-ba62-4431-8570-1e6951b29a86 FLAKY 2/21 upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8b-297b-431e-b360-a363bf3fb157 FLAKY 1/21 upstream reliability is '100.0'. current run reliability is '4.761904761904762'. drift is 95.2381 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
DeleteRecordsTest test_delete_records_concurrent_truncations {"cloud_storage_enabled": true, "truncate_point": "random_offset"} integration https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8a-ba6e-4ed7-bc6a-ed771284140e FLAKY 15/21 upstream reliability is '100.0'. current run reliability is '71.42857142857143'. drift is 28.57143 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DeleteRecordsTest&test_method=test_delete_records_concurrent_truncations
ShardPlacementTest test_core_count_change null integration https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8b-297b-431e-b360-a363bf3fb157 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShardPlacementTest&test_method=test_core_count_change
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8b-2980-4598-9dbd-4aad69c91cec FLAKY 19/21 upstream reliability is '83.9544513457557'. current run reliability is '90.47619047619048'. drift is -6.52174 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8a-ba67-4b9f-b722-7d7a33ef482d FLAKY 20/21 upstream reliability is '92.26277372262773'. current run reliability is '95.23809523809523'. drift is -2.97532 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss
test results on build#75060
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-6076-46b9-91db-38f8419b10c5 FLAKY 5/21 upstream reliability is '100.0'. current run reliability is '23.809523809523807'. drift is 76.19048 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-f278-4262-be10-ae052b508a2e FLAKY 5/21 upstream reliability is '100.0'. current run reliability is '23.809523809523807'. drift is 76.19048 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-6077-44c5-a5b7-3b0ff5f327d9 FLAKY 2/21 upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-f279-42f6-b40f-c07110c49471 FLAKY 2/21 upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-607e-4269-84eb-25c38b7bc5c0 FLAKY 15/21 upstream reliability is '84.85523385300668'. current run reliability is '71.42857142857143'. drift is 13.42666 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
DeleteRecordsTest test_delete_records_concurrent_truncations {"cloud_storage_enabled": true, "truncate_point": "one_below_high_watermark"} integration https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-f27e-4bec-b2df-2fa875947fc5 FLAKY 15/21 upstream reliability is '100.0'. current run reliability is '71.42857142857143'. drift is 28.57143 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DeleteRecordsTest&test_method=test_delete_records_concurrent_truncations
LogCompactionTxRemovalUpgradeTest test_tx_control_batch_removal_with_upgrade {"test_case_name": "All aborts"} integration https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-f28e-4741-9274-0203a7e794f3 FLAKY 16/21 upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalUpgradeTest&test_method=test_tx_control_batch_removal_with_upgrade
PartitionMovementTest test_static {"num_to_upgrade": 0} integration https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-607d-4489-ace0-6d48d1e7b26a FLAKY 15/21 upstream reliability is '88.66442199775533'. current run reliability is '71.42857142857143'. drift is 17.23585 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionMovementTest&test_method=test_static
TLSVersionTestRSA test_change_version {"version": 1} integration https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-6078-4d5d-9e65-a17104ef22e8 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version

Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo log level 😄

one piece of context that's missing from the description is that the backgrounded replicate command holds an exclusive lock over the metadata STM lock as well, so short-circuiting housekeeping in this way guarantees that we won't call gc/spillover until the truncate command is replicated. Is that right? sorry if obvious.

Comment thread src/v/cluster/archival/ntp_archiver_service.cc
@daisukebe daisukebe changed the title [CORE-1337] archival: Fix retention fallthrough behavior [CORE-13370] archival: Fix retention fallthrough behavior Oct 7, 2025
const auto manifest_upload_timeout = _conf->manifest_upload_timeout();
const auto manifest_upload_backoff = _conf->cloud_storage_initial_backoff();

// Check the spillover invariant.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the fix to demote the archival_metadata_stm error to a warning? The issue is benign at this point, and expected due to races, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to eliminate the race?

Lazin added 2 commits October 27, 2025 11:00
The 'apply_spillover' method expects that the start offset field in the
manifest is aligned with the first segment in the manifest.
This commit explicitly checks this invariant.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
If apply_retention method failed to replicate the command it's not safe
to run GC or spillover because the command could be actually replicated
and applied asynchronously.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
@WillemKauf
Copy link
Copy Markdown
Contributor

/ci-repeat 1

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Oct 27, 2025

Retry command for Build#75060

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/tls_version_test.py::TLSVersionTestRSA.test_change_version@{"version":1}
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":1}
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":2}

Lazin added 2 commits October 28, 2025 12:54
in scrubber test because the test explicitly violates consistency.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
The recovery subsystem in cloud_storage defines error_category function
with exactly the same signature etc. This commit renames it so there is
no name collision.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
@Lazin Lazin force-pushed the fix/retention-fall-through-behavior branch from 243af4f to eed8a38 Compare October 28, 2025 16:56
@Lazin Lazin requested review from andrwng and oleiman October 28, 2025 16:57
Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. what's the deal with the last commit?

Comment thread tests/rptest/tests/cloud_storage_scrubber_test.py
Comment thread src/v/cloud_storage/recovery_errors.cc
@Lazin
Copy link
Copy Markdown
Contributor Author

Lazin commented Oct 28, 2025

lgtm. what's the deal with the last commit?

it's unrelated, I stumbled on this when I was fixing the CI failure

@Lazin Lazin enabled auto-merge October 28, 2025 18:49
@Lazin Lazin merged commit ec346a4 into redpanda-data:dev Oct 28, 2025
19 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.1.x

Comment on lines -40 to +41
const std::error_category& error_category() noexcept {
static const recovery_error_category e;
const std::error_category& recovery_error_category() noexcept {
static const recovery_error_category_type e;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lazin maybe we can do something like this in cloud topics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants