[CORE-13370] archival: Fix retention fallthrough behavior by Lazin · Pull Request #27913 · redpanda-data/redpanda

Lazin · 2025-10-06T16:49:44Z

In the ntp_archiver::apply_retention method we're not throwing exception if we failed to replicate the command that moves start offset forward. This leads to a situation when even if we failed to replicate the command the apply_gc and apply_spillover methods are invoked. This leads to a situation when spillover invariant is broken. The apply_spillover method expects that the start offset is aligned with the first segment in the STM manifest. If this is not the case the spillover manifest that it produces can't be applied.

This is what leads to this problem:

apply_retention fails to replicate the command due to timeout. The timeout could be caused by the high cpu load. The command is still replciated and applied asynchronously.
apply_gc invoked before the command that moves start offset is applied. The method does nothing since start offset is still aligned with the first segment in the manifest.
apply_spillover is invoked after the command is applied. Now the start offset is moved forward and is no longer aligned with the first segment in the manifest. The method creates a spillover manifest, uploads it, puts it to the cloud storage cache, and replicates the command that adds spill metadata to the manifest. The command fails because it can't pass the check in the manifest (the manifest expects that the base offset of the spill is equal to start offset.

Before #27714 was merged this could cause cloud storage cache to fill up. Now this can only create a single failed spillover attempt.

Backports Required

Release Notes

none

Copilot

Pull Request Overview

This PR fixes a race condition in the archival retention system where failed replication of start offset commands could lead to broken spillover invariants. The fix prevents subsequent operations from proceeding when the start offset update fails and adds an early validation check to ensure data consistency.

Throws an exception when start offset replication fails, preventing fallthrough to subsequent operations
Adds spillover invariant validation to ensure start offset aligns with the first segment
Fixes a typo in an error message from "offest" to "offset"

vbotbuildovich · 2025-10-06T18:31:04Z

Retry command for Build#73642

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":2}
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":1}

vbotbuildovich · 2025-10-06T19:02:24Z

CI test results

test results on build#73642

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
CloudStorageScrubberTest	test_scrubber	{"cloud_storage_type": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8a-ba60-4d71-9a40-7bbd52eeea28	FLAKY	2/21	upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest	test_scrubber	{"cloud_storage_type": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8b-297a-4401-bd5d-c9ccc187a024	FLAKY	1/21	upstream reliability is '100.0'. current run reliability is '4.761904761904762'. drift is 95.2381 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest	test_scrubber	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8a-ba62-4431-8570-1e6951b29a86	FLAKY	2/21	upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest	test_scrubber	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8b-297b-431e-b360-a363bf3fb157	FLAKY	1/21	upstream reliability is '100.0'. current run reliability is '4.761904761904762'. drift is 95.2381 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
DeleteRecordsTest	test_delete_records_concurrent_truncations	{"cloud_storage_enabled": true, "truncate_point": "random_offset"}	integration	https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8a-ba6e-4ed7-bc6a-ed771284140e	FLAKY	15/21	upstream reliability is '100.0'. current run reliability is '71.42857142857143'. drift is 28.57143 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DeleteRecordsTest&test_method=test_delete_records_concurrent_truncations
ShardPlacementTest	test_core_count_change	null	integration	https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8b-297b-431e-b360-a363bf3fb157	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShardPlacementTest&test_method=test_core_count_change
WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8b-2980-4598-9dbd-4aad69c91cec	FLAKY	19/21	upstream reliability is '83.9544513457557'. current run reliability is '90.47619047619048'. drift is -6.52174 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
WriteCachingFailureInjectionTest	test_unavoidable_data_loss	null	integration	https://buildkite.com/redpanda/redpanda/builds/73642#0199ba8a-ba67-4b9f-b722-7d7a33ef482d	FLAKY	20/21	upstream reliability is '92.26277372262773'. current run reliability is '95.23809523809523'. drift is -2.97532 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

test results on build#75060

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
CloudStorageScrubberTest	test_scrubber	{"cloud_storage_type": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-6076-46b9-91db-38f8419b10c5	FLAKY	5/21	upstream reliability is '100.0'. current run reliability is '23.809523809523807'. drift is 76.19048 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest	test_scrubber	{"cloud_storage_type": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-f278-4262-be10-ae052b508a2e	FLAKY	5/21	upstream reliability is '100.0'. current run reliability is '23.809523809523807'. drift is 76.19048 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest	test_scrubber	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-6077-44c5-a5b7-3b0ff5f327d9	FLAKY	2/21	upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
CloudStorageScrubberTest	test_scrubber	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-f279-42f6-b40f-c07110c49471	FLAKY	2/21	upstream reliability is '100.0'. current run reliability is '9.523809523809524'. drift is 90.47619 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudStorageScrubberTest&test_method=test_scrubber
MountUnmountIcebergTest	test_simple_remount	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-607e-4269-84eb-25c38b7bc5c0	FLAKY	15/21	upstream reliability is '84.85523385300668'. current run reliability is '71.42857142857143'. drift is 13.42666 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
DeleteRecordsTest	test_delete_records_concurrent_truncations	{"cloud_storage_enabled": true, "truncate_point": "one_below_high_watermark"}	integration	https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-f27e-4bec-b2df-2fa875947fc5	FLAKY	15/21	upstream reliability is '100.0'. current run reliability is '71.42857142857143'. drift is 28.57143 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DeleteRecordsTest&test_method=test_delete_records_concurrent_truncations
LogCompactionTxRemovalUpgradeTest	test_tx_control_batch_removal_with_upgrade	{"test_case_name": "All aborts"}	integration	https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-f28e-4741-9274-0203a7e794f3	FLAKY	16/21	upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalUpgradeTest&test_method=test_tx_control_batch_removal_with_upgrade
PartitionMovementTest	test_static	{"num_to_upgrade": 0}	integration	https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-607d-4489-ace0-6d48d1e7b26a	FLAKY	15/21	upstream reliability is '88.66442199775533'. current run reliability is '71.42857142857143'. drift is 17.23585 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionMovementTest&test_method=test_static
TLSVersionTestRSA	test_change_version	{"version": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/75060#019a2727-6078-4d5d-9e65-a17104ef22e8	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version

oleiman

lgtm modulo log level 😄

one piece of context that's missing from the description is that the backgrounded replicate command holds an exclusive lock over the metadata STM lock as well, so short-circuiting housekeeping in this way guarantees that we won't call gc/spillover until the truncate command is replicated. Is that right? sorry if obvious.

andrwng · 2025-10-07T23:14:48Z

    const auto manifest_upload_timeout = _conf->manifest_upload_timeout();
    const auto manifest_upload_backoff = _conf->cloud_storage_initial_backoff();

+    // Check the spillover invariant.


Why isn't the fix to demote the archival_metadata_stm error to a warning? The issue is benign at this point, and expected due to races, right?

Isn't it better to eliminate the race?

The 'apply_spillover' method expects that the start offset field in the manifest is aligned with the first segment in the manifest. This commit explicitly checks this invariant. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

If apply_retention method failed to replicate the command it's not safe to run GC or spillover because the command could be actually replicated and applied asynchronously. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

WillemKauf · 2025-10-27T18:59:46Z

/ci-repeat 1

vbotbuildovich · 2025-10-27T20:22:49Z

Retry command for Build#75060

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/tls_version_test.py::TLSVersionTestRSA.test_change_version@{"version":1}
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":1}
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":2}

in scrubber test because the test explicitly violates consistency. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

The recovery subsystem in cloud_storage defines error_category function with exactly the same signature etc. This commit renames it so there is no name collision. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

oleiman

lgtm. what's the deal with the last commit?

Lazin · 2025-10-28T18:49:44Z

lgtm. what's the deal with the last commit?

it's unrelated, I stumbled on this when I was fixing the CI failure

vbotbuildovich · 2025-10-28T19:07:56Z

/backport v25.2.x

vbotbuildovich · 2025-10-28T19:07:57Z

/backport v25.1.x

dotnwat · 2025-10-28T20:35:20Z

-const std::error_category& error_category() noexcept {
-    static const recovery_error_category e;
+const std::error_category& recovery_error_category() noexcept {
+    static const recovery_error_category_type e;


@Lazin maybe we can do something like this in cloud topics?

Copilot AI review requested due to automatic review settings October 6, 2025 16:49

github-actions Bot added the area/redpanda label Oct 6, 2025

Copilot AI reviewed Oct 6, 2025

View reviewed changes

Lazin changed the title ~~archival: Fix retention fallthrough behavior~~ [CORE-1337] archival: Fix retention fallthrough behavior Oct 6, 2025

Lazin requested review from andrwng and oleiman October 6, 2025 16:50

oleiman reviewed Oct 6, 2025

View reviewed changes

Comment thread src/v/cluster/archival/ntp_archiver_service.cc

daisukebe changed the title ~~[CORE-1337] archival: Fix retention fallthrough behavior~~ [CORE-13370] archival: Fix retention fallthrough behavior Oct 7, 2025

andrwng reviewed Oct 7, 2025

View reviewed changes

Lazin added 2 commits October 27, 2025 11:00

archival: Enforce spillover invariant

97d6cc6

The 'apply_spillover' method expects that the start offset field in the manifest is aligned with the first segment in the manifest. This commit explicitly checks this invariant. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

archival: If throw in apply_retnetion

1c3acdc

If apply_retention method failed to replicate the command it's not safe to run GC or spillover because the command could be actually replicated and applied asynchronously. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

Lazin added 2 commits October 28, 2025 12:54

rptest: Allow spillover violation error message

0e6d0d8

in scrubber test because the test explicitly violates consistency. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

ts: Fix ODR violation

eed8a38

The recovery subsystem in cloud_storage defines error_category function with exactly the same signature etc. This commit renames it so there is no name collision. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

Lazin force-pushed the fix/retention-fall-through-behavior branch from 243af4f to eed8a38 Compare October 28, 2025 16:56

Lazin requested review from andrwng and oleiman October 28, 2025 16:57

oleiman approved these changes Oct 28, 2025

View reviewed changes

Comment thread tests/rptest/tests/cloud_storage_scrubber_test.py

Comment thread src/v/cloud_storage/recovery_errors.cc

WillemKauf approved these changes Oct 28, 2025

View reviewed changes

Lazin enabled auto-merge October 28, 2025 18:49

Lazin merged commit ec346a4 into redpanda-data:dev Oct 28, 2025
19 checks passed

This was referenced Oct 28, 2025

[v25.2.x] [CORE-13370] archival: Fix retention fallthrough behavior #28244

Merged

[v25.1.x] [CORE-13370] archival: Fix retention fallthrough behavior #28245

Merged

dotnwat reviewed Oct 28, 2025

View reviewed changes

Conversation

Lazin commented Oct 6, 2025

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

vbotbuildovich commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#73642

Uh oh!

vbotbuildovich commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

oleiman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrwng Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Lazin Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

WillemKauf commented Oct 27, 2025

Uh oh!

vbotbuildovich commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#75060

Uh oh!

oleiman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Lazin commented Oct 28, 2025

Uh oh!

Uh oh!

vbotbuildovich commented Oct 28, 2025

Uh oh!

vbotbuildovich commented Oct 28, 2025

Uh oh!

dotnwat Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vbotbuildovich commented Oct 6, 2025 •

edited

Loading

vbotbuildovich commented Oct 6, 2025 •

edited

Loading

vbotbuildovich commented Oct 27, 2025 •

edited

Loading