[CORE-13370] archival: Fence spillover command by Lazin · Pull Request #27714 · redpanda-data/redpanda

Lazin · 2025-09-24T18:36:34Z

Replicate spillover command with the fence to avoid races.

Backports Required

Release Notes

none

oleiman

looks reasonable. is there a link to an issue we can add or a high level description of the situation that motivated the change? or why we didn't do it before, or whatever.

Copilot

Pull Request Overview

This PR refactors the archival system to use a fenced spillover command approach to prevent race conditions. The change centralizes fence creation logic and adds fencing support specifically to the spillover operation.

Introduces a new emit_rw_fence() method to centralize fence creation logic
Refactors multiple methods to use the centralized fence creation instead of duplicated code
Adds proper fencing to the spillover command with fence reset between iterations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/v/cluster/archival/ntp_archiver_service.h	Declares new `emit_rw_fence()` method for centralized fence creation
src/v/cluster/archival/ntp_archiver_service.cc	Implements `emit_rw_fence()` and refactors multiple methods to use it, adds fencing to spillover operations

src/v/cluster/archival/ntp_archiver_service.cc

Lazin · 2025-09-25T14:37:25Z

looks reasonable. is there a link to an issue we can add or a high level description of the situation that motivated the change? or why we didn't do it before, or whatever.

I think we ignored it before because it's replicated in a loop. The issue is that on CI it could also attempt to upload/replicates a spillover manifest which is slightly off in some cases.

vbotbuildovich · 2025-09-25T16:19:29Z

CI test results

test results on build#72948

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ShadowLinkingReplicationTests	test_replication_basic	{"shuffle_leadership": false, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}}	integration	https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7dfc-45a0-854a-84d474bb1dde	FLAKY	19/21	upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ShadowLinkingReplicationTests	test_replication_basic	{"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}}	integration	https://buildkite.com/redpanda/redpanda/builds/72948#0199810c-27e9-4126-ae01-455ee8374f3b	FLAKY	13/21	upstream reliability is '97.03389830508475'. current run reliability is '61.904761904761905'. drift is 35.12914 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ClusterRateQuotaTest	test_client_group_consume_rate_throttle_mechanism	null	integration	https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df1-4ad7-bd6b-030a62faba91	FLAKY	14/21	upstream reliability is '81.31229235880399'. current run reliability is '66.66666666666666'. drift is 14.64563 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_consume_rate_throttle_mechanism
ClusterRateQuotaTest	test_client_group_produce_rate_throttle_mechanism	null	integration	https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df2-41cb-8bd7-b7db1febc310	FLAKY	17/21	upstream reliability is '83.00518134715026'. current run reliability is '80.95238095238095'. drift is 2.0528 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_produce_rate_throttle_mechanism
ClusterRateQuotaTest	test_client_response_throttle_mechanism	null	integration	https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df8-4ecf-8db5-ea3f420f4e18	FLAKY	12/21	upstream reliability is '81.82527301092044'. current run reliability is '57.14285714285714'. drift is 24.68242 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism
ClusterRateQuotaTest	test_client_response_throttle_mechanism_applies_to_next_request	null	integration	https://buildkite.com/redpanda/redpanda/builds/72948#0199810c-27f1-43e1-a0a5-7f70d2ebb2be	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism_applies_to_next_request
PartitionBalancerTest	test_rack_awareness	null	integration	https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df3-405b-bb92-75d65aa92d3b	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionBalancerTest&test_method=test_rack_awareness
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": false, "mixed_versions": true, "with_iceberg": false}	integration	https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df2-41cb-8bd7-b7db1febc310	FLAKY	20/21	upstream reliability is '99.51923076923077'. current run reliability is '95.23809523809523'. drift is 4.28114 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RandomNodeOperationsTest&test_method=test_node_operations
RecoveryModeTest	test_rolling_restart	null	integration	https://buildkite.com/redpanda/redpanda/builds/72948#01998113-7df1-4ad7-bd6b-030a62faba91	FLAKY	17/21	upstream reliability is '94.36860068259386'. current run reliability is '80.95238095238095'. drift is 13.41622 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RecoveryModeTest&test_method=test_rolling_restart

test results on build#73057

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
MasterTestSuite	test_replica_pair_frequency		unit	https://buildkite.com/redpanda/redpanda/builds/73057#0199862d-8f59-415f-a4bb-7e63eda8a9fb	FAIL	0/1
ShadowLinkingReplicationTests	test_replication_basic	{"shuffle_leadership": false, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}}	integration	https://buildkite.com/redpanda/redpanda/builds/73057#01998688-a51a-4120-99be-8f68479e23e8	FLAKY	19/21	upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ShadowLinkingReplicationTests	test_replication_basic	{"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}}	integration	https://buildkite.com/redpanda/redpanda/builds/73057#0199865d-61a2-49ed-9e94-8cab3ec781c8	FLAKY	9/21	upstream reliability is '88.51351351351352'. current run reliability is '42.857142857142854'. drift is 45.65637 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ControllerLogLimitMirrorMakerTests	test_mirror_maker_with_limits	null	integration	https://buildkite.com/redpanda/redpanda/builds/73057#01998688-a520-4034-89e1-ed87b1cfc150	FLAKY	20/21	upstream reliability is '99.21259842519686'. current run reliability is '95.23809523809523'. drift is 3.9745 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerLogLimitMirrorMakerTests&test_method=test_mirror_maker_with_limits

src/v/cluster/archival/ntp_archiver_service.cc

Also, extract fence initialization into a method in the ntp_archiver to avoid code duplication. There is a change in the control flow in the 'apply_spillover' method. Previously, the spillover wouldn't stop in case of replication error causing the error to be repeated. The loop would use manifest to create a spillover manifest and replicate the command with archival STM. The replicate method waits until the command is applied and propagates the error back to the loop. In case of error the error was printed and the loop continued. Since the state of the manifest didn't change the loop would produce the same manifesta and the same command causing new failure. This commit breaks if the spillover command can't be applied. This guarantees forward progress. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

oleiman

lgtm

vbotbuildovich · 2025-09-26T22:20:08Z

/backport v25.2.x

vbotbuildovich · 2025-09-26T22:20:09Z

/backport v25.1.x

vbotbuildovich · 2025-09-26T22:20:10Z

/backport v24.3.x

vbotbuildovich · 2025-09-26T22:21:12Z

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27714-v25.1.x-416 remotes/upstream/v25.1.x
git cherry-pick -x 35dc6d6630

Workflow run logs.

vbotbuildovich · 2025-09-26T22:21:13Z

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27714-v25.2.x-908 remotes/upstream/v25.2.x
git cherry-pick -x 35dc6d6630

Workflow run logs.

vbotbuildovich · 2025-09-26T22:21:19Z

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27714-v24.3.x-25 remotes/upstream/v24.3.x
git cherry-pick -x 35dc6d6630

Workflow run logs.

Lazin requested review from Copilot and oleiman and removed request for Copilot September 24, 2025 18:36

github-actions bot added the area/redpanda label Sep 24, 2025

oleiman previously approved these changes Sep 24, 2025

View reviewed changes

Lazin dismissed oleiman’s stale review via b05f01c September 25, 2025 12:32

Lazin force-pushed the fix/spillover-cmd branch from e7399d9 to b05f01c Compare September 25, 2025 12:32

Copilot AI review requested due to automatic review settings September 25, 2025 12:32

Copilot AI reviewed Sep 25, 2025

View reviewed changes

src/v/cluster/archival/ntp_archiver_service.cc Show resolved Hide resolved

Lazin requested a review from oleiman September 25, 2025 14:35

oleiman reviewed Sep 25, 2025

View reviewed changes

src/v/cluster/archival/ntp_archiver_service.cc Show resolved Hide resolved

Lazin force-pushed the fix/spillover-cmd branch from b05f01c to 35dc6d6 Compare September 26, 2025 13:17

Lazin requested a review from oleiman September 26, 2025 13:18

oleiman approved these changes Sep 26, 2025

View reviewed changes

Lazin merged commit 4d18c2e into redpanda-data:dev Sep 26, 2025
17 checks passed

This was referenced Sep 26, 2025

[v25.2.x] [CORE-13370] archival: Fence spillover command #27782

Open

[v25.1.x] [CORE-13370] archival: Fence spillover command #27781

Open

vbotbuildovich mentioned this pull request Sep 26, 2025

[v24.3.x] [CORE-13370] archival: Fence spillover command #27783

Open

This was referenced Oct 3, 2025

[v25.2.x] [CORE-13370] archival: Fence spillover command #27889

Merged

[v25.1.x] [CORE-13370] archival: Fence spillover command #27890

Merged

[CORE-13370] archival: Fix retention fallthrough behavior #27913

Merged

dotnwat requested review from andrwng and nvartolomei October 6, 2025 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE-13370] archival: Fence spillover command#27714

[CORE-13370] archival: Fence spillover command#27714
Lazin merged 1 commit intoredpanda-data:devfrom
Lazin:fix/spillover-cmd

Lazin commented Sep 24, 2025

Uh oh!

oleiman left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Lazin commented Sep 25, 2025

Uh oh!

vbotbuildovich commented Sep 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

oleiman left a comment

Uh oh!

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Lazin commented Sep 24, 2025

Backports Required

Release Notes

Uh oh!

oleiman left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Lazin commented Sep 25, 2025

Uh oh!

vbotbuildovich commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

Uh oh!

oleiman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

vbotbuildovich commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vbotbuildovich commented Sep 25, 2025 •

edited

Loading