[CORE-13370] archival: Fence spillover command#27714
Conversation
oleiman
left a comment
There was a problem hiding this comment.
looks reasonable. is there a link to an issue we can add or a high level description of the situation that motivated the change? or why we didn't do it before, or whatever.
e7399d9 to
b05f01c
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the archival system to use a fenced spillover command approach to prevent race conditions. The change centralizes fence creation logic and adds fencing support specifically to the spillover operation.
- Introduces a new
emit_rw_fence()method to centralize fence creation logic - Refactors multiple methods to use the centralized fence creation instead of duplicated code
- Adds proper fencing to the spillover command with fence reset between iterations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/v/cluster/archival/ntp_archiver_service.h | Declares new emit_rw_fence() method for centralized fence creation |
| src/v/cluster/archival/ntp_archiver_service.cc | Implements emit_rw_fence() and refactors multiple methods to use it, adds fencing to spillover operations |
I think we ignored it before because it's replicated in a loop. The issue is that on CI it could also attempt to upload/replicates a spillover manifest which is slightly off in some cases. |
CI test resultstest results on build#72948
test results on build#73057
|
Also, extract fence initialization into a method in the ntp_archiver to avoid code duplication. There is a change in the control flow in the 'apply_spillover' method. Previously, the spillover wouldn't stop in case of replication error causing the error to be repeated. The loop would use manifest to create a spillover manifest and replicate the command with archival STM. The replicate method waits until the command is applied and propagates the error back to the loop. In case of error the error was printed and the loop continued. Since the state of the manifest didn't change the loop would produce the same manifesta and the same command causing new failure. This commit breaks if the spillover command can't be applied. This guarantees forward progress. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
b05f01c to
35dc6d6
Compare
|
/backport v25.2.x |
|
/backport v25.1.x |
|
/backport v24.3.x |
|
Failed to create a backport PR to v25.1.x branch. I tried: |
|
Failed to create a backport PR to v25.2.x branch. I tried: |
|
Failed to create a backport PR to v24.3.x branch. I tried: |
Replicate spillover command with the fence to avoid races.
Backports Required
Release Notes