[CORE-13370] archival: Fix retention fallthrough behavior#27913
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a race condition in the archival retention system where failed replication of start offset commands could lead to broken spillover invariants. The fix prevents subsequent operations from proceeding when the start offset update fails and adds an early validation check to ensure data consistency.
- Throws an exception when start offset replication fails, preventing fallthrough to subsequent operations
- Adds spillover invariant validation to ensure start offset aligns with the first segment
- Fixes a typo in an error message from "offest" to "offset"
Retry command for Build#73642please wait until all jobs are finished before running the slash command |
CI test resultstest results on build#73642
test results on build#75060
|
oleiman
left a comment
There was a problem hiding this comment.
lgtm modulo log level 😄
one piece of context that's missing from the description is that the backgrounded replicate command holds an exclusive lock over the metadata STM lock as well, so short-circuiting housekeeping in this way guarantees that we won't call gc/spillover until the truncate command is replicated. Is that right? sorry if obvious.
| const auto manifest_upload_timeout = _conf->manifest_upload_timeout(); | ||
| const auto manifest_upload_backoff = _conf->cloud_storage_initial_backoff(); | ||
|
|
||
| // Check the spillover invariant. |
There was a problem hiding this comment.
Why isn't the fix to demote the archival_metadata_stm error to a warning? The issue is benign at this point, and expected due to races, right?
There was a problem hiding this comment.
Isn't it better to eliminate the race?
The 'apply_spillover' method expects that the start offset field in the manifest is aligned with the first segment in the manifest. This commit explicitly checks this invariant. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
If apply_retention method failed to replicate the command it's not safe to run GC or spillover because the command could be actually replicated and applied asynchronously. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
|
/ci-repeat 1 |
Retry command for Build#75060please wait until all jobs are finished before running the slash command |
in scrubber test because the test explicitly violates consistency. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
The recovery subsystem in cloud_storage defines error_category function with exactly the same signature etc. This commit renames it so there is no name collision. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
243af4f to
eed8a38
Compare
oleiman
left a comment
There was a problem hiding this comment.
lgtm. what's the deal with the last commit?
it's unrelated, I stumbled on this when I was fixing the CI failure |
|
/backport v25.2.x |
|
/backport v25.1.x |
| const std::error_category& error_category() noexcept { | ||
| static const recovery_error_category e; | ||
| const std::error_category& recovery_error_category() noexcept { | ||
| static const recovery_error_category_type e; |
There was a problem hiding this comment.
@Lazin maybe we can do something like this in cloud topics?
In the
ntp_archiver::apply_retentionmethod we're not throwing exception if we failed to replicate the command that moves start offset forward. This leads to a situation when even if we failed to replicate the command theapply_gcandapply_spillovermethods are invoked. This leads to a situation when spillover invariant is broken. Theapply_spillovermethod expects that the start offset is aligned with the first segment in the STM manifest. If this is not the case the spillover manifest that it produces can't be applied.This is what leads to this problem:
apply_retentionfails to replicate the command due to timeout. The timeout could be caused by the high cpu load. The command is still replciated and applied asynchronously.apply_gcinvoked before the command that moves start offset is applied. The method does nothing since start offset is still aligned with the first segment in the manifest.apply_spilloveris invoked after the command is applied. Now the start offset is moved forward and is no longer aligned with the first segment in the manifest. The method creates a spillover manifest, uploads it, puts it to the cloud storage cache, and replicates the command that adds spill metadata to the manifest. The command fails because it can't pass the check in the manifest (the manifest expects that the base offset of the spill is equal to start offset.Before #27714 was merged this could cause cloud storage cache to fill up. Now this can only create a single failed spillover attempt.
Backports Required
Release Notes