[CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments()#28472
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adjusts the logging severity for batch write errors in the storage layer based on whether the error occurred during system shutdown. Errors during normal operation remain at ERROR level, while shutdown-related errors are downgraded to WARN to avoid alarming users about expected behavior during shutdown.
- Conditionally sets log level to
WARNfor shutdown exceptions andERRORfor other exceptions - Uses
ssx::is_shutdown_exception()to distinguish shutdown-related errors - Replaces direct
stlog.error()call with level-conditionalvlogl()macro
nvartolomei
left a comment
There was a problem hiding this comment.
please see the comments first
|
@WillemKauf do you happen to know what gate is getting closed? I'm suspicious that we close storage stuff with work in-flight. In most cases we'd want this to be the other way around, cancel work, wait for cancellation, then shutdown storage/segments in correct order. |
cc7bb40 to
1770323
Compare
|
@nvartolomei I believe it is the |
|
I agree that perhaps the gate should be held/kept open by the appender if it is already in progress. |
|
@WillemKauf storage is one of the first services to start and hence i'd assume one of the last to stop. Something is leaking continuations then? My concern is that they, in this case, get an exception on the gate, but what if instead they access some deallocated memory in other instances? I.e. when the continuation is scheduled slightly later. |
|
|
|
Ah, I think I see it. |
|
@WillemKauf this sounds like a potential bug? Should we ensure release appender is initiated before closing and takes a gate? Current behavior seem spaghetti 🤔 |
|
@nvartolomei yes for sure this is spaghetti. Pushing an update soon- let me know what you think. |
1770323 to
2ef42a9
Compare
|
@nvartolomei I don't think we should force a log appender to hold a |
storage: log batch write errors at WARN iff shutdownstorage: wrap maybe_roll_unlocked in as_future() in disk_log_appender
nvartolomei
left a comment
There was a problem hiding this comment.
From phone. Will have a better chance to look at this on Thursday the earliest.
src/v/storage/disk_log_appender.cc
Outdated
| co_await _log.maybe_roll_unlocked(_last_term, _idx); | ||
| auto roll_fut = co_await ss::coroutine::as_future( | ||
| _log.maybe_roll_unlocked(_last_term, _idx)); | ||
| if (roll_fut.failed()) { |
There was a problem hiding this comment.
This is a large catch all. I'm not comfortable with catch-all and continuing as if nothing happens. If we expect a benign race, can't we handle it at a lower level? Ie what if storage is indeed stopping (not sure it's possible today but may be another race introduced in the future)? It would spin in this while loop forever?
There was a problem hiding this comment.
If it helps, this race (between disk_log_appender and a segment close) has always existed- the only difference in recent history is that release_appender_in_background() can now throw leading to this new narrow race condition. The best thing to do here is to ignore the closing of the old segment and attempt to re-roll with the log.
This exceptional path is also not what typically saves us from a spin-loop nor is it the typical exit path, fyi.
There was a problem hiding this comment.
And to be honest, I think a lot of this spaghettiness comes from the fact the disk_log_appender directly interfaces with a segment obtained from the log instead of opaquely dealing with the log itself. It makes the object/behavior very difficult to reason about indeed.
There was a problem hiding this comment.
I also need to stare at this a bit more, but I wonder if we should be treating the segment_roll_lock as a general lock around the last segment (regardless of whether it has an appender or not), in which case, maybe we should try to acquire the segment roll lock before calling _segs.pop_back() on the last segment (assuming that's what's happening here)
There was a problem hiding this comment.
but I wonder if we should be treating...
That feels pretty sane. It feels like there are a bunch of overlapping hoops serving as concurrency control in local storage and this gap isn't yet covered.
It also seems like even if this roll were to be wrapped in a try/catch block, we may just end up throwing when attempting to append to a closed segment anyways in append_batch_to_segment.
I can try to prototype a solution which requires segment_roll_lock be held when popping back a segment from the log.
There was a problem hiding this comment.
Hmm, actually, not sure this is the problem yet. Need to think about this more.
There was a problem hiding this comment.
Does this commit 59dea3e pass the sniff test?
CI test resultstest results on build#76043
test results on build#76173
test results on build#76180
test results on build#76227
|
Instead, log batch append errors in the `disk_log_appender` at `ERROR`.
A useful sanity check to prevent any potential concurrency issues.
2ef42a9 to
1f5b520
Compare
storage: wrap maybe_roll_unlocked in as_future() in disk_log_appenderstorage: prevent race between disk_log_appender and remove_prefix_full_segments()
storage: prevent race between disk_log_appender and remove_prefix_full_segments()storage: prevent race between disk_log_appender and remove_prefix_full_segments()
1f5b520 to
9b01363
Compare
8410e99 to
59dea3e
Compare
59dea3e to
70fd3e4
Compare
|
/ci-repeat 1 |
| // concurrency issues. | ||
| std::optional<ssx::semaphore_units> seg_rolling_units; | ||
| if (ptr->has_appender()) { | ||
| seg_rolling_units = co_await _segments_rolling_lock.get_units(); |
There was a problem hiding this comment.
if we happen to be waiting for log::close(), should we catch the broken semaphore exception here and quietly return?
There was a problem hiding this comment.
We are currently relying on the thrown exception to break out of the ss::do_until() loop (unless we also make that a bit more intelligent- catching and silencing the exception without other changes would cause an infinite loop). The exception is eventually caught further up the chain (I think all the way in log_eviction_stm).
There was a problem hiding this comment.
cool yeah, that's what i was asking. i didn't see anything else that would throw in this path at a glance, so this is basically new behavior? maybe it makes sense to add "closed" to the do_until condition instead, but idk.
There was a problem hiding this comment.
Yeah, I agree that having an exceptional path here does seem like new behavior. It feels like similar behavior regardless of whether we catch and silently return on an exception here OR allow it to be thrown and bubble up the chain, but I'm still a little unsure.
Happy to get other opinions here and a sanity check on any potential side effects to this change (i.e previously if this were to race with a close() operation, it seems that we would still be able to prefix truncate the entire log if necessary)
There was a problem hiding this comment.
I think this is fine, because if we're racing with a close, the log will have still persisted the start offset before getting here, and subsequent usage should be quite limited given the log is closing.
There was a problem hiding this comment.
Is this PR overall too complex then? hitting a vassert is bad but if we have your PR to downgrade those to a throw, are we happy to accept there may be concurrency issues during shutdown which are fairly benign after all?
oleiman
left a comment
There was a problem hiding this comment.
looks pretty much legit to me. might want a second set of eyes
| // concurrency issues. | ||
| std::optional<ssx::semaphore_units> seg_rolling_units; | ||
| if (ptr->has_appender()) { | ||
| seg_rolling_units = co_await _segments_rolling_lock.get_units(); |
There was a problem hiding this comment.
cool yeah, that's what i was asking. i didn't see anything else that would throw in this path at a glance, so this is basically new behavior? maybe it makes sense to add "closed" to the do_until condition instead, but idk.
…ve()`...
...in `disk_log_impl`
Without any sort of safety guarantees around mutual exclusion
of these functions, we can easily end up with an assert failure like
```
ERROR 2025-11-13 15:50:46,593 [shard 0:main] assert - Assert failure: (src/v/storage/disk_log_impl.cc:2084) '!_closed' flush on closed log - {offsets: {start_offset:850, committed_offset:849, committed_offset_term:-9223372036854775808, dirty_offset:849, dirty_offset_term:-9223372036854775808}, is_closed: true, segments: [{size: 0, []}], config: {ntp: {test.log_builder/OOvVPTBb/0}, base_dir: /home/willem/.cache/bazel/_bazel_willem/6fc2d9663f4e6fbf4046ab3d95678674/sandbox/linux-sandbox/12837/execroot/_main/_tmp/efa90c48b51a675ce0846aa066568b84/test.dir_8IlPIe, overrides: nullptr, revision: 0, topic_revision: -9223372036854775808, remote_revision: 0}}
```
Due to the call to `flush()` in `truncate_prefix()`.
Wait for and break the `_segment_rewrite_lock()` in `disk_log_impl::close()`
and `remove()` to avoid this issue.
70fd3e4 to
6059c5b
Compare
…x_full_segments()` We will be calling `remove_segment_permanently()` on what is potentially the active segment in `remove_prefix_full_segments()`. If there is a concurrent append occuring, we may get into a race with the `disk_log_appender` since we return the units in `lock_holder` (i.e the segment's write lock) while the future for `remove_segment_permanently()` is still unresolved. In this case, we need to obtain units from `_segments_rolling_lock` to prevent any concurrency issues, which manifest as either a thrown exception due to gate closure, or an inability to add batches to a closed `segment`.
For use in a non-seastar thread context.
A test which validates potential races between `disk_log_appender`, a `truncate_prefix()` operation, and a `close()` operation in a concurrent context. Without the previous two commits, this test would occasionally demonstrate assert failures around operations on closed logs.
6059c5b to
799ff14
Compare
| && !_eviction_monitor->promise.get_future().available()) { | ||
| _eviction_monitor->promise.set_exception(segment_closed_exception()); | ||
| } | ||
|
|
There was a problem hiding this comment.
nit: not from this PR (maybe as a follow up?), but let's move the "closing log" line up top, before we do any awaits (I can imagine this being extra hard to debug without realizing we're even in this call). and maybe update the text here to indicate that we've waited for locks
There was a problem hiding this comment.
Agreed, this is a good follow up
| } | ||
|
|
||
| ss::abort_source as; | ||
| mutex log_mutex{"e2e_test::log_mutex"}; |
There was a problem hiding this comment.
I'm a little confused about the purpose of the mutex here -- is it meant to emulate a specific call pattern (like certain calls must not overlap), or is it just another layer of delaying the calls in funky ways?
There was a problem hiding this comment.
Mutex here is just to prevent the append fiber from trying to make_appender on a closed log as well as avoiding a prefix truncation on a closed log (both of which vassert(), at least before your PR lands)
| // concurrency issues. | ||
| std::optional<ssx::semaphore_units> seg_rolling_units; | ||
| if (ptr->has_appender()) { | ||
| seg_rolling_units = co_await _segments_rolling_lock.get_units(); |
There was a problem hiding this comment.
I think this is fine, because if we're racing with a close, the log will have still persisted the start offset before getting here, and subsequent usage should be quite limited given the log is closing.
|
/backport v25.3.x |
|
/backport v25.2.x |
|
CI looks green, no new flakes or suspicious storage related regressions. Merging. Let's see how these changes do in CI before finalizing backports. |
|
Failed to create a backport PR to v25.2.x branch. I tried: |
New Cover Letter
Here be dragons. Read individual commit messages to understand races and their fixes
Old Cover Letter
(This only covers one of the discovered issues/race conditions)
This can race with a
segment::close()operation and result in an exception being thrown:Or, I hypothesize, we could end up attempting to append to a closed segment, which would also be an exceptional case.
Fixes https://redpandadata.atlassian.net/browse/CORE-14558.
Backports Required
Release Notes
Bug Fixes
storagelayer that could lead to attempting to append to a closed segment or a failed assert