Skip to content

[CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments()#28472

Merged
WillemKauf merged 6 commits intoredpanda-data:devfrom
WillemKauf:storage_batch_write_error_logging
Nov 14, 2025
Merged

[CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments()#28472
WillemKauf merged 6 commits intoredpanda-data:devfrom
WillemKauf:storage_batch_write_error_logging

Conversation

@WillemKauf
Copy link
Copy Markdown
Contributor

@WillemKauf WillemKauf commented Nov 11, 2025

New Cover Letter

Here be dragons. Read individual commit messages to understand races and their fixes

Old Cover Letter

(This only covers one of the discovered issues/race conditions)

This can race with a segment::close() operation and result in an exception being thrown:

1. F1: 13:12:39,356 [shard 0:clus] storage - disk_log_impl.cc:3236 - Removing "/var/lib/redpanda/data/redpanda/controller/0_0/0-1-v1.log" (remove_prefix_full_segments...)
2. F2: 13:12:39,356 [shard 0:rr  ] storage - disk_log_impl.cc:2077 - creating log appender for: {redpanda/controller/0}, next offset: 24, log offsets: {start_offset:24, committed_offset:23, committed_offset_term:1, dirty_offset:23, dirty_offset_term:1}
3. F1: 13:12:39,357 [shard 0:clus] storage - disk_log_impl.cc:3249 - Segment has outstanding locks. Might take a while to close:/var/lib/redpanda/data/redpanda/controller/0_0/0-1-v1.log
4. F2: _log.maybe_roll_unlocked() in disk_log_appender
5. F1: TRACE 2025-11-06 13:12:39,357 [shard 0:clus] storage - segment.cc:94 - closing segment: {offset_tracker:{term:1, base_offset:0, committed_offset:23, stable_offset:23, dirty_offset:23}, compacted_segment=0, finished_self_compaction=0, finished_windowed_compaction=0, generation=48, reader={/var/lib/redpanda/data/redpanda/controller/0_0/0-1-v1.log
6. F1: _gate in segment 0-1-v1.log is closed as a result of the above
7. F2: We have an appender, so maybe_roll_unlocked() goes into segment::release_appender()
8. F2: We are forced into release_appender_in_background() due to held segment locks
9. F2: _gate.check(); in release_appender_in_background() throws.

Or, I hypothesize, we could end up attempting to append to a closed segment, which would also be an exceptional case.

Fixes https://redpandadata.atlassian.net/browse/CORE-14558.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Bug Fixes

  • Prevent a slew of races between operations in the storage layer that could lead to attempting to append to a closed segment or a failed assert

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adjusts the logging severity for batch write errors in the storage layer based on whether the error occurred during system shutdown. Errors during normal operation remain at ERROR level, while shutdown-related errors are downgraded to WARN to avoid alarming users about expected behavior during shutdown.

  • Conditionally sets log level to WARN for shutdown exceptions and ERROR for other exceptions
  • Uses ssx::is_shutdown_exception() to distinguish shutdown-related errors
  • Replaces direct stlog.error() call with level-conditional vlogl() macro

nvartolomei
nvartolomei previously approved these changes Nov 11, 2025
Copy link
Copy Markdown
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see the comments first

@nvartolomei
Copy link
Copy Markdown
Contributor

@WillemKauf do you happen to know what gate is getting closed? I'm suspicious that we close storage stuff with work in-flight. In most cases we'd want this to be the other way around, cancel work, wait for cancellation, then shutdown storage/segments in correct order.

@WillemKauf
Copy link
Copy Markdown
Contributor Author

@nvartolomei I believe it is the log_manager's _gate, since we call into log_manager::make_log_segment() from disk_log_impl::new_segment() (from disk_log_impl::maybe_roll_unlocked()) which seems to be the only pathway in the relevant try/catch block here.

@WillemKauf
Copy link
Copy Markdown
Contributor Author

I agree that perhaps the gate should be held/kept open by the appender if it is already in progress.

@nvartolomei
Copy link
Copy Markdown
Contributor

nvartolomei commented Nov 11, 2025

@WillemKauf storage is one of the first services to start and hence i'd assume one of the last to stop. Something is leaking continuations then? My concern is that they, in this case, get an exception on the gate, but what if instead they access some deallocated memory in other instances? I.e. when the continuation is scheduled slightly later.

@WillemKauf
Copy link
Copy Markdown
Contributor Author

WillemKauf commented Nov 11, 2025

@nvartolomei It looks like, from the CI failure logs, this was a concurrent partition movement of controller/0 and append requests leading to the ERROR log line.

@WillemKauf
Copy link
Copy Markdown
Contributor Author

WillemKauf commented Nov 11, 2025

Ah, I think I see it.

1. F1: 13:12:39,356 [shard 0:clus] storage - disk_log_impl.cc:3236 - Removing "/var/lib/redpanda/data/redpanda/controller/0_0/0-1-v1.log" (remove_prefix_full_segments...)
2. F2: 13:12:39,356 [shard 0:rr  ] storage - disk_log_impl.cc:2077 - creating log appender for: {redpanda/controller/0}, next offset: 24, log offsets: {start_offset:24, committed_offset:23, committed_offset_term:1, dirty_offset:23, dirty_offset_term:1}
3. F1: 13:12:39,357 [shard 0:clus] storage - disk_log_impl.cc:3249 - Segment has outstanding locks. Might take a while to close:/var/lib/redpanda/data/redpanda/controller/0_0/0-1-v1.log
4. F2: _log.maybe_roll_unlocked() in disk_log_appender
5. F1: TRACE 2025-11-06 13:12:39,357 [shard 0:clus] storage - segment.cc:94 - closing segment: {offset_tracker:{term:1, base_offset:0, committed_offset:23, stable_offset:23, dirty_offset:23}, compacted_segment=0, finished_self_compaction=0, finished_windowed_compaction=0, generation=48, reader={/var/lib/redpanda/data/redpanda/controller/0_0/0-1-v1.log
6. F1: _gate in segment 0-1-v1.log is closed as a result of the above
7. F2: We have an appender, so maybe_roll_unlocked() goes into segment::release_appender()
8. F2: We are forced into release_appender_in_background() due to held segment locks
9. F2: _gate.check(); in release_appender_in_background() throws.

@nvartolomei
Copy link
Copy Markdown
Contributor

@WillemKauf this sounds like a potential bug? Should we ensure release appender is initiated before closing and takes a gate? Current behavior seem spaghetti 🤔

@WillemKauf
Copy link
Copy Markdown
Contributor Author

@nvartolomei yes for sure this is spaghetti. Pushing an update soon- let me know what you think.

@WillemKauf WillemKauf force-pushed the storage_batch_write_error_logging branch from 1770323 to 2ef42a9 Compare November 11, 2025 17:39
@WillemKauf
Copy link
Copy Markdown
Contributor Author

@nvartolomei I don't think we should force a log appender to hold a segment's gate. We should allow segments to be closed and dispatch any batch appends to a new segment the log provides us with.

@WillemKauf WillemKauf changed the title [CORE-14558] storage: log batch write errors at WARN iff shutdown [CORE-14558] storage: wrap maybe_roll_unlocked in as_future() in disk_log_appender Nov 11, 2025
Copy link
Copy Markdown
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From phone. Will have a better chance to look at this on Thursday the earliest.

co_await _log.maybe_roll_unlocked(_last_term, _idx);
auto roll_fut = co_await ss::coroutine::as_future(
_log.maybe_roll_unlocked(_last_term, _idx));
if (roll_fut.failed()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a large catch all. I'm not comfortable with catch-all and continuing as if nothing happens. If we expect a benign race, can't we handle it at a lower level? Ie what if storage is indeed stopping (not sure it's possible today but may be another race introduced in the future)? It would spin in this while loop forever?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps, this race (between disk_log_appender and a segment close) has always existed- the only difference in recent history is that release_appender_in_background() can now throw leading to this new narrow race condition. The best thing to do here is to ignore the closing of the old segment and attempt to re-roll with the log.

This exceptional path is also not what typically saves us from a spin-loop nor is it the typical exit path, fyi.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to be honest, I think a lot of this spaghettiness comes from the fact the disk_log_appender directly interfaces with a segment obtained from the log instead of opaquely dealing with the log itself. It makes the object/behavior very difficult to reason about indeed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also need to stare at this a bit more, but I wonder if we should be treating the segment_roll_lock as a general lock around the last segment (regardless of whether it has an appender or not), in which case, maybe we should try to acquire the segment roll lock before calling _segs.pop_back() on the last segment (assuming that's what's happening here)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I wonder if we should be treating...

That feels pretty sane. It feels like there are a bunch of overlapping hoops serving as concurrency control in local storage and this gap isn't yet covered.

It also seems like even if this roll were to be wrapped in a try/catch block, we may just end up throwing when attempting to append to a closed segment anyways in append_batch_to_segment.

I can try to prototype a solution which requires segment_roll_lock be held when popping back a segment from the log.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, not sure this is the problem yet. Need to think about this more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this commit 59dea3e pass the sniff test?

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Nov 11, 2025

CI test results

test results on build#76043
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/76043#019a7434-4e74-4c26-bd96-90ad39601145 FLAKY 18/21 upstream reliability is '90.12256669069934'. current run reliability is '85.71428571428571'. drift is 4.40828 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76043#019a7434-4e6e-492e-a41d-e45c5e84b60f FLAKY 14/21 upstream reliability is '88.74773139745916'. current run reliability is '66.66666666666666'. drift is 22.08106 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
SegmentMsTest test_segment_rolling_with_retention_consumer null integration https://buildkite.com/redpanda/redpanda/builds/76043#019a7434-4e70-46c4-b515-30095b06a3be FLAKY 19/21 upstream reliability is '93.28028293545535'. current run reliability is '90.47619047619048'. drift is 2.80409 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SegmentMsTest&test_method=test_segment_rolling_with_retention_consumer
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/76043#019a7434-4e74-4c26-bd96-90ad39601145 FLAKY 19/21 upstream reliability is '94.77911646586345'. current run reliability is '90.47619047619048'. drift is 4.30293 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss
test results on build#76173
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
SegmentMsTest test_segment_rolling_with_retention_consumer null integration https://buildkite.com/redpanda/redpanda/builds/76173#019a7aaa-134c-4400-87ac-d5b7801649da FLAKY 20/21 upstream reliability is '92.67241379310344'. current run reliability is '95.23809523809523'. drift is -2.56568 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SegmentMsTest&test_method=test_segment_rolling_with_retention_consumer
simple_fair_scheduling_policy_fixture test_happy_path unit https://buildkite.com/redpanda/redpanda/builds/76173#019a7a36-3eff-4f69-837a-be4fe7ef25b1 FAIL 0/1
src/v/storage/tests/segment_appender_rpbench_test src/v/storage/tests/segment_appender_rpbench_test unit https://buildkite.com/redpanda/redpanda/builds/76173#019a7a36-3f01-4fab-96ef-6445f584e2b5 FAIL 0/1
test results on build#76180
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/76180#019a7b1e-0307-4f10-a9bc-87fb5358a61a FLAKY 20/21 upstream reliability is '94.93201483312733'. current run reliability is '95.23809523809523'. drift is -0.30608 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss
test results on build#76227
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ReplicatedMetastoreTest TestBasicRemoveTopics unit https://buildkite.com/redpanda/redpanda/builds/76227#019a7e2a-82af-42bc-98a9-b871394d48d7 FAIL 0/1
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76227#019a7ed2-ee61-421e-80d0-717cf4cc0e98 FLAKY 16/21 upstream reliability is '83.94160583941606'. current run reliability is '76.19047619047619'. drift is 7.75113 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
DatalakeThrottlingTest test_backlog_metric {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76227#019a7ed2-ee61-421e-80d0-717cf4cc0e98 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeThrottlingTest&test_method=test_backlog_metric
SegmentMsTest test_segment_rolling_with_retention_consumer null integration https://buildkite.com/redpanda/redpanda/builds/76227#019a7ed2-ee6d-4838-846b-62f300952c8d FLAKY 20/21 upstream reliability is '92.65785609397945'. current run reliability is '95.23809523809523'. drift is -2.58024 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SegmentMsTest&test_method=test_segment_rolling_with_retention_consumer
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/76227#019a7ed2-ee6f-4ac4-8351-97155fd6af80 FLAKY 16/21 upstream reliability is '91.0295616717635'. current run reliability is '76.19047619047619'. drift is 14.83909 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
src/v/storage/mvlog/tests/file_test src/v/storage/mvlog/tests/file_test unit https://buildkite.com/redpanda/redpanda/builds/76227#019a7e8b-8e76-49dd-8a3b-8aabc6f8bbf5 FAIL 0/1

Instead, log batch append errors in the `disk_log_appender` at `ERROR`.
A useful sanity check to prevent any potential concurrency issues.
@WillemKauf WillemKauf force-pushed the storage_batch_write_error_logging branch from 2ef42a9 to 1f5b520 Compare November 12, 2025 18:52
@WillemKauf WillemKauf changed the title [CORE-14558] storage: wrap maybe_roll_unlocked in as_future() in disk_log_appender [CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments() Nov 12, 2025
@WillemKauf WillemKauf changed the title [CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments() [CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments() Nov 12, 2025
@WillemKauf WillemKauf force-pushed the storage_batch_write_error_logging branch from 1f5b520 to 9b01363 Compare November 12, 2025 19:06
@WillemKauf WillemKauf force-pushed the storage_batch_write_error_logging branch 2 times, most recently from 8410e99 to 59dea3e Compare November 12, 2025 19:10
@WillemKauf WillemKauf force-pushed the storage_batch_write_error_logging branch from 59dea3e to 70fd3e4 Compare November 12, 2025 22:34
@WillemKauf
Copy link
Copy Markdown
Contributor Author

/ci-repeat 1

// concurrency issues.
std::optional<ssx::semaphore_units> seg_rolling_units;
if (ptr->has_appender()) {
seg_rolling_units = co_await _segments_rolling_lock.get_units();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we happen to be waiting for log::close(), should we catch the broken semaphore exception here and quietly return?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently relying on the thrown exception to break out of the ss::do_until() loop (unless we also make that a bit more intelligent- catching and silencing the exception without other changes would cause an infinite loop). The exception is eventually caught further up the chain (I think all the way in log_eviction_stm).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool yeah, that's what i was asking. i didn't see anything else that would throw in this path at a glance, so this is basically new behavior? maybe it makes sense to add "closed" to the do_until condition instead, but idk.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that having an exceptional path here does seem like new behavior. It feels like similar behavior regardless of whether we catch and silently return on an exception here OR allow it to be thrown and bubble up the chain, but I'm still a little unsure.

Happy to get other opinions here and a sanity check on any potential side effects to this change (i.e previously if this were to race with a close() operation, it seems that we would still be able to prefix truncate the entire log if necessary)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, because if we're racing with a close, the log will have still persisted the start offset before getting here, and subsequent usage should be quite limited given the log is closing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR overall too complex then? hitting a vassert is bad but if we have your PR to downgrade those to a throw, are we happy to accept there may be concurrency issues during shutdown which are fairly benign after all?

oleiman
oleiman previously approved these changes Nov 13, 2025
Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty much legit to me. might want a second set of eyes

// concurrency issues.
std::optional<ssx::semaphore_units> seg_rolling_units;
if (ptr->has_appender()) {
seg_rolling_units = co_await _segments_rolling_lock.get_units();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool yeah, that's what i was asking. i didn't see anything else that would throw in this path at a glance, so this is basically new behavior? maybe it makes sense to add "closed" to the do_until condition instead, but idk.

…ve()`...

...in `disk_log_impl`

Without any sort of safety guarantees around mutual exclusion
of these functions, we can easily end up with an assert failure like

```
ERROR 2025-11-13 15:50:46,593 [shard 0:main] assert - Assert failure: (src/v/storage/disk_log_impl.cc:2084) '!_closed' flush on closed log - {offsets: {start_offset:850, committed_offset:849, committed_offset_term:-9223372036854775808, dirty_offset:849, dirty_offset_term:-9223372036854775808}, is_closed: true, segments: [{size: 0, []}], config: {ntp: {test.log_builder/OOvVPTBb/0}, base_dir: /home/willem/.cache/bazel/_bazel_willem/6fc2d9663f4e6fbf4046ab3d95678674/sandbox/linux-sandbox/12837/execroot/_main/_tmp/efa90c48b51a675ce0846aa066568b84/test.dir_8IlPIe, overrides: nullptr, revision: 0, topic_revision: -9223372036854775808, remote_revision: 0}}
```

Due to the call to `flush()` in `truncate_prefix()`.

Wait for and break the `_segment_rewrite_lock()` in `disk_log_impl::close()`
and `remove()` to avoid this issue.
…x_full_segments()`

We will be calling `remove_segment_permanently()` on what is potentially
the active segment in `remove_prefix_full_segments()`. If there is a concurrent
append occuring, we may get into a race with the `disk_log_appender` since we
return the units in `lock_holder` (i.e the segment's write lock) while the
future for `remove_segment_permanently()` is still unresolved.

In this case, we need to obtain units from `_segments_rolling_lock` to
prevent any concurrency issues, which manifest as either a thrown exception
due to gate closure, or an inability to add batches to a closed `segment`.
For use in a non-seastar thread context.
A test which validates potential races between `disk_log_appender`,
a `truncate_prefix()` operation, and a `close()` operation in a concurrent
context.

Without the previous two commits, this test would occasionally
demonstrate assert failures around operations on closed logs.
@WillemKauf WillemKauf force-pushed the storage_batch_write_error_logging branch from 6059c5b to 799ff14 Compare November 13, 2025 17:00
&& !_eviction_monitor->promise.get_future().available()) {
_eviction_monitor->promise.set_exception(segment_closed_exception());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not from this PR (maybe as a follow up?), but let's move the "closing log" line up top, before we do any awaits (I can imagine this being extra hard to debug without realizing we're even in this call). and maybe update the text here to indicate that we've waited for locks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is a good follow up

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

ss::abort_source as;
mutex log_mutex{"e2e_test::log_mutex"};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about the purpose of the mutex here -- is it meant to emulate a specific call pattern (like certain calls must not overlap), or is it just another layer of delaying the calls in funky ways?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutex here is just to prevent the append fiber from trying to make_appender on a closed log as well as avoiding a prefix truncation on a closed log (both of which vassert(), at least before your PR lands)

// concurrency issues.
std::optional<ssx::semaphore_units> seg_rolling_units;
if (ptr->has_appender()) {
seg_rolling_units = co_await _segments_rolling_lock.get_units();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, because if we're racing with a close, the log will have still persisted the start offset before getting here, and subsequent usage should be quite limited given the log is closing.

@WillemKauf WillemKauf merged commit 013d951 into redpanda-data:dev Nov 14, 2025
23 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@WillemKauf
Copy link
Copy Markdown
Contributor Author

CI looks green, no new flakes or suspicious storage related regressions. Merging. Let's see how these changes do in CI before finalizing backports.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-28472-v25.2.x-121 remotes/upstream/v25.2.x
git cherry-pick -x f7ca2ec7da df9089acbf 485b24d8ad e31ae0bdca a0ae9625a2 799ff148b7

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants