Skip to content

[v25.2.x] [CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments() (MANUAL BACKPORT)#28563

Closed
WillemKauf wants to merge 6 commits intoredpanda-data:v25.2.xfrom
WillemKauf:manual-backport-28472-v25.2.x-580
Closed

[v25.2.x] [CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments() (MANUAL BACKPORT)#28563
WillemKauf wants to merge 6 commits intoredpanda-data:v25.2.xfrom
WillemKauf:manual-backport-28472-v25.2.x-580

Conversation

@WillemKauf
Copy link
Copy Markdown
Contributor

Manual backport of #28472. cherry-pick conflict due to missing includes.

Fixes #28546.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Bug Fixes

  • Prevent a slew of races between operations in the storage layer that could lead to attempting to append to a closed segment or a failed assert

Instead, log batch append errors in the `disk_log_appender` at `ERROR`.

(cherry picked from commit f7ca2ec)
A useful sanity check to prevent any potential concurrency issues.

(cherry picked from commit df9089a)
…ve()`...

...in `disk_log_impl`

Without any sort of safety guarantees around mutual exclusion
of these functions, we can easily end up with an assert failure like

```
ERROR 2025-11-13 15:50:46,593 [shard 0:main] assert - Assert failure: (src/v/storage/disk_log_impl.cc:2084) '!_closed' flush on closed log - {offsets: {start_offset:850, committed_offset:849, committed_offset_term:-9223372036854775808, dirty_offset:849, dirty_offset_term:-9223372036854775808}, is_closed: true, segments: [{size: 0, []}], config: {ntp: {test.log_builder/OOvVPTBb/0}, base_dir: /home/willem/.cache/bazel/_bazel_willem/6fc2d9663f4e6fbf4046ab3d95678674/sandbox/linux-sandbox/12837/execroot/_main/_tmp/efa90c48b51a675ce0846aa066568b84/test.dir_8IlPIe, overrides: nullptr, revision: 0, topic_revision: -9223372036854775808, remote_revision: 0}}
```

Due to the call to `flush()` in `truncate_prefix()`.

Wait for and break the `_segment_rewrite_lock()` in `disk_log_impl::close()`
and `remove()` to avoid this issue.

(cherry picked from commit 485b24d)
…x_full_segments()`

We will be calling `remove_segment_permanently()` on what is potentially
the active segment in `remove_prefix_full_segments()`. If there is a concurrent
append occuring, we may get into a race with the `disk_log_appender` since we
return the units in `lock_holder` (i.e the segment's write lock) while the
future for `remove_segment_permanently()` is still unresolved.

In this case, we need to obtain units from `_segments_rolling_lock` to
prevent any concurrency issues, which manifest as either a thrown exception
due to gate closure, or an inability to add batches to a closed `segment`.

(cherry picked from commit e31ae0b)
For use in a non-seastar thread context.

(cherry picked from commit a0ae962)
A test which validates potential races between `disk_log_appender`,
a `truncate_prefix()` operation, and a `close()` operation in a concurrent
context.

Without the previous two commits, this test would occasionally
demonstrate assert failures around operations on closed logs.

(cherry picked from commit 799ff14)
Copy link
Copy Markdown
Contributor Author

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please DNM yet. Leaving these changes to marinate in dev before backporting.

@WillemKauf WillemKauf requested a review from andrwng December 2, 2025 03:43
@WillemKauf
Copy link
Copy Markdown
Contributor Author

This should be okay to merge at this point.

@WillemKauf WillemKauf closed this Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant