Skip to content

[v25.3.x] [CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments()#28545

Merged
piyushredpanda merged 6 commits intoredpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-28472-v25.3.x-983
Dec 19, 2025
Merged

[v25.3.x] [CORE-14558] storage: prevent race between disk_log_appender and remove_prefix_full_segments()#28545
piyushredpanda merged 6 commits intoredpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-28472-v25.3.x-983

Conversation

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Backport of PR #28472

Instead, log batch append errors in the `disk_log_appender` at `ERROR`.

(cherry picked from commit f7ca2ec)
A useful sanity check to prevent any potential concurrency issues.

(cherry picked from commit df9089a)
…ve()`...

...in `disk_log_impl`

Without any sort of safety guarantees around mutual exclusion
of these functions, we can easily end up with an assert failure like

```
ERROR 2025-11-13 15:50:46,593 [shard 0:main] assert - Assert failure: (src/v/storage/disk_log_impl.cc:2084) '!_closed' flush on closed log - {offsets: {start_offset:850, committed_offset:849, committed_offset_term:-9223372036854775808, dirty_offset:849, dirty_offset_term:-9223372036854775808}, is_closed: true, segments: [{size: 0, []}], config: {ntp: {test.log_builder/OOvVPTBb/0}, base_dir: /home/willem/.cache/bazel/_bazel_willem/6fc2d9663f4e6fbf4046ab3d95678674/sandbox/linux-sandbox/12837/execroot/_main/_tmp/efa90c48b51a675ce0846aa066568b84/test.dir_8IlPIe, overrides: nullptr, revision: 0, topic_revision: -9223372036854775808, remote_revision: 0}}
```

Due to the call to `flush()` in `truncate_prefix()`.

Wait for and break the `_segment_rewrite_lock()` in `disk_log_impl::close()`
and `remove()` to avoid this issue.

(cherry picked from commit 485b24d)
…x_full_segments()`

We will be calling `remove_segment_permanently()` on what is potentially
the active segment in `remove_prefix_full_segments()`. If there is a concurrent
append occuring, we may get into a race with the `disk_log_appender` since we
return the units in `lock_holder` (i.e the segment's write lock) while the
future for `remove_segment_permanently()` is still unresolved.

In this case, we need to obtain units from `_segments_rolling_lock` to
prevent any concurrency issues, which manifest as either a thrown exception
due to gate closure, or an inability to add batches to a closed `segment`.

(cherry picked from commit e31ae0b)
For use in a non-seastar thread context.

(cherry picked from commit a0ae962)
A test which validates potential races between `disk_log_appender`,
a `truncate_prefix()` operation, and a `close()` operation in a concurrent
context.

Without the previous two commits, this test would occasionally
demonstrate assert failures around operations on closed logs.

(cherry picked from commit 799ff14)
@vbotbuildovich vbotbuildovich added this to the v25.3.x-next milestone Nov 14, 2025
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Nov 14, 2025
@vbotbuildovich
Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#76288
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkTopicFailoverTests test_link_failover {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}, "with_failures": false} integration https://buildkite.com/redpanda/redpanda/builds/76288#019a80b4-9450-4d96-884b-39aaa1867ce4 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkTopicFailoverTests&test_method=test_link_failover
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76288#019a80ba-9723-48c2-8f69-2104f1f8b9fd FLAKY 19/21 upstream reliability is '83.95967002749771'. current run reliability is '90.47619047619048'. drift is -6.51652 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount

Copy link
Copy Markdown
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving these changes to marinate in dev before backporting.

@BenPope BenPope requested a review from WillemKauf December 1, 2025 10:37
Copy link
Copy Markdown
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine for backporting to me.

@piyushredpanda piyushredpanda merged commit 0329c6e into redpanda-data:v25.3.x Dec 19, 2025
17 checks passed
@piyushredpanda piyushredpanda modified the milestones: v25.3.x-next, v25.3.3 Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants