Skip to content

cloud_storage: include revision_id in spillover manifest cache key#29068

Merged
nvartolomei merged 2 commits intoredpanda-data:devfrom
nvartolomei:nv/spillover-cache-collision
Dec 21, 2025
Merged

cloud_storage: include revision_id in spillover manifest cache key#29068
nvartolomei merged 2 commits intoredpanda-data:devfrom
nvartolomei:nv/spillover-cache-collision

Conversation

@nvartolomei
Copy link
Copy Markdown
Contributor

@nvartolomei nvartolomei commented Dec 19, 2025

Fix cache collision where spillover manifests from a deleted topic could
be applied to a newly created topic with the same ntp. The cache key now
includes revision_id to distinguish between different topic incarnations.

Adds test that reproduces the issue by creating a topic, applying
spillover, deleting the topic, and recreating it with the same name.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Bug Fixes

  • Fix a cache collision in in-memory spillover manifest cache which could lead to new topic reading data from the old topic with the same name for a short period of time.

@nvartolomei nvartolomei requested review from Lazin and Copilot and removed request for Copilot December 19, 2025 12:43
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Dec 19, 2025

CI test results

test results on build#78196
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
PartitionBalancerTest test_rack_awareness null integration https://buildkite.com/redpanda/redpanda/builds/78196#019b36dc-9f81-4f4a-9ffc-8cac1d602bb4 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionBalancerTest&test_method=test_rack_awareness
ScalingUpTest test_fast_node_addition null integration https://buildkite.com/redpanda/redpanda/builds/78196#019b36dc-9f86-4ab3-8951-e2b31e7e5f2d FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0318, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_fast_node_addition
test results on build#78253
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/78253#019b3aff-63e9-4047-9d2c-bca9f2ecd3e4 FLAKY 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1620, p0=0.8291, reject_threshold=0.0100. adj_baseline=0.4114, p1=0.0399, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount

oleiman
oleiman previously approved these changes Dec 19, 2025
Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. awesome find.

Comment on lines +1545 to +1547
// Test a scenario where after a topic recreation the spillover manifests from
// the previous incarnation could be applied to the new topic, causing errors or
// reading wrong data.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for posterity, this test would consistently fail before your fix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Every single time. The test was written before the fix.

I need to an assert though to make sure that spillover did actually happen to future proof it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every single time

😞

assert ... to make sure that spillover did actually happen

good idea 👍

Fix cache collision where spillover manifests from a deleted topic could
be applied to a newly created topic with the same ntp. The cache key now
includes revision_id to distinguish between different topic incarnations.

Adds test that reproduces the issue by creating a topic, applying
spillover, deleting the topic, and recreating it with the same name.
Copilot AI review requested due to automatic review settings December 20, 2025 07:59
@nvartolomei nvartolomei force-pushed the nv/spillover-cache-collision branch from 9c7f258 to 379d427 Compare December 20, 2025 07:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a cache collision bug in the spillover manifest cache where manifests from a deleted topic could be incorrectly applied to a newly created topic with the same ntp. The fix adds revision_id to the cache key to distinguish between different topic incarnations.

Key Changes:

  • Updated manifest cache key from tuple<ntp, offset> to tuple<ntp, revision_id, offset>
  • Added test reproducing the collision scenario (create topic, apply spillover, delete, recreate)
  • Modified test utilities to support sequential record generation across topic recreations

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/v/cloud_storage/materialized_manifest_cache.h Updated cache key type definition to include revision_id
src/v/cloud_storage/materialized_manifest_cache.cc Updated cache key construction and formatting to include revision_id
src/v/cloud_storage/async_manifest_materializer.cc Updated all cache lookups to include revision_id in the key
src/v/cloud_storage/tests/materialized_manifest_cache_test.cc Updated test helper functions to construct keys with revision_id
src/v/cloud_storage/tests/cloud_storage_e2e_test.cc Added regression test for spillover cache collision bug
src/v/cloud_storage/tests/produce_utils.h Enhanced generator to support continuing sequence across invocations

Comment on lines +28 to +31
remote_segment_generator& start_ix(size_t ix) {
_current_ix = ix;
return *this;
}
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new start_ix method lacks documentation explaining its purpose and when it should be used. Add a comment describing that this method sets the starting index for record generation, useful for continuing sequences across multiple invocations or topic incarnations.

Copilot uses AI. Check for mistakes.
vlog(e2e_test_log.info, "Running iteration {}", iteration);

add_topic({model::kafka_namespace, topic_name}, 1, props).get();
wait_for_leader(ntp).get();
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment 'Seeding partition data' is correct, but consider making it more descriptive for this specific iteration context, such as 'Seeding partition data for iteration N' to improve test output clarity.

Suggested change
wait_for_leader(ntp).get();
SCOPED_TRACE(
fmt::format("Seeding partition data for iteration {}", iteration));

Copilot uses AI. Check for mistakes.
// Consume all data.
tests::kafka_consume_transport consumer(make_kafka_client().get());
auto deferred_c_close = ss::defer(
[&consumer] { consumer.stop().get(); });
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test verifies only the first record's key matches the expected iteration value. Consider also checking the last record or a sample of records to ensure all consumed data is from the correct iteration and not mixed with data from the previous topic incarnation.

Copilot uses AI. Check for mistakes.
@nvartolomei nvartolomei merged commit 9a7a69b into redpanda-data:dev Dec 21, 2025
19 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29068-v25.2.x-603 remotes/upstream/v25.2.x
git cherry-pick -x dfceec48d9 379d42743c

Workflow run logs.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29068-v25.1.x-83 remotes/upstream/v25.1.x
git cherry-pick -x dfceec48d9 379d42743c

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants