CORE-14617 kafka: fix fetch session retry losing partition inclusion#29304
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug (CORE-14617) where Kafka fetch sessions could incorrectly exclude partitions with changed metadata (high watermark, log start offset, last stable offset) when fetches retry internally due to min_bytes not being satisfied.
Changes:
- Separated the logic for checking if a partition has changes from the logic for updating the session cache
- Re-enabled source_high_watermark check in cluster linking tests that was previously disabled due to this bug
- Added a regression test to verify partitions with changed metadata are included in fetch responses even during retries
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/v/kafka/server/handlers/fetch.cc | Split update_fetch_partition() into partition_has_changes() (check only) and update_session_cache() (mutation only); moved cache updates to send_response() to avoid premature updates during retries |
| src/v/kafka/server/tests/fetch_test.cc | Added regression test verifying partitions with changed log_start_offset are included in incremental fetch responses even when min_bytes triggers internal retry |
| tests/rptest/tests/cluster_linking_e2e_test.py | Re-enabled source_high_watermark validation check that was previously disabled due to this bug |
michael-redpanda
left a comment
There was a problem hiding this comment.
looks good - just request that you rerun that one re-enabled test a bunch
|
Does this work if a fetch response gets lost on the wire + client retries? |
|
/ci-repeat 10 |
|
/ci-repeat 10 |
830bda4 to
e70ac4d
Compare
|
force-push: noop (only comment changes + variable renaming) to addresses review feedback |
Done, both tests that got this check reenabled are green: |
Good question, there are a couple of points here:
So the two expected scenarios are really:
|
When fetches retry internally due to min_bytes not being satisfied, partitions with changed metadata were incorrectly excluded from the response. The issue was that update_fetch_partition() both checked whether a partition should be included AND mutated the session cache. On retry, the cache already reflected the response, so no change was detected. Split into partition_has_changes() for the check in set() and update_session_partition() for the mutation in send_response(). Also re-enables the source_high_watermark check in cluster linking tests.
e70ac4d to
b7c6970
Compare
|
force-push: rebase to dev to fix the failure of the PR rendering CI check (pulls in this commit) |
Retry command for Build#79312please wait until all jobs are finished before running the slash command |
|
/ci-repeat 1 |
|
/backport v25.3.x |
|
/backport v25.2.x |
|
/backport v25.1.x |
|
Failed to create a backport PR to v25.2.x branch. I tried: |
|
Failed to create a backport PR to v25.1.x branch. I tried: |
When fetches retry internally due to min_bytes not being satisfied, partitions with changed metadata were incorrectly excluded from the response. The issue was that update_fetch_partition() both checked whether a partition should be included AND mutated the session cache. On retry, the cache already reflected the response, so no change was detected.
Split into partition_has_changes() for the check in set() and update_session_partition() for the mutation in send_response().
Also re-enables the source_high_watermark check in cluster linking tests.
Fixes https://redpandadata.atlassian.net/browse/CORE-14617
Fixes https://redpandadata.atlassian.net/browse/CORE-14653
Backports Required
Release Notes
Bug Fixes