docs: document single-batch-per-partition Kafka constraint by thomhurst · Pull Request #529 · thomhurst/Dekaf

thomhurst · 2026-03-16T03:04:17Z

Summary

Documents WHY the 1-batch-per-partition constraint exists in CoalesceBatch, preventing future developers from attempting multi-batch-per-partition coalescing

Background

Profiling showed 7.5% of CPU time blocked in ReserveMemorySync (backpressure). Investigation into multi-batch-per-partition coalescing to improve drain throughput revealed that Kafka's LogValidator rejects multiple record batches per partition in v2 format — even for uncompressed batches:

"Compressed outer record has more than one batch"

The Java Kafka client also enforces this (RecordAccumulator.drain returns at most one batch per partition per node). The constraint is not just for simplicity — it's a broker requirement.

Test plan

Verified InvalidRecord error occurs when sending multiple batches per partition (integration test)
Verified the constraint comment matches Kafka's actual LogValidator behavior
Build passes

- ProducerOptions.LingerMs default: 5ms → 0ms (matches Java producer default) - ForHighThroughput() linger: 5ms → 100ms (matches Kafka Streams default)

Kafka's LogValidator rejects multiple record batches per partition in v2 format ("Compressed outer record has more than one batch"), even for uncompressed batches. The Java client also enforces this in RecordAccumulator.drain(). Add comment explaining the constraint so future developers don't attempt multi-batch-per-partition coalescing.

claude

Code Review

Overall Assessment

Good PR with a mix of excellent defensive documentation and a few changes that deserve discussion.

`BrokerSender.cs` — ✅ Excellent improvement

The improved comment on CoalesceBatch is exactly the right kind of defensive documentation. The old comment (// Ensure at most one batch per partition per coalesced request) described what without explaining why, which is the kind of comment that invites future "optimization" that breaks things. The new comment:

Names the specific error that would occur
Clarifies it applies to ALL v2 batches (not just compressed, which is non-obvious)
Cites the Java client precedent for validation
Ends with an explicit "Do not remove this constraint" warning

This is the right pattern for non-obvious correctness constraints.

`ProducerOptions.cs` — ⚠️ Behavioral change + lost history

Changing LingerMs default from 5 → 0 is a behavioral breaking change for existing users who rely on the default. More critically, the PR removes the historical note that explained why 5ms was chosen:

"This default was changed from 0 to 5 in v1.0.0 to improve batching for awaited ProduceAsync callers."

That institutional knowledge is now gone. If a developer later sees linger=0 and wonders "should we add a small default linger for batching?"—they won't find the history of why it was tried and apparently reverted.

Suggestion: Even if aligning with Java's 0ms default is the right call (and it's a reasonable call for API predictability), preserve the reasoning:

/// Default is 0 (immediate send), matching the Java Kafka producer default.
/// A non-zero linger (e.g. 5ms) improves batching for concurrent <c>ProduceAsync</c>
/// callers at the cost of added latency. Use <see cref="ProducerBuilder{TKey,TValue}.ForHighThroughput"/>
/// to set 100ms for throughput-oriented workloads.

Also worth noting: the PR description doesn't explain why this default is being reverted. A commit message or PR body explaining the trade-off ("aligning with Java default for predictability; users needing batching should opt in via ForHighThroughput") would help reviewers and future maintainers.

`Builders.cs` — ⚠️ 100ms may be surprising for `ForHighThroughput`

Bumping ForHighThroughput linger from 5ms → 100ms is a 20x increase. Users calling ForHighThroughput() expect latency trade-offs in exchange for throughput, but 100ms is aggressive:

Every message will wait up to 100ms before sending if the batch isn't full
For a producer sending at steady high volume this is fine; for bursty or low-frequency producers this introduces unexpected tail latency

The justification "matches Kafka Streams default" is a bit of a category mismatch — Kafka Streams has stream-processing semantics (batching is tied to commit intervals) rather than a general producer operating at will.

Suggestion: Either:

Document the latency implication explicitly in the XML doc: "Note: 100ms linger means messages may be held up to 100ms before dispatch. Only suitable for sustained high-throughput workloads."
Consider a more moderate default (e.g. 20–50ms) that still meaningfully improves batching without the tail latency cliff, or make it configurable via the builder.

Summary

Change	Assessment
`CoalesceBatch` comment	✅ Excellent — preserve this pattern
`LingerMs` default 5→0	⚠️ Valid but loses historical context; add reasoning to docs
`ForHighThroughput` 5ms→100ms	⚠️ Functional but 100ms tail latency should be documented explicitly

claude

Code Review

Overall Assessment

Good PR with excellent defensive documentation on the Kafka constraint, plus a few changes that deserve discussion.

BrokerSender.cs — Excellent improvement

The improved comment on CoalesceBatch is exactly the right kind of defensive documentation. The old comment described what without explaining why, which invites future "optimization" that breaks things. The new comment names the specific error that would occur, clarifies it applies to ALL v2 batches (not just compressed, which is non-obvious), cites the Java client precedent, and ends with an explicit "Do not remove this constraint" warning. This is the right pattern for non-obvious correctness constraints.

ProducerOptions.cs — Behavioral change + lost history

Changing LingerMs default from 5 to 0 is a behavioral breaking change for existing users relying on the default. More critically, the PR removes the historical note explaining why 5ms was chosen:

"This default was changed from 0 to 5 in v1.0.0 to improve batching for awaited ProduceAsync callers."

That institutional knowledge is now gone. If a developer later sees linger=0 and wonders "should we add a small default linger for batching?"—they won't find the history of why it was tried and apparently reverted.

Suggestion: Even if aligning with Java's 0ms default is the right call for API predictability, preserve the reasoning in the XML doc:

/// Default is 0 (immediate send), matching the Java Kafka producer default.
/// A non-zero linger (e.g. 5ms) improves batching for concurrent ProduceAsync
/// callers at the cost of added latency. Use ForHighThroughput() to set 100ms
/// for throughput-oriented workloads.

The PR body also doesn't explain why this default is being reverted. A clear rationale ("aligning with Java default for predictability; users needing batching should opt in via ForHighThroughput") would help future maintainers understand the intent.

Builders.cs — 100ms may be surprising for ForHighThroughput

Bumping ForHighThroughput linger from 5ms to 100ms is a 20x increase. Users calling ForHighThroughput() expect latency trade-offs, but 100ms is aggressive: every message waits up to 100ms before sending if the batch isn't full. For bursty or low-frequency producers this introduces unexpected tail latency even when they've explicitly opted into the high-throughput preset.

The justification "matches Kafka Streams default" is a category mismatch — Kafka Streams batches around commit intervals, not general produce throughput.

Suggestion: Document the latency implication explicitly: "Note: 100ms linger means messages may be held up to 100ms before dispatch. Only suitable for sustained high-throughput workloads where latency is not a concern." Alternatively, consider 20–50ms which meaningfully improves batching without the tail latency cliff.

Summary

Change	Assessment
CoalesceBatch comment	Excellent — exactly the right defensive pattern
LingerMs default 5ms→0ms	Valid alignment with Java defaults, but loses important historical context
ForHighThroughput 5ms→100ms	The latency implication (100ms tail) should be documented explicitly

thomhurst added 2 commits March 16, 2026 02:12

perf: align LingerMs defaults with Java Kafka producer

ad441e8

- ProducerOptions.LingerMs default: 5ms → 0ms (matches Java producer default) - ForHighThroughput() linger: 5ms → 100ms (matches Kafka Streams default)

claude bot reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: document single-batch-per-partition Kafka constraint#529

docs: document single-batch-per-partition Kafka constraint#529
thomhurst wants to merge 2 commits intomainfrom
perf/multi-batch-per-partition-coalescing

thomhurst commented Mar 16, 2026

Uh oh!

claude bot left a comment

Uh oh!

claude bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomhurst commented Mar 16, 2026

Summary

Background

Test plan

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Code Review

Overall Assessment

BrokerSender.cs — ✅ Excellent improvement

ProducerOptions.cs — ⚠️ Behavioral change + lost history

Builders.cs — ⚠️ 100ms may be surprising for ForHighThroughput

Summary

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Code Review

Overall Assessment

BrokerSender.cs — Excellent improvement

ProducerOptions.cs — Behavioral change + lost history

Builders.cs — 100ms may be surprising for ForHighThroughput

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`BrokerSender.cs` — ✅ Excellent improvement

`ProducerOptions.cs` — ⚠️ Behavioral change + lost history

`Builders.cs` — ⚠️ 100ms may be surprising for `ForHighThroughput`