docs: document single-batch-per-partition Kafka constraint#529
docs: document single-batch-per-partition Kafka constraint#529
Conversation
- ProducerOptions.LingerMs default: 5ms → 0ms (matches Java producer default) - ForHighThroughput() linger: 5ms → 100ms (matches Kafka Streams default)
Kafka's LogValidator rejects multiple record batches per partition in
v2 format ("Compressed outer record has more than one batch"), even
for uncompressed batches. The Java client also enforces this in
RecordAccumulator.drain(). Add comment explaining the constraint so
future developers don't attempt multi-batch-per-partition coalescing.
There was a problem hiding this comment.
Code Review
Overall Assessment
Good PR with a mix of excellent defensive documentation and a few changes that deserve discussion.
BrokerSender.cs — ✅ Excellent improvement
The improved comment on CoalesceBatch is exactly the right kind of defensive documentation. The old comment (// Ensure at most one batch per partition per coalesced request) described what without explaining why, which is the kind of comment that invites future "optimization" that breaks things. The new comment:
- Names the specific error that would occur
- Clarifies it applies to ALL v2 batches (not just compressed, which is non-obvious)
- Cites the Java client precedent for validation
- Ends with an explicit "Do not remove this constraint" warning
This is the right pattern for non-obvious correctness constraints.
ProducerOptions.cs — ⚠️ Behavioral change + lost history
Changing LingerMs default from 5 → 0 is a behavioral breaking change for existing users who rely on the default. More critically, the PR removes the historical note that explained why 5ms was chosen:
"This default was changed from 0 to 5 in v1.0.0 to improve batching for awaited
ProduceAsynccallers."
That institutional knowledge is now gone. If a developer later sees linger=0 and wonders "should we add a small default linger for batching?"—they won't find the history of why it was tried and apparently reverted.
Suggestion: Even if aligning with Java's 0ms default is the right call (and it's a reasonable call for API predictability), preserve the reasoning:
/// Default is 0 (immediate send), matching the Java Kafka producer default.
/// A non-zero linger (e.g. 5ms) improves batching for concurrent <c>ProduceAsync</c>
/// callers at the cost of added latency. Use <see cref="ProducerBuilder{TKey,TValue}.ForHighThroughput"/>
/// to set 100ms for throughput-oriented workloads.Also worth noting: the PR description doesn't explain why this default is being reverted. A commit message or PR body explaining the trade-off ("aligning with Java default for predictability; users needing batching should opt in via ForHighThroughput") would help reviewers and future maintainers.
Builders.cs — ⚠️ 100ms may be surprising for ForHighThroughput
Bumping ForHighThroughput linger from 5ms → 100ms is a 20x increase. Users calling ForHighThroughput() expect latency trade-offs in exchange for throughput, but 100ms is aggressive:
- Every message will wait up to 100ms before sending if the batch isn't full
- For a producer sending at steady high volume this is fine; for bursty or low-frequency producers this introduces unexpected tail latency
The justification "matches Kafka Streams default" is a bit of a category mismatch — Kafka Streams has stream-processing semantics (batching is tied to commit intervals) rather than a general producer operating at will.
Suggestion: Either:
- Document the latency implication explicitly in the XML doc: "Note: 100ms linger means messages may be held up to 100ms before dispatch. Only suitable for sustained high-throughput workloads."
- Consider a more moderate default (e.g. 20–50ms) that still meaningfully improves batching without the tail latency cliff, or make it configurable via the builder.
Summary
| Change | Assessment |
|---|---|
CoalesceBatch comment |
✅ Excellent — preserve this pattern |
LingerMs default 5→0 |
|
ForHighThroughput 5ms→100ms |
There was a problem hiding this comment.
Code Review
Overall Assessment
Good PR with excellent defensive documentation on the Kafka constraint, plus a few changes that deserve discussion.
BrokerSender.cs — Excellent improvement
The improved comment on CoalesceBatch is exactly the right kind of defensive documentation. The old comment described what without explaining why, which invites future "optimization" that breaks things. The new comment names the specific error that would occur, clarifies it applies to ALL v2 batches (not just compressed, which is non-obvious), cites the Java client precedent, and ends with an explicit "Do not remove this constraint" warning. This is the right pattern for non-obvious correctness constraints.
ProducerOptions.cs — Behavioral change + lost history
Changing LingerMs default from 5 to 0 is a behavioral breaking change for existing users relying on the default. More critically, the PR removes the historical note explaining why 5ms was chosen:
"This default was changed from 0 to 5 in v1.0.0 to improve batching for awaited ProduceAsync callers."
That institutional knowledge is now gone. If a developer later sees linger=0 and wonders "should we add a small default linger for batching?"—they won't find the history of why it was tried and apparently reverted.
Suggestion: Even if aligning with Java's 0ms default is the right call for API predictability, preserve the reasoning in the XML doc:
/// Default is 0 (immediate send), matching the Java Kafka producer default.
/// A non-zero linger (e.g. 5ms) improves batching for concurrent ProduceAsync
/// callers at the cost of added latency. Use ForHighThroughput() to set 100ms
/// for throughput-oriented workloads.The PR body also doesn't explain why this default is being reverted. A clear rationale ("aligning with Java default for predictability; users needing batching should opt in via ForHighThroughput") would help future maintainers understand the intent.
Builders.cs — 100ms may be surprising for ForHighThroughput
Bumping ForHighThroughput linger from 5ms to 100ms is a 20x increase. Users calling ForHighThroughput() expect latency trade-offs, but 100ms is aggressive: every message waits up to 100ms before sending if the batch isn't full. For bursty or low-frequency producers this introduces unexpected tail latency even when they've explicitly opted into the high-throughput preset.
The justification "matches Kafka Streams default" is a category mismatch — Kafka Streams batches around commit intervals, not general produce throughput.
Suggestion: Document the latency implication explicitly: "Note: 100ms linger means messages may be held up to 100ms before dispatch. Only suitable for sustained high-throughput workloads where latency is not a concern." Alternatively, consider 20–50ms which meaningfully improves batching without the tail latency cliff.
Summary
| Change | Assessment |
|---|---|
| CoalesceBatch comment | Excellent — exactly the right defensive pattern |
| LingerMs default 5ms→0ms | Valid alignment with Java defaults, but loses important historical context |
| ForHighThroughput 5ms→100ms | The latency implication (100ms tail) should be documented explicitly |
Summary
CoalesceBatch, preventing future developers from attempting multi-batch-per-partition coalescingBackground
Profiling showed 7.5% of CPU time blocked in
ReserveMemorySync(backpressure). Investigation into multi-batch-per-partition coalescing to improve drain throughput revealed that Kafka'sLogValidatorrejects multiple record batches per partition in v2 format — even for uncompressed batches:The Java Kafka client also enforces this (RecordAccumulator.drain returns at most one batch per partition per node). The constraint is not just for simplicity — it's a broker requirement.
Test plan
InvalidRecorderror occurs when sending multiple batches per partition (integration test)