Prevent request corruption by avoiding failing entire bulk request on single shard rejection from shard backpressure#20727
Conversation
PR Reviewer Guide 🔍(Review updated until commit 52b3c0e)Here are some key observations to aid the review process:
|
09b6c8b to
7e90117
Compare
PR Code Suggestions ✨Latest suggestions up to 52b3c0e Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 1c2ad51
Suggestions up to commit 7e90117
Suggestions up to commit 09b6c8b
|
|
Persistent review updated to latest commit 7e90117 |
|
❌ Gradle check result for 7e90117: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 7e90117: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 7e90117: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 7e90117: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
7e90117 to
acabbad
Compare
|
Persistent review updated to latest commit acabbad |
|
❌ Gradle check result for acabbad: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit 1c2ad51 |
|
❌ Gradle check result for 1c2ad51: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 1c2ad51: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20727 +/- ##
============================================
- Coverage 73.28% 73.26% -0.03%
+ Complexity 72066 72047 -19
============================================
Files 5786 5786
Lines 329620 329683 +63
Branches 47568 47572 +4
============================================
- Hits 241574 241533 -41
- Misses 68700 68766 +66
- Partials 19346 19384 +38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…rejection Signed-off-by: Varun Bharadwaj <varunbharadwaj1995@gmail.com>
1c2ad51 to
52b3c0e
Compare
|
Persistent review updated to latest commit 52b3c0e |
|
❌ Gradle check result for 52b3c0e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
For batch style APIs in general, the best practice is that the overall request only fails with a non-200 response for things like failing authentication or invalid request syntax. Shard backpressure rejections are clearly shard-level failures and as such only apply to part of the batch _bulk request, so it seems that it is clearly a bug to fail the entire request as opposed to sending a 200 with individual document-level failures for the documents that were rejected due to backpressure. I mean, obviously request corruption is a bug but the behavior is wrong even if corruption wasn't possible. Do I understand that correctly? However, tuning clients to handle backpressure properly is often a delicate balance so I do worry about the impact if today the service sends 429s and after this fix it starts sending 200s with the backpressure details in the response. @msfroh @Bukhtawar what do you think? |
Yeah, that was the idea behind this approach.
This is a good point, agree. I was thinking if someone is relying on the 429s from shard level back-pressure mechanism, they are very likely impacted by the request corruption issue. Another alternative tested was to first apply the shard level checks on all shards (phase 1) and later send the shard requests (phase 2). So we fail the bulk request if any shard is overloaded. This will maintain better compatibility with current behavior. Open to suggestions on which approach might be preferred here. |
Honestly this seems like a less risky approach, but I'd like to get some opinions from others familiar with the indexing path, e.g. @Bukhtawar @mgodwan @gbbafna It seems like there might be other aspects to this behavior change as well. With this change, in case of a single slow shard then indexing traffic will proceed a full throughput to the other shards. This seems good but could lead to data skew across shards if the slowness persists. That could be a big change to a system relying on the current behavior. |
| indicesService.addDocStatusStats(docStatusStats); | ||
| listener.onResponse( | ||
| new BulkResponse(responses.toArray(new BulkItemResponse[responses.length()]), buildTookInMillis(startTimeNanos)) | ||
| ); |
There was a problem hiding this comment.
Can we avoid duplicating finishHim()?
| throw e; | ||
| // Treat execution failures as shard-level failures rather than failing the entire bulk request | ||
| onShardFailure.accept(e); |
There was a problem hiding this comment.
Looks like we need a test for this, unless an existing test already covers this?
Do you think this can be enabled behind a bulk query param while still keeping the existing behavior as default? |
Description
Shard backpressure rejections result in failing entire bulk request, releasing unsafe buffer which can cause request corruption. More details can be found in #20724. This PR updates the logic to treat shard level pressure check failures as regular shard failures, without failing the entire bulk request. This way, we continue to process other shards, and return shard level errors in the bulk response.
This change also solves the request corruption problem when shard level pressure check is enabled.
Related Issues
Resolves #20724
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.