Prevent dropping valid peer connections on failure#20961
Prevent dropping valid peer connections on failure#20961gagandhakrey wants to merge 3 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Gagan Dhakrey <gagandhakrey@Gagans-MacBook-Pro.local>
PR Reviewer Guide 🔍(Review updated until commit c25ccc6)Here are some key observations to aid the review process:
|
|
❌ Gradle check result for 729e31d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Failed to generate code suggestions for PR |
|
❌ Gradle check result for 9f7b607: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit c25ccc6 |
PR Code Suggestions ✨Explore these optional code suggestions:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20961 +/- ##
============================================
- Coverage 73.26% 73.18% -0.09%
- Complexity 72743 72801 +58
============================================
Files 5862 5871 +9
Lines 332558 332666 +108
Branches 48010 48014 +4
============================================
- Hits 243643 243451 -192
- Misses 69343 69768 +425
+ Partials 19572 19447 -125 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@cwperks @andrross @sandeshkr419 please review this , i think it is critical race condition issue |
Description
Fixed a race condition in cluster discovery where a stale connection failure could wipe out a newer, valid connection to the same address.
The issue was in the onFailure callback — when an older Peer connection attempt failed, it would blindly call peersByAddress.remove(transportAddress), not caring whether a newer connection to that same address had already been established. This caused unnecessary disconnects and flapping.
The fix is simple: swap the unconditional remove(transportAddress) for remove(transportAddress, Peer.this), which only removes the entry if it's still pointing to this specific Peer instance. That way, a failed old connection can't accidentally clean up someone else's active one.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.