Skip to content

Prevent dropping valid peer connections on failure#20961

Open
gagandhakrey wants to merge 3 commits intoopensearch-project:mainfrom
gagandhakrey:fix/peerfinder-race-condition
Open

Prevent dropping valid peer connections on failure#20961
gagandhakrey wants to merge 3 commits intoopensearch-project:mainfrom
gagandhakrey:fix/peerfinder-race-condition

Conversation

@gagandhakrey
Copy link
Copy Markdown

@gagandhakrey gagandhakrey commented Mar 22, 2026

Description

Fixed a race condition in cluster discovery where a stale connection failure could wipe out a newer, valid connection to the same address.
The issue was in the onFailure callback — when an older Peer connection attempt failed, it would blindly call peersByAddress.remove(transportAddress), not caring whether a newer connection to that same address had already been established. This caused unnecessary disconnects and flapping.
The fix is simple: swap the unconditional remove(transportAddress) for remove(transportAddress, Peer.this), which only removes the entry if it's still pointing to this specific Peer instance. That way, a failed old connection can't accidentally clean up someone else's active one.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gagan Dhakrey <gagandhakrey@Gagans-MacBook-Pro.local>
@gagandhakrey gagandhakrey requested a review from a team as a code owner March 22, 2026 19:44
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 22, 2026

PR Reviewer Guide 🔍

(Review updated until commit c25ccc6)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Correctness Validation

The fix uses peersByAddress.remove(transportAddress, Peer.this) which relies on the map's conditional remove (key-value pair match). This is correct only if peersByAddress is a ConcurrentHashMap or similar map that supports this two-argument remove(key, value) method. Verify that the map type used for peersByAddress supports this operation and that Peer implements proper equals/hashCode (or identity comparison is sufficient) for the conditional remove to work as intended.

peersByAddress.remove(transportAddress, Peer.this);

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 729e31d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 9f7b607: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit c25ccc6

@github-actions
Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Verify atomic key-value removal semantics

Using Map.remove(key, value) only removes the entry if the key is currently mapped
to the specified value, which is the correct fix to prevent dropping valid peer
connections. However, it's worth verifying that peersByAddress is a
ConcurrentHashMap or similar map that supports this two-argument remove method with
atomic semantics, since the operation is performed inside a synchronized block and
the behavior should be consistent.

server/src/main/java/org/opensearch/discovery/PeerFinder.java [445]

+peersByAddress.remove(transportAddress, Peer.this);
 
-
Suggestion importance[1-10]: 3

__

Why: The suggestion asks to verify that peersByAddress supports the two-argument remove method, but the existing_code and improved_code are identical, meaning no actual change is proposed. Additionally, since the operation is already inside a synchronized block, atomic semantics of the map type are less critical here.

Low

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for c25ccc6: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.18%. Comparing base (15fcc08) to head (c25ccc6).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20961      +/-   ##
============================================
- Coverage     73.26%   73.18%   -0.09%     
- Complexity    72743    72801      +58     
============================================
  Files          5862     5871       +9     
  Lines        332558   332666     +108     
  Branches      48010    48014       +4     
============================================
- Hits         243643   243451     -192     
- Misses        69343    69768     +425     
+ Partials      19572    19447     -125     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gagandhakrey
Copy link
Copy Markdown
Author

gagandhakrey commented Mar 31, 2026

@cwperks @andrross @sandeshkr419 please review this , i think it is critical race condition issue
apologies for bumping it up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant