Skip to content

Conversation

@easwars
Copy link
Contributor

@easwars easwars commented Sep 23, 2025

Fixes #8594

The above issue clearly describes the condition under which the race manifests. The changes in this PR are as follows:

  • Remove the readyCh field in the flow control that was previously used to block when waiting for flow control. Instead use a condition variable.
  • Have two bits of state inside the flow control:
    • One to indicate if there is a pending update that is waiting consumption by all watchers
    • One to indicate that the stream is closed
  • The flow control objects no longer needs to be recreated every time a new stream is created
  • The flow control object is stopped when the adsStreamImpl is stopped

This PR also makes other minor changes:

  • Fix a flaky test by ensuring that the test stream implementation unblocks from a Recv call when the underlying stream context is cancelled
  • Couple of logging improvements

RELEASE NOTES:

  • xdsclient: fix a race in the ADS stream implementation that could result in resource-not-found errors, causing the gRPC client channel to move to TransientFailure

@easwars easwars added Type: Bug Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Sep 23, 2025
@easwars easwars added this to the 1.77 Release milestone Sep 23, 2025
@codecov
Copy link

codecov bot commented Sep 23, 2025

Codecov Report

❌ Patch coverage is 95.83333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.08%. Comparing base (7235bb7) to head (320dd81).
⚠️ Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
internal/xds/clients/xdsclient/ads_stream.go 97.87% 1 Missing ⚠️
internal/xds/clients/xdsclient/xdsclient.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8605      +/-   ##
==========================================
+ Coverage   81.86%   82.08%   +0.21%     
==========================================
  Files         415      415              
  Lines       40694    40709      +15     
==========================================
+ Hits        33316    33416     +100     
+ Misses       5993     5910      -83     
+ Partials     1385     1383       -2     
Files with missing lines Coverage Δ
internal/xds/clients/xdsclient/ads_stream.go 84.15% <97.87%> (+1.19%) ⬆️
internal/xds/clients/xdsclient/xdsclient.go 81.65% <0.00%> (+1.31%) ⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if s.fc.pending.Load() {
bufferRequest := func() {
select {
case state.bufferedRequests <- struct{}{}:
Copy link
Contributor

@arjan-bal arjan-bal Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the changes in this PR, in all the reads/writes of bufferedRequests, there's a default case. If I understand correctly, this means that none of the references are blocking, they're just checking/changing if the channel is empty (true/false). Does it make sense to replace the channel with an atomic boolean? This can be done in a separate PR to keep this PR focused on the bug fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that is true. We can replace it with an atomic boolean and it will definitely simplify the code more. Thanks. I will do it in a follow-up.

@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Sep 30, 2025
@easwars easwars assigned arjan-bal and unassigned easwars Sep 30, 2025
fc.cond.Wait()
}

return fc.stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should this should return !fc.stopped to keep the previous behaviour, or the check above if !s.fc.wait() needs to be inverted. Previously fc.wait() was returning false when the context expired. Now, fc.wait() is returning true if the flow control object was stopped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds right to me; good catch. Let's try to make sure we have a test for this scenario if we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for catching this.

All tests in https://github.com/grpc/grpc-go/blob/master/internal/xds/clients/xdsclient/test/ads_stream_flow_control_test.go fail without fixing this. Also, TestADSFlowControl_ResourceUpdates_SingleResource specifically checks for this case in

// At this point, the xDS client is shut down (and the associated transport

I think I just forgot to run the tests after making the previous set of changes and forgot to check the CI on the PR.

@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Oct 1, 2025
@dfawley dfawley assigned easwars and unassigned easwars and dfawley Oct 1, 2025
@easwars easwars assigned dfawley and arjan-bal and unassigned easwars Oct 1, 2025
@dfawley dfawley assigned easwars and unassigned dfawley Oct 2, 2025
@easwars easwars merged commit 1b8b98e into grpc:master Oct 2, 2025
15 checks passed
@easwars easwars deleted the ads_stream_flow_control branch October 2, 2025 18:31
Pranjali-2501 pushed a commit to Pranjali-2501/grpc-go that referenced this pull request Oct 6, 2025
…cking (grpc#8605)

Fixes grpc#8594

The above issue clearly describes the condition under which the race
manifests. The changes in this PR are as follows:
- Remove the `readyCh` field in the flow control that was previously
used to block when waiting for flow control. Instead use a condition
variable.
- Have two bits of state inside the flow control:
- One to indicate if there is a pending update that is waiting
consumption by all watchers
  - One to indicate that the stream is closed
- The flow control objects no longer needs to be recreated every time a
new stream is created
- The flow control object is stopped when the `adsStreamImpl` is stopped

This PR also makes other minor changes:
- Fix a flaky test by ensuring that the test stream implementation
unblocks from a `Recv` call when the underlying stream context is
cancelled
- Couple of logging improvements

RELEASE NOTES:
- xdsclient: fix a race in the ADS stream implementation that could
result in resource-not-found errors, causing the gRPC client channel to
move to `TransientFailure`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xdsclient: ads stream flow control can block forever, resulting in resource-not-found errors

3 participants