-
Notifications
You must be signed in to change notification settings - Fork 4.6k
xdsclient: fix race in ADS stream flow control causing indefinite blocking #8605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…stream to block forever
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8605 +/- ##
==========================================
+ Coverage 81.86% 82.08% +0.21%
==========================================
Files 415 415
Lines 40694 40709 +15
==========================================
+ Hits 33316 33416 +100
+ Misses 5993 5910 -83
+ Partials 1385 1383 -2
🚀 New features to boost your workflow:
|
| if s.fc.pending.Load() { | ||
| bufferRequest := func() { | ||
| select { | ||
| case state.bufferedRequests <- struct{}{}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the changes in this PR, in all the reads/writes of bufferedRequests, there's a default case. If I understand correctly, this means that none of the references are blocking, they're just checking/changing if the channel is empty (true/false). Does it make sense to replace the channel with an atomic boolean? This can be done in a separate PR to keep this PR focused on the bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that is true. We can replace it with an atomic boolean and it will definitely simplify the code more. Thanks. I will do it in a follow-up.
| fc.cond.Wait() | ||
| } | ||
|
|
||
| return fc.stopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should this should return !fc.stopped to keep the previous behaviour, or the check above if !s.fc.wait() needs to be inverted. Previously fc.wait() was returning false when the context expired. Now, fc.wait() is returning true if the flow control object was stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right to me; good catch. Let's try to make sure we have a test for this scenario if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for catching this.
All tests in https://github.com/grpc/grpc-go/blob/master/internal/xds/clients/xdsclient/test/ads_stream_flow_control_test.go fail without fixing this. Also, TestADSFlowControl_ResourceUpdates_SingleResource specifically checks for this case in
| // At this point, the xDS client is shut down (and the associated transport |
I think I just forgot to run the tests after making the previous set of changes and forgot to check the CI on the PR.
…cking (grpc#8605) Fixes grpc#8594 The above issue clearly describes the condition under which the race manifests. The changes in this PR are as follows: - Remove the `readyCh` field in the flow control that was previously used to block when waiting for flow control. Instead use a condition variable. - Have two bits of state inside the flow control: - One to indicate if there is a pending update that is waiting consumption by all watchers - One to indicate that the stream is closed - The flow control objects no longer needs to be recreated every time a new stream is created - The flow control object is stopped when the `adsStreamImpl` is stopped This PR also makes other minor changes: - Fix a flaky test by ensuring that the test stream implementation unblocks from a `Recv` call when the underlying stream context is cancelled - Couple of logging improvements RELEASE NOTES: - xdsclient: fix a race in the ADS stream implementation that could result in resource-not-found errors, causing the gRPC client channel to move to `TransientFailure`
Fixes #8594
The above issue clearly describes the condition under which the race manifests. The changes in this PR are as follows:
readyChfield in the flow control that was previously used to block when waiting for flow control. Instead use a condition variable.adsStreamImplis stoppedThis PR also makes other minor changes:
Recvcall when the underlying stream context is cancelledRELEASE NOTES:
TransientFailure