feat(frontend): Allow poll requests to wait for a rate limit token#7571
feat(frontend): Allow poll requests to wait for a rate limit token#7571natemort merged 11 commits intocadence-workflow:masterfrom
Conversation
natemort
left a comment
There was a problem hiding this comment.
Thanks for your work on this, very excited to see it. This will be very impactful for us to help smooth out load.
| h.allowDomain({{(index $method.Params 0).Name}}, {{$ratelimitType}}, {{$domain}}) | ||
| {{- else}} | ||
| if ok := h.allowDomain({{$ratelimitType}}, {{$domain}}); !ok { | ||
| {{- if has $method.Name $pollAPIs}} |
There was a problem hiding this comment.
Having complex logic in the templates makes them harder to maintain, and we've introduced some really bad bugs before from it.
I think it would be better structured if we:
- Add a new ratelimitTypeWorkerPoll to wrappers/ratelimited/ratelimit.go.
- Update the
ratelimitTypeMapto map the poll apis toratelimitTypeWorkerPoll. - Update
allowDomainto return an error, rather than (bool, error).- Currently we're returning the same thing in two different ways. Either allowed = false or an error both indicate that they were limited, we can convey that just through the error.
- In cases where we return
allowed = falsewe can instead return the ServiceBusyError directly (and introduce a constant for it).
This would make the template the same for the two cases and leave the complexity in the handler:
if limitErr := h.allowDomain({{(index $method.Params 0).Name}}, {{$ratelimitType}}, {{$domain}}); limitErr != nil {
err = limitErr
return
}
| case ratelimitTypeUser: | ||
| return h.userRateLimiter.Allow(quotas.Info{Domain: domain}) | ||
| policy = h.userRateLimiter | ||
| case ratelimitTypeWorker: |
There was a problem hiding this comment.
We'll want to add another case here for ratelimitTypeWorkerPoll to use the workerRateLimiter as the policy, just like ratelimitTypeWorker.
|
|
||
| // If context has a deadline, use Wait() to potentially wait for a token | ||
| // Otherwise, use Allow() for an immediate check | ||
| if _, hasDeadline := ctx.Deadline(); hasDeadline { |
There was a problem hiding this comment.
The incoming context likely always has a deadline imposed by the RPC mechanism (gRPC or TChannel), so I think this would apply the wait time to more than we'd like.
I think we want to instead apply it based on the ratelimitType:
if requestType == ratelimitTypeWorkerPoll {
waitTime := h.maxWorkerPollWait(domain)
if waitTime > 0 {
return h.waitForPolicy(ctx, waitTime, policy)
}
}
if !policy.Allow(info) {
return ErrRateLimited
}
return nil
where waitForPolicy is a new method that includes the logic you have here along with the context creation logic that's currently in the template.
and ErrRateLimited is just a global variable equal to &types.ServiceBusyError{Message: "Too many outstanding requests to the cadence service"}.
common/quotas/limiter_test.go
Outdated
|
|
||
| check("") | ||
| // allow bucket to refill | ||
| time.Sleep(time.Second) |
There was a problem hiding this comment.
These tests are clever and demonstrate the behavior, but relying on a real clock both slows them down considerably (they'd run in <1 ms otherwise) and in some scenarios can introduce flakiness when run on a particularly slow machine. We've been trying to slowly remove any tests that use the real clock or need sleeping to pass.
I'd recommend either using a mock clock and manually advancing the time, or using policies that will always fail/succeed. Then there are just 4 cases to consider with the (globalLimiter, domainLimiter) as ((ok,ok), (ok,err), (err, ok), (err,err)). You could consider a table test with the policy, similar to the PR you made for task_list_limiter.
| ) | ||
|
|
||
| func (h *apiHandler) allowDomain(requestType ratelimitType, domain string) bool { | ||
| func (h *apiHandler) allowDomain(ctx context.Context, requestType ratelimitType, domain string) (bool, error) { |
There was a problem hiding this comment.
This function currently has no tests. Especially as we're adding complex new logic here, it would be great if you could add something.
I think the best way to test it would be to:
- Call a few of the RPC methods like PollForDecisionTask to confirm that they use the correct policy and do the correct thing based on the config with regards to using Allow vs Wait.
- We can use a mock Limiter to ensure that it does exactly what we want and enforce that the right method is called.
- This would also crucially test the generated code in the template
To scaffold a new test file you just need to create wrappers/ratelimited/ratelimit_test.go, and then I'd use wrappers/metered/metered_test.go as a reference. It creates similar mocks to what we'd need here and tests the wrapper behavior by invoking a few of the specific RPC methods.
| GlobalRatelimiterUpdateInterval | ||
| // FrontendMaxWorkerPollWait is the maximum duration a worker poll request (PollForActivityTask, PollForDecisionTask) | ||
| // will wait for a rate limit token before being rejected | ||
| // KeyName: frontend.maxWorkerPollWait |
There was a problem hiding this comment.
Naming things is hard, and I don't have a great name for this, but I think this name may be confusing. Worker poll requests are expected to wait for quite some time (see matching.longPollExpirationInterval, they open the connection for up to a minute by default ), and this doesn't impact that behavior.
Maybe frontend.maxWorkerPollDelay better describes it? I'm honestly not sure.
6fcfd2e to
98e4feb
Compare
|
I am unsure how to resolve the failing tests because |
c-warren
left a comment
There was a problem hiding this comment.
Looks good, great work! And thank you for contributing!
| FrontendMaxWorkerPollDelay: { | ||
| KeyName: "frontend.maxWorkerPollDelay", | ||
| Filters: []Filter{DomainName}, | ||
| Description: "FrontendMaxWorkerPollDelay is the maximum duration a worker poll request will wait for a rate limit token before being rejected", |
There was a problem hiding this comment.
I'd recommend making the units explicit in the description (and in the commend on the flag above). Without explicit units we leave it open to interpretation for users & future developers, and people may make incorrect assumptions.
It may also be worth documenting that this setting doesn't completely control how long a request can take - the request will complete after the minimum of the configured request timeout or this value.
There was a problem hiding this comment.
Our dynamic config duration values are unit-less. You have to specify a unit when defining them. They follow the spec here: https://pkg.go.dev/time#ParseDuration
There was a problem hiding this comment.
Ah ok - so the default value is the already parsed value, while the value specified in the dynamic config is the pre-parsed value (the duration string)?
There was a problem hiding this comment.
Yeah, the default value is of type time.Duration which when written as a literal is just nanos. But it is strongly typed and has a unit. If you set it in the config you'll specify something like "1s".
| } | ||
|
|
||
| func (h *apiHandler) waitForPolicy(ctx context.Context, waitTime time.Duration, policy quotas.Policy, domain string) error { | ||
| ctx, cancel := context.WithTimeout(ctx, waitTime) |
There was a problem hiding this comment.
Following back up to the top level comments, it looks like if someone specifies the FrontendMaxWorkerPollDelay it will be in nanoseconds.
| ) | ||
|
|
||
| func (h *apiHandler) allowDomain(requestType ratelimitType, domain string) bool { | ||
| var ErrRateLimited = &types.ServiceBusyError{Message: "Too many outstanding requests to the cadence service"} |
There was a problem hiding this comment.
NIT: Making this a pointer to a type rather than the type itself means that multiple threads/requests can modify the value. This go playground shows what would happen if multiple threads were to use ErrRateLimited and modify the value: https://go.dev/play/p/AH9jnSP6kKp
In general this is unlikely! And we're somewhat hamstrung by the type making its members (the Message) accessible. However it is always better to prevent misuse of an api/type than to hope users of the type use it properly.
var ErrRateLimited = types.ServiceBusyError...Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
… type to variable Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
be67312 to
65dd724
Compare
| { | ||
| name: "WorkerPoll Wait timeout falls back to Allow", | ||
| requestType: ratelimitTypeWorkerPoll, |
There was a problem hiding this comment.
This test was removed because mocking the race condition caused flaky behaviour
…Error and add dependances to template Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
|
For the test failures, I think we're missing the initialization of You correctly added it in the initial version of the pull request: 77b4f11#diff-4ecd1b6059c2d9d94a2fe2dc4091260f435ea0acfa90cd4d67ad21aae1b4f608R85 . I think you just accidentally removed it. Overall looks good to me once the tests are passing. Thanks again for working on this! |
…cy timeout calculation Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
70026eb to
63b8a18
Compare
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
63b8a18 to
dacb248
Compare
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
b719269 to
de819f9
Compare
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
What changed?
Allow poll requests to wait until their reservation is satisfied rather than immediately rejecting them, through defining a a maximum wait threshold via DynamicConfig (default 0 = no waiting)
Why?
Poll requests can easily exceed rate limits when there's a backlog of work to complete. When poll requests are frequently rejected, this creates unnecessary noise as metrics indicate errors being returned. Although the phenomenon is typically harmless, these error logs may confuse users.
How did you test it?
Potential risks
Wait()implemented inMultiStageRateLimiter, if the domain limiter allows a request but the global limiter subsequently rejects it, the token from the domain limiter cannot be returned, potentially leading to inaccurate rate limiting for that domain. However, this should not affect throughput since global limiter is the bottleneck in this situation.Release notes
Documentation Changes
Fixes: #7555