Skip to content

feat(frontend): Allow poll requests to wait for a rate limit token#7571

Merged
natemort merged 11 commits intocadence-workflow:masterfrom
joannalauu:feat/slow-poll-requests
Jan 7, 2026
Merged

feat(frontend): Allow poll requests to wait for a rate limit token#7571
natemort merged 11 commits intocadence-workflow:masterfrom
joannalauu:feat/slow-poll-requests

Conversation

@joannalauu
Copy link
Contributor

@joannalauu joannalauu commented Dec 29, 2025

What changed?
Allow poll requests to wait until their reservation is satisfied rather than immediately rejecting them, through defining a a maximum wait threshold via DynamicConfig (default 0 = no waiting)

Why?
Poll requests can easily exceed rate limits when there's a backlog of work to complete. When poll requests are frequently rejected, this creates unnecessary noise as metrics indicate errors being returned. Although the phenomenon is typically harmless, these error logs may confuse users.

How did you test it?

  • Unit tests

Potential risks

  • In Wait() implemented in MultiStageRateLimiter, if the domain limiter allows a request but the global limiter subsequently rejects it, the token from the domain limiter cannot be returned, potentially leading to inaccurate rate limiting for that domain. However, this should not affect throughput since global limiter is the bottleneck in this situation.

Release notes

Documentation Changes

Fixes: #7555

Copy link
Member

@natemort natemort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this, very excited to see it. This will be very impactful for us to help smooth out load.

h.allowDomain({{(index $method.Params 0).Name}}, {{$ratelimitType}}, {{$domain}})
{{- else}}
if ok := h.allowDomain({{$ratelimitType}}, {{$domain}}); !ok {
{{- if has $method.Name $pollAPIs}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having complex logic in the templates makes them harder to maintain, and we've introduced some really bad bugs before from it.

I think it would be better structured if we:

  • Add a new ratelimitTypeWorkerPoll to wrappers/ratelimited/ratelimit.go.
  • Update the ratelimitTypeMap to map the poll apis to ratelimitTypeWorkerPoll.
  • Update allowDomain to return an error, rather than (bool, error).
    • Currently we're returning the same thing in two different ways. Either allowed = false or an error both indicate that they were limited, we can convey that just through the error.
    • In cases where we return allowed = false we can instead return the ServiceBusyError directly (and introduce a constant for it).

This would make the template the same for the two cases and leave the complexity in the handler:

if limitErr := h.allowDomain({{(index $method.Params 0).Name}}, {{$ratelimitType}}, {{$domain}}); limitErr != nil {
  err = limitErr
  return
}

case ratelimitTypeUser:
return h.userRateLimiter.Allow(quotas.Info{Domain: domain})
policy = h.userRateLimiter
case ratelimitTypeWorker:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to add another case here for ratelimitTypeWorkerPoll to use the workerRateLimiter as the policy, just like ratelimitTypeWorker.


// If context has a deadline, use Wait() to potentially wait for a token
// Otherwise, use Allow() for an immediate check
if _, hasDeadline := ctx.Deadline(); hasDeadline {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incoming context likely always has a deadline imposed by the RPC mechanism (gRPC or TChannel), so I think this would apply the wait time to more than we'd like.

I think we want to instead apply it based on the ratelimitType:

if requestType == ratelimitTypeWorkerPoll {
  waitTime := h.maxWorkerPollWait(domain)
  if waitTime > 0 {
      return h.waitForPolicy(ctx, waitTime, policy)
  }
}
if !policy.Allow(info) {
  return ErrRateLimited
}
return nil

where waitForPolicy is a new method that includes the logic you have here along with the context creation logic that's currently in the template.

and ErrRateLimited is just a global variable equal to &types.ServiceBusyError{Message: "Too many outstanding requests to the cadence service"}.


check("")
// allow bucket to refill
time.Sleep(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are clever and demonstrate the behavior, but relying on a real clock both slows them down considerably (they'd run in <1 ms otherwise) and in some scenarios can introduce flakiness when run on a particularly slow machine. We've been trying to slowly remove any tests that use the real clock or need sleeping to pass.

I'd recommend either using a mock clock and manually advancing the time, or using policies that will always fail/succeed. Then there are just 4 cases to consider with the (globalLimiter, domainLimiter) as ((ok,ok), (ok,err), (err, ok), (err,err)). You could consider a table test with the policy, similar to the PR you made for task_list_limiter.

)

func (h *apiHandler) allowDomain(requestType ratelimitType, domain string) bool {
func (h *apiHandler) allowDomain(ctx context.Context, requestType ratelimitType, domain string) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function currently has no tests. Especially as we're adding complex new logic here, it would be great if you could add something.

I think the best way to test it would be to:

  • Call a few of the RPC methods like PollForDecisionTask to confirm that they use the correct policy and do the correct thing based on the config with regards to using Allow vs Wait.
    • We can use a mock Limiter to ensure that it does exactly what we want and enforce that the right method is called.
    • This would also crucially test the generated code in the template

To scaffold a new test file you just need to create wrappers/ratelimited/ratelimit_test.go, and then I'd use wrappers/metered/metered_test.go as a reference. It creates similar mocks to what we'd need here and tests the wrapper behavior by invoking a few of the specific RPC methods.

GlobalRatelimiterUpdateInterval
// FrontendMaxWorkerPollWait is the maximum duration a worker poll request (PollForActivityTask, PollForDecisionTask)
// will wait for a rate limit token before being rejected
// KeyName: frontend.maxWorkerPollWait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming things is hard, and I don't have a great name for this, but I think this name may be confusing. Worker poll requests are expected to wait for quite some time (see matching.longPollExpirationInterval, they open the connection for up to a minute by default ), and this doesn't impact that behavior.

Maybe frontend.maxWorkerPollDelay better describes it? I'm honestly not sure.

@joannalauu joannalauu force-pushed the feat/slow-poll-requests branch from 6fcfd2e to 98e4feb Compare January 2, 2026 09:47
@joannalauu joannalauu requested a review from natemort January 2, 2026 18:49
@joannalauu
Copy link
Contributor Author

I am unsure how to resolve the failing tests because make test and make test_e2e both pass locally

Copy link
Contributor

@c-warren c-warren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, great work! And thank you for contributing!

FrontendMaxWorkerPollDelay: {
KeyName: "frontend.maxWorkerPollDelay",
Filters: []Filter{DomainName},
Description: "FrontendMaxWorkerPollDelay is the maximum duration a worker poll request will wait for a rate limit token before being rejected",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend making the units explicit in the description (and in the commend on the flag above). Without explicit units we leave it open to interpretation for users & future developers, and people may make incorrect assumptions.

It may also be worth documenting that this setting doesn't completely control how long a request can take - the request will complete after the minimum of the configured request timeout or this value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our dynamic config duration values are unit-less. You have to specify a unit when defining them. They follow the spec here: https://pkg.go.dev/time#ParseDuration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok - so the default value is the already parsed value, while the value specified in the dynamic config is the pre-parsed value (the duration string)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the default value is of type time.Duration which when written as a literal is just nanos. But it is strongly typed and has a unit. If you set it in the config you'll specify something like "1s".

}

func (h *apiHandler) waitForPolicy(ctx context.Context, waitTime time.Duration, policy quotas.Policy, domain string) error {
ctx, cancel := context.WithTimeout(ctx, waitTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following back up to the top level comments, it looks like if someone specifies the FrontendMaxWorkerPollDelay it will be in nanoseconds.

)

func (h *apiHandler) allowDomain(requestType ratelimitType, domain string) bool {
var ErrRateLimited = &types.ServiceBusyError{Message: "Too many outstanding requests to the cadence service"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Making this a pointer to a type rather than the type itself means that multiple threads/requests can modify the value. This go playground shows what would happen if multiple threads were to use ErrRateLimited and modify the value: https://go.dev/play/p/AH9jnSP6kKp

In general this is unlikely! And we're somewhat hamstrung by the type making its members (the Message) accessible. However it is always better to prevent misuse of an api/type than to hope users of the type use it properly.

var ErrRateLimited = types.ServiceBusyError...

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
… type to variable

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@joannalauu joannalauu force-pushed the feat/slow-poll-requests branch from be67312 to 65dd724 Compare January 5, 2026 20:09
Comment on lines -84 to -86
{
name: "WorkerPoll Wait timeout falls back to Allow",
requestType: ratelimitTypeWorkerPoll,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was removed because mocking the race condition caused flaky behaviour

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks!

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
…Error and add dependances to template

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@natemort
Copy link
Member

natemort commented Jan 6, 2026

For the test failures, I think we're missing the initialization of apiHandler's field maxWorkerPollWait.

You correctly added it in the initial version of the pull request: 77b4f11#diff-4ecd1b6059c2d9d94a2fe2dc4091260f435ea0acfa90cd4d67ad21aae1b4f608R85 . I think you just accidentally removed it.

Overall looks good to me once the tests are passing. Thanks again for working on this!

@c-warren c-warren self-assigned this Jan 6, 2026
…cy timeout calculation

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@joannalauu joannalauu force-pushed the feat/slow-poll-requests branch from 70026eb to 63b8a18 Compare January 6, 2026 21:43
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@joannalauu joannalauu force-pushed the feat/slow-poll-requests branch from 63b8a18 to dacb248 Compare January 6, 2026 21:48
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@joannalauu joannalauu force-pushed the feat/slow-poll-requests branch from b719269 to de819f9 Compare January 7, 2026 01:12
Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>
@joannalauu joannalauu requested a review from c-warren January 7, 2026 02:55
@natemort natemort merged commit 54f1f01 into cadence-workflow:master Jan 7, 2026
41 checks passed
@joannalauu joannalauu deleted the feat/slow-poll-requests branch January 7, 2026 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow Rather than Rate Limit Poll Requests

3 participants