feat(frontend): Allow poll requests to wait for a rate limit token by joannalauu · Pull Request #7571 · cadence-workflow/cadence

joannalauu · 2025-12-29T08:57:03Z

What changed?
Allow poll requests to wait until their reservation is satisfied rather than immediately rejecting them, through defining a a maximum wait threshold via DynamicConfig (default 0 = no waiting)

Why?
Poll requests can easily exceed rate limits when there's a backlog of work to complete. When poll requests are frequently rejected, this creates unnecessary noise as metrics indicate errors being returned. Although the phenomenon is typically harmless, these error logs may confuse users.

How did you test it?

Unit tests

Potential risks

In Wait() implemented in MultiStageRateLimiter, if the domain limiter allows a request but the global limiter subsequently rejects it, the token from the domain limiter cannot be returned, potentially leading to inaccurate rate limiting for that domain. However, this should not affect throughput since global limiter is the bottleneck in this situation.

Release notes

Documentation Changes

Fixes: #7555

natemort

Thanks for your work on this, very excited to see it. This will be very impactful for us to help smooth out load.

natemort · 2025-12-30T21:47:49Z

service/frontend/templates/ratelimited.tmpl

+            h.allowDomain({{(index $method.Params 0).Name}}, {{$ratelimitType}}, {{$domain}})
        {{- else}}
-            if ok := h.allowDomain({{$ratelimitType}}, {{$domain}}); !ok {
+            {{- if has $method.Name $pollAPIs}}


Having complex logic in the templates makes them harder to maintain, and we've introduced some really bad bugs before from it.

I think it would be better structured if we:

Add a new ratelimitTypeWorkerPoll to wrappers/ratelimited/ratelimit.go.

Update the ratelimitTypeMap to map the poll apis to ratelimitTypeWorkerPoll.

Update allowDomain to return an error, rather than (bool, error).

Currently we're returning the same thing in two different ways. Either allowed = false or an error both indicate that they were limited, we can convey that just through the error.

In cases where we return allowed = false we can instead return the ServiceBusyError directly (and introduce a constant for it).

This would make the template the same for the two cases and leave the complexity in the handler:

if limitErr := h.allowDomain({{(index $method.Params 0).Name}}, {{$ratelimitType}}, {{$domain}}); limitErr != nil { err = limitErr return }

natemort · 2025-12-30T21:49:37Z

service/frontend/wrappers/ratelimited/ratelimit.go

 	case ratelimitTypeUser:
-		return h.userRateLimiter.Allow(quotas.Info{Domain: domain})
+		policy = h.userRateLimiter
 	case ratelimitTypeWorker:


We'll want to add another case here for ratelimitTypeWorkerPoll to use the workerRateLimiter as the policy, just like ratelimitTypeWorker.

natemort · 2025-12-30T21:58:24Z

service/frontend/wrappers/ratelimited/ratelimit.go

+
+	// If context has a deadline, use Wait() to potentially wait for a token
+	// Otherwise, use Allow() for an immediate check
+	if _, hasDeadline := ctx.Deadline(); hasDeadline {


The incoming context likely always has a deadline imposed by the RPC mechanism (gRPC or TChannel), so I think this would apply the wait time to more than we'd like.

I think we want to instead apply it based on the ratelimitType:

if requestType == ratelimitTypeWorkerPoll { waitTime := h.maxWorkerPollWait(domain) if waitTime > 0 { return h.waitForPolicy(ctx, waitTime, policy) } } if !policy.Allow(info) { return ErrRateLimited } return nil

where waitForPolicy is a new method that includes the logic you have here along with the context creation logic that's currently in the template.

and ErrRateLimited is just a global variable equal to &types.ServiceBusyError{Message: "Too many outstanding requests to the cadence service"}.

natemort · 2025-12-30T22:10:53Z

common/quotas/limiter_test.go

+
+	check("")
+	// allow bucket to refill
+	time.Sleep(time.Second)


These tests are clever and demonstrate the behavior, but relying on a real clock both slows them down considerably (they'd run in <1 ms otherwise) and in some scenarios can introduce flakiness when run on a particularly slow machine. We've been trying to slowly remove any tests that use the real clock or need sleeping to pass.

I'd recommend either using a mock clock and manually advancing the time, or using policies that will always fail/succeed. Then there are just 4 cases to consider with the (globalLimiter, domainLimiter) as ((ok,ok), (ok,err), (err, ok), (err,err)). You could consider a table test with the policy, similar to the PR you made for task_list_limiter.

natemort · 2025-12-30T22:20:45Z

service/frontend/wrappers/ratelimited/ratelimit.go

 )

-func (h *apiHandler) allowDomain(requestType ratelimitType, domain string) bool {
+func (h *apiHandler) allowDomain(ctx context.Context, requestType ratelimitType, domain string) (bool, error) {


This function currently has no tests. Especially as we're adding complex new logic here, it would be great if you could add something.

I think the best way to test it would be to:

Call a few of the RPC methods like PollForDecisionTask to confirm that they use the correct policy and do the correct thing based on the config with regards to using Allow vs Wait.

We can use a mock Limiter to ensure that it does exactly what we want and enforce that the right method is called.

This would also crucially test the generated code in the template

To scaffold a new test file you just need to create wrappers/ratelimited/ratelimit_test.go, and then I'd use wrappers/metered/metered_test.go as a reference. It creates similar mocks to what we'd need here and tests the wrapper behavior by invoking a few of the specific RPC methods.

natemort · 2025-12-30T22:29:33Z

common/dynamicconfig/dynamicproperties/constants.go

 	GlobalRatelimiterUpdateInterval
+	// FrontendMaxWorkerPollWait is the maximum duration a worker poll request (PollForActivityTask, PollForDecisionTask)
+	// will wait for a rate limit token before being rejected
+	// KeyName: frontend.maxWorkerPollWait


Naming things is hard, and I don't have a great name for this, but I think this name may be confusing. Worker poll requests are expected to wait for quite some time (see matching.longPollExpirationInterval, they open the connection for up to a minute by default ), and this doesn't impact that behavior.

Maybe frontend.maxWorkerPollDelay better describes it? I'm honestly not sure.

joannalauu · 2026-01-03T00:41:36Z

I am unsure how to resolve the failing tests because make test and make test_e2e both pass locally

c-warren

Looks good, great work! And thank you for contributing!

c-warren · 2026-01-05T18:46:21Z

common/dynamicconfig/dynamicproperties/constants.go

+	FrontendMaxWorkerPollDelay: {
+		KeyName:      "frontend.maxWorkerPollDelay",
+		Filters:      []Filter{DomainName},
+		Description:  "FrontendMaxWorkerPollDelay is the maximum duration a worker poll request will wait for a rate limit token before being rejected",


I'd recommend making the units explicit in the description (and in the commend on the flag above). Without explicit units we leave it open to interpretation for users & future developers, and people may make incorrect assumptions.

It may also be worth documenting that this setting doesn't completely control how long a request can take - the request will complete after the minimum of the configured request timeout or this value.

Our dynamic config duration values are unit-less. You have to specify a unit when defining them. They follow the spec here: https://pkg.go.dev/time#ParseDuration

Ah ok - so the default value is the already parsed value, while the value specified in the dynamic config is the pre-parsed value (the duration string)?

Yeah, the default value is of type time.Duration which when written as a literal is just nanos. But it is strongly typed and has a unit. If you set it in the config you'll specify something like "1s".

c-warren · 2026-01-05T19:17:08Z

service/frontend/wrappers/ratelimited/ratelimit.go

+}
+
+func (h *apiHandler) waitForPolicy(ctx context.Context, waitTime time.Duration, policy quotas.Policy, domain string) error {
+	ctx, cancel := context.WithTimeout(ctx, waitTime)


Following back up to the top level comments, it looks like if someone specifies the FrontendMaxWorkerPollDelay it will be in nanoseconds.

c-warren · 2026-01-05T19:24:01Z

service/frontend/wrappers/ratelimited/ratelimit.go

 )

-func (h *apiHandler) allowDomain(requestType ratelimitType, domain string) bool {
+var ErrRateLimited = &types.ServiceBusyError{Message: "Too many outstanding requests to the cadence service"}


NIT: Making this a pointer to a type rather than the type itself means that multiple threads/requests can modify the value. This go playground shows what would happen if multiple threads were to use ErrRateLimited and modify the value: https://go.dev/play/p/AH9jnSP6kKp

In general this is unlikely! And we're somewhat hamstrung by the type making its members (the Message) accessible. However it is always better to prevent misuse of an api/type than to hope users of the type use it properly.

var ErrRateLimited = types.ServiceBusyError...

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

… type to variable Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

joannalauu · 2026-01-05T20:11:40Z

service/frontend/wrappers/ratelimited/ratelimit_test.go

-		{
-			name:        "WorkerPoll Wait timeout falls back to Allow",
-			requestType: ratelimitTypeWorkerPoll,


This test was removed because mocking the race condition caused flaky behaviour

That makes sense, thanks!

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

…Error and add dependances to template Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

natemort · 2026-01-06T16:18:06Z

For the test failures, I think we're missing the initialization of apiHandler's field maxWorkerPollWait.

You correctly added it in the initial version of the pull request: 77b4f11#diff-4ecd1b6059c2d9d94a2fe2dc4091260f435ea0acfa90cd4d67ad21aae1b4f608R85 . I think you just accidentally removed it.

Overall looks good to me once the tests are passing. Thanks again for working on this!

…cy timeout calculation Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

joannalauu requested review from 3vilhamster, Shaddoll, davidporter-id-au, demirkayaender, dkrotx, jakobht, neil-xie, sankari165, shijiesheng and taylanisikdemir as code owners December 29, 2025 08:57

natemort reviewed Dec 30, 2025

View reviewed changes

joannalauu force-pushed the feat/slow-poll-requests branch from 6fcfd2e to 98e4feb Compare January 2, 2026 09:47

joannalauu requested a review from natemort January 2, 2026 18:49

c-warren reviewed Jan 5, 2026

View reviewed changes

joannalauu added 5 commits January 5, 2026 15:09

Allow poll requests to wait for a rate limit token

77b4f11

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

Fix lint errors, implement ratelimitTypeWorkerPoll logic and add tests

e6e46b3

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

Refactor ratelimit test to avoid blocking

26cc672

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

Updated FrontendMaxWorkerPollDelay to be in microseconds and assign…

be19c12

… type to variable Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

Remove test for race condition

65dd724

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

joannalauu force-pushed the feat/slow-poll-requests branch from be67312 to 65dd724 Compare January 5, 2026 20:09

joannalauu commented Jan 5, 2026

View reviewed changes

joannalauu added 2 commits January 5, 2026 16:00

fix lint

fa55fb0

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

Return pointer to the type instead of the type itself for ServiceBusy…

6b943f4

…Error and add dependances to template Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

c-warren self-assigned this Jan 6, 2026

Update FrontendMaxWorkerPollDelay descriptions and adjust waitForPoli…

e5963b1

…cy timeout calculation Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

joannalauu force-pushed the feat/slow-poll-requests branch from 70026eb to 63b8a18 Compare January 6, 2026 21:43

Add the initialization of apiHandler's field maxWorkerPollWait

dacb248

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

joannalauu force-pushed the feat/slow-poll-requests branch from 63b8a18 to dacb248 Compare January 6, 2026 21:48

Refactor waitForPolicy to handle context errors more gracefully

de819f9

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

joannalauu force-pushed the feat/slow-poll-requests branch from b719269 to de819f9 Compare January 7, 2026 01:12

fix linting

11fbc7f

Signed-off-by: Joanna Lau <118241363+joannalauu@users.noreply.github.com>

joannalauu requested a review from c-warren January 7, 2026 02:55

c-warren approved these changes Jan 7, 2026

View reviewed changes

natemort merged commit 54f1f01 into cadence-workflow:master Jan 7, 2026
41 checks passed

joannalauu deleted the feat/slow-poll-requests branch January 7, 2026 23:09

Conversation

joannalauu commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natemort left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joannalauu commented Jan 3, 2026

Uh oh!

c-warren left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natemort commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joannalauu commented Dec 29, 2025 •

edited

Loading