Skip to content

feat: improve the default setup for hub agent leader election to allow better scalability/stability#414

Merged
michaelawyu merged 5 commits intokubefleet-dev:mainfrom
michaelawyu:feat/tweak-leader-election-settings
Mar 4, 2026
Merged

feat: improve the default setup for hub agent leader election to allow better scalability/stability#414
michaelawyu merged 5 commits intokubefleet-dev:mainfrom
michaelawyu:feat/tweak-leader-election-settings

Conversation

@michaelawyu
Copy link
Member

@michaelawyu michaelawyu commented Jan 15, 2026

Description of your changes

This PR makes the following changes:

  • Use a separate config (rate limiter setup) for leader election, so that lease renewals will not be starved by regular controllers under heavy load.
  • Apply a default lease duration of 60s (increased from 15s), a default lease renewal duration of 45s (increases from 10s), and a default lease renewal retry period of 5s (increased from 2s). This helps the hub agent to hold onto leadership longer + send less renewal requests), and have a better chance at renewing leases (9 attempts in 45 seconds vs ~5 attempts in 10 seconds), as in recent performance tests we have identified that unexpected leader election failures can lead to frequent agent restarts under heavy load.

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

N/A

Special notes for your reviewer

Please refer to the Jan 2026 performance test report for more information.

Signed-off-by: michaelawyu <chenyu1@microsoft.com>
@codecov
Copy link

codecov bot commented Jan 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
&o.LeaseDuration.Duration,
"leader-lease-duration",
15*time.Second,
90*time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we tested this value? It seems a bit high

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! In the performance test we used a value even higher (180s), but if 90 secs is a concern, would you like me to lower this to 60 secs?

The requirement is basically the lease duration must be higher than the renew deadline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to be on the safer side, would it be better if we rollout this in multiple stages?

e.g.,

for now, set things to 30 seconds/25 seconds/5 seconds; when it rolls out fully,
next, set things to 45 seconds/40 seconds/5 seconds; ...

It shouldn't affect our performance test plan -> I could do an override on that specific environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try 60/45/5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Has tweaked the numbers.

Signed-off-by: michaelawyu <chenyu1@microsoft.com>
@michaelawyu michaelawyu merged commit 3d066d8 into kubefleet-dev:main Mar 4, 2026
16 checks passed
@michaelawyu michaelawyu deleted the feat/tweak-leader-election-settings branch March 4, 2026 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants