fix: do not register event handler for the same GVR multiple times#396
Merged
weng271190436 merged 2 commits intokubefleet-dev:mainfrom Jan 12, 2026
Merged
Conversation
added 2 commits
January 6, 2026 20:33
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Member
ryanzhang-oss
left a comment
There was a problem hiding this comment.
good catch, I though AddEventHandler was idempotent.
jwtty
reviewed
Jan 7, 2026
ryanzhang-oss
approved these changes
Jan 12, 2026
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of your changes
@michaelawyu found that I introduced a regression with #380 where I split resource informer creation and event handler addition. There used to be deduplication logic guarding both. After I split I didn't add deduplication logic to event handler addition.
How did we figure out the issue
While load testing, @michaelawyu observed that the memory consumption of hub agent climbed to 4Gi after a few hours -> with pprof and prometheus he found that there are 140k goroutines and determined that this is a goroutine leak -> he then determined the last commit where the issue didn't exist, which was the commit right before #380 -> he then noticed that the majority of the leaked goroutines are from processorListeners, which is due to me periodically registering new event handlers
I then re-deployed latest main branch to the test clusters, and with pprof I determined the goroutine that spun off 7000+ goroutines within 30 minutes
goroutine 2488 [select]: k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext({0x2f96270, 0xc00047f3b0}, 0xc000c07f70, {0x2f771e0, 0xc004ecb5c0}, 0x1) /go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/backoff.go:267 +0x185 k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x2f96270, 0xc00047f3b0}, 0xc003115770, 0x6fc23ac00, 0x0, 0x1) /go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/backoff.go:223 +0x8f k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...) /go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/backoff.go:172 github.com/kubefleet-dev/kubefleet/pkg/resourcewatcher.(*ChangeDetector).discoverAPIResourcesLoop(0x2c22316174656231?, {0x2f96270?, 0xc00047f3b0?}, 0x446b636f6c423156?, {0x2f92fb0?, 0xc0020c38c0?}) /workspace/pkg/resourcewatcher/change_dector.go:134 +0x5b created by github.com/kubefleet-dev/kubefleet/pkg/resourcewatcher.(*ChangeDetector).Start in goroutine 1668 /workspace/pkg/resourcewatcher/change_dector.go:112 +0x355Every go routine spun off from 2488 look like either of below 2 with "in goroutine 2488". So I was able to correlate the 7000 goroutines 2488 and
discoverAPIResourcesLoop.After knowing that
discoverAPIResourcesLoopfrom change_dector.go is the issue, I noticed that discoverAPIResourcesLoop calls discoverResources which calls AddEventHandlerToInformer on the same resourcesToWatch repeatedly.This evidence supports @michaelawyu 's theory.
So the fix is to make
AddEventHandlerToInformertrack resources where handlers were registered and do not register duplicate handlers.How I tested the fix
Deployed the fix to AKS cluster where the issue was originally discovered and wait for 1 hour to verify goroutine count doesn't grow.
Fixes #
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer