Skip to content

fix: do not register event handler for the same GVR multiple times#396

Merged
weng271190436 merged 2 commits intokubefleet-dev:mainfrom
weng271190436:weiweng/fix-event-handler-registering
Jan 12, 2026
Merged

fix: do not register event handler for the same GVR multiple times#396
weng271190436 merged 2 commits intokubefleet-dev:mainfrom
weng271190436:weiweng/fix-event-handler-registering

Conversation

@weng271190436
Copy link
Member

@weng271190436 weng271190436 commented Jan 6, 2026

Description of your changes

@michaelawyu found that I introduced a regression with #380 where I split resource informer creation and event handler addition. There used to be deduplication logic guarding both. After I split I didn't add deduplication logic to event handler addition.

How did we figure out the issue

While load testing, @michaelawyu observed that the memory consumption of hub agent climbed to 4Gi after a few hours -> with pprof and prometheus he found that there are 140k goroutines and determined that this is a goroutine leak -> he then determined the last commit where the issue didn't exist, which was the commit right before #380 -> he then noticed that the majority of the leaked goroutines are from processorListeners, which is due to me periodically registering new event handlers

I then re-deployed latest main branch to the test clusters, and with pprof I determined the goroutine that spun off 7000+ goroutines within 30 minutes

goroutine 2488 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext({0x2f96270, 0xc00047f3b0}, 0xc000c07f70, {0x2f771e0, 0xc004ecb5c0}, 0x1)
	/go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/backoff.go:267 +0x185
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x2f96270, 0xc00047f3b0}, 0xc003115770, 0x6fc23ac00, 0x0, 0x1)
	/go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/backoff.go:223 +0x8f
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
	/go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/backoff.go:172
github.com/kubefleet-dev/kubefleet/pkg/resourcewatcher.(*ChangeDetector).discoverAPIResourcesLoop(0x2c22316174656231?, {0x2f96270?, 0xc00047f3b0?}, 0x446b636f6c423156?, {0x2f92fb0?, 0xc0020c38c0?})
	/workspace/pkg/resourcewatcher/change_dector.go:134 +0x5b
created by github.com/kubefleet-dev/kubefleet/pkg/resourcewatcher.(*ChangeDetector).Start in goroutine 1668
	/workspace/pkg/resourcewatcher/change_dector.go:112 +0x355

Every go routine spun off from 2488 look like either of below 2 with "in goroutine 2488". So I was able to correlate the 7000 goroutines 2488 and discoverAPIResourcesLoop.

goroutine 2331 [chan receive, 17 minutes]:
k8s.io/client-go/tools/cache.(*processorListener).run(0xc00176e460)
	/go/pkg/mod/k8s.io/client-go@v0.34.1/tools/cache/shared_informer.go:1062 +0x4e
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/wait.go:72 +0x4c
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 2488
	/go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/wait.go:70 +0x73
goroutine 2332 [select, 17 minutes]:
k8s.io/client-go/tools/cache.(*processorListener).pop(0xc00176e460)
	/go/pkg/mod/k8s.io/client-go@v0.34.1/tools/cache/shared_informer.go:1031 +0x129
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/wait.go:72 +0x4c
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 2488
	/go/pkg/mod/k8s.io/apimachinery@v0.34.1/pkg/util/wait/wait.go:70 +0x73

After knowing that discoverAPIResourcesLoop from change_dector.go is the issue, I noticed that discoverAPIResourcesLoop calls discoverResources which calls AddEventHandlerToInformer on the same resourcesToWatch repeatedly.

This evidence supports @michaelawyu 's theory.

So the fix is to make AddEventHandlerToInformer track resources where handlers were registered and do not register duplicate handlers.

How I tested the fix

Deployed the fix to AKS cluster where the issue was originally discovered and wait for 1 hour to verify goroutine count doesn't grow.

Screenshot 2026-01-06 at 4 10 27 PM

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

Wei Weng added 2 commits January 6, 2026 20:33
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
Signed-off-by: Wei Weng <Wei.Weng@microsoft.com>
@codecov
Copy link

codecov bot commented Jan 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Member

@ryanzhang-oss ryanzhang-oss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I though AddEventHandler was idempotent.

@weng271190436 weng271190436 merged commit 52aa16c into kubefleet-dev:main Jan 12, 2026
21 of 22 checks passed
@weng271190436 weng271190436 deleted the weiweng/fix-event-handler-registering branch January 12, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants