node: refresh CNI kubeconfig promptly on SA token rotation#12595
Open
skoryk-oleksandr wants to merge 4 commits intoprojectcalico:masterfrom
Open
node: refresh CNI kubeconfig promptly on SA token rotation#12595skoryk-oleksandr wants to merge 4 commits intoprojectcalico:masterfrom
skoryk-oleksandr wants to merge 4 commits intoprojectcalico:masterfrom
Conversation
The cni-config-monitor subprocess wakes up on its own timer between refreshes (default 6-12 hours). If the CNI kubeconfig token is externally invalidated during that sleep — for example by a ServiceAccount signing-key rotation — every pod sandbox create fails with "error getting ClusterInformation: connection is unauthorized: Unauthorized" until the next timer fires or someone restarts the calico-node pod. Two changes close this gap: 1. TokenRefresher.Run now installs an fsnotify watcher on the directory containing the pod's projected ServiceAccount token. Kubelet re-projects this token whenever it rotates, and the watcher wakes the refresh loop immediately so UpdateToken runs and the on-disk kubeconfig is rewritten with a freshly-minted token. If watcher setup fails, the refresher falls back to timer-only behaviour (unchanged semantics). 2. writeKubeconfig now writes atomically via a temp file in the same directory + rename. The previous os.WriteFile was truncate-then-write, so a CNI plugin invocation concurrent with a refresh could in principle read a partially-written kubeconfig. Broadened the clientset parameter on NewTokenRefresher from *kubernetes.Clientset to kubernetes.Interface to make the type testable with fake clients. Existing callers are unaffected. Added unit tests for both changes. Fixes CORE-12727. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the reliability of Calico node’s CNI kubeconfig refresh loop by reacting promptly to projected ServiceAccount token rotations (instead of waiting hours for the next scheduled refresh) and by making kubeconfig updates atomic to avoid partial reads during concurrent CNI executions.
Changes:
- Add an
fsnotifywatch on the projected ServiceAccount token directory to wake the refresh loop immediately on token rotation. - Switch kubeconfig writes to an atomic temp-file +
fsync+ rename flow. - Add unit tests covering the atomic write behavior and the watcher-driven wake-up / fallback behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
node/pkg/cni/token_watch.go |
Adds token-directory watching to accelerate refresh on SA token rotation; introduces atomic kubeconfig write helper. |
node/pkg/cni/token_watch_internal_test.go |
Adds unit tests for atomic writes and watcher/timer behavior in TokenRefresher.Run(). |
Per fsnotify's contract, both Events and Errors channels must be drained. Leaving Errors unread can block the watcher's internal goroutine and stop further events from being delivered — which would silently break the fast path added in the previous commit. Spawn a drainer goroutine on startup that logs errors and exits when the watcher is closed. cleanup() now waits for it to finish so there's no goroutine leak when Run() returns. Spotted by Copilot reviewer on projectcalico#12595. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added five tests addressing gaps surfaced during review: - TestWriteFileAtomic_ConcurrentReaderNeverSeesPartialContent: the property test for the atomic write fix. A reader goroutine reads the kubeconfig continuously while a writer rewrites it 100 times with varying content length; the reader must always see one of the known valid revisions (never empty, truncated, or mid-write). - TestWriteFileAtomic_ErrorsWhenDirectoryMissing: error path of os.CreateTemp when the parent dir does not exist. - TestDrainEvents_IsNonBlockingAndDrainsBurst: direct unit test on the drainEvents helper for both full and empty channels. - TestTokenRefresher_DegradesGracefullyWhenWatcherChannelCloses: covers the "!ok" branch of Run's select — if the fsnotify watcher dies and its events channel closes, the timer loop must continue operating. Uses an injectable watcher factory with a pre-closed channel so the condition is triggered deterministically. - TestTokenRefresher_RecoversAfterTransientUpdateTokenError: the first UpdateToken call fails, the second succeeds; verifies the existing 5-10s retry path actually delivers a token after recovery. To make the watcher branch testable without real fsnotify plumbing, the TokenRefresher now exposes a watcherFactory field that defaults to the existing startTokenFileWatcher method. Production behaviour unchanged. All tests stable under `go test -count=20 -race` and `-count=50` (non-race). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The atomic write change was defensive hygiene, not a fix for the reported bug. The race it prevents — a CNI plugin reading a partially-written kubeconfig during the microsecond window between truncate and write — has no evidence of affecting the reported failure mode (which is a 401 from a stale token, not a parse error). Keep this PR focused on the fsnotify wake-up path that addresses the customer symptom. Revert to the original os.WriteFile and drop the five TestWriteFileAtomic_* tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a multi-hour blind spot in
cni-config-monitorwhere an externally-invalidated CNI kubeconfig token goes undetected between scheduled refreshes, leading toplugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorizeduntil a calico-node pod restart.What this changes
TokenRefresher.Runnow installs anfsnotifywatcher on the directory containing the pod's projected ServiceAccount token. When kubelet rotates that token, the refresh loop wakes up immediately, mints a fresh token viaTokenRequest, and rewrites/etc/cni/net.d/calico-kubeconfig. If watcher setup fails the refresher falls back to the original timer-only behaviour. BothEventsandErrorschannels are drained per fsnotify's contract.Why now
The
TokenRefresher.Runloop sleeps for(tokenExp - now) / 4+ up to 100% jitter between refreshes. With the default 24h token validity that is 6-12 hours. During that sleep there is no mechanism that notices a token has become invalid externally. The two common triggers in practice are:End-to-end verification on a kind cluster
Brought up a kind cluster running this branch and ran through the scenario the fix targets.
Fix activation confirmed — on pod startup, calico-node logs show the fsnotify watcher is installed:
fsnotify fast-path demonstration — triggered a projected-token change in the watched directory on a healthy cluster:
jti59030a44-…45043310-…(new token)Log line, verbatim:
Same trigger on master (without this branch) produced no response — no log, no kubeconfig rewrite, no token change. That's the blind spot the fix closes.
Underlying bug reproduction — also reproduced the original failure end-to-end on the kind cluster by regenerating the SA signing key and restarting the apiserver. Pod creation fails with the exact error from the original report:
Honest scope
This is a necessary-but-not-sufficient fix. On its own it changes worst-case automatic recovery from "up to 6-12 hours (or manual pod restart)" to "at most one kubelet projection cycle". With default Kubernetes settings the projected SA token lifetime is ~1 hour and kubelet refreshes at 80 % of that, so the practical upper bound after this fix is ~48 minutes (average ~24 minutes, best case seconds).
That is still slow for time-sensitive workloads. Operators who need a tighter bound should additionally:
expirationSecondson the calico-node DaemonSet (e.g. 600s → refresh at ~8 minutes). That change belongs in the operator/Helm chart, not here.UpdateTokenerrors so kubelet restarts the pod. Not included in this PR.What this fix does NOT do:
Unauthorizederrors when the SA signing key rotates. Every workload holding a projected SA token sees 401s until kubelet re-projects.expirationSeconds, set by the operator.Tests
Added
node/pkg/cni/token_watch_internal_test.gowith:TestTokenRefresher_WakesUpOnTokenFileChange— configures a 24h token (normally 6-12h sleep), writes a file in the watched dir, asserts a secondUpdateTokenfires within 3s. Without the fsnotify fast path this would time out.TestTokenRefresher_FallsBackWhenWatcherSetupFails— verifies the graceful timer-only fallback when the watched directory doesn't exist.TestTokenRefresher_DegradesGracefullyWhenWatcherChannelCloses— uses an injected watcher factory returning a pre-closed channel, verifies the timer continues to drive refreshes.TestTokenRefresher_RecoversAfterTransientUpdateTokenError— firstUpdateTokencall fails, second succeeds; verifies the existing 5-10s retry path delivers a token after recovery.TestDrainEvents_IsNonBlockingAndDrainsBurst— exercises thedrainEventshelper on both full and empty channels.All pass under
-racewith-count=20and stable under-count=50.Release note: