node: refresh CNI kubeconfig promptly on SA token rotation by skoryk-oleksandr · Pull Request #12595 · projectcalico/calico

skoryk-oleksandr · 2026-04-22T22:21:57Z

Summary

Fixes a multi-hour blind spot in cni-config-monitor where an externally-invalidated CNI kubeconfig token goes undetected between scheduled refreshes, leading to plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized until a calico-node pod restart.

What this changes

TokenRefresher.Run now installs an fsnotify watcher on the directory containing the pod's projected ServiceAccount token. When kubelet rotates that token, the refresh loop wakes up immediately, mints a fresh token via TokenRequest, and rewrites /etc/cni/net.d/calico-kubeconfig. If watcher setup fails the refresher falls back to the original timer-only behaviour. Both Events and Errors channels are drained per fsnotify's contract.

Why now

The TokenRefresher.Run loop sleeps for (tokenExp - now) / 4 + up to 100% jitter between refreshes. With the default 24h token validity that is 6-12 hours. During that sleep there is no mechanism that notices a token has become invalid externally. The two common triggers in practice are:

ServiceAccount signing-key rotation (reproduced on kind; the failure is immediate and persistent until calico-node restarts).
Any event that makes kubelet re-project the SA token before the scheduled refresh — picking it up promptly keeps the CNI kubeconfig as fresh as possible.

End-to-end verification on a kind cluster

Brought up a kind cluster running this branch and ran through the scenario the fix targets.

Fix activation confirmed — on pod startup, calico-node logs show the fsnotify watcher is installed:

[INFO] node-services/token_watch.go: Watching service account token directory for rotation

fsnotify fast-path demonstration — triggered a projected-token change in the watched directory on a healthy cluster:

	Before	After
CNI token `jti`	`59030a44-…`	`45043310-…` (new token)
CNI kubeconfig mtime	20:56:07	21:01:47
Latency trigger → new kubeconfig written		~30 ms

Log line, verbatim:

[INFO] node-services/token_watch.go: Projected service account token changed; refreshing CNI kubeconfig immediately

Same trigger on master (without this branch) produced no response — no log, no kubeconfig rewrite, no token change. That's the blind spot the fix closes.

Underlying bug reproduction — also reproduced the original failure end-to-end on the kind cluster by regenerating the SA signing key and restarting the apiserver. Pod creation fails with the exact error from the original report:

Failed to create pod sandbox: ... plugin type="calico" failed (add):
  error getting ClusterInformation: connection is unauthorized: Unauthorized

Honest scope

This is a necessary-but-not-sufficient fix. On its own it changes worst-case automatic recovery from "up to 6-12 hours (or manual pod restart)" to "at most one kubelet projection cycle". With default Kubernetes settings the projected SA token lifetime is ~1 hour and kubelet refreshes at 80 % of that, so the practical upper bound after this fix is ~48 minutes (average ~24 minutes, best case seconds).

That is still slow for time-sensitive workloads. Operators who need a tighter bound should additionally:

Shorten the projected SA token's expirationSeconds on the calico-node DaemonSet (e.g. 600s → refresh at ~8 minutes). That change belongs in the operator/Helm chart, not here.
Optionally add a health check that fails liveness after N minutes of UpdateToken errors so kubelet restarts the pod. Not included in this PR.

What this fix does NOT do:

Eliminate transient Unauthorized errors when the SA signing key rotates. Every workload holding a projected SA token sees 401s until kubelet re-projects.
Speed up kubelet's projected-token refresh cadence. That is controlled by the pod spec's expirationSeconds, set by the operator.

Tests

Added node/pkg/cni/token_watch_internal_test.go with:

TestTokenRefresher_WakesUpOnTokenFileChange — configures a 24h token (normally 6-12h sleep), writes a file in the watched dir, asserts a second UpdateToken fires within 3s. Without the fsnotify fast path this would time out.
TestTokenRefresher_FallsBackWhenWatcherSetupFails — verifies the graceful timer-only fallback when the watched directory doesn't exist.
TestTokenRefresher_DegradesGracefullyWhenWatcherChannelCloses — uses an injected watcher factory returning a pre-closed channel, verifies the timer continues to drive refreshes.
TestTokenRefresher_RecoversAfterTransientUpdateTokenError — first UpdateToken call fails, second succeeds; verifies the existing 5-10s retry path delivers a token after recovery.
TestDrainEvents_IsNonBlockingAndDrainsBurst — exercises the drainEvents helper on both full and empty channels.

All pass under -race with -count=20 and stable under -count=50.

Release note:

calico/node now refreshes the CNI plugin's kubeconfig immediately when the pod's projected ServiceAccount token is rotated, closing a 6-12h window where an externally-invalidated token could cause CNI ADD to fail with "Unauthorized" until the calico-node pod was restarted.

The cni-config-monitor subprocess wakes up on its own timer between refreshes (default 6-12 hours). If the CNI kubeconfig token is externally invalidated during that sleep — for example by a ServiceAccount signing-key rotation — every pod sandbox create fails with "error getting ClusterInformation: connection is unauthorized: Unauthorized" until the next timer fires or someone restarts the calico-node pod. Two changes close this gap: 1. TokenRefresher.Run now installs an fsnotify watcher on the directory containing the pod's projected ServiceAccount token. Kubelet re-projects this token whenever it rotates, and the watcher wakes the refresh loop immediately so UpdateToken runs and the on-disk kubeconfig is rewritten with a freshly-minted token. If watcher setup fails, the refresher falls back to timer-only behaviour (unchanged semantics). 2. writeKubeconfig now writes atomically via a temp file in the same directory + rename. The previous os.WriteFile was truncate-then-write, so a CNI plugin invocation concurrent with a refresh could in principle read a partially-written kubeconfig. Broadened the clientset parameter on NewTokenRefresher from *kubernetes.Clientset to kubernetes.Interface to make the type testable with fake clients. Existing callers are unaffected. Added unit tests for both changes. Fixes CORE-12727. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves the reliability of Calico node’s CNI kubeconfig refresh loop by reacting promptly to projected ServiceAccount token rotations (instead of waiting hours for the next scheduled refresh) and by making kubeconfig updates atomic to avoid partial reads during concurrent CNI executions.

Changes:

Add an fsnotify watch on the projected ServiceAccount token directory to wake the refresh loop immediately on token rotation.
Switch kubeconfig writes to an atomic temp-file + fsync + rename flow.
Add unit tests covering the atomic write behavior and the watcher-driven wake-up / fallback behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`node/pkg/cni/token_watch.go`	Adds token-directory watching to accelerate refresh on SA token rotation; introduces atomic kubeconfig write helper.
`node/pkg/cni/token_watch_internal_test.go`	Adds unit tests for atomic writes and watcher/timer behavior in `TokenRefresher.Run()`.

Per fsnotify's contract, both Events and Errors channels must be drained. Leaving Errors unread can block the watcher's internal goroutine and stop further events from being delivered — which would silently break the fast path added in the previous commit. Spawn a drainer goroutine on startup that logs errors and exits when the watcher is closed. cleanup() now waits for it to finish so there's no goroutine leak when Run() returns. Spotted by Copilot reviewer on projectcalico#12595. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Added five tests addressing gaps surfaced during review: - TestWriteFileAtomic_ConcurrentReaderNeverSeesPartialContent: the property test for the atomic write fix. A reader goroutine reads the kubeconfig continuously while a writer rewrites it 100 times with varying content length; the reader must always see one of the known valid revisions (never empty, truncated, or mid-write). - TestWriteFileAtomic_ErrorsWhenDirectoryMissing: error path of os.CreateTemp when the parent dir does not exist. - TestDrainEvents_IsNonBlockingAndDrainsBurst: direct unit test on the drainEvents helper for both full and empty channels. - TestTokenRefresher_DegradesGracefullyWhenWatcherChannelCloses: covers the "!ok" branch of Run's select — if the fsnotify watcher dies and its events channel closes, the timer loop must continue operating. Uses an injectable watcher factory with a pre-closed channel so the condition is triggered deterministically. - TestTokenRefresher_RecoversAfterTransientUpdateTokenError: the first UpdateToken call fails, the second succeeds; verifies the existing 5-10s retry path actually delivers a token after recovery. To make the watcher branch testable without real fsnotify plumbing, the TokenRefresher now exposes a watcherFactory field that defaults to the existing startTokenFileWatcher method. Production behaviour unchanged. All tests stable under `go test -count=20 -race` and `-count=50` (non-race). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The atomic write change was defensive hygiene, not a fix for the reported bug. The race it prevents — a CNI plugin reading a partially-written kubeconfig during the microsecond window between truncate and write — has no evidence of affecting the reported failure mode (which is a 401 from a stale token, not a parse error). Keep this PR focused on the fsnotify wake-up path that addresses the customer symptom. Revert to the original os.WriteFile and drop the five TestWriteFileAtomic_* tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 22, 2026 22:21

skoryk-oleksandr requested a review from a team as a code owner April 22, 2026 22:21

marvin-tigera added this to the Calico v3.33.0 milestone Apr 22, 2026

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 22, 2026

Copilot started reviewing on behalf of skoryk-oleksandr April 22, 2026 22:22 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread node/pkg/cni/token_watch.go

skoryk-oleksandr and others added 3 commits April 24, 2026 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node: refresh CNI kubeconfig promptly on SA token rotation#12595

node: refresh CNI kubeconfig promptly on SA token rotation#12595
skoryk-oleksandr wants to merge 4 commits intoprojectcalico:masterfrom
skoryk-oleksandr:fix-cni-token-refresh-on-rotation

skoryk-oleksandr commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

skoryk-oleksandr commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this changes

Why now

End-to-end verification on a kind cluster

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

skoryk-oleksandr commented Apr 22, 2026 •

edited

Loading