Skip to content

node: refresh CNI kubeconfig promptly on SA token rotation#12595

Open
skoryk-oleksandr wants to merge 4 commits intoprojectcalico:masterfrom
skoryk-oleksandr:fix-cni-token-refresh-on-rotation
Open

node: refresh CNI kubeconfig promptly on SA token rotation#12595
skoryk-oleksandr wants to merge 4 commits intoprojectcalico:masterfrom
skoryk-oleksandr:fix-cni-token-refresh-on-rotation

Conversation

@skoryk-oleksandr
Copy link
Copy Markdown
Contributor

@skoryk-oleksandr skoryk-oleksandr commented Apr 22, 2026

Summary

Fixes a multi-hour blind spot in cni-config-monitor where an externally-invalidated CNI kubeconfig token goes undetected between scheduled refreshes, leading to plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized until a calico-node pod restart.

What this changes

TokenRefresher.Run now installs an fsnotify watcher on the directory containing the pod's projected ServiceAccount token. When kubelet rotates that token, the refresh loop wakes up immediately, mints a fresh token via TokenRequest, and rewrites /etc/cni/net.d/calico-kubeconfig. If watcher setup fails the refresher falls back to the original timer-only behaviour. Both Events and Errors channels are drained per fsnotify's contract.

Why now

The TokenRefresher.Run loop sleeps for (tokenExp - now) / 4 + up to 100% jitter between refreshes. With the default 24h token validity that is 6-12 hours. During that sleep there is no mechanism that notices a token has become invalid externally. The two common triggers in practice are:

  • ServiceAccount signing-key rotation (reproduced on kind; the failure is immediate and persistent until calico-node restarts).
  • Any event that makes kubelet re-project the SA token before the scheduled refresh — picking it up promptly keeps the CNI kubeconfig as fresh as possible.

End-to-end verification on a kind cluster

Brought up a kind cluster running this branch and ran through the scenario the fix targets.

Fix activation confirmed — on pod startup, calico-node logs show the fsnotify watcher is installed:

[INFO] node-services/token_watch.go: Watching service account token directory for rotation

fsnotify fast-path demonstration — triggered a projected-token change in the watched directory on a healthy cluster:

Before After
CNI token jti 59030a44-… 45043310-… (new token)
CNI kubeconfig mtime 20:56:07 21:01:47
Latency trigger → new kubeconfig written ~30 ms

Log line, verbatim:

[INFO] node-services/token_watch.go: Projected service account token changed; refreshing CNI kubeconfig immediately

Same trigger on master (without this branch) produced no response — no log, no kubeconfig rewrite, no token change. That's the blind spot the fix closes.

Underlying bug reproduction — also reproduced the original failure end-to-end on the kind cluster by regenerating the SA signing key and restarting the apiserver. Pod creation fails with the exact error from the original report:

Failed to create pod sandbox: ... plugin type="calico" failed (add):
  error getting ClusterInformation: connection is unauthorized: Unauthorized

Honest scope

This is a necessary-but-not-sufficient fix. On its own it changes worst-case automatic recovery from "up to 6-12 hours (or manual pod restart)" to "at most one kubelet projection cycle". With default Kubernetes settings the projected SA token lifetime is ~1 hour and kubelet refreshes at 80 % of that, so the practical upper bound after this fix is ~48 minutes (average ~24 minutes, best case seconds).

That is still slow for time-sensitive workloads. Operators who need a tighter bound should additionally:

  • Shorten the projected SA token's expirationSeconds on the calico-node DaemonSet (e.g. 600s → refresh at ~8 minutes). That change belongs in the operator/Helm chart, not here.
  • Optionally add a health check that fails liveness after N minutes of UpdateToken errors so kubelet restarts the pod. Not included in this PR.

What this fix does NOT do:

  • Eliminate transient Unauthorized errors when the SA signing key rotates. Every workload holding a projected SA token sees 401s until kubelet re-projects.
  • Speed up kubelet's projected-token refresh cadence. That is controlled by the pod spec's expirationSeconds, set by the operator.

Tests

Added node/pkg/cni/token_watch_internal_test.go with:

  • TestTokenRefresher_WakesUpOnTokenFileChange — configures a 24h token (normally 6-12h sleep), writes a file in the watched dir, asserts a second UpdateToken fires within 3s. Without the fsnotify fast path this would time out.
  • TestTokenRefresher_FallsBackWhenWatcherSetupFails — verifies the graceful timer-only fallback when the watched directory doesn't exist.
  • TestTokenRefresher_DegradesGracefullyWhenWatcherChannelCloses — uses an injected watcher factory returning a pre-closed channel, verifies the timer continues to drive refreshes.
  • TestTokenRefresher_RecoversAfterTransientUpdateTokenError — first UpdateToken call fails, second succeeds; verifies the existing 5-10s retry path delivers a token after recovery.
  • TestDrainEvents_IsNonBlockingAndDrainsBurst — exercises the drainEvents helper on both full and empty channels.

All pass under -race with -count=20 and stable under -count=50.

Release note:

calico/node now refreshes the CNI plugin's kubeconfig immediately when the pod's projected ServiceAccount token is rotated, closing a 6-12h window where an externally-invalidated token could cause CNI ADD to fail with "Unauthorized" until the calico-node pod was restarted.

The cni-config-monitor subprocess wakes up on its own timer between refreshes
(default 6-12 hours). If the CNI kubeconfig token is externally invalidated
during that sleep — for example by a ServiceAccount signing-key rotation —
every pod sandbox create fails with "error getting ClusterInformation:
connection is unauthorized: Unauthorized" until the next timer fires or
someone restarts the calico-node pod.

Two changes close this gap:

1. TokenRefresher.Run now installs an fsnotify watcher on the directory
   containing the pod's projected ServiceAccount token. Kubelet re-projects
   this token whenever it rotates, and the watcher wakes the refresh loop
   immediately so UpdateToken runs and the on-disk kubeconfig is rewritten
   with a freshly-minted token. If watcher setup fails, the refresher falls
   back to timer-only behaviour (unchanged semantics).

2. writeKubeconfig now writes atomically via a temp file in the same
   directory + rename. The previous os.WriteFile was truncate-then-write,
   so a CNI plugin invocation concurrent with a refresh could in principle
   read a partially-written kubeconfig.

Broadened the clientset parameter on NewTokenRefresher from
*kubernetes.Clientset to kubernetes.Interface to make the type testable
with fake clients. Existing callers are unaffected.

Added unit tests for both changes.

Fixes CORE-12727.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 22, 2026 22:21
@skoryk-oleksandr skoryk-oleksandr requested a review from a team as a code owner April 22, 2026 22:21
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Apr 22, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the reliability of Calico node’s CNI kubeconfig refresh loop by reacting promptly to projected ServiceAccount token rotations (instead of waiting hours for the next scheduled refresh) and by making kubeconfig updates atomic to avoid partial reads during concurrent CNI executions.

Changes:

  • Add an fsnotify watch on the projected ServiceAccount token directory to wake the refresh loop immediately on token rotation.
  • Switch kubeconfig writes to an atomic temp-file + fsync + rename flow.
  • Add unit tests covering the atomic write behavior and the watcher-driven wake-up / fallback behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
node/pkg/cni/token_watch.go Adds token-directory watching to accelerate refresh on SA token rotation; introduces atomic kubeconfig write helper.
node/pkg/cni/token_watch_internal_test.go Adds unit tests for atomic writes and watcher/timer behavior in TokenRefresher.Run().

Comment thread node/pkg/cni/token_watch.go
skoryk-oleksandr and others added 3 commits April 24, 2026 12:38
Per fsnotify's contract, both Events and Errors channels must be drained.
Leaving Errors unread can block the watcher's internal goroutine and stop
further events from being delivered — which would silently break the fast
path added in the previous commit.

Spawn a drainer goroutine on startup that logs errors and exits when the
watcher is closed. cleanup() now waits for it to finish so there's no
goroutine leak when Run() returns.

Spotted by Copilot reviewer on projectcalico#12595.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added five tests addressing gaps surfaced during review:

- TestWriteFileAtomic_ConcurrentReaderNeverSeesPartialContent: the
  property test for the atomic write fix. A reader goroutine reads the
  kubeconfig continuously while a writer rewrites it 100 times with
  varying content length; the reader must always see one of the known
  valid revisions (never empty, truncated, or mid-write).
- TestWriteFileAtomic_ErrorsWhenDirectoryMissing: error path of
  os.CreateTemp when the parent dir does not exist.
- TestDrainEvents_IsNonBlockingAndDrainsBurst: direct unit test on the
  drainEvents helper for both full and empty channels.
- TestTokenRefresher_DegradesGracefullyWhenWatcherChannelCloses: covers
  the "!ok" branch of Run's select — if the fsnotify watcher dies and
  its events channel closes, the timer loop must continue operating.
  Uses an injectable watcher factory with a pre-closed channel so the
  condition is triggered deterministically.
- TestTokenRefresher_RecoversAfterTransientUpdateTokenError: the first
  UpdateToken call fails, the second succeeds; verifies the existing
  5-10s retry path actually delivers a token after recovery.

To make the watcher branch testable without real fsnotify plumbing, the
TokenRefresher now exposes a watcherFactory field that defaults to the
existing startTokenFileWatcher method. Production behaviour unchanged.

All tests stable under `go test -count=20 -race` and `-count=50`
(non-race).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The atomic write change was defensive hygiene, not a fix for the
reported bug. The race it prevents — a CNI plugin reading a
partially-written kubeconfig during the microsecond window between
truncate and write — has no evidence of affecting the reported
failure mode (which is a 401 from a stale token, not a parse
error).

Keep this PR focused on the fsnotify wake-up path that addresses
the customer symptom. Revert to the original os.WriteFile and drop
the five TestWriteFileAtomic_* tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-pr-required Change is not yet documented release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants