Skip to content

fix: preserve gateway token across helm upgrades#2

Open
aehrt55 wants to merge 2 commits intofeat/extensibility-fieldsfrom
fix/gateway-token-stability
Open

fix: preserve gateway token across helm upgrades#2
aehrt55 wants to merge 2 commits intofeat/extensibility-fieldsfrom
fix/gateway-token-stability

Conversation

@aehrt55
Copy link
Copy Markdown

@aehrt55 aehrt55 commented Mar 28, 2026

Summary

Use Helm lookup to retain existing gateway token Secret instead of regenerating on every helm upgrade.

Problem

randAlphaNum 32 is re-evaluated on every template render, overwriting the Secret on each upgrade. The helm.sh/resource-policy: keep annotation only prevents deletion on uninstall, not overwrites on upgrade.

This causes all paired Control UI sessions to lose their token after every helm upgrade.

Fix

Check if the Secret already exists via lookup. If it does, reuse the existing token. Only generate a new random token on first install.

This is the standard Helm pattern used by bitnami, ingress-nginx, and other charts for generated secrets.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the OpenClaw Helm chart to version 1.4.1, introducing support for standard Kubernetes scheduling fields (nodeSelector, affinity, and tolerations) and adding documentation for tool persistence. It also implements a mechanism to reuse existing gateway tokens during upgrades and syncs them into the configuration via an init container. However, a critical bug was identified in the deployment template where the secret key reference does not match the key defined in the secret manifest, which will prevent the token from being correctly applied.

{{- else }}
name: {{ .Values.gateway.token.secretName }}
{{- end }}
key: token
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The secretKeyRef.key is set to token, but the corresponding Secret generated by templates/secret.yaml uses the key OPENCLAW_GATEWAY_TOKEN. This mismatch will cause the OPENCLAW_GATEWAY_TOKEN environment variable in the init-config container to be empty, as optional: true is set. Consequently, the script to sync the token into openclaw.json will not execute, which undermines the goal of persisting the token correctly across upgrades.

              key: OPENCLAW_GATEWAY_TOKEN

Copy link
Copy Markdown
Author

@aehrt55 aehrt55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: fix: preserve gateway token across helm upgrades

🔴 Critical Issues

1. secretKeyRef key mismatch — init-config token sync is a silent no-op

deployment.yaml references key: token, but secret.yaml stores the value under key: OPENCLAW_GATEWAY_TOKEN.

# deployment.yaml (init-config container)
secretKeyRef:
  key: token          # ← does not exist in the Secret
  optional: true      # ← silently swallows the missing key

# secret.yaml
data:
  OPENCLAW_GATEWAY_TOKEN: ...   # ← actual key name

Because optional: true is set, Kubernetes won't error — the env var is simply empty. The if [ -n "$OPENCLAW_GATEWAY_TOKEN" ] guard evaluates to false, and the entire token sync is silently skipped on every pod start. The fix this PR intends to deliver is completely non-functional.

Fix: Change key: token to key: OPENCLAW_GATEWAY_TOKEN in the init-config secretKeyRef.


2. PR branch is stale — needs rebase onto main

The branch predates PRs thepagent#19, thepagent#21, and thepagent#23, which are already merged. As a result:

  • 5 of 6 changed files already exist identically on main
  • Chart.yaml will conflict (1.3.14 → 1.4.1, but main is already at 1.4.1)
  • 17 commits in history, most already merged via other PRs

After rebasing, the only net-new change should be the lookup pattern in secret.yaml. Bump version to 1.4.2 or appropriate.


🟡 High Severity

3. Node.js token sync script has zero error handling

node -e "
  const c = JSON.parse(fs.readFileSync(p));
  ...
  fs.writeFileSync(p, JSON.stringify(c, null, 2));
"

Multiple silent failure modes:

  • openclaw.json doesn't exist yet → ENOENT → script exits non-zero but shell continues
  • Invalid JSON in file → SyntaxError → same
  • PVC full or read-only → ENOSPC/EACCES → same

No set -e, no try/catch, no error logging. The init container will succeed regardless, and the token mismatch persists silently.

Suggestion: Wrap in try/catch with process.exit(1) on failure, and add || { echo "ERROR: token sync failed"; exit 1; } after the node -e call.

4. optional: true hides missing Secret misconfiguration

If gateway.token.generate: false and the user forgets to create the external Secret, the pod starts without any gateway token — potentially leaving the gateway unauthenticated. Consider making optional conditional:

optional: {{ not .Values.gateway.token.generate }}

5. lookup returns empty in GitOps/dry-run contexts

helm template, --dry-run, ArgoCD diff, and Flux render all return {} from lookup. This causes rendered output to always show a new random token, which may cause constant drift detection in GitOps workflows.

This is a known limitation of the standard Helm pattern and is acceptable, but should be documented in a template comment:

{{- /* lookup returns empty during helm template / --dry-run; token regenerated in output only */ -}}

📝 Documentation Issues

6. persistent-tools.md minor inaccuracies

Line Issue Fix
5 Uses ~/.openclaw without clarifying absolute path Add "(i.e., /home/node/.openclaw inside the container)" on first mention
25 "no apt-get" — apt-get exists but requires root → "runs as non-root user and cannot install system packages"
53 # binary lands at ~/.openclaw/workspace/bin/<package> Actual binary name depends on package's bin field, not package name

7. Missing template comments

  • secret.yaml: The lookup pattern needs a comment explaining intent: # Reuse existing token if Secret already exists (prevents regeneration on upgrade)
  • deployment.yaml: The checksum/config annotation needs: # Force pod restart when config values change

8. values.yaml scheduling comments are generic

# nodeSelector constrains the pod to nodes with matching labels just restates Kubernetes docs. Either remove or make chart-specific.


✅ What Works Well

  • The core lookup pattern in secret.yaml is correct and follows the standard bitnami/ingress-nginx approach
  • Defensive triple-check ($existingSecret && .data && index .data "KEY") handles edge cases properly
  • helm.sh/resource-policy: keep retained as belt-and-suspenders
  • The env-first precedence comment in init-config explains "why" well
  • ASCII diagrams in persistent-tools.md effectively communicate complex concepts

Action Items Summary

Priority Item
Must Fix key: tokenkey: OPENCLAW_GATEWAY_TOKEN in secretKeyRef
Must Rebase onto main, resolve conflicts, bump to 1.4.2+
Should Add error handling to init-config Node.js script
Should Add explanatory comments to lookup pattern and checksum annotation
Could Make optional: true conditional on gateway.token.generate
Could Fix persistent-tools.md minor inaccuracies

@aehrt55
Copy link
Copy Markdown
Author

aehrt55 commented Mar 29, 2026

🤖 PR Review

Branch: fix/gateway-token-stabilityfeat/extensibility-fields
Round: 1 | Updated: 2026-03-29
Reviewer: claude-opus-4-6 via pr-review-toolkit


📊 Summary

Category Total Fixed Remaining
🔴 Critical 0 0 0
🟡 Important 4 0 4
💡 Suggestions 7 0 7

Status: ✅ Ready to merge (all important issues reviewed and deferred to backlog)


🟡 Important Issues

1. ⏭️ lookup silently returns empty on RBAC failure

File: charts/openclaw-helm/templates/secret.yaml:3

Problem: Helm lookup returns empty dict both when Secret doesn't exist and when RBAC denies access. On RBAC failure, a new token is silently generated on every upgrade.

Decision: Deferred — this is a known Helm limitation shared by all charts using lookup (bitnami, ingress-nginx, etc.). Not specific to this chart.

2. ⏭️ optional: true silently skips token injection when Secret is missing

File: charts/openclaw-helm/templates/deployment.yaml:67

Problem: When Secret is expected but missing, optional: true silently unsets the env var, token sync is skipped, and gateway may start without authentication.

Decision: Deferred — currently works correctly in normal operation. Conditional optional or log line is a future improvement.

3. ⏭️ node -e failures produce raw stack traces

File: charts/openclaw-helm/templates/deployment.yaml:47-55

Problem: set -e correctly crashes init container on failure, but error messages lack actionable guidance (e.g., "delete corrupt config and restart").

Decision: Deferred — set -e ensures failure is visible (not silent). Improved error messages are a UX enhancement.

4. ⏭️ No validation when generate=false and secretName is empty

File: charts/openclaw-helm/templates/deployment.yaml:60-64

Problem: gateway.token.generate: false with empty secretName renders invalid secretKeyRef and skips envFrom — gateway starts without authentication.

Decision: Deferred — will be addressed together with values.schema.json in a future PR.


💡 Suggestions

View 7 suggestions (0 addressed)
# Suggestion Status
1 Add helm-unittest tests for secretKeyRef key, init-config script, Secret rendering ⏭️
2 Extract secret name into helper template openclaw-helm.gatewayTokenSecretName ⏭️
3 Document required Secret key name (OPENCLAW_GATEWAY_TOKEN) in values.yaml ⏭️
4 Add ` quote` to lookup branch in secret.yaml for consistency
5 Use secretKeyRef instead of secretRef in main container to prevent env var leakage ⏭️
6 Add token trim in node script for whitespace-padded values ⏭️
7 Add comment to `[ -f ]

✨ Strengths

  • lookup-based token preservation is the standard Helm pattern — well-implemented with proper nil guards
  • Correct fix: gateway.auth.token (not remote.token) matches verified config-first precedence behavior
  • set -e ensures init-config fails fast on errors instead of silently continuing
  • secretKeyRef key fix (tokenOPENCLAW_GATEWAY_TOKEN) makes token sync actually functional
  • Comments accurately describe config-first precedence — correcting the previous misleading "env-first" claim
  • Clean, minimal diff focused on the exact bugs being fixed
  • No simplification needed — code is already concise and well-structured

📋 Type Design Ratings

Type Encap. Express. Useful. Enforce. Overall
Gateway Token Config 5 4 7 3 4.8

🎯 Action Plan

Before Merge:

  • All issues reviewed — 4 deferred to backlog with explicit rationale

After Merge (Backlog):

  • Add RBAC requirement comment to secret.yaml lookup
  • Add log line or conditional optional for missing token
  • Add try-catch with actionable error message in node -e script
  • Add fail guard for generate=false + empty secretName
  • Add helm-unittest tests
  • Extract gateway token secret name into helper template
  • Document required key name in values.yaml

Generated by pr-review-and-document skill | Round 1 | [View edit history](click edited)

@aehrt55 aehrt55 force-pushed the fix/gateway-token-stability branch from 29f7bc6 to 3ca3480 Compare March 29, 2026 05:06
@aehrt55 aehrt55 changed the base branch from main to feat/extensibility-fields March 29, 2026 05:06
@aehrt55 aehrt55 force-pushed the fix/gateway-token-stability branch from 3ca3480 to be428df Compare March 29, 2026 05:20
@aehrt55 aehrt55 force-pushed the feat/extensibility-fields branch from 5c9e4a8 to 3702451 Compare March 29, 2026 06:02
aehrt55 and others added 2 commits March 29, 2026 14:12
Use Helm lookup to retain existing gateway token Secret instead of
regenerating on every helm upgrade. The randAlphaNum call is only used
for initial creation when no Secret exists yet.

Without this fix, every helm upgrade generates a new random token,
invalidating all paired Control UI sessions and requiring users to
re-enter the gateway token.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The token sync in init-config was silently non-functional due to two
compounding issues that happened to cancel out:

1. secretKeyRef used `key: token` but Secret stores `OPENCLAW_GATEWAY_TOKEN`,
   so the env var was always empty and sync was always skipped
2. Since config had no gateway.auth.token, the gateway fell back to the
   env var from the main container's secretRef — masking the issue

This worked by accident, but breaks if auth.token is ever set in config.

Fixes:
- secretKeyRef key: `token` → `OPENCLAW_GATEWAY_TOKEN` to match Secret
- Sync gateway.auth.token (not remote.token), because the gateway process
  uses config-first precedence when OPENCLAW_SERVICE_KIND=gateway
- Add `set -e` so init-config fails loudly on JSON parse or write errors
- Correct misleading comment that claimed env-first precedence

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aehrt55 aehrt55 force-pushed the fix/gateway-token-stability branch 3 times, most recently from d912e9b to aee7a62 Compare March 29, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant