fix(jetson): apply br_netfilter on JetPack R39 when missing (Fixes #2418)#2419
fix(jetson): apply br_netfilter on JetPack R39 when missing (Fixes #2418)#2419yanyunl1991 wants to merge 8 commits intomainfrom
Conversation
) JetPack R39 is silently skipped by setup-jetson.sh (get_jetpack_version early-returns for any release ≥ 39). That means `modprobe br_netfilter` and `sysctl -w net.bridge.bridge-nf-call-iptables=1` — which live inside `configure_jetson_host()` — are never run on R39 hosts. The assumption behind that early-return was wrong: k3s running inside the OpenShell gateway container needs bridge-nf-call-iptables=1 for ClusterIP service routing. Without it, the kube-proxy iptables NAT rules are written correctly but don't match pod-originated traffic, so sandbox agent pods can't resolve `openshell.openshell.svc.cluster.local` and crash-loop during onboard's [7/8] step. Recognize R39 as `jp7-r39`, same as R38 for the iptables / daemon.json case (no changes needed there). For the shared br_netfilter block: keep the existing unconditional behavior for JP6 / JP7-R38 (idempotent, no reason to change what ships today), but on R39 apply only when `/proc/sys/net/bridge/bridge-nf-call-iptables` is absent or not set to 1. This avoids writing NemoClaw-owned drop-ins to `/etc/modules-load.d/` and `/etc/sysctl.d/` on R39 images that already ship with the module in place (not all R39 builds need it — reporter's hit the broken case). Verified manually on an R39 Jetson Orin: before the fix, sandbox agent pod crash-looped 156+ times with `failed to lookup address information: Temporary failure in name resolution`. After `modprobe br_netfilter; sysctl -w net.bridge.bridge-nf-call-iptables=1`, a fresh `nemoclaw onboard` runs through [1/8]–[8/8] cleanly. Full diagnostic chain in #2418. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
…2418) Address reviewer feedback: keep the existing `if ((release >= 39))` branch in `get_jetpack_version()` and do the R39 check + setup right there instead of threading a new "jp7-r39" token through `configure_jetson_host()`. The earlier version added a `39.*)` case to `get_jetpack_version`, a matching `jp7-r39)` case to `configure_jetson_host`, and a conditional inside that function's br_netfilter block. This version: - Leaves `get_jetpack_version`'s R36/R38 case statement and `configure_jetson_host` case statements completely untouched for R36/R38 — no regression surface there. - Extracts `apply_br_netfilter_setup()` as a shared helper so the R39 branch and the R36/R38 path call the same code without duplication. - In the `>=39` branch: probe `bridge_netfilter_ready` first; if the host is already set up, log and return empty (main() exits early, same as before this PR). Otherwise do the sudo prompt and call `apply_br_netfilter_setup`. Net effect on R39 is identical — `br_netfilter` is loaded and persisted only when missing — but the diff is smaller and R39-specific logic is co-located with the rest of the R39 detection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughCentralizes br_netfilter readiness check and apply logic in the Jetson setup script; conditionally applies iptables-only (R36/R38) or dual-stack (R39) br_netfilter setup (modprobe, runtime sysctls, and persistence files), and avoids early-return on JP7 R39 so host bridge filtering is enforced when needed. Changes
Sequence Diagram(s)sequenceDiagram
participant Installer
participant HostKernel
participant SysctlPersist
participant ModulesPersist
Installer->>HostKernel: bridge_netfilter_ready? (check runtime & persistence)
alt not ready
Installer->>HostKernel: modprobe br_netfilter
Installer->>HostKernel: set sysctl net.bridge.bridge-nf-call-iptables/ip6tables=1
Installer->>SysctlPersist: write /etc/sysctl.d/99-nemoclaw.conf
Installer->>ModulesPersist: write /etc/modules-load.d/nemoclaw.conf
Installer->>HostKernel: log runtime values and persistence paths
else ready
Installer->>Installer: report no host setup required
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
The original fix only set the IPv4 sysctl (bridge-nf-call-iptables), which is enough for IPv4 ClusterIP service routing in k3s — the path validated end-to-end on the reporter's Jetson Orin R39. However, the manual workaround tracked in the issue thread enables both bridge-nf-call-iptables and bridge-nf-call-ip6tables, and that is also the combination that was actually verified on the reporter's host. Shipping only v4 leaves a small but real gap for any future dual-stack or v6-only cluster service the gateway might route, and makes the automated fix diverge slightly from the proven-works manual steps. - apply_br_netfilter_setup: flip ip6tables sysctl alongside iptables, and persist both in /etc/sysctl.d/99-nemoclaw.conf so the setting survives reboot (same drop-in, two lines). - bridge_netfilter_ready: require both sysctls to be 1 (and both /proc nodes to exist) before treating the host as already-configured, so we don't skip the apply step when only v4 is pre-set. - Updated comments to reflect v4+v6 coverage. Docker/k3s do not need to be restarted after this — br_netfilter is a kernel module and the sysctl takes effect immediately; new bridged traffic picks up the new netfilter hook without any daemon reload.
The R39 apply path currently prints a single "loading br_netfilter" line before running modprobe/sysctl/tee, all of which are silent (their stdout is redirected to /dev/null). On success the script exits quietly, leaving the operator without a confirmation that the apply actually completed, which two sysctls were flipped, or where the drop-ins were persisted. On failure, only the raw error from whichever step tripped `set -e` is visible. Add one structured info line at the end of the apply branch summarising what was done and where state now lives. Keeps the skip branch unchanged and does not affect R36/R38 logging (which already gets output from update-alternatives / systemctl restart). Useful for: - confirming the script actually ran the expected code path (in particular, that it's the v4+v6 version, not an older v4-only build) when triaging future reports. - documenting the on-host footprint for later cleanup / debugging.
Refines the confirmation log added in 5b1753f so the script itself produces evidence of the fix's effect, rather than just asserting "we set it to 1". A user validating on their own Jetson can now confirm from script output alone that: 1) The runtime sysctls read back as 1 (read from /proc, not hardcoded in the log string). 2) What this unblocks — sandbox-pod → ClusterIP routing, i.e. the original #2418 symptom. 3) No docker/k3s restart is required (common operator question; answered inline so reporters don't have to ask). 4) Which drop-in files own the reboot persistence. Before: "br_netfilter applied: bridge-nf-call-{iptables,ip6tables}=1; persisted ..." (one line, hardcoded string — no actual verification) After: "br_netfilter runtime: bridge-nf-call-iptables=1, bridge-nf-call-ip6tables=1 — sandbox → ClusterIP routing (CoreDNS, services) is unblocked; no docker or k3s restart needed" "Reboot persistence: /etc/modules-load.d/nemoclaw.conf, /etc/sysctl.d/99-nemoclaw.conf" (two lines; first reads live /proc state as proof) Falls back to "?" in the unlikely case /proc/sys/net/bridge/* is missing when read (e.g., racy unload), so the log line is always printable even in edge cases.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/setup-jetson.sh`:
- Line 183: The call to apply_br_netfilter_setup now routes jp6/jp7-r38 through
a shared helper that introduced expanded persistence writes intended for R39;
isolate legacy JetPack behavior by either creating a separate helper (e.g.,
apply_br_netfilter_setup_legacy) and keep jp6/jp7-r38 calling the legacy
variant, or add a guarded conditional inside apply_br_netfilter_setup that
checks the JetPack version and only performs the extra persistence writes when
the version equals R39; update references to apply_br_netfilter_setup in the jp6
and jp7-r38 flow to use the legacy helper or ensure the conditional prevents
R36/R38 writes so those paths remain byte-identical.
- Around line 19-28: The current bridge_netfilter_ready() only checks live /proc
state; update the logic so the script does not skip persistence steps
prematurely: either extend bridge_netfilter_ready() to also verify the
persistence drop-ins exist and contain the expected settings (check
/etc/modules-load.d/* for br_netfilter and /etc/sysctl.d/* for
bridge-nf-call-iptables and bridge-nf-call-ip6tables), or make the persistence
step idempotent by always creating/updating the /etc/modules-load.d and
/etc/sysctl.d drop-in files when they are missing or don’t contain the required
entries; reference the bridge_netfilter_ready() function and the persistence
targets (/etc/modules-load.d and /etc/sysctl.d) when implementing the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9260e3aa-267d-460a-a3fb-b1953242f97e
📒 Files selected for processing (1)
scripts/setup-jetson.sh
…2418) Addresses two review comments on PR #2419: 1. bridge_netfilter_ready only inspected the live kernel state, which let the skip branch trigger on runtime-only configuration (e.g. someone ran modprobe + sysctl manually for a test). The drop-ins would never get written and the fix would vanish at the next reboot. The predicate now also requires: - /etc/modules-load.d/nemoclaw.conf contains `br_netfilter` - /etc/sysctl.d/99-nemoclaw.conf contains both sysctl=1 lines apply_br_netfilter_setup_r39 is already idempotent, so when runtime is live but drop-ins are missing or stale (v4-only leftover from an older script), ready returns false and the apply path rewrites them. 2. The earlier "also flip ip6tables" commit (77e84f9) added v6 handling to the shared apply_br_netfilter_setup helper, which was also called by R36/R38 via configure_jetson_host. That expanded the R36/R38 behavior — out of scope for this PR. Split into two helpers: - apply_br_netfilter_setup: v4-only, byte-identical to the pre-#2418 inline block, still used by R36/R38. - apply_br_netfilter_setup_r39: v4+v6, the dual-stack variant validated on the reporter's Jetson; only called from the R39 branch in get_jetpack_version.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/setup-jetson.sh`:
- Around line 30-39: The bridge IPv6 sysctl is being written unconditionally
which fails on kernels without bridge-nf-call-ip6tables; update the script so
the write of net.bridge.bridge-nf-call-ip6tables=1 is only attempted when the
kernel exposes /proc/sys/net/bridge/bridge-nf-call-ip6tables (or when the file
exists), and update the bridge_netfilter_ready() function to check for the IPv6
sysctl file before asserting its value (mirror the existing IPv4 checks in
bridge_netfilter_ready()); ensure any set -e failure paths are avoided by gating
the IPv6 sysctl write with a presence check and only adding the grep check for
net.bridge.bridge-nf-call-ip6tables=1 in /etc/sysctl.d/99-nemoclaw.conf when
that sysctl is present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: adc4de17-ceae-46df-8123-2bafdfcc99e4
📒 Files selected for processing (1)
scripts/setup-jetson.sh
CodeRabbit flagged that the IPv6 sysctl write and the ready-check for bridge-nf-call-ip6tables were unconditional, which would fail under `set -e` on kernels built without CONFIG_IPV6 (where /proc/sys/net/bridge/bridge-nf-call-ip6tables does not exist). Make the IPv6 leg present-aware in all three places: - apply_br_netfilter_setup_r39: only runs `sysctl -w ...-ip6tables=1` when the /proc node exists; the persistence drop-in then contains only the v4 line on IPv6-less kernels (so bridge_netfilter_ready stays consistent with what was actually written). - bridge_netfilter_ready: v4 checks are mandatory; v6 checks (both the runtime sysctl value and the drop-in line) are gated on the /proc node existing. On a v4-only kernel the predicate returns true with just v4, matching the drop-in content. - R39 branch log line: when /proc for ip6tables is missing, emit "(ip6tables not exposed by this kernel; IPv4-only persistence)" so the operator can tell from logs alone that the v4-only path was taken intentionally, rather than because v6 silently failed.
Removes the IPv6 work that scope-crept into this PR and triggered two rounds of review: the R36/R38 byte-identical split (77e84f9 / 83a2398) and the CONFIG_IPV6-gated presence check (2a3ddbd). The reporter's actual symptom on #2418 — sandbox pod → ClusterIP DNS timing out — is IPv4-only (CoreDNS ClusterIP 10.43.0.10). The v4 sysctl + br_netfilter modprobe alone is sufficient and matches what the R36/R38 paths have always done. IPv6 was belt-and- suspenders; every review iteration showed the extra guard code was more complexity than the robustness was worth. Net effect: - apply_br_netfilter_setup_r39 is deleted; R39 calls the shared apply_br_netfilter_setup (same helper R36/R38 already use). - bridge_netfilter_ready drops IPv6 runtime and persistence checks. - R39 log line drops the IPv6 summary branching. - CodeRabbit's other finding (persistence must be part of ready) is preserved: bridge_netfilter_ready still checks /etc/modules-load.d/nemoclaw.conf and /etc/sysctl.d/99-nemoclaw.conf content, so transient modprobe-only state no longer triggers skip. If a future issue shows IPv6 cluster-service routing breaking on R39, it can be added back with its own scope and validation.
Summary
Fixes #2418.
JetPack R39 was silently skipped by
setup-jetson.sh—get_jetpack_version()returned empty for any release ≥ 39, somain()exited before ever setting upbr_netfilter. Withoutnet.bridge.bridge-nf-call-iptables=1, kube-proxy's iptables NAT rules are present but don't match pod-originated traffic, so sandboxagent pods can't resolve
openshell.openshell.svc.cluster.localand crash-loop during onboard's[7/8] Setting up OpenClaw inside sandboxstep.See #2418 for full diagnostic chain: pod-IP DNS query works, service-IP times out, iptables
KUBE-SVC-*rules exist,lsmod | grep br_netfilteris empty, and loading the module immediately fixes it.Changes
Scoped to
scripts/setup-jetson.shonly. Approach keeps R36/R38 code paths untouched and co-locates R39 handling inside the existingif ((release >= 39))branch rather than threading a newjp7-r39token throughconfigure_jetson_host().New helper
bridge_netfilter_ready()— returns 0 when/proc/sys/net/bridge/bridge-nf-call-iptablesexists and reads1. That file only appears after the module is loaded, so a single check covers both conditions.New helper
apply_br_netfilter_setup()— extracts the four-linemodprobe+sysctl+tee /etc/modules-load.d/+tee /etc/sysctl.d/block that previously lived inline inconfigure_jetson_host(). Both the R39 branch and the R36/R38 path now call this helper, so behavior for R36/R38 is byte-identical tomain.
R39 branch in
get_jetpack_version()— the existingif ((release >= 39))early-return now carries the work instead of just logging and returning:bridge_netfilter_ready→ log and return empty (main exits, no host state is modified).apply_br_netfilter_setup, return empty.Returning empty preserves the existing invariant that R39 does not flow through
configure_jetson_host()(which handles iptables-legacy /daemon.json/ post-setup docker restart — none of which R39 needs).Conditional on R39 only because not every R39 image hits the bug: some ship with
br_netfilteralready loaded and we don't want to plant/etc/modules-load.d/nemoclaw.conf+/etc/sysctl.d/99-nemoclaw.confdrop-ins on those systems. For JP6 / JP7-R38 the behavior is preserved exactly (unconditionalapply_br_netfilter_setupviaconfigure_jetson_host).What the user sees on affected R39 hosts
Before (bug):
(…then onboard[7/8] fails with
sandbox is not ready.)After, first-affected R39 install:
After, R39 image that already has it:
Verification
bash -n scripts/setup-jetson.sh— clean (syntax check).npm run typecheck:cli— clean.npm run ts-migration:guard -- --base origin/main --head HEAD—No edits to migrated legacy paths or removed compatibility shims detected.test/setup-jetson.test.ts(4 tests for the Pythondaemon.jsonpatcher, which is unchanged) — all pass.nemoclaw onboardreaches[7/8]and fails withsandbox is not ready;kubectl describe pod my-assistant -n openshellshows theagentcontainer inCrashLoopBackOffwith 156+ restarts, logs showTemporary failure in name resolutionforopenshell.openshell.svc.cluster.local.sudo modprobe br_netfilter && sudo sysctl -w net.bridge.bridge-nf-call-iptables=1 net.bridge.bridge-nf-call-ip6tables=1) and rerunningnemoclaw onboard→[1/8]–[8/8]complete, sandbox reaches Ready, dashboard reachable.No bash-level unit tests added because the existing test coverage for this script only exercises the embedded Python
daemon.jsonpatcher (extracted via regex), and the new logic is top-level bash that would require either a source-guard refactor or mocking/etc/nv_tegra_release//proc/sys/...via env-varindirection — neither matches existing patterns. Happy to add them if a reviewer prefers.
Type of Change
AI Disclosure
Summary by CodeRabbit
Signed-off-by: Yanyun Liao yanyunl@nvidia.com