Skip to content

fix(jetson): apply br_netfilter on JetPack R39 when missing (Fixes #2418)#2419

Open
yanyunl1991 wants to merge 8 commits intomainfrom
fix/setup-jetson-r39-br-netfilter-2418
Open

fix(jetson): apply br_netfilter on JetPack R39 when missing (Fixes #2418)#2419
yanyunl1991 wants to merge 8 commits intomainfrom
fix/setup-jetson-r39-br-netfilter-2418

Conversation

@yanyunl1991
Copy link
Copy Markdown
Contributor

@yanyunl1991 yanyunl1991 commented Apr 24, 2026

Summary

Fixes #2418.

JetPack R39 was silently skipped by setup-jetson.shget_jetpack_version() returned empty for any release ≥ 39, so main() exited before ever setting up br_netfilter. Without net.bridge.bridge-nf-call-iptables=1, kube-proxy's iptables NAT rules are present but don't match pod-originated traffic, so sandbox
agent pods can't resolve openshell.openshell.svc.cluster.local and crash-loop during onboard's [7/8] Setting up OpenClaw inside sandbox step.

See #2418 for full diagnostic chain: pod-IP DNS query works, service-IP times out, iptables KUBE-SVC-* rules exist, lsmod | grep br_netfilter is empty, and loading the module immediately fixes it.

Changes

Scoped to scripts/setup-jetson.sh only. Approach keeps R36/R38 code paths untouched and co-locates R39 handling inside the existing if ((release >= 39)) branch rather than threading a new jp7-r39 token through configure_jetson_host().

  1. New helper bridge_netfilter_ready() — returns 0 when /proc/sys/net/bridge/bridge-nf-call-iptables exists and reads 1. That file only appears after the module is loaded, so a single check covers both conditions.

  2. New helper apply_br_netfilter_setup() — extracts the four-line modprobe + sysctl + tee /etc/modules-load.d/ + tee /etc/sysctl.d/ block that previously lived inline in configure_jetson_host(). Both the R39 branch and the R36/R38 path now call this helper, so behavior for R36/R38 is byte-identical to
    main.

  3. R39 branch in get_jetpack_version() — the existing if ((release >= 39)) early-return now carries the work instead of just logging and returning:

    • If bridge_netfilter_ready → log and return empty (main exits, no host state is modified).
    • Otherwise → check sudo, call apply_br_netfilter_setup, return empty.

    Returning empty preserves the existing invariant that R39 does not flow through configure_jetson_host() (which handles iptables-legacy / daemon.json / post-setup docker restart — none of which R39 needs).

Conditional on R39 only because not every R39 image hits the bug: some ship with br_netfilter already loaded and we don't want to plant /etc/modules-load.d/nemoclaw.conf + /etc/sysctl.d/99-nemoclaw.conf drop-ins on those systems. For JP6 / JP7-R38 the behavior is preserved exactly (unconditional
apply_br_netfilter_setup via configure_jetson_host).

What the user sees on affected R39 hosts

Before (bug):

Jetson detected (L4T 39.1) — this version does not require any host setup

(…then onboard[7/8] fails with sandbox is not ready.)

After, first-affected R39 install:

Jetson detected (L4T 39.1) — loading br_netfilter (required by k3s inside the OpenShell gateway; see #2418)

After, R39 image that already has it:

Jetson detected (L4T 39.1) — br_netfilter already configured; no host setup needed

Verification

  • bash -n scripts/setup-jetson.sh — clean (syntax check).
  • npm run typecheck:cli — clean.
  • npm run ts-migration:guard -- --base origin/main --head HEADNo edits to migrated legacy paths or removed compatibility shims detected.
  • Existing test/setup-jetson.test.ts (4 tests for the Python daemon.json patcher, which is unchanged) — all pass.
  • Manual end-to-end on a JetPack R39 Jetson Orin (the setup-jetson.sh skips br_netfilter on JetPack R39, breaking k3s service DNS and onboard [7/8] #2418 reporter's affected machine):
    • Before: nemoclaw onboard reaches [7/8] and fails with sandbox is not ready; kubectl describe pod my-assistant -n openshell shows the agent container in CrashLoopBackOff with 156+ restarts, logs show Temporary failure in name resolution for openshell.openshell.svc.cluster.local.
    • After applying the equivalent of this fix manually (sudo modprobe br_netfilter && sudo sysctl -w net.bridge.bridge-nf-call-iptables=1 net.bridge.bridge-nf-call-ip6tables=1) and rerunning nemoclaw onboard[1/8][8/8] complete, sandbox reaches Ready, dashboard reachable.

No bash-level unit tests added because the existing test coverage for this script only exercises the embedded Python daemon.json patcher (extracted via regex), and the new logic is top-level bash that would require either a source-guard refactor or mocking /etc/nv_tegra_release / /proc/sys/... via env-var
indirection — neither matches existing patterns. Happy to add them if a reviewer prefers.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

AI Disclosure

  • AI-assisted — tool: Claude Code

Summary by CodeRabbit

  • Chores
    • Centralized bridge/netfilter setup with a readiness check so host changes are applied only when needed.
    • Newer releases get a conditional dual‑stack (IPv4+IPv6) setup; older paths use iptables‑only configuration.
    • Ensures elevated privileges when required, persists bridge/IPv4/IPv6 settings across reboots, and logs runtime values and persistence locations.

Signed-off-by: Yanyun Liao yanyunl@nvidia.com

yanyunl1991 and others added 2 commits April 24, 2026 14:04
)

JetPack R39 is silently skipped by setup-jetson.sh (get_jetpack_version
early-returns for any release ≥ 39). That means `modprobe br_netfilter`
and `sysctl -w net.bridge.bridge-nf-call-iptables=1` — which live inside
`configure_jetson_host()` — are never run on R39 hosts.

The assumption behind that early-return was wrong: k3s running inside
the OpenShell gateway container needs bridge-nf-call-iptables=1 for
ClusterIP service routing. Without it, the kube-proxy iptables NAT
rules are written correctly but don't match pod-originated traffic, so
sandbox agent pods can't resolve `openshell.openshell.svc.cluster.local`
and crash-loop during onboard's [7/8] step.

Recognize R39 as `jp7-r39`, same as R38 for the iptables / daemon.json
case (no changes needed there). For the shared br_netfilter block: keep
the existing unconditional behavior for JP6 / JP7-R38 (idempotent, no
reason to change what ships today), but on R39 apply only when
`/proc/sys/net/bridge/bridge-nf-call-iptables` is absent or not set to
1. This avoids writing NemoClaw-owned drop-ins to `/etc/modules-load.d/`
and `/etc/sysctl.d/` on R39 images that already ship with the module in
place (not all R39 builds need it — reporter's hit the broken case).

Verified manually on an R39 Jetson Orin: before the fix, sandbox agent
pod crash-looped 156+ times with
`failed to lookup address information: Temporary failure in name
resolution`. After `modprobe br_netfilter; sysctl -w
net.bridge.bridge-nf-call-iptables=1`, a fresh `nemoclaw onboard` runs
through [1/8]–[8/8] cleanly. Full diagnostic chain in #2418.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
…2418)

Address reviewer feedback: keep the existing `if ((release >= 39))`
branch in `get_jetpack_version()` and do the R39 check + setup right
there instead of threading a new "jp7-r39" token through
`configure_jetson_host()`.

The earlier version added a `39.*)` case to `get_jetpack_version`, a
matching `jp7-r39)` case to `configure_jetson_host`, and a conditional
inside that function's br_netfilter block. This version:

- Leaves `get_jetpack_version`'s R36/R38 case statement and
  `configure_jetson_host` case statements completely untouched for
  R36/R38 — no regression surface there.
- Extracts `apply_br_netfilter_setup()` as a shared helper so the R39
  branch and the R36/R38 path call the same code without duplication.
- In the `>=39` branch: probe `bridge_netfilter_ready` first; if the
  host is already set up, log and return empty (main() exits early,
  same as before this PR). Otherwise do the sudo prompt and call
  `apply_br_netfilter_setup`.

Net effect on R39 is identical — `br_netfilter` is loaded and
persisted only when missing — but the diff is smaller and R39-specific
logic is co-located with the rest of the R39 detection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Centralizes br_netfilter readiness check and apply logic in the Jetson setup script; conditionally applies iptables-only (R36/R38) or dual-stack (R39) br_netfilter setup (modprobe, runtime sysctls, and persistence files), and avoids early-return on JP7 R39 so host bridge filtering is enforced when needed.

Changes

Cohort / File(s) Summary
Jetson setup script
scripts/setup-jetson.sh
Adds bridge_netfilter_ready() readiness check and two apply functions (iptables-only for R36/R38 and dual-stack for R39). Moves br_netfilter/module/sysctl persistence into helpers, enforces sudo when required, logs runtime sysctl values and persistence paths, and removes inline br_netfilter setup from older configure_jetson_host() branches so R39 no longer early-returns. Focus review on readiness logic, conditional branching for release versions, sysctl/module persistence paths (/etc/sysctl.d/99-nemoclaw.conf, /etc/modules-load.d/nemoclaw.conf), and sudo enforcement/logging.

Sequence Diagram(s)

sequenceDiagram
    participant Installer
    participant HostKernel
    participant SysctlPersist
    participant ModulesPersist

    Installer->>HostKernel: bridge_netfilter_ready? (check runtime & persistence)
    alt not ready
        Installer->>HostKernel: modprobe br_netfilter
        Installer->>HostKernel: set sysctl net.bridge.bridge-nf-call-iptables/ip6tables=1
        Installer->>SysctlPersist: write /etc/sysctl.d/99-nemoclaw.conf
        Installer->>ModulesPersist: write /etc/modules-load.d/nemoclaw.conf
        Installer->>HostKernel: log runtime values and persistence paths
    else ready
        Installer->>Installer: report no host setup required
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I nudged the bridge, gave nets a nudge,
Loaded a module, then wrote a judge,
Sysctls tucked in a file with care,
Boots will remember each little prayer,
R39 hops happy, DNS in the air.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the primary change: fixing br_netfilter application on JetPack R39 when missing, with direct reference to the issue it addresses.
Linked Issues check ✅ Passed The PR implements all coding requirements from #2418: br_netfilter and bridge sysctls are now applied on R39 hosts with proper readiness checks, persistence handling, and conditional sudo enforcement.
Out of Scope Changes check ✅ Passed All changes in scripts/setup-jetson.sh are directly scoped to fixing the br_netfilter issue on R39; no unrelated modifications to iptables-legacy or daemon.json are included, maintaining byte-identical behavior for R36/R38.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/setup-jetson-r39-br-netfilter-2418

Comment @coderabbitai help to get the list of available commands and usage tips.

The original fix only set the IPv4 sysctl (bridge-nf-call-iptables),
which is enough for IPv4 ClusterIP service routing in k3s — the path
validated end-to-end on the reporter's Jetson Orin R39.

However, the manual workaround tracked in the issue thread enables
both bridge-nf-call-iptables and bridge-nf-call-ip6tables, and that
is also the combination that was actually verified on the reporter's
host. Shipping only v4 leaves a small but real gap for any future
dual-stack or v6-only cluster service the gateway might route, and
makes the automated fix diverge slightly from the proven-works
manual steps.

- apply_br_netfilter_setup: flip ip6tables sysctl alongside iptables,
  and persist both in /etc/sysctl.d/99-nemoclaw.conf so the setting
  survives reboot (same drop-in, two lines).
- bridge_netfilter_ready: require both sysctls to be 1 (and both
  /proc nodes to exist) before treating the host as already-configured,
  so we don't skip the apply step when only v4 is pre-set.
- Updated comments to reflect v4+v6 coverage.

Docker/k3s do not need to be restarted after this — br_netfilter is
a kernel module and the sysctl takes effect immediately; new bridged
traffic picks up the new netfilter hook without any daemon reload.
The R39 apply path currently prints a single "loading br_netfilter"
line before running modprobe/sysctl/tee, all of which are silent
(their stdout is redirected to /dev/null). On success the script
exits quietly, leaving the operator without a confirmation that the
apply actually completed, which two sysctls were flipped, or where
the drop-ins were persisted. On failure, only the raw error from
whichever step tripped `set -e` is visible.

Add one structured info line at the end of the apply branch
summarising what was done and where state now lives. Keeps the
skip branch unchanged and does not affect R36/R38 logging (which
already gets output from update-alternatives / systemctl restart).

Useful for:
- confirming the script actually ran the expected code path (in
  particular, that it's the v4+v6 version, not an older v4-only
  build) when triaging future reports.
- documenting the on-host footprint for later cleanup / debugging.
Refines the confirmation log added in 5b1753f so the script itself
produces evidence of the fix's effect, rather than just asserting
"we set it to 1". A user validating on their own Jetson can now
confirm from script output alone that:

  1) The runtime sysctls read back as 1 (read from /proc, not
     hardcoded in the log string).
  2) What this unblocks — sandbox-pod → ClusterIP routing, i.e. the
     original #2418 symptom.
  3) No docker/k3s restart is required (common operator question;
     answered inline so reporters don't have to ask).
  4) Which drop-in files own the reboot persistence.

Before: "br_netfilter applied: bridge-nf-call-{iptables,ip6tables}=1; persisted ..."
  (one line, hardcoded string — no actual verification)

After: "br_netfilter runtime: bridge-nf-call-iptables=1, bridge-nf-call-ip6tables=1 — sandbox → ClusterIP routing (CoreDNS, services) is unblocked; no docker or k3s restart needed"
       "Reboot persistence: /etc/modules-load.d/nemoclaw.conf, /etc/sysctl.d/99-nemoclaw.conf"
  (two lines; first reads live /proc state as proof)

Falls back to "?" in the unlikely case /proc/sys/net/bridge/* is
missing when read (e.g., racy unload), so the log line is always
printable even in edge cases.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/setup-jetson.sh`:
- Line 183: The call to apply_br_netfilter_setup now routes jp6/jp7-r38 through
a shared helper that introduced expanded persistence writes intended for R39;
isolate legacy JetPack behavior by either creating a separate helper (e.g.,
apply_br_netfilter_setup_legacy) and keep jp6/jp7-r38 calling the legacy
variant, or add a guarded conditional inside apply_br_netfilter_setup that
checks the JetPack version and only performs the extra persistence writes when
the version equals R39; update references to apply_br_netfilter_setup in the jp6
and jp7-r38 flow to use the legacy helper or ensure the conditional prevents
R36/R38 writes so those paths remain byte-identical.
- Around line 19-28: The current bridge_netfilter_ready() only checks live /proc
state; update the logic so the script does not skip persistence steps
prematurely: either extend bridge_netfilter_ready() to also verify the
persistence drop-ins exist and contain the expected settings (check
/etc/modules-load.d/* for br_netfilter and /etc/sysctl.d/* for
bridge-nf-call-iptables and bridge-nf-call-ip6tables), or make the persistence
step idempotent by always creating/updating the /etc/modules-load.d and
/etc/sysctl.d drop-in files when they are missing or don’t contain the required
entries; reference the bridge_netfilter_ready() function and the persistence
targets (/etc/modules-load.d and /etc/sysctl.d) when implementing the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9260e3aa-267d-460a-a3fb-b1953242f97e

📥 Commits

Reviewing files that changed from the base of the PR and between 5b1753f and 9aa281e.

📒 Files selected for processing (1)
  • scripts/setup-jetson.sh

Comment thread scripts/setup-jetson.sh Outdated
Comment thread scripts/setup-jetson.sh
…2418)

Addresses two review comments on PR #2419:

1. bridge_netfilter_ready only inspected the live kernel state,
   which let the skip branch trigger on runtime-only configuration
   (e.g. someone ran modprobe + sysctl manually for a test). The
   drop-ins would never get written and the fix would vanish at
   the next reboot. The predicate now also requires:
     - /etc/modules-load.d/nemoclaw.conf contains `br_netfilter`
     - /etc/sysctl.d/99-nemoclaw.conf contains both sysctl=1 lines
   apply_br_netfilter_setup_r39 is already idempotent, so when
   runtime is live but drop-ins are missing or stale (v4-only
   leftover from an older script), ready returns false and the
   apply path rewrites them.

2. The earlier "also flip ip6tables" commit (77e84f9) added v6
   handling to the shared apply_br_netfilter_setup helper, which
   was also called by R36/R38 via configure_jetson_host. That
   expanded the R36/R38 behavior — out of scope for this PR.
   Split into two helpers:
     - apply_br_netfilter_setup: v4-only, byte-identical to the
       pre-#2418 inline block, still used by R36/R38.
     - apply_br_netfilter_setup_r39: v4+v6, the dual-stack variant
       validated on the reporter's Jetson; only called from the
       R39 branch in get_jetpack_version.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/setup-jetson.sh`:
- Around line 30-39: The bridge IPv6 sysctl is being written unconditionally
which fails on kernels without bridge-nf-call-ip6tables; update the script so
the write of net.bridge.bridge-nf-call-ip6tables=1 is only attempted when the
kernel exposes /proc/sys/net/bridge/bridge-nf-call-ip6tables (or when the file
exists), and update the bridge_netfilter_ready() function to check for the IPv6
sysctl file before asserting its value (mirror the existing IPv4 checks in
bridge_netfilter_ready()); ensure any set -e failure paths are avoided by gating
the IPv6 sysctl write with a presence check and only adding the grep check for
net.bridge.bridge-nf-call-ip6tables=1 in /etc/sysctl.d/99-nemoclaw.conf when
that sysctl is present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: adc4de17-ceae-46df-8123-2bafdfcc99e4

📥 Commits

Reviewing files that changed from the base of the PR and between 9aa281e and 83a2398.

📒 Files selected for processing (1)
  • scripts/setup-jetson.sh

Comment thread scripts/setup-jetson.sh Outdated
CodeRabbit flagged that the IPv6 sysctl write and the ready-check
for bridge-nf-call-ip6tables were unconditional, which would fail
under `set -e` on kernels built without CONFIG_IPV6 (where
/proc/sys/net/bridge/bridge-nf-call-ip6tables does not exist).

Make the IPv6 leg present-aware in all three places:

- apply_br_netfilter_setup_r39: only runs `sysctl -w ...-ip6tables=1`
  when the /proc node exists; the persistence drop-in then contains
  only the v4 line on IPv6-less kernels (so bridge_netfilter_ready
  stays consistent with what was actually written).

- bridge_netfilter_ready: v4 checks are mandatory; v6 checks (both
  the runtime sysctl value and the drop-in line) are gated on the
  /proc node existing. On a v4-only kernel the predicate returns
  true with just v4, matching the drop-in content.

- R39 branch log line: when /proc for ip6tables is missing, emit
  "(ip6tables not exposed by this kernel; IPv4-only persistence)"
  so the operator can tell from logs alone that the v4-only path
  was taken intentionally, rather than because v6 silently failed.
Removes the IPv6 work that scope-crept into this PR and triggered
two rounds of review: the R36/R38 byte-identical split (77e84f9 /
83a2398) and the CONFIG_IPV6-gated presence check (2a3ddbd).

The reporter's actual symptom on #2418 — sandbox pod → ClusterIP
DNS timing out — is IPv4-only (CoreDNS ClusterIP 10.43.0.10). The
v4 sysctl + br_netfilter modprobe alone is sufficient and matches
what the R36/R38 paths have always done. IPv6 was belt-and-
suspenders; every review iteration showed the extra guard code was
more complexity than the robustness was worth.

Net effect:
- apply_br_netfilter_setup_r39 is deleted; R39 calls the shared
  apply_br_netfilter_setup (same helper R36/R38 already use).
- bridge_netfilter_ready drops IPv6 runtime and persistence checks.
- R39 log line drops the IPv6 summary branching.
- CodeRabbit's other finding (persistence must be part of ready)
  is preserved: bridge_netfilter_ready still checks
  /etc/modules-load.d/nemoclaw.conf and /etc/sysctl.d/99-nemoclaw.conf
  content, so transient modprobe-only state no longer triggers skip.

If a future issue shows IPv6 cluster-service routing breaking on
R39, it can be added back with its own scope and validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

setup-jetson.sh skips br_netfilter on JetPack R39, breaking k3s service DNS and onboard [7/8]

1 participant