Skip to content

feat(orchestrator): ORCHESTRATOR_HOST env var for worker callback address#2506

Open
ymcrcat wants to merge 1 commit into
nearai:mainfrom
ymcrcat:fix/orchestrator-host
Open

feat(orchestrator): ORCHESTRATOR_HOST env var for worker callback address#2506
ymcrcat wants to merge 1 commit into
nearai:mainfrom
ymcrcat:fix/orchestrator-host

Conversation

@ymcrcat
Copy link
Copy Markdown
Contributor

@ymcrcat ymcrcat commented Apr 15, 2026

Summary

When the orchestrator spawns a worker container, it hardcodes host.docker.internal as the callback address. This works on Docker Desktop (macOS/Windows), where host.docker.internal resolves to the host and Docker publishes ports on all interfaces. It does not work on Linux with Nomad (and similar orchestrators).

The failure mode: Nomad publishes Docker ports on the node's advertised IP (e.g. 10.128.0.17), not 0.0.0.0. Inside the worker container, host.docker.internal resolves via host-gateway to the Docker bridge gateway (172.17.0.1) — a different IP from where Nomad published the orchestrator port. The worker can't connect back, hangs, and fails with the 10-minute container execution timed out error.

Change

Add an ORCHESTRATOR_HOST env var read at job-dispatch time. Operators set it to the IP where Docker published the orchestrator port (from docker port <container>).

When unset, defaults stay Docker-Desktop-friendly:

  • Linux → 172.17.0.1 (Docker bridge gateway, same behavior as before for standalone Docker)
  • Other platforms → host.docker.internal (unchanged)

In a Nomad job spec:

env {
  ORCHESTRATOR_HOST = "\${attr.unique.network.ip-address}"
}

Test plan

  • cargo test --lib orchestrator::job_manager::tests::resolve_orchestrator_host — 3 unit tests, all pass
  • Verified end-to-end on a live Nomad cluster: worker container spawns, connects back to 10.128.0.17:<orchestrator_port>, receives job, dispatches shell tool, returns output. Prior to this change the same setup hit the 10-minute timeout.

Backwards compatibility

Zero impact on existing deployments:

  • Docker Desktop macOS/Windows: env var unset → default host.docker.internal (unchanged)
  • Standalone Linux Docker with published-on-all-interfaces: env var unset → 172.17.0.1 (bridge gateway, same as previous hardcoded fallback via extra_hosts: host.docker.internal:host-gateway)
  • Nomad / orchestrator with node-IP port binding: set ORCHESTRATOR_HOST explicitly to fix a previously-broken path

🤖 Generated with Claude Code

@github-actions github-actions Bot added scope: channel/web Web gateway channel scope: tool/wasm WASM tool sandbox scope: orchestrator Container orchestrator scope: sandbox Docker sandbox size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: regular 2-5 merged PRs labels Apr 15, 2026
@ymcrcat ymcrcat requested a review from henrypark133 April 15, 2026 20:53
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces slim Docker images for production and worker environments, significantly reducing image size for high-density deployments. Key architectural changes include a configurable Tokio runtime that supports single-worker multi-threaded execution to preserve blocking task semantics, and lazy initialization of the WASM epoch ticker thread to improve resource efficiency. Additionally, Docker detection was updated to support environments where only the socket is available, and a new orchestrator host resolution mechanism was added for better compatibility with Linux-based orchestrators like Nomad. Feedback is provided regarding the optimization of atomic memory ordering in the lazy initialization logic.

Comment thread src/tools/wasm/runtime.rs Outdated
@ymcrcat ymcrcat force-pushed the fix/orchestrator-host branch from 9fc2120 to 4940b87 Compare April 15, 2026 20:58
@github-actions github-actions Bot added size: M 50-199 changed lines and removed size: L 200-499 changed lines labels Apr 15, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9fc2120fb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/orchestrator/job_manager.rs Outdated
Comment thread src/bootstrap.rs Outdated
let mut builder = tokio::runtime::Builder::new_multi_thread();
builder.enable_all();
if let Some(threads) = worker_threads {
builder.worker_threads(threads);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate TOKIO_WORKER_THREADS is at least 1

TOKIO_WORKER_THREADS is parsed as usize, so 0 is accepted and passed to builder.worker_threads(threads). Tokio’s worker_threads(0) panics (rather than returning an error), so a misconfigured env var will crash startup instead of producing a recoverable config error from build_runtime_from_env.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna be resolved by PR #2418

@ymcrcat ymcrcat enabled auto-merge (squash) April 15, 2026 20:59
…ress

On Docker Desktop (macOS/Windows), host.docker.internal routes to the
host and ports are published on all interfaces — works fine. On Linux
with Nomad or similar orchestrators, Docker ports are published on the
node's advertised IP (e.g. 10.128.0.17), NOT on 0.0.0.0.
host.docker.internal resolves to the bridge gateway (172.17.0.1), a
different IP — worker containers can't reach the orchestrator and fail
with a 10-minute timeout.

Read ORCHESTRATOR_HOST env var to override the callback host. Falls
back to 172.17.0.1 on Linux and host.docker.internal elsewhere to keep
existing Docker Desktop behavior unchanged.

Operators set ORCHESTRATOR_HOST to match where Docker publishes the
orchestrator port (visible in `docker port <container>`). In Nomad
HCL: `ORCHESTRATOR_HOST = "\${attr.unique.network.ip-address}"`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ymcrcat ymcrcat force-pushed the fix/orchestrator-host branch from 4940b87 to 9a9c60b Compare April 15, 2026 21:29
@ymcrcat ymcrcat requested a review from serrrfirat April 15, 2026 22:08
Copy link
Copy Markdown
Collaborator

@henrypark133 henrypark133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: empty ORCHESTRATOR_HOST now produces a broken callback URL

The direction is right, but I found one verified config-handling bug in the new host-resolution helper.

Concerning: empty ORCHESTRATOR_HOST is accepted verbatim and yields an invalid callback URL

File: src/orchestrator/job_manager.rs:210, src/orchestrator/job_manager.rs:1022
resolve_orchestrator_host() now returns the raw env-var value for any Ok(...), and the new test explicitly locks in "" as a valid result. That value is fed straight into format!("http://{}:{}", orchestrator_host, ...), so an empty env var produces http://:PORT. Worker containers then get a malformed/unreachable IRONCLAW_ORCHESTRATOR_URL and will fail the callback path this PR is trying to fix. This is easy to hit in container schedulers where an env var is present but expands to an empty string.
Suggested fix: treat empty or all-whitespace ORCHESTRATOR_HOST as unset and fall back to the default host, or reject it explicitly with a startup error.

Summary:

  • Recommended verdict: Request changes
  • Prior feedback status: no prior blocker review
  • Residual risk: the new helper is otherwise small, but this empty-value path is now documented by test as accepted behavior.

Copy link
Copy Markdown
Collaborator

@henrypark133 henrypark133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier blocker still appears unresolved.

  1. High – src/orchestrator/job_manager.rs:210 still treats ORCHESTRATOR_HOST="" as a valid configured host instead of falling back. Combined with the URL assembly path, that yields http://:PORT and breaks worker callbacks for environments that inject the variable but leave it empty.
  2. The regression test at src/orchestrator/job_manager.rs:1022 still locks in that broken behavior by asserting the empty string is preserved.

I don't see a compensating change elsewhere in the PR, so I can't clear the prior request-changes yet.

@henrypark133 henrypark133 changed the base branch from staging to main May 1, 2026 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: regular 2-5 merged PRs risk: medium Business logic, config, or moderate-risk modules scope: channel/web Web gateway channel scope: orchestrator Container orchestrator scope: sandbox Docker sandbox scope: tool/wasm WASM tool sandbox size: M 50-199 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants