feat(orchestrator): ORCHESTRATOR_HOST env var for worker callback address#2506
feat(orchestrator): ORCHESTRATOR_HOST env var for worker callback address#2506ymcrcat wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces slim Docker images for production and worker environments, significantly reducing image size for high-density deployments. Key architectural changes include a configurable Tokio runtime that supports single-worker multi-threaded execution to preserve blocking task semantics, and lazy initialization of the WASM epoch ticker thread to improve resource efficiency. Additionally, Docker detection was updated to support environments where only the socket is available, and a new orchestrator host resolution mechanism was added for better compatibility with Linux-based orchestrators like Nomad. Feedback is provided regarding the optimization of atomic memory ordering in the lazy initialization logic.
9fc2120 to
4940b87
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9fc2120fb2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let mut builder = tokio::runtime::Builder::new_multi_thread(); | ||
| builder.enable_all(); | ||
| if let Some(threads) = worker_threads { | ||
| builder.worker_threads(threads); |
There was a problem hiding this comment.
Validate TOKIO_WORKER_THREADS is at least 1
TOKIO_WORKER_THREADS is parsed as usize, so 0 is accepted and passed to builder.worker_threads(threads). Tokio’s worker_threads(0) panics (rather than returning an error), so a misconfigured env var will crash startup instead of producing a recoverable config error from build_runtime_from_env.
Useful? React with 👍 / 👎.
…ress
On Docker Desktop (macOS/Windows), host.docker.internal routes to the
host and ports are published on all interfaces — works fine. On Linux
with Nomad or similar orchestrators, Docker ports are published on the
node's advertised IP (e.g. 10.128.0.17), NOT on 0.0.0.0.
host.docker.internal resolves to the bridge gateway (172.17.0.1), a
different IP — worker containers can't reach the orchestrator and fail
with a 10-minute timeout.
Read ORCHESTRATOR_HOST env var to override the callback host. Falls
back to 172.17.0.1 on Linux and host.docker.internal elsewhere to keep
existing Docker Desktop behavior unchanged.
Operators set ORCHESTRATOR_HOST to match where Docker publishes the
orchestrator port (visible in `docker port <container>`). In Nomad
HCL: `ORCHESTRATOR_HOST = "\${attr.unique.network.ip-address}"`.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4940b87 to
9a9c60b
Compare
henrypark133
left a comment
There was a problem hiding this comment.
Review: empty ORCHESTRATOR_HOST now produces a broken callback URL
The direction is right, but I found one verified config-handling bug in the new host-resolution helper.
Concerning: empty ORCHESTRATOR_HOST is accepted verbatim and yields an invalid callback URL
File: src/orchestrator/job_manager.rs:210, src/orchestrator/job_manager.rs:1022
resolve_orchestrator_host() now returns the raw env-var value for any Ok(...), and the new test explicitly locks in "" as a valid result. That value is fed straight into format!("http://{}:{}", orchestrator_host, ...), so an empty env var produces http://:PORT. Worker containers then get a malformed/unreachable IRONCLAW_ORCHESTRATOR_URL and will fail the callback path this PR is trying to fix. This is easy to hit in container schedulers where an env var is present but expands to an empty string.
Suggested fix: treat empty or all-whitespace ORCHESTRATOR_HOST as unset and fall back to the default host, or reject it explicitly with a startup error.
Summary:
- Recommended verdict: Request changes
- Prior feedback status: no prior blocker review
- Residual risk: the new helper is otherwise small, but this empty-value path is now documented by test as accepted behavior.
henrypark133
left a comment
There was a problem hiding this comment.
The earlier blocker still appears unresolved.
- High –
src/orchestrator/job_manager.rs:210still treatsORCHESTRATOR_HOST=""as a valid configured host instead of falling back. Combined with the URL assembly path, that yieldshttp://:PORTand breaks worker callbacks for environments that inject the variable but leave it empty. - The regression test at
src/orchestrator/job_manager.rs:1022still locks in that broken behavior by asserting the empty string is preserved.
I don't see a compensating change elsewhere in the PR, so I can't clear the prior request-changes yet.
Summary
When the orchestrator spawns a worker container, it hardcodes
host.docker.internalas the callback address. This works on Docker Desktop (macOS/Windows), wherehost.docker.internalresolves to the host and Docker publishes ports on all interfaces. It does not work on Linux with Nomad (and similar orchestrators).The failure mode: Nomad publishes Docker ports on the node's advertised IP (e.g.
10.128.0.17), not0.0.0.0. Inside the worker container,host.docker.internalresolves viahost-gatewayto the Docker bridge gateway (172.17.0.1) — a different IP from where Nomad published the orchestrator port. The worker can't connect back, hangs, and fails with the 10-minutecontainer execution timed outerror.Change
Add an
ORCHESTRATOR_HOSTenv var read at job-dispatch time. Operators set it to the IP where Docker published the orchestrator port (fromdocker port <container>).When unset, defaults stay Docker-Desktop-friendly:
172.17.0.1(Docker bridge gateway, same behavior as before for standalone Docker)host.docker.internal(unchanged)In a Nomad job spec:
Test plan
cargo test --lib orchestrator::job_manager::tests::resolve_orchestrator_host— 3 unit tests, all pass10.128.0.17:<orchestrator_port>, receives job, dispatches shell tool, returns output. Prior to this change the same setup hit the 10-minute timeout.Backwards compatibility
Zero impact on existing deployments:
host.docker.internal(unchanged)172.17.0.1(bridge gateway, same as previous hardcoded fallback viaextra_hosts: host.docker.internal:host-gateway)ORCHESTRATOR_HOSTexplicitly to fix a previously-broken path🤖 Generated with Claude Code