feat: Add Headscale support for WAN peer discovery#1023
feat: Add Headscale support for WAN peer discovery#1023realies wants to merge 3 commits intoexo-explore:mainfrom
Conversation
|
This actually looks really solid, but I'm uncertain about our long term support for headscale. Our net code is not in a great state and I have rewritten it in #959, so I'd only want to support this in an experimental capacity. Would you mind if I merged this with the understanding that it may be rewritten in the future? |
|
Thanks for the review! Yes, that sounds completely reasonable - I'm happy for this to be merged as experimental with the understanding that it may be rewritten when the networking code is refactored in #959. I've pushed a fix for the CI failure (the Python stubs needed to be updated for the new HeadscaleConfig class). I'm also happy to help adapt this to any new networking architecture if needed in the future. |
Evanev7
left a comment
There was a problem hiding this comment.
...yeah this looks solid. I don't have headscale setup to test, and long term support will be a future conversation, but the code looks good.
This comment was marked as spam.
This comment was marked as spam.
| pub struct PyHeadscaleConfig { | ||
| /// Base URL for the Headscale API (e.g., "https://headscale.example.com") | ||
| #[pyo3(get, set)] | ||
| pub api_base_url: String, | ||
|
|
||
| /// API key for authenticating with the Headscale API | ||
| #[pyo3(get, set)] | ||
| pub api_key: String, | ||
|
|
||
| /// How often to poll the Headscale API for peer updates (in seconds) | ||
| #[pyo3(get, set)] | ||
| pub poll_interval_secs: u64, | ||
|
|
||
| /// The port that exo is listening on | ||
| #[pyo3(get, set)] | ||
| pub exo_port: u16, | ||
| } |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| # Build Headscale config if API URL and key are provided | ||
| headscale_config = None | ||
| if args.headscale_api_url and args.headscale_api_key: | ||
| logger.info(f"Headscale discovery enabled with API URL: {args.headscale_api_url}") | ||
| headscale_config = HeadscaleConfig( | ||
| api_base_url=args.headscale_api_url, | ||
| api_key=args.headscale_api_key, | ||
| poll_interval_secs=args.headscale_poll_interval, | ||
| exo_port=args.api_port, | ||
| ) |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| @@ -101,6 +101,58 @@ mod exception { | |||
| } | |||
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| // #![feature(negative_impls)] | ||
|
|
||
| pub mod discovery; | ||
| pub mod headscale; | ||
| pub mod keep_alive; |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
|
Wow, spam reviews. There's a new one. Bear with me while I figure out how to dismiss that. Update: github does not have this feature. Great. |
This adds support for discovering exo peers across different networks using a Headscale coordination server. Headscale is an open-source implementation of the Tailscale control server. Changes: - Add headscale.rs module in rust/networking for Headscale API integration - Implement HeadscaleConfig Python binding for configuration - Add --headscale-api-url and --headscale-api-key CLI flags - Integrate Headscale discovery as optional Toggle behaviour in swarm - Handle Headscale connection events in networking task Usage: exo --headscale-api-url https://headscale.example.com \ --headscale-api-key YOUR_API_KEY Nodes running exo with Headscale will: 1. Poll the Headscale API for online peers 2. Discover peers with exo peer_id tags 3. Dial discovered peers using their Tailscale IPs 4. Work alongside local mDNS discovery for hybrid setups
f63b1b2 to
8957772
Compare
- Add environment variable support for HEADSCALE_API_URL and HEADSCALE_API_KEY (recommended for security as CLI args are visible in ps) - Add documentation for fetch_nodes error conditions - Improve TODO comment for async blocking issue with recommended fix approach - Document catch-all pattern in on_swarm_event - Batch dial requests instead of individual dials per address (perf) - Add URL validation for Headscale API URL with HTTPS warning - Move reqwest to workspace dependencies for consistency
8957772 to
586b07f
Compare
|
Rebased onto latest main and resolved the merge conflicts. Also addressed some of the valid points from the automated review (env var support for API credentials, batched dial requests, URL validation, etc.) and marked them as resolved. CI should pass now - ready to merge when you get a chance! |
|
Thank you for your contribution! The exo codebase has undergone significant architectural changes since this PR was opened (event sourcing, new placement system, runner rewrite), and this PR now has merge conflicts that would require a substantial rewrite to resolve. We're closing this to keep the PR list focused, but the idea is still welcome — if you'd like to revisit this, please feel free to open a fresh PR against the current |
**Enabling peers to be discovered in environments where mDNS is unavailable (SSH sessions, headless servers, Docker).** ## Motivation Exo discovers peers exclusively via mDNS, which works great on a local network but breaks once you move beyond a single L2 broadcast domain: - SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI sessions (#1488) - Headless servers/rack machines — #1682 ("DGX Spark does not find other nodes") - Docker Compose — mDNS is often unavailable across container networks; e.g. #1462 (E2E test framework) needs an alternative Related works: #1488 (working implementation made by @AlexCheema and closed because SSH had a GUI workaround), #1023 (Headscale WAN then closed due to merge conflicts), #1656 (discovery cleanup, open). This PR introduces an optional bootstrap mechanism for peer discovery while leaving the existing mDNS behavior unchanged. ## Changes Adds two new CLI flags: - `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated libp2p multiaddrs to dial on startup and retry periodically - `--libp2p-port` — fixed TCP port for libp2p to listen on (default: OS-assigned). Required when bootstrap peers, so other nodes know which port to dial. 8 files: - `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in existing retry loop - `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to `Behaviour` - `rust/networking/examples/chatroom.rs`: Updated call site for new create_swarm signature - `rust/networking/tests/bootstrap_peers.rs`: Integration tests - `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional `bootstrap_peers` in PyO3 constructor - `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub - `src/exo/routing/router.py`: Pass peers to `NetworkingHandle` - `src/exo/main.py` : `--bootstrap-peers` CLI arg + `EXO_BOOTSTRAP_PEERS` env var ## Why It Works Bootstrap peers are dialed in the existing retry loop — the same path taken by peers when mDNS-discovered. The swarm handles connection, Noise handshake, and gossipsub mesh joining from there. PeerId is intentionally not required in the multiaddr, the Noise handshake discovers it. Docker Compose example: ```yaml services: exo-1: environment: EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000" exo-2: environment: EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000" ``` ## Test Plan ### Manual Testing <details> <summary>Docker Compose config</summary> ``` services: exo-node1: build: context: . dockerfile: Dockerfile.bootstrap-test container_name: exo-bootstrap-node1 hostname: exo-node1 command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"] environment: - EXO_LIBP2P_NAMESPACE=bootstrap-test ports: - "52415:52415" networks: bootstrap-net: ipv4_address: 172.30.20.2 deploy: resources: limits: memory: 4g exo-node2: build: context: . dockerfile: Dockerfile.bootstrap-test container_name: exo-bootstrap-node2 hostname: exo-node2 command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"] environment: - EXO_LIBP2P_NAMESPACE=bootstrap-test ports: - "52416:52415" networks: bootstrap-net: ipv4_address: 172.30.20.3 deploy: resources: limits: memory: 4g networks: bootstrap-net: driver: bridge ipam: config: - subnet: 172.30.20.0/24 ``` </details> Two containers on a bridge network (`172.30.20.0/24`), fixed IPs, `--libp2p-port 30000`, cross-referencing `--bootstrap-peers`. Both nodes found each other and established a connection then ran the election protocol. ### Automated Testing 4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs` (`cargo test -p networking`): | Test | What it verifies | Result | |------|-----------------|--------| | `two_nodes_connect_via_bootstrap_peers` | Node B discovers Node A via bootstrap addr (real TCP connection) | PASS | | `create_swarm_with_empty_bootstrap_peers` | Backward compatibility — no bootstrap peers works | PASS | | `create_swarm_ignores_invalid_bootstrap_addrs` | Invalid multiaddrs silently filtered | PASS | | `create_swarm_with_fixed_port` | `listen_port` parameter works | PASS | All 4 pass. The connection test takes ~6s --------- Signed-off-by: DeepZima <deepzima@outlook.com> Co-authored-by: Evan <evanev7@gmail.com>
This PR adds support for discovering exo peers across different networks using Headscale, an open-source, self-hosted implementation of the Tailscale control server.
This enables running exo clusters across WAN (different networks/locations) while still using the existing local mDNS discovery for LAN peers.
Motivation
While exo's mDNS discovery works great for local networks, there's demand for running clusters across geographically distributed machines. Headscale provides a self-hosted coordination layer that can be used to discover peers without exposing machines directly to the internet.
This is a re-implementation of #739 for the new v2 codebase architecture.
Changes
Rust Networking (
rust/networking/)headscale.rsmodule implementing libp2pNetworkBehaviourfor Headscale discoverytag:exo_peer_id=<peer_id>tags to identify exo nodesPython Bindings (
rust/exo_pyo3_bindings/)HeadscaleConfigclass for configurationNetworkingHandleto accept optional Headscale configCLI (
src/exo/)--headscale-api-urlflag--headscale-api-keyflag--headscale-poll-intervalflag (default: 5 seconds)Usage
Setup Requirements
tag:exo_peer_id=<peer_id>andtag:exo_port=52415How It Works
--headscale-api-urland--headscale-api-keyare provided, Headscale discovery is enabled alongside mDNSGET /api/v1/nodeto list all nodes in the Headscale networktag:exo_peer_id=*tags are identified as exo peersTesting
Tested with:
Future Improvements