Skip to content

feat: Add Headscale support for WAN peer discovery#1023

Closed
realies wants to merge 3 commits intoexo-explore:mainfrom
realies:feature/headscale-support
Closed

feat: Add Headscale support for WAN peer discovery#1023
realies wants to merge 3 commits intoexo-explore:mainfrom
realies:feature/headscale-support

Conversation

@realies
Copy link

@realies realies commented Dec 28, 2025

This PR adds support for discovering exo peers across different networks using Headscale, an open-source, self-hosted implementation of the Tailscale control server.

This enables running exo clusters across WAN (different networks/locations) while still using the existing local mDNS discovery for LAN peers.

Motivation

While exo's mDNS discovery works great for local networks, there's demand for running clusters across geographically distributed machines. Headscale provides a self-hosted coordination layer that can be used to discover peers without exposing machines directly to the internet.

This is a re-implementation of #739 for the new v2 codebase architecture.

Changes

Rust Networking (rust/networking/)

  • Added headscale.rs module implementing libp2p NetworkBehaviour for Headscale discovery
  • Polls Headscale API periodically to discover online peers
  • Parses tag:exo_peer_id=<peer_id> tags to identify exo nodes
  • Dials discovered peers using their Tailscale IPs

Python Bindings (rust/exo_pyo3_bindings/)

  • Added HeadscaleConfig class for configuration
  • Updated NetworkingHandle to accept optional Headscale config

CLI (src/exo/)

  • Added --headscale-api-url flag
  • Added --headscale-api-key flag
  • Added --headscale-poll-interval flag (default: 5 seconds)

Usage

# Start exo with Headscale discovery enabled
exo --headscale-api-url https://headscale.example.com \
    --headscale-api-key YOUR_API_KEY

# Optionally customize poll interval
exo --headscale-api-url https://headscale.example.com \
    --headscale-api-key YOUR_API_KEY \
    --headscale-poll-interval 10

Setup Requirements

  1. Run a Headscale server
  2. Connect machines to Headscale using Tailscale client
  3. Create an API key in Headscale
  4. Tag each exo node with its peer ID: tag:exo_peer_id=<peer_id> and tag:exo_port=52415

How It Works

  1. When --headscale-api-url and --headscale-api-key are provided, Headscale discovery is enabled alongside mDNS
  2. The node polls GET /api/v1/node to list all nodes in the Headscale network
  3. Online nodes with tag:exo_peer_id=* tags are identified as exo peers
  4. libp2p dials these peers using their Tailscale IP addresses
  5. Once connected, normal gossipsub communication proceeds

Testing

Tested with:

  • Headscale server (self-hosted)
  • Multiple machines on different networks connected via Tailscale client

Future Improvements

  • Auto-register peer_id tag on startup via Headscale API
  • Support for Tailscale (managed) in addition to Headscale
  • IPv6 support for Tailscale addresses

@exo-explore exo-explore deleted a comment Dec 28, 2025
@exo-explore exo-explore deleted a comment Dec 28, 2025
@exo-explore exo-explore deleted a comment Dec 28, 2025
@exo-explore exo-explore deleted a comment Dec 28, 2025
@exo-explore exo-explore deleted a comment Dec 28, 2025
@exo-explore exo-explore deleted a comment Dec 28, 2025
@Evanev7
Copy link
Member

Evanev7 commented Dec 28, 2025

This actually looks really solid, but I'm uncertain about our long term support for headscale. Our net code is not in a great state and I have rewritten it in #959, so I'd only want to support this in an experimental capacity. Would you mind if I merged this with the understanding that it may be rewritten in the future?

@realies
Copy link
Author

realies commented Dec 28, 2025

Thanks for the review! Yes, that sounds completely reasonable - I'm happy for this to be merged as experimental with the understanding that it may be rewritten when the networking code is refactored in #959.

I've pushed a fix for the CI failure (the Python stubs needed to be updated for the new HeadscaleConfig class).

I'm also happy to help adapt this to any new networking architecture if needed in the future.

Copy link
Member

@Evanev7 Evanev7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...yeah this looks solid. I don't have headscale setup to test, and long term support will be a future conversation, but the code looks good.

@diffray-bot

This comment was marked as spam.

Comment on lines +108 to +124
pub struct PyHeadscaleConfig {
/// Base URL for the Headscale API (e.g., "https://headscale.example.com")
#[pyo3(get, set)]
pub api_base_url: String,

/// API key for authenticating with the Headscale API
#[pyo3(get, set)]
pub api_key: String,

/// How often to poll the Headscale API for peer updates (in seconds)
#[pyo3(get, set)]
pub poll_interval_secs: u64,

/// The port that exo is listening on
#[pyo3(get, set)]
pub exo_port: u16,
}

This comment was marked as spam.

Comment on lines +46 to +55
# Build Headscale config if API URL and key are provided
headscale_config = None
if args.headscale_api_url and args.headscale_api_key:
logger.info(f"Headscale discovery enabled with API URL: {args.headscale_api_url}")
headscale_config = HeadscaleConfig(
api_base_url=args.headscale_api_url,
api_key=args.headscale_api_key,
poll_interval_secs=args.headscale_poll_interval,
exo_port=args.api_port,
)

This comment was marked as spam.

@@ -101,6 +101,58 @@ mod exception {
}

This comment was marked as spam.

Comment on lines 15 to 19
// #![feature(negative_impls)]

pub mod discovery;
pub mod headscale;
pub mod keep_alive;

This comment was marked as spam.

@diffray-bot

This comment was marked as spam.

@Evanev7
Copy link
Member

Evanev7 commented Dec 30, 2025

Wow, spam reviews. There's a new one. Bear with me while I figure out how to dismiss that.

Update: github does not have this feature. Great.

This adds support for discovering exo peers across different networks using
a Headscale coordination server. Headscale is an open-source implementation
of the Tailscale control server.

Changes:
- Add headscale.rs module in rust/networking for Headscale API integration
- Implement HeadscaleConfig Python binding for configuration
- Add --headscale-api-url and --headscale-api-key CLI flags
- Integrate Headscale discovery as optional Toggle behaviour in swarm
- Handle Headscale connection events in networking task

Usage:
  exo --headscale-api-url https://headscale.example.com \
      --headscale-api-key YOUR_API_KEY

Nodes running exo with Headscale will:
1. Poll the Headscale API for online peers
2. Discover peers with exo peer_id tags
3. Dial discovered peers using their Tailscale IPs
4. Work alongside local mDNS discovery for hybrid setups
@realies realies force-pushed the feature/headscale-support branch from f63b1b2 to 8957772 Compare January 10, 2026 17:23
- Add environment variable support for HEADSCALE_API_URL and HEADSCALE_API_KEY
  (recommended for security as CLI args are visible in ps)
- Add documentation for fetch_nodes error conditions
- Improve TODO comment for async blocking issue with recommended fix approach
- Document catch-all pattern in on_swarm_event
- Batch dial requests instead of individual dials per address (perf)
- Add URL validation for Headscale API URL with HTTPS warning
- Move reqwest to workspace dependencies for consistency
@realies realies force-pushed the feature/headscale-support branch from 8957772 to 586b07f Compare January 10, 2026 17:34
@realies
Copy link
Author

realies commented Jan 10, 2026

Rebased onto latest main and resolved the merge conflicts. Also addressed some of the valid points from the automated review (env var support for API credentials, batched dial requests, URL validation, etc.) and marked them as resolved. CI should pass now - ready to merge when you get a chance!

@AlexCheema
Copy link
Contributor

Thank you for your contribution! The exo codebase has undergone significant architectural changes since this PR was opened (event sourcing, new placement system, runner rewrite), and this PR now has merge conflicts that would require a substantial rewrite to resolve.

We're closing this to keep the PR list focused, but the idea is still welcome — if you'd like to revisit this, please feel free to open a fresh PR against the current main branch. Thanks again for your time and effort!

@AlexCheema AlexCheema closed this Feb 17, 2026
Evanev7 added a commit that referenced this pull request Mar 25, 2026
**Enabling peers to be discovered in environments where mDNS is
unavailable (SSH sessions, headless servers, Docker).**

## Motivation
Exo discovers peers exclusively via mDNS, which works great on a local
network but breaks once you move beyond a single L2 broadcast domain:

- SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI
sessions (#1488)
- Headless servers/rack machines — #1682 ("DGX Spark does not find other
nodes")
- Docker Compose — mDNS is often unavailable across container networks;
e.g. #1462 (E2E test framework) needs an alternative

Related works: 
#1488 (working implementation made by @AlexCheema and closed because SSH
had a GUI workaround),
#1023 (Headscale WAN then closed due to merge conflicts), 
#1656 (discovery cleanup, open). 

This PR introduces an optional bootstrap mechanism for peer discovery
while leaving the existing mDNS behavior unchanged.

## Changes
Adds two new CLI flags:

- `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated
libp2p multiaddrs to dial on startup and retry periodically
- `--libp2p-port` — fixed TCP port for libp2p to listen on (default:
OS-assigned). Required when bootstrap peers, so other nodes know which
port to dial.

8 files: 
- `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in
existing retry loop
- `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to
`Behaviour`
- `rust/networking/examples/chatroom.rs`: Updated call site for new
create_swarm signature
- `rust/networking/tests/bootstrap_peers.rs`: Integration tests
- `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional
`bootstrap_peers` in PyO3 constructor
- `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub 
- `src/exo/routing/router.py`: Pass peers to `NetworkingHandle` 
- `src/exo/main.py` : `--bootstrap-peers` CLI arg +
`EXO_BOOTSTRAP_PEERS` env var

## Why It Works

Bootstrap peers are dialed in the existing retry loop — the same path
taken by peers when mDNS-discovered. The swarm handles connection, Noise
handshake, and gossipsub mesh joining from there.

PeerId is intentionally not required in the multiaddr, the Noise
handshake discovers it.

Docker Compose example:

```yaml
services:
  exo-1:
    environment:
      EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000"
  exo-2:
    environment:
      EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000"
```

## Test Plan

### Manual Testing
<details>
<summary>Docker Compose config</summary>

```
services:
  exo-node1:
    build:
      context: .
      dockerfile: Dockerfile.bootstrap-test
    container_name: exo-bootstrap-node1
    hostname: exo-node1
    command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"]
    environment:
      - EXO_LIBP2P_NAMESPACE=bootstrap-test
    ports:
      - "52415:52415"
    networks:
      bootstrap-net:
        ipv4_address: 172.30.20.2
    deploy:
      resources:
        limits:
          memory: 4g

  exo-node2:
    build:
      context: .
      dockerfile: Dockerfile.bootstrap-test
    container_name: exo-bootstrap-node2
    hostname: exo-node2
    command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"]
    environment:
      - EXO_LIBP2P_NAMESPACE=bootstrap-test
    ports:
      - "52416:52415"
    networks:
      bootstrap-net:
        ipv4_address: 172.30.20.3
    deploy:
      resources:
        limits:
          memory: 4g

networks:
  bootstrap-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.30.20.0/24
```
</details> 

Two containers on a bridge network (`172.30.20.0/24`), fixed IPs,
`--libp2p-port 30000`, cross-referencing `--bootstrap-peers`.

Both nodes found each other and established a connection then ran the
election protocol.

### Automated Testing

4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs`
(`cargo test -p networking`):

| Test | What it verifies | Result |
|------|-----------------|--------|
| `two_nodes_connect_via_bootstrap_peers` | Node B discovers Node A via
bootstrap addr (real TCP connection) | PASS |
| `create_swarm_with_empty_bootstrap_peers` | Backward compatibility —
no bootstrap peers works | PASS |
| `create_swarm_ignores_invalid_bootstrap_addrs` | Invalid multiaddrs
silently filtered | PASS |
| `create_swarm_with_fixed_port` | `listen_port` parameter works | PASS
|

All 4 pass. The connection test takes ~6s

---------

Signed-off-by: DeepZima <deepzima@outlook.com>
Co-authored-by: Evan <evanev7@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants