Skip to content

P2P Mesh Fails to Re-establish Direct Thunderbolt Link After Exo Crash/Restart #1723

@singletapcoder

Description

@singletapcoder

Summary

After an Exo process crash (SIGABRT in networking layer), the auto-restarted Exo processes fail to re-establish the direct Thunderbolt 5 peer-to-peer link between nodes. Instead, they fall back to a slower network path (Tailscale/WAN), causing inter-node latency to jump from <1ms to 40-70ms and token throughput to drop from 24 tok/s to 2.5-7.9 tok/s. The cluster appears "connected" but inference is severely degraded or fails entirely with connection refused errors.

Environment

  • Exo Version: 1.0.68 (macOS App)
  • macOS: 26.3.1 (25D2128)
  • Hardware: 2× Mac Studio M4 Ultra (Mac15,14), 512GB RAM each
  • Interconnect: Thunderbolt 5 cable, 80 Gbps symmetric, Receptacle 4 on both machines
  • Exo Settings: Sharding Strategy: Tensor, Interconnect: TCP/IP
  • Model: mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit (252GB, split across both nodes)
  • Networking: Tailscale VPN on both machines (100.x.x.x), Thunderbolt bridge on 192.168.4.x

Reproduction Steps

  1. Set up two Mac Studios connected via Thunderbolt 5 cable
  2. Launch Exo on both, form cluster, load 480B model
  3. Verify inference works at ~24 tok/s — this confirms healthy state
  4. Wait for an Exo crash (SIGABRT in networking layer — see crash logs below), OR kill the Exo process manually
  5. Let Exo auto-restart via LaunchAgent
  6. Observe: nodes reconnect but inter-node latency is now 40-70ms instead of <1ms
  7. Inference throughput drops to 2.5-7.9 tok/s or fails with connection refused

Expected Behavior

After crash/restart, Exo should:

  1. Detect available Thunderbolt bridge interface (192.168.4.x)
  2. Preferentially establish P2P connections over the direct link
  3. Resume inference at normal throughput (~24 tok/s)

Actual Behavior

After crash/restart, Exo:

  1. Reconnects to peer nodes via Tailscale (100.x.x.x) instead of Thunderbolt (192.168.4.x)
  2. Inter-node latency jumps from <1ms to 40-70ms
  3. Token throughput degrades to 10-33% of baseline
  4. Sustained inference requests fail with connection refused after initial attempts
  5. /v1/models endpoint responds (lightweight metadata) but /v1/chat/completions fails under load

Diagnostic Data

Crash Frequency

Both nodes crash simultaneously, suggesting a shared networking event:

Node Crash Dates Total (4 days)
Mac Studio 1 Mar 10 (×2), Mar 11, Mar 12, Mar 13 5 crashes
Mac Studio 2 Mar 10, Mar 12, Mar 13 3 crashes

Latest simultaneous crash: Mac Studio 1 at 07:28:40, Mac Studio 2 at 07:28:26 on 2026-03-13.

Crash Log Excerpt (Mac Studio 1 — 2026-03-13-072840.ips)

{"app_name":"exo","timestamp":"2026-03-13 07:28:40.00 -0700","app_version":"",
 "slice_uuid":"82a0f60a-edf8-4094-31f3-db131e28b491","platform":1,
 "os_version":"macOS 26.3.1 (25D2128)","name":"exo"}
  • Process: /Applications/EXO.app/Contents/Resources/exo/exo (PID 12556)
  • Launch time: 2026-03-13 05:47:56 → Crash at 07:28:38 (uptime ~1.5 hours)
  • Bug type: 309 (SIGABRT)
  • Coalition: exolabs.EXO

Watchdog Alerts (Automated Monitoring)

[15:03:47] CRITICAL crash: mac-studio-1: 5 NEW crash(es) detected
[15:03:47] CRITICAL crash: mac-studio-2: 3 NEW crash(es) detected
[15:17:51] WARNING network: mac-studio-2: High inter-node latency: 49.66ms (expected <1ms)
[15:30:06] WARNING throughput: Token throughput degraded: 7.9 tok/s (baseline: 24.0, ratio: 33%)
[15:34:15] WARNING network: mac-studio-2: High inter-node latency: 70.51ms (expected <1ms)
[15:36:15] WARNING throughput: Token throughput degraded: 2.5 tok/s (baseline: 24.0, ratio: 10.5%)
[15:49:02] WARNING network: mac-studio-2: High inter-node latency: 40.33ms (expected <1ms)
[16:06:53] WARNING throughput: Token throughput degraded: 2.9 tok/s (baseline: 24.0, ratio: 12%)

Network State After Crash (Mac Studio 2)

# bridge0 interface missing
ifconfig bridge0 → "No bridge0 interface"

# Thunderbolt ports show "No device connected" on buses 4 & 5
# despite physical cable being connected

# Exo connections show TCP over Thunderbolt IPs but with high latency:
TCP 192.168.4.22:57838 -> 192.168.4.21:61098 (ESTABLISHED)

Throughput Comparison

State Throughput Latency Status
Healthy (before crash) 24.0 tok/s <1ms
After crash/auto-restart 2.5-7.9 tok/s 40-70ms
After manual cable reconnect + restart 22.0 tok/s <1ms

Recovery Procedure (Current Workaround)

  1. Quit Exo on both Mac Studios
  2. Physically disconnect Thunderbolt cable
  3. Wait 10 seconds, reconnect cable
  4. Start Exo on node 2 first, then node 1
  5. Launch model from GUI
  6. Throughput returns to ~22 tok/s (92% baseline)

Additional Note: macOS Kernel Panic on Hot-Unplug

When disconnecting the Thunderbolt cable while Exo has an active P2P link, macOS crashes with:

panic(cpu 0): ACIO3 NMI FIQ - ACIO main workloop(2) - failed to transition to state 3 (_iopStatus=7)

This is an Apple firmware issue (RTKit/AppleCIOFirmwareV2) not an Exo issue, but it's worth noting that Exo should ideally release the Thunderbolt connection cleanly before shutdown to prevent this.

Suggested Improvements

  1. Prefer direct Thunderbolt link during peer discovery — After restart, actively probe for low-latency interfaces (192.168.4.x) and prefer them over higher-latency paths
  2. Connection health monitoring — Detect latency degradation (e.g., >5ms on a Thunderbolt link) and trigger P2P mesh renegotiation
  3. Pre-flight validation — Before starting inference, verify inter-node connection quality meets a minimum threshold
  4. Graceful Thunderbolt cleanup — On shutdown/crash handler, release Thunderbolt bridge resources cleanly

System Info

Exo: 1.0.68
macOS: 26.3.1 (25D2128)
Kernel: Darwin 25.3.0 xnu-12377.91.3~2/RELEASE_ARM64_T6031
Hardware: Mac Studio M4 Ultra (Mac15,14), 512GB unified memory each
Thunderbolt: 5 (80 Gbps symmetric)
Interconnect setting: TCP/IP
Sharding: Tensor

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions