-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Summary
After an Exo process crash (SIGABRT in networking layer), the auto-restarted Exo processes fail to re-establish the direct Thunderbolt 5 peer-to-peer link between nodes. Instead, they fall back to a slower network path (Tailscale/WAN), causing inter-node latency to jump from <1ms to 40-70ms and token throughput to drop from 24 tok/s to 2.5-7.9 tok/s. The cluster appears "connected" but inference is severely degraded or fails entirely with connection refused errors.
Environment
- Exo Version: 1.0.68 (macOS App)
- macOS: 26.3.1 (25D2128)
- Hardware: 2× Mac Studio M4 Ultra (Mac15,14), 512GB RAM each
- Interconnect: Thunderbolt 5 cable, 80 Gbps symmetric, Receptacle 4 on both machines
- Exo Settings: Sharding Strategy: Tensor, Interconnect: TCP/IP
- Model: mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit (252GB, split across both nodes)
- Networking: Tailscale VPN on both machines (100.x.x.x), Thunderbolt bridge on 192.168.4.x
Reproduction Steps
- Set up two Mac Studios connected via Thunderbolt 5 cable
- Launch Exo on both, form cluster, load 480B model
- Verify inference works at ~24 tok/s — this confirms healthy state
- Wait for an Exo crash (SIGABRT in networking layer — see crash logs below), OR kill the Exo process manually
- Let Exo auto-restart via LaunchAgent
- Observe: nodes reconnect but inter-node latency is now 40-70ms instead of <1ms
- Inference throughput drops to 2.5-7.9 tok/s or fails with connection refused
Expected Behavior
After crash/restart, Exo should:
- Detect available Thunderbolt bridge interface (192.168.4.x)
- Preferentially establish P2P connections over the direct link
- Resume inference at normal throughput (~24 tok/s)
Actual Behavior
After crash/restart, Exo:
- Reconnects to peer nodes via Tailscale (100.x.x.x) instead of Thunderbolt (192.168.4.x)
- Inter-node latency jumps from <1ms to 40-70ms
- Token throughput degrades to 10-33% of baseline
- Sustained inference requests fail with connection refused after initial attempts
/v1/modelsendpoint responds (lightweight metadata) but/v1/chat/completionsfails under load
Diagnostic Data
Crash Frequency
Both nodes crash simultaneously, suggesting a shared networking event:
| Node | Crash Dates | Total (4 days) |
|---|---|---|
| Mac Studio 1 | Mar 10 (×2), Mar 11, Mar 12, Mar 13 | 5 crashes |
| Mac Studio 2 | Mar 10, Mar 12, Mar 13 | 3 crashes |
Latest simultaneous crash: Mac Studio 1 at 07:28:40, Mac Studio 2 at 07:28:26 on 2026-03-13.
Crash Log Excerpt (Mac Studio 1 — 2026-03-13-072840.ips)
{"app_name":"exo","timestamp":"2026-03-13 07:28:40.00 -0700","app_version":"",
"slice_uuid":"82a0f60a-edf8-4094-31f3-db131e28b491","platform":1,
"os_version":"macOS 26.3.1 (25D2128)","name":"exo"}- Process:
/Applications/EXO.app/Contents/Resources/exo/exo(PID 12556) - Launch time: 2026-03-13 05:47:56 → Crash at 07:28:38 (uptime ~1.5 hours)
- Bug type: 309 (SIGABRT)
- Coalition: exolabs.EXO
Watchdog Alerts (Automated Monitoring)
[15:03:47] CRITICAL crash: mac-studio-1: 5 NEW crash(es) detected
[15:03:47] CRITICAL crash: mac-studio-2: 3 NEW crash(es) detected
[15:17:51] WARNING network: mac-studio-2: High inter-node latency: 49.66ms (expected <1ms)
[15:30:06] WARNING throughput: Token throughput degraded: 7.9 tok/s (baseline: 24.0, ratio: 33%)
[15:34:15] WARNING network: mac-studio-2: High inter-node latency: 70.51ms (expected <1ms)
[15:36:15] WARNING throughput: Token throughput degraded: 2.5 tok/s (baseline: 24.0, ratio: 10.5%)
[15:49:02] WARNING network: mac-studio-2: High inter-node latency: 40.33ms (expected <1ms)
[16:06:53] WARNING throughput: Token throughput degraded: 2.9 tok/s (baseline: 24.0, ratio: 12%)
Network State After Crash (Mac Studio 2)
# bridge0 interface missing
ifconfig bridge0 → "No bridge0 interface"
# Thunderbolt ports show "No device connected" on buses 4 & 5
# despite physical cable being connected
# Exo connections show TCP over Thunderbolt IPs but with high latency:
TCP 192.168.4.22:57838 -> 192.168.4.21:61098 (ESTABLISHED)
Throughput Comparison
| State | Throughput | Latency | Status |
|---|---|---|---|
| Healthy (before crash) | 24.0 tok/s | <1ms | ✅ |
| After crash/auto-restart | 2.5-7.9 tok/s | 40-70ms | ❌ |
| After manual cable reconnect + restart | 22.0 tok/s | <1ms | ✅ |
Recovery Procedure (Current Workaround)
- Quit Exo on both Mac Studios
- Physically disconnect Thunderbolt cable
- Wait 10 seconds, reconnect cable
- Start Exo on node 2 first, then node 1
- Launch model from GUI
- Throughput returns to ~22 tok/s (92% baseline)
Additional Note: macOS Kernel Panic on Hot-Unplug
When disconnecting the Thunderbolt cable while Exo has an active P2P link, macOS crashes with:
panic(cpu 0): ACIO3 NMI FIQ - ACIO main workloop(2) - failed to transition to state 3 (_iopStatus=7)
This is an Apple firmware issue (RTKit/AppleCIOFirmwareV2) not an Exo issue, but it's worth noting that Exo should ideally release the Thunderbolt connection cleanly before shutdown to prevent this.
Suggested Improvements
- Prefer direct Thunderbolt link during peer discovery — After restart, actively probe for low-latency interfaces (192.168.4.x) and prefer them over higher-latency paths
- Connection health monitoring — Detect latency degradation (e.g., >5ms on a Thunderbolt link) and trigger P2P mesh renegotiation
- Pre-flight validation — Before starting inference, verify inter-node connection quality meets a minimum threshold
- Graceful Thunderbolt cleanup — On shutdown/crash handler, release Thunderbolt bridge resources cleanly
System Info
Exo: 1.0.68
macOS: 26.3.1 (25D2128)
Kernel: Darwin 25.3.0 xnu-12377.91.3~2/RELEASE_ARM64_T6031
Hardware: Mac Studio M4 Ultra (Mac15,14), 512GB unified memory each
Thunderbolt: 5 (80 Gbps symmetric)
Interconnect setting: TCP/IP
Sharding: Tensor