P2P Mesh Fails to Re-establish Direct Thunderbolt Link After Exo Crash/Restart

## Summary

After an Exo process crash (SIGABRT in networking layer), the auto-restarted Exo processes fail to re-establish the direct Thunderbolt 5 peer-to-peer link between nodes. Instead, they fall back to a slower network path (Tailscale/WAN), causing inter-node latency to jump from <1ms to 40-70ms and token throughput to drop from 24 tok/s to 2.5-7.9 tok/s. The cluster appears "connected" but inference is severely degraded or fails entirely with connection refused errors.

## Environment

- **Exo Version:** 1.0.68 (macOS App)
- **macOS:** 26.3.1 (25D2128)
- **Hardware:** 2× Mac Studio M4 Ultra (Mac15,14), 512GB RAM each
- **Interconnect:** Thunderbolt 5 cable, 80 Gbps symmetric, Receptacle 4 on both machines
- **Exo Settings:** Sharding Strategy: Tensor, Interconnect: TCP/IP
- **Model:** mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit (252GB, split across both nodes)
- **Networking:** Tailscale VPN on both machines (100.x.x.x), Thunderbolt bridge on 192.168.4.x

## Reproduction Steps

1. Set up two Mac Studios connected via Thunderbolt 5 cable
2. Launch Exo on both, form cluster, load 480B model
3. Verify inference works at ~24 tok/s — this confirms healthy state
4. Wait for an Exo crash (SIGABRT in networking layer — see crash logs below), OR kill the Exo process manually
5. Let Exo auto-restart via LaunchAgent
6. Observe: nodes reconnect but inter-node latency is now 40-70ms instead of <1ms
7. Inference throughput drops to 2.5-7.9 tok/s or fails with connection refused

## Expected Behavior

After crash/restart, Exo should:
1. Detect available Thunderbolt bridge interface (192.168.4.x)
2. Preferentially establish P2P connections over the direct link
3. Resume inference at normal throughput (~24 tok/s)

## Actual Behavior

After crash/restart, Exo:
1. Reconnects to peer nodes via Tailscale (100.x.x.x) instead of Thunderbolt (192.168.4.x)
2. Inter-node latency jumps from <1ms to 40-70ms
3. Token throughput degrades to 10-33% of baseline
4. Sustained inference requests fail with connection refused after initial attempts
5. `/v1/models` endpoint responds (lightweight metadata) but `/v1/chat/completions` fails under load

## Diagnostic Data

### Crash Frequency
Both nodes crash simultaneously, suggesting a shared networking event:

| Node | Crash Dates | Total (4 days) |
|------|------------|----------------|
| Mac Studio 1 | Mar 10 (×2), Mar 11, Mar 12, Mar 13 | 5 crashes |
| Mac Studio 2 | Mar 10, Mar 12, Mar 13 | 3 crashes |

Latest simultaneous crash: Mac Studio 1 at 07:28:40, Mac Studio 2 at 07:28:26 on 2026-03-13.

### Crash Log Excerpt (Mac Studio 1 — 2026-03-13-072840.ips)
```json
{"app_name":"exo","timestamp":"2026-03-13 07:28:40.00 -0700","app_version":"",
 "slice_uuid":"82a0f60a-edf8-4094-31f3-db131e28b491","platform":1,
 "os_version":"macOS 26.3.1 (25D2128)","name":"exo"}
```
- Process: `/Applications/EXO.app/Contents/Resources/exo/exo` (PID 12556)
- Launch time: 2026-03-13 05:47:56 → Crash at 07:28:38 (uptime ~1.5 hours)
- Bug type: 309 (SIGABRT)
- Coalition: exolabs.EXO

### Watchdog Alerts (Automated Monitoring)
```
[15:03:47] CRITICAL crash: mac-studio-1: 5 NEW crash(es) detected
[15:03:47] CRITICAL crash: mac-studio-2: 3 NEW crash(es) detected
[15:17:51] WARNING network: mac-studio-2: High inter-node latency: 49.66ms (expected <1ms)
[15:30:06] WARNING throughput: Token throughput degraded: 7.9 tok/s (baseline: 24.0, ratio: 33%)
[15:34:15] WARNING network: mac-studio-2: High inter-node latency: 70.51ms (expected <1ms)
[15:36:15] WARNING throughput: Token throughput degraded: 2.5 tok/s (baseline: 24.0, ratio: 10.5%)
[15:49:02] WARNING network: mac-studio-2: High inter-node latency: 40.33ms (expected <1ms)
[16:06:53] WARNING throughput: Token throughput degraded: 2.9 tok/s (baseline: 24.0, ratio: 12%)
```

### Network State After Crash (Mac Studio 2)
```
# bridge0 interface missing
ifconfig bridge0 → "No bridge0 interface"

# Thunderbolt ports show "No device connected" on buses 4 & 5
# despite physical cable being connected

# Exo connections show TCP over Thunderbolt IPs but with high latency:
TCP 192.168.4.22:57838 -> 192.168.4.21:61098 (ESTABLISHED)
```

### Throughput Comparison
| State | Throughput | Latency | Status |
|-------|-----------|---------|--------|
| Healthy (before crash) | 24.0 tok/s | <1ms | ✅ |
| After crash/auto-restart | 2.5-7.9 tok/s | 40-70ms | ❌ |
| After manual cable reconnect + restart | 22.0 tok/s | <1ms | ✅ |

### Recovery Procedure (Current Workaround)
1. Quit Exo on both Mac Studios
2. Physically disconnect Thunderbolt cable
3. Wait 10 seconds, reconnect cable
4. Start Exo on node 2 first, then node 1
5. Launch model from GUI
6. Throughput returns to ~22 tok/s (92% baseline)

## Additional Note: macOS Kernel Panic on Hot-Unplug

When disconnecting the Thunderbolt cable while Exo has an active P2P link, macOS crashes with:
```
panic(cpu 0): ACIO3 NMI FIQ - ACIO main workloop(2) - failed to transition to state 3 (_iopStatus=7)
```
This is an Apple firmware issue (RTKit/AppleCIOFirmwareV2) not an Exo issue, but it's worth noting that Exo should ideally release the Thunderbolt connection cleanly before shutdown to prevent this.

## Suggested Improvements

1. **Prefer direct Thunderbolt link during peer discovery** — After restart, actively probe for low-latency interfaces (192.168.4.x) and prefer them over higher-latency paths
2. **Connection health monitoring** — Detect latency degradation (e.g., >5ms on a Thunderbolt link) and trigger P2P mesh renegotiation
3. **Pre-flight validation** — Before starting inference, verify inter-node connection quality meets a minimum threshold
4. **Graceful Thunderbolt cleanup** — On shutdown/crash handler, release Thunderbolt bridge resources cleanly

## System Info
```
Exo: 1.0.68
macOS: 26.3.1 (25D2128)
Kernel: Darwin 25.3.0 xnu-12377.91.3~2/RELEASE_ARM64_T6031
Hardware: Mac Studio M4 Ultra (Mac15,14), 512GB unified memory each
Thunderbolt: 5 (80 Gbps symmetric)
Interconnect setting: TCP/IP
Sharding: Tensor
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2P Mesh Fails to Re-establish Direct Thunderbolt Link After Exo Crash/Restart #1723

Summary

Environment

Reproduction Steps

Expected Behavior

Actual Behavior

Diagnostic Data

Crash Frequency

Crash Log Excerpt (Mac Studio 1 — 2026-03-13-072840.ips)

Watchdog Alerts (Automated Monitoring)

Network State After Crash (Mac Studio 2)

Throughput Comparison

Recovery Procedure (Current Workaround)

Additional Note: macOS Kernel Panic on Hot-Unplug

Suggested Improvements

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node	Crash Dates	Total (4 days)
Mac Studio 1	Mar 10 (×2), Mar 11, Mar 12, Mar 13	5 crashes
Mac Studio 2	Mar 10, Mar 12, Mar 13	3 crashes

State	Throughput	Latency	Status
Healthy (before crash)	24.0 tok/s	<1ms	✅
After crash/auto-restart	2.5-7.9 tok/s	40-70ms	❌
After manual cable reconnect + restart	22.0 tok/s	<1ms	✅

P2P Mesh Fails to Re-establish Direct Thunderbolt Link After Exo Crash/Restart #1723

Description

Summary

Environment

Reproduction Steps

Expected Behavior

Actual Behavior

Diagnostic Data

Crash Frequency

Crash Log Excerpt (Mac Studio 1 — 2026-03-13-072840.ips)

Watchdog Alerts (Automated Monitoring)

Network State After Crash (Mac Studio 2)

Throughput Comparison

Recovery Procedure (Current Workaround)

Additional Note: macOS Kernel Panic on Hot-Unplug

Suggested Improvements

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions