[BUG] Title: 2-node Mac cluster with Qwen3.5-397B-A17B-4bit becomes unstable after a few prompts: peer queues full → broken pipe → worker teardown

### Environment

- EXO commit on both Macs: `7ed46395`
- Both nodes on same build
- Model: `mlx-community/Qwen3.5-397B-A17B-4bit`
- 2-node Mac cluster
- Not intentionally using RDMA
- `EXO_FAST_SYNCH=off` tested
- Thunderbolt Bridge removed/disabled on MSM2 for latest tests

### Hardware

- Node 1: MacBook Pro M5 MAX (128GB)
- Node 2: Mac Studio M2 Ultra (192GB)

### What works

- Both nodes start cleanly by themselves
- Model downloads complete successfully on both nodes
- `/instance/previews` returns valid placements
- Best/only semi-stable manual launch settings so far:
  - `Pipeline`
  - `TCP/IP`
  - `MLX Ring`

With those settings, the model launches and can answer a few prompts.

### What fails

After a few prompts, the cluster becomes unstable and the instance fails.

Symptoms:
- Answer gets cut off mid-generation
- Dashboards eventually show `FAILED`
- MSM2 starts flapping / re-electing
- API on MSM2 goes down

### Key observation

It looks like a runtime / control-plane collapse on MSM2 during sustained distributed inference.

### Logs from the failing node (MSM2)

```text
[ 2026-03-14 13:40:30.888 | WARNING  | exo.routing.router:_networking_publish:230 ] All peer queues full, dropping message on local_events
...
[ 2026-03-14 13:40:31.234 | WARNING  | logging:handle:1680 ] Failure while closing connection: i/o error: Broken pipe (os error 32)
[ 2026-03-14 13:40:31.439 | INFO     | exo.shared.election:_campaign:197 ] Waiting for other campaign to finish
[ 2026-03-14 13:40:32.881 | INFO     | exo.shared.election:_campaign:194 ] Cancelling other campaign
[ 2026-03-14 13:40:33.164 | INFO     | exo.routing.event_router:_nack_request:150 ] Nack attempt 1: Requesting Event Log from 6539
[ 2026-03-14 13:40:35.120 | INFO     | exo.routing.event_router:_nack_request:150 ] Nack attempt 2: Requesting Event Log from 6539
[ 2026-03-14 13:40:35.883 | INFO     | exo.main:_elect_loop:199 ] Node elected Master - promoting self
[ 2026-03-14 13:40:35.886 | INFO     | exo.master.api:reset:257 ] Resetting API State
[ 2026-03-14 13:40:35.900 | INFO     | exo.master.main:run:97 ] Starting Master
[ 2026-03-14 13:40:35.902 | INFO     | exo.worker.main:run:83 ] Starting Worker
[ 2026-03-14 13:40:35.946 | INFO     | exo.worker.main:run:97 ] Stopping Worker
[ 2026-03-14 13:40:35.946 | INFO     | exo.worker.runner.runner_supervisor:shutdown:118 ] Runner supervisor shutting down
[ 2026-03-14 13:40:35.947 | INFO     | exo.worker.runner.bootstrap:entrypoint:72 ] bye from the runner
[ 2026-03-14 13:40:37.168 | INFO     | exo.worker.runner.runner_supervisor:shutdown:134 ] Runner process succesfully terminated
Comparison with the other node (MBPM5)

At the same timestamp window, MBPM5 does not show the queue flood first. It only shows the aftermath / election transition:

[ 2026-03-14 13:40:32.884 | INFO     | exo.shared.election:_campaign:197 ] Waiting for other campaign to finish
[ 2026-03-14 13:40:35.887 | INFO     | exo.main:_elect_loop:194 ] Node elected Master
[ 2026-03-14 13:40:35.888 | INFO     | exo.master.api:unpause:270 ] Unpausing API

This suggests MSM2 is the node that fails first.

Additional details

/instance/previews showed valid placements for:

Pipeline / MlxRing

Pipeline / MlxJaccl

Tensor / MlxRing

Tensor / MlxJaccl

The JACCL previews referenced rdma_en* interfaces, and earlier previews also showed a 169.254.x.x address on one node. Auto mode seemed to choose bad paths. Manually forcing Pipeline / TCP/IP / MLX Ring got the furthest.

Repro summary

Start EXO on both Macs (same commit)

Use model mlx-community/Qwen3.5-397B-A17B-4bit

Launch manually with:

Pipeline

TCP/IP

MLX Ring

Send a few prompts

After a few requests, MSM2 starts logging:

All peer queues full, dropping message on local_events

Broken pipe

MSM2 self-elects, resets API state, and tears down worker/runner

Cluster ends up failed

Expected

The 2-node instance should remain stable across multiple prompts.

Actual

The cluster works briefly, then control-plane / peer communication appears to collapse on MSM2 under sustained load.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Title: 2-node Mac cluster with Qwen3.5-397B-A17B-4bit becomes unstable after a few prompts: peer queues full → broken pipe → worker teardown #1726

Environment

Hardware

What works

What fails

Key observation

Logs from the failing node (MSM2)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Title: 2-node Mac cluster with Qwen3.5-397B-A17B-4bit becomes unstable after a few prompts: peer queues full → broken pipe → worker teardown #1726

Description

Environment

Hardware

What works

What fails

Key observation

Logs from the failing node (MSM2)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions