-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Environment
- EXO commit on both Macs:
7ed46395 - Both nodes on same build
- Model:
mlx-community/Qwen3.5-397B-A17B-4bit - 2-node Mac cluster
- Not intentionally using RDMA
EXO_FAST_SYNCH=offtested- Thunderbolt Bridge removed/disabled on MSM2 for latest tests
Hardware
- Node 1: MacBook Pro M5 MAX (128GB)
- Node 2: Mac Studio M2 Ultra (192GB)
What works
- Both nodes start cleanly by themselves
- Model downloads complete successfully on both nodes
/instance/previewsreturns valid placements- Best/only semi-stable manual launch settings so far:
PipelineTCP/IPMLX Ring
With those settings, the model launches and can answer a few prompts.
What fails
After a few prompts, the cluster becomes unstable and the instance fails.
Symptoms:
- Answer gets cut off mid-generation
- Dashboards eventually show
FAILED - MSM2 starts flapping / re-electing
- API on MSM2 goes down
Key observation
It looks like a runtime / control-plane collapse on MSM2 during sustained distributed inference.
Logs from the failing node (MSM2)
[ 2026-03-14 13:40:30.888 | WARNING | exo.routing.router:_networking_publish:230 ] All peer queues full, dropping message on local_events
...
[ 2026-03-14 13:40:31.234 | WARNING | logging:handle:1680 ] Failure while closing connection: i/o error: Broken pipe (os error 32)
[ 2026-03-14 13:40:31.439 | INFO | exo.shared.election:_campaign:197 ] Waiting for other campaign to finish
[ 2026-03-14 13:40:32.881 | INFO | exo.shared.election:_campaign:194 ] Cancelling other campaign
[ 2026-03-14 13:40:33.164 | INFO | exo.routing.event_router:_nack_request:150 ] Nack attempt 1: Requesting Event Log from 6539
[ 2026-03-14 13:40:35.120 | INFO | exo.routing.event_router:_nack_request:150 ] Nack attempt 2: Requesting Event Log from 6539
[ 2026-03-14 13:40:35.883 | INFO | exo.main:_elect_loop:199 ] Node elected Master - promoting self
[ 2026-03-14 13:40:35.886 | INFO | exo.master.api:reset:257 ] Resetting API State
[ 2026-03-14 13:40:35.900 | INFO | exo.master.main:run:97 ] Starting Master
[ 2026-03-14 13:40:35.902 | INFO | exo.worker.main:run:83 ] Starting Worker
[ 2026-03-14 13:40:35.946 | INFO | exo.worker.main:run:97 ] Stopping Worker
[ 2026-03-14 13:40:35.946 | INFO | exo.worker.runner.runner_supervisor:shutdown:118 ] Runner supervisor shutting down
[ 2026-03-14 13:40:35.947 | INFO | exo.worker.runner.bootstrap:entrypoint:72 ] bye from the runner
[ 2026-03-14 13:40:37.168 | INFO | exo.worker.runner.runner_supervisor:shutdown:134 ] Runner process succesfully terminated
Comparison with the other node (MBPM5)
At the same timestamp window, MBPM5 does not show the queue flood first. It only shows the aftermath / election transition:
[ 2026-03-14 13:40:32.884 | INFO | exo.shared.election:_campaign:197 ] Waiting for other campaign to finish
[ 2026-03-14 13:40:35.887 | INFO | exo.main:_elect_loop:194 ] Node elected Master
[ 2026-03-14 13:40:35.888 | INFO | exo.master.api:unpause:270 ] Unpausing API
This suggests MSM2 is the node that fails first.
Additional details
/instance/previews showed valid placements for:
Pipeline / MlxRing
Pipeline / MlxJaccl
Tensor / MlxRing
Tensor / MlxJaccl
The JACCL previews referenced rdma_en* interfaces, and earlier previews also showed a 169.254.x.x address on one node. Auto mode seemed to choose bad paths. Manually forcing Pipeline / TCP/IP / MLX Ring got the furthest.
Repro summary
Start EXO on both Macs (same commit)
Use model mlx-community/Qwen3.5-397B-A17B-4bit
Launch manually with:
Pipeline
TCP/IP
MLX Ring
Send a few prompts
After a few requests, MSM2 starts logging:
All peer queues full, dropping message on local_events
Broken pipe
MSM2 self-elects, resets API state, and tears down worker/runner
Cluster ends up failed
Expected
The 2-node instance should remain stable across multiple prompts.
Actual
The cluster works briefly, then control-plane / peer communication appears to collapse on MSM2 under sustained load.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working