Skip to content

[BUG] Title: 2-node Mac cluster with Qwen3.5-397B-A17B-4bit becomes unstable after a few prompts: peer queues full → broken pipe → worker teardown #1726

@tniccum21

Description

@tniccum21

Environment

  • EXO commit on both Macs: 7ed46395
  • Both nodes on same build
  • Model: mlx-community/Qwen3.5-397B-A17B-4bit
  • 2-node Mac cluster
  • Not intentionally using RDMA
  • EXO_FAST_SYNCH=off tested
  • Thunderbolt Bridge removed/disabled on MSM2 for latest tests

Hardware

  • Node 1: MacBook Pro M5 MAX (128GB)
  • Node 2: Mac Studio M2 Ultra (192GB)

What works

  • Both nodes start cleanly by themselves
  • Model downloads complete successfully on both nodes
  • /instance/previews returns valid placements
  • Best/only semi-stable manual launch settings so far:
    • Pipeline
    • TCP/IP
    • MLX Ring

With those settings, the model launches and can answer a few prompts.

What fails

After a few prompts, the cluster becomes unstable and the instance fails.

Symptoms:

  • Answer gets cut off mid-generation
  • Dashboards eventually show FAILED
  • MSM2 starts flapping / re-electing
  • API on MSM2 goes down

Key observation

It looks like a runtime / control-plane collapse on MSM2 during sustained distributed inference.

Logs from the failing node (MSM2)

[ 2026-03-14 13:40:30.888 | WARNING  | exo.routing.router:_networking_publish:230 ] All peer queues full, dropping message on local_events
...
[ 2026-03-14 13:40:31.234 | WARNING  | logging:handle:1680 ] Failure while closing connection: i/o error: Broken pipe (os error 32)
[ 2026-03-14 13:40:31.439 | INFO     | exo.shared.election:_campaign:197 ] Waiting for other campaign to finish
[ 2026-03-14 13:40:32.881 | INFO     | exo.shared.election:_campaign:194 ] Cancelling other campaign
[ 2026-03-14 13:40:33.164 | INFO     | exo.routing.event_router:_nack_request:150 ] Nack attempt 1: Requesting Event Log from 6539
[ 2026-03-14 13:40:35.120 | INFO     | exo.routing.event_router:_nack_request:150 ] Nack attempt 2: Requesting Event Log from 6539
[ 2026-03-14 13:40:35.883 | INFO     | exo.main:_elect_loop:199 ] Node elected Master - promoting self
[ 2026-03-14 13:40:35.886 | INFO     | exo.master.api:reset:257 ] Resetting API State
[ 2026-03-14 13:40:35.900 | INFO     | exo.master.main:run:97 ] Starting Master
[ 2026-03-14 13:40:35.902 | INFO     | exo.worker.main:run:83 ] Starting Worker
[ 2026-03-14 13:40:35.946 | INFO     | exo.worker.main:run:97 ] Stopping Worker
[ 2026-03-14 13:40:35.946 | INFO     | exo.worker.runner.runner_supervisor:shutdown:118 ] Runner supervisor shutting down
[ 2026-03-14 13:40:35.947 | INFO     | exo.worker.runner.bootstrap:entrypoint:72 ] bye from the runner
[ 2026-03-14 13:40:37.168 | INFO     | exo.worker.runner.runner_supervisor:shutdown:134 ] Runner process succesfully terminated
Comparison with the other node (MBPM5)

At the same timestamp window, MBPM5 does not show the queue flood first. It only shows the aftermath / election transition:

[ 2026-03-14 13:40:32.884 | INFO     | exo.shared.election:_campaign:197 ] Waiting for other campaign to finish
[ 2026-03-14 13:40:35.887 | INFO     | exo.main:_elect_loop:194 ] Node elected Master
[ 2026-03-14 13:40:35.888 | INFO     | exo.master.api:unpause:270 ] Unpausing API

This suggests MSM2 is the node that fails first.

Additional details

/instance/previews showed valid placements for:

Pipeline / MlxRing

Pipeline / MlxJaccl

Tensor / MlxRing

Tensor / MlxJaccl

The JACCL previews referenced rdma_en* interfaces, and earlier previews also showed a 169.254.x.x address on one node. Auto mode seemed to choose bad paths. Manually forcing Pipeline / TCP/IP / MLX Ring got the furthest.

Repro summary

Start EXO on both Macs (same commit)

Use model mlx-community/Qwen3.5-397B-A17B-4bit

Launch manually with:

Pipeline

TCP/IP

MLX Ring

Send a few prompts

After a few requests, MSM2 starts logging:

All peer queues full, dropping message on local_events

Broken pipe

MSM2 self-elects, resets API state, and tears down worker/runner

Cluster ends up failed

Expected

The 2-node instance should remain stable across multiple prompts.

Actual

The cluster works briefly, then control-plane / peer communication appears to collapse on MSM2 under sustained load.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions