Dora supports deploying dataflows across multiple machines for multi-robot fleets, edge AI pipelines, and distributed robotics systems. This guide covers cluster management, node scheduling, binary distribution, auto-recovery, and operational best practices.
- Overview
- Quick Start
- Features at a Glance
- Cluster Configuration Reference
- Cluster Commands Reference
- Node Scheduling
- Binary Distribution
- systemd Service Management
- Auto-Recovery
- Rolling Upgrade
- Use Cases
- Operations Runbook
- Deployment YAML Reference
- Best Practices
Dora's distributed architecture has three tiers:
CLI --> Coordinator --> Daemon(s) --> Nodes / Operators
(one) (per machine) (user code)
- CLI sends control commands (build, start, stop) to the coordinator.
- Coordinator orchestrates daemons, resolves node placement, and manages dataflow lifecycle.
- Daemons run on each machine, spawning and supervising node processes.
- Nodes communicate via shared memory (same machine) or Zenoh pub-sub (cross-machine).
There are two paths to distributed deployment:
Ad-hoc -- manually start dora daemon on each machine, then use the coordinator for control. Good for development and testing. See Distributed Deployments in the CLI reference.
Managed (cluster.yml) -- define your cluster topology in a YAML file, then use dora cluster commands for SSH-based lifecycle management. This guide focuses on the managed path.
- Create a
cluster.yml:
coordinator:
addr: 10.0.0.1
machines:
- id: robot
host: 10.0.0.2
user: ubuntu
- id: gpu-server
host: 10.0.0.3
user: ubuntu- Bring up the cluster:
dora cluster up cluster.yml- Start a dataflow:
dora start dataflow.yml --name my-app --attach- Check cluster health:
dora cluster status- Tear down:
dora cluster down| Feature | Command / Config | Description |
|---|---|---|
| Cluster lifecycle | dora cluster up/status/down |
SSH-based daemon management from a single machine |
| Label scheduling | _unstable_deploy.labels |
Route nodes to daemons by key-value labels |
| Binary distribution | _unstable_deploy.distribute |
local, scp, or http strategies |
| systemd services | dora cluster install/uninstall |
Persistent daemon services that survive reboots |
| Auto-recovery | Automatic | Re-spawn nodes when a daemon reconnects |
| Rolling upgrade | dora cluster upgrade |
SCP binary + restart per-machine sequentially |
| Dataflow restart | dora cluster restart |
Restart a running dataflow by name or UUID |
A cluster.yml file defines the coordinator address and the set of machines in the cluster.
coordinator:
addr: 10.0.0.1 # IP address the coordinator binds to (required)
port: 6013 # WebSocket port (default: 6013)
machines:
- id: edge-01 # Unique machine identifier (required)
host: 10.0.0.2 # SSH-reachable hostname or IP (required)
user: ubuntu # SSH user (optional, defaults to current user)
labels: # Key-value labels for scheduling (optional)
gpu: "true"
arch: arm64
- id: edge-02
host: 10.0.0.3
labels:
arch: arm64coordinator
| Field | Type | Default | Description |
|---|---|---|---|
addr |
IP address | (required) | Address the coordinator binds to |
port |
u16 | 6013 |
WebSocket port |
machines[]
| Field | Type | Default | Description |
|---|---|---|---|
id |
string | (required) | Unique machine identifier, used in _unstable_deploy.machine |
host |
string | (required) | SSH-reachable hostname or IP address |
user |
string | current user | SSH username |
labels |
map | empty | Key-value pairs for label-based scheduling |
- At least one machine must be defined.
- Machine IDs must be non-empty and unique.
- Machine hosts must be non-empty.
- Unknown fields are rejected (
deny_unknown_fields).
coordinator:
addr: 192.168.1.1
machines:
- id: coordinator-host
host: 192.168.1.1
labels:
role: control
- id: gpu-a100
host: 192.168.1.10
user: ml
labels:
gpu: a100
arch: x86_64
- id: jetson-01
host: 192.168.1.20
user: nvidia
labels:
gpu: jetson
arch: arm64All dora cluster commands operate on a cluster.yml file and use SSH to manage remote machines.
SSH options used: BatchMode=yes, ConnectTimeout=10, StrictHostKeyChecking=accept-new.
Bring up a multi-machine cluster from a cluster.yml file. Starts the coordinator locally, then SSH-es into each machine to start a daemon.
dora cluster up <PATH>
Arguments:
| Argument | Description |
|---|---|
PATH |
Path to the cluster configuration file |
Behavior:
- Loads and validates the cluster config.
- Starts the coordinator locally on
addr:port. - For each machine, SSH-es in and runs
nohup dora daemon --machine-id <id> --coordinator-addr <addr> --coordinator-port <port> [--labels k1=v1,k2=v2] --quiet. - Polls until all expected daemons register with the coordinator (30s timeout).
Example:
$ dora cluster up cluster.yml
Starting coordinator on 10.0.0.1:6013...
Starting daemon on robot (ubuntu@10.0.0.2)... OK
Starting daemon on gpu-server (ubuntu@10.0.0.3)... OK
All 2 daemons connected.Show the current status of the cluster. Displays connected daemons and active dataflow count.
dora cluster status [--coordinator-addr ADDR] [--coordinator-port PORT]
Flags:
| Flag | Default | Description |
|---|---|---|
--coordinator-addr |
localhost |
Coordinator hostname or IP |
--coordinator-port |
6013 |
Coordinator WebSocket port |
Example:
$ dora cluster status
DAEMON ID LAST HEARTBEAT
robot 2s ago
gpu-server 1s ago
Active dataflows: 1Tear down the cluster (coordinator and all daemons).
dora cluster down [--coordinator-addr ADDR] [--coordinator-port PORT]
Terminates all daemons and the coordinator process.
Install dora-daemon as a systemd service on each machine. SSH-es into each machine, writes a systemd unit file, and enables the service.
dora cluster install <PATH>
Arguments:
| Argument | Description |
|---|---|
PATH |
Path to the cluster configuration file |
Behavior:
For each machine, creates and enables a systemd service named dora-daemon-<id>. The unit file:
[Unit]
Description=Dora Daemon (<id>)
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=dora daemon --machine-id <id> --coordinator-addr <addr> --coordinator-port <port> --labels k1=v1,k2=v2 --quiet
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.targetExample:
$ dora cluster install cluster.yml
Installing dora-daemon-robot on ubuntu@10.0.0.2... OK
Installing dora-daemon-gpu-server on ubuntu@10.0.0.3... OK
2/2 succeeded.Uninstall dora-daemon systemd services from each machine. Stops, disables, and removes the systemd unit.
dora cluster uninstall <PATH>
Behavior:
For each machine, runs:
sudo systemctl stop dora-daemon-<id>
sudo systemctl disable dora-daemon-<id>
sudo rm -f /etc/systemd/system/dora-daemon-<id>.service
sudo systemctl daemon-reloadRolling upgrade: SCP the local dora binary to each machine and restart daemons. Processes machines sequentially to maintain availability.
dora cluster upgrade <PATH>
Behavior:
For each machine sequentially:
- SCP the local
dorabinary to/usr/local/bin/doraon the target machine. - Restart the systemd service via
sudo systemctl restart dora-daemon-<id>. - Poll the coordinator until the daemon reconnects (30s timeout, 500ms intervals).
Nodes on other machines continue running while each machine is being upgraded.
Example:
$ dora cluster upgrade cluster.yml
Upgrading robot (ubuntu@10.0.0.2)...
SCP binary... OK
Restart service... OK
Waiting for reconnect... OK (3.2s)
Upgrading gpu-server (ubuntu@10.0.0.3)...
SCP binary... OK
Restart service... OK
Waiting for reconnect... OK (2.8s)
2/2 succeeded.Restart a running dataflow by name or UUID. Stops the dataflow and immediately re-starts it using the stored descriptor (no YAML path needed).
dora cluster restart <DATAFLOW>
Arguments:
| Argument | Description |
|---|---|
DATAFLOW |
Name or UUID of the dataflow to restart |
Example:
$ dora cluster restart my-app
Restarting dataflow `my-app`
dataflow restarted: a1b2c3d4-... -> e5f6a7b8-...When the coordinator receives a dataflow, it decides which daemon runs each node based on the _unstable_deploy section in the dataflow YAML. Resolution priority: machine > labels > unnamed.
Assign a node to a specific machine by its id from cluster.yml:
nodes:
- id: camera
_unstable_deploy:
machine: robot
path: ./camera-driver
outputs:
- framesThe coordinator looks up the daemon whose machine-id matches. If no matching daemon is connected, the deployment fails with: no matching daemon for machine id "robot".
Assign a node by requiring specific labels on the target daemon:
nodes:
- id: inference
_unstable_deploy:
labels:
gpu: "true"
path: ./ml-model
inputs:
frames: camera/frames
outputs:
- predictionsThe coordinator finds the first connected daemon whose labels are a superset of the required labels. All required key-value pairs must match exactly. If no daemon satisfies the requirements, deployment fails with: no daemon matches labels {"gpu": "true"}.
Nodes without an _unstable_deploy section (or with an empty one) are assigned to the first unnamed daemon -- one that connected without a --machine-id flag.
The coordinator resolves node placement in coordinator/run/mod.rs:
resolve_daemon(connections, deploy) -> DaemonId
1. If deploy.machine is Some(id):
-> look up daemon by machine-id
2. Else if deploy.labels is non-empty:
-> find first daemon where all required labels match
3. Else:
-> pick first unnamed daemon
The label matching function iterates over all connected daemons and checks that every required key-value pair exists in the daemon's label set (conn.labels.get(k) == Some(v)). This is a superset check: a daemon with {gpu: "true", arch: "arm64", role: "edge"} satisfies the requirement {gpu: "true"}.
Control how node binaries are delivered to remote daemons via the distribute field.
Each daemon builds from source on its own machine. This is the current default behavior.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01
distribute: local
path: ./my-nodeThe CLI pushes the locally-built binary to the target machine via SSH/SCP before spawning.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01
distribute: scp
path: ./my-nodeThe coordinator runs an artifact store. Daemons pull binaries from the coordinator via HTTP before spawning.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01
distribute: http
path: ./my-nodeArtifacts are served from GET /api/artifacts/{build_id}/{node_id} on the coordinator's WebSocket port. The endpoint requires authentication (Bearer token) and sanitizes node IDs to prevent path traversal.
| Strategy | Best for | Tradeoffs |
|---|---|---|
local |
Homogeneous clusters, CI builds | Requires build toolchain on every machine |
scp |
Heterogeneous clusters, cross-compiled binaries | Requires SSH access from CLI to all machines |
http |
Air-gapped daemons, firewalled networks | Requires coordinator reachability from all daemons |
For production deployments, install daemons as systemd services so they survive reboots and auto-restart on failure.
dora cluster install cluster.ymlCreates a systemd unit file on each machine (see dora cluster install for the full unit template). Key properties:
- Restart=on-failure with RestartSec=5: daemon auto-restarts if it crashes.
- After=network-online.target: waits for network before starting.
- WantedBy=multi-user.target: starts on boot.
dora cluster uninstall cluster.ymlStops, disables, and removes the unit file from each machine, then reloads the systemd daemon.
After install, check services directly:
ssh ubuntu@10.0.0.2 sudo systemctl status dora-daemon-robotWhen a daemon disconnects and reconnects (e.g., after a network blip, machine reboot, or service restart), the coordinator automatically re-spawns any missing dataflows on that daemon.
- Daemon reconnects and sends a
StatusReportlisting its currently running dataflows. - Coordinator compares the report against its expected state (dataflows that should have nodes on this daemon).
- For each running dataflow with nodes assigned to this daemon that the daemon did not report, the coordinator sends a
SpawnDataflowNodescommand to re-spawn the missing nodes.
To prevent crash loops (e.g., a node that immediately crashes on spawn), recovery uses a per-daemon, per-dataflow backoff:
- After a recovery attempt, the coordinator records the timestamp.
- Subsequent recovery for the same daemon/dataflow pair is skipped until 30 seconds have elapsed.
- The backoff clears when the daemon reports the dataflow as running again.
This means a node that crashes immediately will only be re-spawned once every 30 seconds, not in a tight loop.
- Auto-recovery only applies to dataflows started via
dora start(coordinator-managed). Localdora rundataflows are not tracked by the coordinator. - Recovery re-spawns all nodes assigned to the reconnecting daemon, not individual nodes. For per-node restart on crash, use restart policies.
Upgrade the dora binary on all cluster machines with zero downtime using sequential per-machine upgrades.
dora cluster upgrade cluster.ymlFor each machine, sequentially:
- SCP the local
dorabinary to/usr/local/bin/doraon the target. - Restart the systemd service (
systemctl restart dora-daemon-<id>). - Poll the coordinator until the daemon reconnects (30s timeout).
Because machines are upgraded one at a time, nodes on other machines continue running. After the daemon reconnects, auto-recovery re-spawns any dataflow nodes that were running on that machine.
- Daemons must be installed as systemd services (
dora cluster install). - The local
dorabinary must be compatible with the cluster's coordinator version. - SSH access with
sudopermissions on all target machines.
A camera node runs on the robot, sends frames to a GPU server for inference, and results flow back to an actuator on the robot.
cluster.yml:
coordinator:
addr: 192.168.1.1
machines:
- id: robot
host: 192.168.1.10
user: ubuntu
labels:
role: edge
- id: gpu-server
host: 192.168.1.20
user: ml
labels:
gpu: "true"dataflow.yml:
nodes:
- id: camera
_unstable_deploy:
machine: robot
path: ./camera-driver
outputs:
- frames
- id: inference
_unstable_deploy:
labels:
gpu: "true"
path: ./ml-model
inputs:
frames: camera/frames
outputs:
- predictions
- id: actuator
_unstable_deploy:
machine: robot
path: ./actuator-driver
inputs:
commands: inference/predictionsA central coordinator manages N robots with heterogeneous hardware. Label scheduling routes nodes to the right machines without hardcoding machine IDs.
cluster.yml:
coordinator:
addr: 10.0.0.1
machines:
- id: bot-01
host: 10.0.0.11
user: robot
labels:
fleet: warehouse
lidar: "true"
- id: bot-02
host: 10.0.0.12
user: robot
labels:
fleet: warehouse
camera: rgbd
- id: bot-03
host: 10.0.0.13
user: robot
labels:
fleet: warehouse
lidar: "true"
camera: rgbddataflow.yml:
nodes:
- id: lidar-driver
_unstable_deploy:
labels:
lidar: "true"
path: ./lidar-driver
outputs:
- scans
- id: camera-driver
_unstable_deploy:
labels:
camera: rgbd
path: ./camera-driver
outputs:
- framesWith this configuration, lidar-driver runs on bot-01 or bot-03, and camera-driver runs on bot-02 or bot-03.
Automate cluster management in CI:
# Setup
dora cluster install cluster.yml
# Deploy new version
dora cluster upgrade cluster.yml
# Run integration tests
dora start test-dataflow.yml --name integration-test --attach
# Monitor
dora cluster status
dora top
# Cleanup
dora stop integration-test| Stage | Approach | Command |
|---|---|---|
| Local dev | Single-process, no coordinator | dora run dataflow.yml |
| Staging | Ad-hoc daemons, manual setup | dora up + dora daemon on each machine |
| Production | Managed cluster, systemd services | dora cluster install cluster.yml |
- SSH keys: Distribute SSH keys so the CLI machine can reach all cluster machines without a password (
BatchMode=yes). - Dora binary: Install the
dorabinary on all machines (same version). - Network: Ensure coordinator port (default 6013) is reachable from all machines. Ensure Zenoh ports are open between daemons for cross-machine node communication.
- cluster.yml: Create the cluster configuration with correct IPs, users, and labels.
# Start a dataflow
dora start dataflow.yml --name my-app --attach
# List running dataflows
dora list
# Monitor resource usage
dora top
# View node logs
dora logs my-app <node-id> --follow
# Stop a dataflow
dora stop my-app
# Check cluster health
dora cluster status- Build or download the new
dorabinary locally. - Run
dora cluster upgrade cluster.yml. - Verify with
dora cluster statusthat all daemons reconnected. - Running dataflows are automatically re-spawned via auto-recovery.
Daemon not connecting
- Verify the coordinator is running and reachable:
curl http://<addr>:6013/api/health(or check coordinator logs). - Check daemon logs:
journalctl -u dora-daemon-<id> -f(systemd) or the daemon's stderr output (ad-hoc). - Confirm the
--coordinator-addrand--coordinator-portmatch the coordinator's actual bind address.
SSH failures during cluster commands
- Ensure
ssh -o BatchMode=yes <user>@<host> echo okworks from the CLI machine. - Check that
StrictHostKeyChecking=accept-newis acceptable for your environment (first connection auto-accepts the host key). - Verify the
userfield incluster.ymlmatches a valid SSH user on the target.
Label mismatch errors
- Error:
no daemon matches labels {"gpu": "true"}. - Check that the daemon was started with the correct
--labelsflag. - Run
dora cluster statusto see connected daemons. Labels are set at daemon startup fromcluster.ymland cannot be changed at runtime.
Auto-recovery not triggering
- Auto-recovery only applies to coordinator-managed dataflows (
dora start), notdora run. - Check coordinator logs for
auto-recovery: re-spawningmessages. - If the node crashes immediately, recovery is throttled to once every 30 seconds per daemon per dataflow.
The _unstable_deploy section on each node controls placement and distribution. All fields are optional.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01 # Target machine ID from cluster.yml
labels: # Label requirements (superset match)
gpu: "true"
arch: arm64
distribute: local # local | scp | http
working_dir: /opt/my-app # Working directory on the target machine
path: ./my-node| Field | Type | Default | Description |
|---|---|---|---|
machine |
string | none | Target machine ID. Takes priority over labels. |
labels |
map | empty | Required daemon labels. All key-value pairs must match. |
distribute |
string | local |
Binary distribution strategy: local, scp, or http. |
working_dir |
path | none | Working directory on the target machine. |
- machine -- if set, the node is assigned to the daemon with that machine ID.
- labels -- if set (and machine is not), the node is assigned to the first daemon whose labels are a superset of the required labels.
- Fallback -- if neither is set, the node is assigned to the first unnamed (no machine-id) daemon.
- Use labels over machine IDs for flexibility. Labels decouple your dataflow from specific machines, making it easier to add, remove, or replace hardware.
- Use systemd install for production. Daemon services survive reboots and auto-restart on failure with
Restart=on-failure. - Use coordinator persistence (
dora coordinator --store redb) with clusters so the coordinator survives restarts. See Coordinator State Persistence. - Set restart policies on nodes for per-node resilience. Combine with auto-recovery for defense in depth. See Restart Policies.
- Monitor with multiple tools:
dora cluster statusfor daemon health,dora topfor resource usage,dora logsfor node output. - Test locally first. Develop with
dora run dataflow.yml, then deploy to a cluster. The same dataflow YAML works in both modes --_unstable_deployfields are ignored in local mode. - Use rolling upgrades instead of stopping the entire cluster.
dora cluster upgradeprocesses one machine at a time to maintain availability. - Keep cluster.yml in version control alongside your dataflow definitions.