-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Environment
- Consul image:
hashicorp/consul(latest at time of writing)
Consul v1.22.4
Revision c32a5a6c
Build Date 2026-02-18T15:07:06Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
-
Orchestrator: Docker Swarm
-
Network driver: overlay
-
Endpoint mode:
dnsrr -
Swarm nodes: 2 nodes
- 1 manager
- 1 worker
Summary
In a Docker Swarm setup using DNS round-robin (--endpoint-mode dnsrr) and -retry-join="tasks.consulserver", Consul agents cache server container IPs.
When a server task is rescheduled (e.g., container restart, failure, or replica recreation), it receives a new overlay IP. However, agents continue attempting to reconnect to the old, no-longer-valid IP addresses.
Even with retry_join configured, the agent does not re-resolve tasks.consulserver dynamically after startup.
The cluster only recovers after manually restarting the agent container, which forces a fresh DNS resolution.
Expected Behavior
When using:
-retry-join="tasks.consulserver"
or
"retry_join": ["tasks.consulserver"]Consul should:
- Re-resolve the DNS name periodically
- Discover updated IPs of rescheduled server tasks
- Stop attempting reconnections to stale IPs
- Recover automatically without requiring agent restart
Actual Behavior
When a server task fails and is rescheduled:
- Swarm assigns a new IP (e.g., old:
10.10.10.2, new:10.10.10.11) - Agent logs show server marked as failed
- Agent repeatedly attempts reconnect to old IP
- Agent enters
No known Consul serversstate - Cluster does not recover automatically
- Restarting the agent resolves the issue immediately
Relevant Logs (Failure State)
memberlist: Suspect 2a6a100c8864 has failed
Marking 2a6a100c8864 as failed
removing server: tcp/10.10.10.2:8300
serf: attempting reconnect to 2a6a100c8864 10.10.10.2:8301
ERROR agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
ERROR agent: Coordinate update error: error="No known Consul servers"
The agent continues attempting reconnect to the old IP (10.10.10.2), which is no longer assigned to any task.
After Agent Restart
Immediately after restarting the agent container:
agent: (LAN) joining: lan_addresses=["tasks.consulserver"]
serf: EventMemberJoin: d30310218559 10.10.10.11
serf: EventMemberJoin: d80292989ac5 10.10.10.13
serf: EventMemberJoin: 03fab6e4075c 10.10.10.12
agent.client: adding server: tcp/10.10.10.11:8300
agent.client: adding server: tcp/10.10.10.13:8300
agent.client: adding server: tcp/10.10.10.12:8300
Cluster becomes healthy again.
This indicates that DNS resolution works correctly, but only at agent startup.
Steps to Reproduce
1️⃣ Create overlay network
docker network create \
--driver overlay \
--subnet 10.10.10.0/24 \
consul2️⃣ Create Consul server service
docker service create \
--name consulserver \
--network consul \
--replicas 3 \
--constraint 'node.role == manager' \
--endpoint-mode dnsrr \
-e 'CONSUL_LOCAL_CONFIG={"leave_on_terminate": true}' \
hashicorp/consul agent \
-server \
-bootstrap-expect=3 \
-bind='{{ GetInterfaceIP "eth0" }}' \
-client=0.0.0.0 \
-retry-join="tasks.consulserver" \
-data-dir=/tmp3️⃣ Create Consul agent service
docker service create \
--name consulagent \
--network consul \
--replicas 1 \
--constraint 'node.role != manager' \
--publish "8500:8500" \
-e 'CONSUL_BIND_INTERFACE=eth0' \
-e 'CONSUL_LOCAL_CONFIG={"leave_on_terminate": true, "retry_join":["tasks.consulserver"]}' \
hashicorp/consul agent \
-data-dir=/tmp \
-client=0.0.0.04️⃣ Trigger the problem
-
Force remove one server task:
docker service update --force consulserveror
docker service scale consulserver=2 docker service scale consulserver=3 -
Observe:
- New server IP assigned
- Agent continues reconnecting to old IP
No known Consul serverserrors- Recovery only after agent restart
Additional Notes
leave_on_terminate = trueis enabled.- Using DNSRR (not VIP mode).
- Appears that
retry_joinis evaluated only during startup. - Agent does not re-resolve DNS names after initial join.
- This behavior makes Swarm task rescheduling unsafe without external supervision or health-based restarts.
Question
Is this expected behavior?
If so, what is the recommended approach in Docker Swarm environments where container IPs are ephemeral?
Should Consul agents periodically re-resolve DNS names used in retry_join?