Skip to content

Consul agents in Docker Swarm keep reconnecting to stale server IPs after task rescheduling (DNSRR + overlay network) #23239

@masoudniki

Description

@masoudniki

Environment

  • Consul image: hashicorp/consul (latest at time of writing)
Consul v1.22.4
Revision c32a5a6c
Build Date 2026-02-18T15:07:06Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)


  • Orchestrator: Docker Swarm

  • Network driver: overlay

  • Endpoint mode: dnsrr

  • Swarm nodes: 2 nodes

    • 1 manager
    • 1 worker

Summary

In a Docker Swarm setup using DNS round-robin (--endpoint-mode dnsrr) and -retry-join="tasks.consulserver", Consul agents cache server container IPs.

When a server task is rescheduled (e.g., container restart, failure, or replica recreation), it receives a new overlay IP. However, agents continue attempting to reconnect to the old, no-longer-valid IP addresses.

Even with retry_join configured, the agent does not re-resolve tasks.consulserver dynamically after startup.

The cluster only recovers after manually restarting the agent container, which forces a fresh DNS resolution.


Expected Behavior

When using:

-retry-join="tasks.consulserver"

or

"retry_join": ["tasks.consulserver"]

Consul should:

  • Re-resolve the DNS name periodically
  • Discover updated IPs of rescheduled server tasks
  • Stop attempting reconnections to stale IPs
  • Recover automatically without requiring agent restart

Actual Behavior

When a server task fails and is rescheduled:

  1. Swarm assigns a new IP (e.g., old: 10.10.10.2, new: 10.10.10.11)
  2. Agent logs show server marked as failed
  3. Agent repeatedly attempts reconnect to old IP
  4. Agent enters No known Consul servers state
  5. Cluster does not recover automatically
  6. Restarting the agent resolves the issue immediately

Relevant Logs (Failure State)

memberlist: Suspect 2a6a100c8864 has failed
Marking 2a6a100c8864 as failed
removing server: tcp/10.10.10.2:8300

serf: attempting reconnect to 2a6a100c8864 10.10.10.2:8301

ERROR agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
ERROR agent: Coordinate update error: error="No known Consul servers"

The agent continues attempting reconnect to the old IP (10.10.10.2), which is no longer assigned to any task.


After Agent Restart

Immediately after restarting the agent container:

agent: (LAN) joining: lan_addresses=["tasks.consulserver"]

serf: EventMemberJoin: d30310218559 10.10.10.11
serf: EventMemberJoin: d80292989ac5 10.10.10.13
serf: EventMemberJoin: 03fab6e4075c 10.10.10.12

agent.client: adding server: tcp/10.10.10.11:8300
agent.client: adding server: tcp/10.10.10.13:8300
agent.client: adding server: tcp/10.10.10.12:8300

Cluster becomes healthy again.

This indicates that DNS resolution works correctly, but only at agent startup.


Steps to Reproduce

1️⃣ Create overlay network

docker network create \
  --driver overlay \
  --subnet 10.10.10.0/24 \
  consul

2️⃣ Create Consul server service

docker service create \
  --name consulserver \
  --network consul \
  --replicas 3 \
  --constraint 'node.role == manager' \
  --endpoint-mode dnsrr \
  -e 'CONSUL_LOCAL_CONFIG={"leave_on_terminate": true}' \
  hashicorp/consul agent \
    -server \
    -bootstrap-expect=3 \
    -bind='{{ GetInterfaceIP "eth0" }}' \
    -client=0.0.0.0 \
    -retry-join="tasks.consulserver" \
    -data-dir=/tmp

3️⃣ Create Consul agent service

docker service create \
  --name consulagent \
  --network consul \
  --replicas 1 \
  --constraint 'node.role != manager' \
  --publish "8500:8500" \
  -e 'CONSUL_BIND_INTERFACE=eth0' \
  -e 'CONSUL_LOCAL_CONFIG={"leave_on_terminate": true, "retry_join":["tasks.consulserver"]}' \
  hashicorp/consul agent \
    -data-dir=/tmp \
    -client=0.0.0.0

4️⃣ Trigger the problem

  • Force remove one server task:

    docker service update --force consulserver
    

    or

    docker service scale consulserver=2
    docker service scale consulserver=3
    
  • Observe:

    • New server IP assigned
    • Agent continues reconnecting to old IP
    • No known Consul servers errors
    • Recovery only after agent restart

Additional Notes

  • leave_on_terminate = true is enabled.
  • Using DNSRR (not VIP mode).
  • Appears that retry_join is evaluated only during startup.
  • Agent does not re-resolve DNS names after initial join.
  • This behavior makes Swarm task rescheduling unsafe without external supervision or health-based restarts.

Question

Is this expected behavior?

If so, what is the recommended approach in Docker Swarm environments where container IPs are ephemeral?

Should Consul agents periodically re-resolve DNS names used in retry_join?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions