Consul agents in Docker Swarm keep reconnecting to stale server IPs after task rescheduling (DNSRR + overlay network)

## Environment

* Consul image: `hashicorp/consul` (latest at time of writing)
```
Consul v1.22.4
Revision c32a5a6c
Build Date 2026-02-18T15:07:06Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)


```



* Orchestrator: Docker Swarm
* Network driver: overlay
* Endpoint mode: `dnsrr`
* Swarm nodes: 2 nodes

  * 1 manager
  * 1 worker

---

## Summary

In a Docker Swarm setup using DNS round-robin (`--endpoint-mode dnsrr`) and `-retry-join="tasks.consulserver"`, Consul agents cache server container IPs.

When a server task is rescheduled (e.g., container restart, failure, or replica recreation), it receives a new overlay IP. However, agents continue attempting to reconnect to the old, no-longer-valid IP addresses.

Even with `retry_join` configured, the agent does not re-resolve `tasks.consulserver` dynamically after startup.

The cluster only recovers after manually restarting the agent container, which forces a fresh DNS resolution.

---

## Expected Behavior

When using:

```
-retry-join="tasks.consulserver"
```

or

```json
"retry_join": ["tasks.consulserver"]
```

Consul should:

* Re-resolve the DNS name periodically
* Discover updated IPs of rescheduled server tasks
* Stop attempting reconnections to stale IPs
* Recover automatically without requiring agent restart

---

## Actual Behavior

When a server task fails and is rescheduled:

1. Swarm assigns a new IP (e.g., old: `10.10.10.2`, new: `10.10.10.11`)
2. Agent logs show server marked as failed
3. Agent repeatedly attempts reconnect to old IP
4. Agent enters `No known Consul servers` state
5. Cluster does not recover automatically
6. Restarting the agent resolves the issue immediately

---

## Relevant Logs (Failure State)

```
memberlist: Suspect 2a6a100c8864 has failed
Marking 2a6a100c8864 as failed
removing server: tcp/10.10.10.2:8300

serf: attempting reconnect to 2a6a100c8864 10.10.10.2:8301

ERROR agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
ERROR agent: Coordinate update error: error="No known Consul servers"
```

The agent continues attempting reconnect to the old IP (10.10.10.2), which is no longer assigned to any task.

---

## After Agent Restart

Immediately after restarting the agent container:

```
agent: (LAN) joining: lan_addresses=["tasks.consulserver"]

serf: EventMemberJoin: d30310218559 10.10.10.11
serf: EventMemberJoin: d80292989ac5 10.10.10.13
serf: EventMemberJoin: 03fab6e4075c 10.10.10.12

agent.client: adding server: tcp/10.10.10.11:8300
agent.client: adding server: tcp/10.10.10.13:8300
agent.client: adding server: tcp/10.10.10.12:8300
```

Cluster becomes healthy again.

This indicates that DNS resolution works correctly, but only at agent startup.

---

## Steps to Reproduce

### 1️⃣ Create overlay network

```bash
docker network create \
  --driver overlay \
  --subnet 10.10.10.0/24 \
  consul
```

---

### 2️⃣ Create Consul server service

```bash
docker service create \
  --name consulserver \
  --network consul \
  --replicas 3 \
  --constraint 'node.role == manager' \
  --endpoint-mode dnsrr \
  -e 'CONSUL_LOCAL_CONFIG={"leave_on_terminate": true}' \
  hashicorp/consul agent \
    -server \
    -bootstrap-expect=3 \
    -bind='{{ GetInterfaceIP "eth0" }}' \
    -client=0.0.0.0 \
    -retry-join="tasks.consulserver" \
    -data-dir=/tmp
```

---

### 3️⃣ Create Consul agent service

```bash
docker service create \
  --name consulagent \
  --network consul \
  --replicas 1 \
  --constraint 'node.role != manager' \
  --publish "8500:8500" \
  -e 'CONSUL_BIND_INTERFACE=eth0' \
  -e 'CONSUL_LOCAL_CONFIG={"leave_on_terminate": true, "retry_join":["tasks.consulserver"]}' \
  hashicorp/consul agent \
    -data-dir=/tmp \
    -client=0.0.0.0
```

---

### 4️⃣ Trigger the problem

* Force remove one server task:

  ```
  docker service update --force consulserver
  ```

  or

  ```
  docker service scale consulserver=2
  docker service scale consulserver=3
  ```

* Observe:

  * New server IP assigned
  * Agent continues reconnecting to old IP
  * `No known Consul servers` errors
  * Recovery only after agent restart

---

## Additional Notes

* `leave_on_terminate = true` is enabled.
* Using DNSRR (not VIP mode).
* Appears that `retry_join` is evaluated only during startup.
* Agent does not re-resolve DNS names after initial join.
* This behavior makes Swarm task rescheduling unsafe without external supervision or health-based restarts.

---

## Question

Is this expected behavior?

If so, what is the recommended approach in Docker Swarm environments where container IPs are ephemeral?

Should Consul agents periodically re-resolve DNS names used in `retry_join`?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul agents in Docker Swarm keep reconnecting to stale server IPs after task rescheduling (DNSRR + overlay network) #23239

Environment

Summary

Expected Behavior

Actual Behavior

Relevant Logs (Failure State)

After Agent Restart

Steps to Reproduce

1️⃣ Create overlay network

2️⃣ Create Consul server service

3️⃣ Create Consul agent service

4️⃣ Trigger the problem

Additional Notes

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consul agents in Docker Swarm keep reconnecting to stale server IPs after task rescheduling (DNSRR + overlay network) #23239

Description

Environment

Summary

Expected Behavior

Actual Behavior

Relevant Logs (Failure State)

After Agent Restart

Steps to Reproduce

1️⃣ Create overlay network

2️⃣ Create Consul server service

3️⃣ Create Consul agent service

4️⃣ Trigger the problem

Additional Notes

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions