-
Notifications
You must be signed in to change notification settings - Fork 610
Description
Describe the bug
In a leader-follower setup the leader hangs indefinitely on SIGTERM after closing connections.
To Reproduce
docker-compose.yaml:
services:
leader:
image: "docker.io/tile38/tile38:1.36.3"
command: ["tile38-server", "-vv", "-l", "json", "-d", "/data"]
follower:
image: "docker.io/tile38/tile38:1.36.3"Scenario A:
- Single instance without follower exits on SIGTERM without issues
Scenario B:
- Issue
FOLLOW leader 9851on$${\color{green}follower}$$ - Send SIGTERM to
$${\color{blue}leader}$$ -
$${\color{blue}leader}$$ shuts down (closes connections) but process hangs indefinitely
Scenario C:
- Issue
FOLLOW leader 9851on$${\color{green}follower}$$ - Issue
FOLLOW no oneon$${\color{green}follower}$$ - Send SIGTERM to
$${\color{blue}leader}$$ -
$${\color{blue}leader}$$ shuts down (closes connections) but process hangs indefinitely
Scenario D:
- Issue
FOLLOW leader 9851on$${\color{green}follower}$$ - Issue
FOLLOW no oneon$${\color{green}follower}$$ - Shutdown
$${\color{green}follower}$$ - Send SIGTERM to
$${\color{blue}leader}$$ -
$${\color{blue}leader}$$ shuts down without issues
Logs on
{"level":"info","ts":1759422156.1645777,"msg":"Server started, Tile38 version 1.36.3, git 01db1d1b"}
{"level":"debug","ts":1759422156.1664107,"msg":"Geom indexing: QuadTree (64 points)"}
{"level":"debug","ts":1759422156.1664326,"msg":"Multi indexing: RTree (64 points)"}
{"level":"info","ts":1759422156.1666253,"msg":"AOF loaded 0 commands: 0.00s, 0/s, 0 bytes/s"}
{"level":"info","ts":1759422156.1668293,"msg":"Ready to accept connections at [::]:9851"}
{"level":"debug","ts":1759422177.476115,"msg":"Opened connection: 10.89.10.21:57322"}
{"level":"debug","ts":1759422177.4773784,"msg":"Closed connection: 10.89.10.21:57322"}
{"level":"debug","ts":1759422177.4780068,"msg":"Opened connection: 10.89.10.21:57330"}
{"level":"debug","ts":1759422177.4785671,"msg":"Detached connection: 10.89.10.21:57330"}
{"level":"info","ts":1759422177.4785929,"msg":"live 10.89.10.21:57330"}
{"level":"warn","ts":1759422186.8873913,"msg":"signal: terminated"}
{"level":"warn","ts":1759422186.8874192,"msg":"Shutting down..."}
{"level":"debug","ts":1759422186.887454,"msg":"Closing client connections..."}I think the issue is twofold:
- If I issue
FOLLOW no oneon the$${\color{green}follower}$$ , the TCP connection between the$${\color{blue}leader}$$ and$${\color{green}follower}$$ remain open. (verified withss -a).
The$${\color{blue}leader}$$ also still thinks the$${\color{green}follower}$$ is connected when I issue theROLEcommand:{"ok":true,"role":{"role":"master","offset":0,"slaves":[{"ip":"10.89.10.21","port":"9851","offset":"0"}]},"elapsed":"48.622µs"} - Even though the
$${\color{blue}leader}$$ handles SIGTERM correctly and closes all connections (this hanging TCP connection as well), it doesn't terminate the process.
I am not a Go expert sadly but my hunch is that closing the "detached connection" created by the follower doesn't decrease the semaphore that is waited in the shutdown sequence (it gets stuck somewhere between the "Closing client connections..." and "Client connection closed" logs):
tile38/internal/server/server.go
Lines 571 to 582 in 01db1d1
| var wg sync.WaitGroup | |
| defer func() { | |
| log.Debug("Closing client connections...") | |
| s.connsmu.RLock() | |
| for _, c := range s.conns { | |
| c.closer.Close() | |
| } | |
| s.connsmu.RUnlock() | |
| wg.Wait() | |
| ln.Close() | |
| log.Debug("Client connection closed") | |
| }() |
I think it gets stuck on the wg.Wait() call.
Expected behavior
Leader tile38-server process should exit after follower connections are closed on SIGTERM. Additionally in my opinion issuing FOLLOW no one on the follower should close the TCP connection between the leader and the follower.
Operating System (please complete the following information):
- OS: Linux
- CPU: amd64
- Version: 1.36.3
- Container: Docker