Skip to content

Graceful shutdown in a leader-follower setup #783

@dobiadi

Description

@dobiadi

Describe the bug
In a leader-follower setup the leader hangs indefinitely on SIGTERM after closing connections.

To Reproduce
docker-compose.yaml:

services:
  leader:
    image: "docker.io/tile38/tile38:1.36.3"
    command: ["tile38-server", "-vv", "-l", "json", "-d", "/data"]
  follower:
    image: "docker.io/tile38/tile38:1.36.3"

Scenario A:

  • Single instance without follower exits on SIGTERM without issues

Scenario B:

  • Issue FOLLOW leader 9851 on $${\color{green}follower}$$
  • Send SIGTERM to $${\color{blue}leader}$$
  • $${\color{blue}leader}$$ shuts down (closes connections) but process hangs indefinitely

Scenario C:

  • Issue FOLLOW leader 9851 on $${\color{green}follower}$$
  • Issue FOLLOW no one on $${\color{green}follower}$$
  • Send SIGTERM to $${\color{blue}leader}$$
  • $${\color{blue}leader}$$ shuts down (closes connections) but process hangs indefinitely

Scenario D:

  • Issue FOLLOW leader 9851 on $${\color{green}follower}$$
  • Issue FOLLOW no one on $${\color{green}follower}$$
  • Shutdown $${\color{green}follower}$$
  • Send SIGTERM to $${\color{blue}leader}$$
  • $${\color{blue}leader}$$ shuts down without issues

Logs on $${\color{blue}leader}$$ when it is stuck:

{"level":"info","ts":1759422156.1645777,"msg":"Server started, Tile38 version 1.36.3, git 01db1d1b"}
{"level":"debug","ts":1759422156.1664107,"msg":"Geom indexing: QuadTree (64 points)"}
{"level":"debug","ts":1759422156.1664326,"msg":"Multi indexing: RTree (64 points)"}
{"level":"info","ts":1759422156.1666253,"msg":"AOF loaded 0 commands: 0.00s, 0/s, 0 bytes/s"}
{"level":"info","ts":1759422156.1668293,"msg":"Ready to accept connections at [::]:9851"}
{"level":"debug","ts":1759422177.476115,"msg":"Opened connection: 10.89.10.21:57322"}
{"level":"debug","ts":1759422177.4773784,"msg":"Closed connection: 10.89.10.21:57322"}
{"level":"debug","ts":1759422177.4780068,"msg":"Opened connection: 10.89.10.21:57330"}
{"level":"debug","ts":1759422177.4785671,"msg":"Detached connection: 10.89.10.21:57330"}
{"level":"info","ts":1759422177.4785929,"msg":"live 10.89.10.21:57330"}
{"level":"warn","ts":1759422186.8873913,"msg":"signal: terminated"}
{"level":"warn","ts":1759422186.8874192,"msg":"Shutting down..."}
{"level":"debug","ts":1759422186.887454,"msg":"Closing client connections..."}

I think the issue is twofold:

  • If I issue FOLLOW no one on the $${\color{green}follower}$$, the TCP connection between the $${\color{blue}leader}$$ and $${\color{green}follower}$$ remain open. (verified with ss -a).
    The $${\color{blue}leader}$$ also still thinks the $${\color{green}follower}$$ is connected when I issue the ROLE command: {"ok":true,"role":{"role":"master","offset":0,"slaves":[{"ip":"10.89.10.21","port":"9851","offset":"0"}]},"elapsed":"48.622µs"}
  • Even though the $${\color{blue}leader}$$ handles SIGTERM correctly and closes all connections (this hanging TCP connection as well), it doesn't terminate the process.

I am not a Go expert sadly but my hunch is that closing the "detached connection" created by the follower doesn't decrease the semaphore that is waited in the shutdown sequence (it gets stuck somewhere between the "Closing client connections..." and "Client connection closed" logs):

var wg sync.WaitGroup
defer func() {
log.Debug("Closing client connections...")
s.connsmu.RLock()
for _, c := range s.conns {
c.closer.Close()
}
s.connsmu.RUnlock()
wg.Wait()
ln.Close()
log.Debug("Client connection closed")
}()

I think it gets stuck on the wg.Wait() call.

Expected behavior

Leader tile38-server process should exit after follower connections are closed on SIGTERM. Additionally in my opinion issuing FOLLOW no one on the follower should close the TCP connection between the leader and the follower.

Operating System (please complete the following information):

  • OS: Linux
  • CPU: amd64
  • Version: 1.36.3
  • Container: Docker

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions