Improved cluster replica tracking and management by carlhoerberg · Pull Request #1537 · cloudamqp/lavinmq

carlhoerberg · 2025-12-07T04:37:50Z

Summary

Track replicas using individual etcd keys instead of comma-separated ISR list
Show all known replicas in Nodes page with Connected/In-sync status columns
Add ability to forget disconnected replicas via UI and CLI
Generate sequential cluster node IDs using atomic etcd transactions

Changes

Use individual etcd keys for ISR tracking

Instead of maintaining a comma-separated list at a single "isr" key, each replica now has its own key at {prefix}/replica/{id}/insync with value "1" for in-sync and "0" for out-of-sync.

Show all known replicas in nodes page

Add known_replicas method to Clustering::Server that queries etcd
Merge etcd replicas with connected followers in NodesController
Add Connected and In-sync columns to followers table
Show leader role in replicas table, sorted first

Add forget button for disconnected replicas

Allow removing disconnected replicas via Nodes page UI or DELETE /api/nodes/:id
Prevent forgetting leader node
Add lavinmqctl forget_cluster_node <id> CLI command

Generate sequential cluster node IDs

New nodes get sequential IDs (1, 2, 3, ...) instead of random IDs
Uses atomic etcd transactions to ensure IDs don't collide
Handles manually created nodes by checking for existing replica keys

Test plan

Start a 3-node cluster and verify nodes get IDs 1, 2, 3
Verify Nodes page shows all replicas with correct Connected/In-sync status
Stop a follower and verify it shows as disconnected
Use Forget button to remove disconnected replica
Verify lavinmqctl forget_cluster_node works
Restart forgotten node and verify it gets a new sequential ID

🤖 Generated with Claude Code

claude · 2025-12-07T04:38:07Z

Claude finished @carlhoerberg's task —— View job

Code Review Complete

Read and analyze changed files for bugs, security issues, performance problems, and Crystal anti-patterns

Issues Found

1. Missing Error Handling in Etcd Operations

File: src/lavinmq/clustering/server.cr:149-161

The known_replicas method calls @etcd.get_prefix(prefix) but doesn't handle potential etcd communication failures
Impact: Network failures or etcd outages could cause unhandled exceptions
Recommendation: Add exception handling around etcd operations

2. Potential Integer Overflow in Sequential ID Generation

File: src/lavinmq/clustering/controller.cr:81-89

The next_cluster_node_id method increments id indefinitely without bounds checking
Impact: In theory, could overflow after Int32::MAX iterations (though unlikely in practice)
Recommendation: Add reasonable upper bound check or use Int64

3. Missing Validation in Node Forgetting API

File: src/lavinmq/http/controller/nodes.cr:163-184

The delete "/api/nodes/:id" endpoint converts string to Int32 but only checks for nil, not parsing errors
Invalid base-36 strings could cause exceptions
Recommendation: Add proper error handling for malformed ID strings

4. Race Condition in Follower Connection Handling

File: src/lavinmq/clustering/server.cr:194-202

When a follower with existing ID connects, the old follower is removed from @followers array without synchronization with etcd state updates
Impact: Brief window where etcd may still show old follower as in-sync while new follower is connecting
Recommendation: Ensure etcd state is updated atomically with follower replacement

5. Performance: Inefficient String Operations

File: src/lavinmq/clustering/server.cr:152-158

Multiple string splits and repeated array operations when parsing etcd keys
Impact: Performance degradation with many replicas
Recommendation: Use more efficient parsing or cache parsed results

No Issues Found For:

Security vulnerabilities
Major Crystal anti-patterns
Missing memory management issues

carlhoerberg · 2025-12-08T15:37:43Z

Fixes #1425

snichme

I looked at the code in Clustering::Server but haven't worked with that code so cannot give a good review there. Will run the code locally as well.

Possible to add some more specs, feel like there is a lot of logic added but not much specs around that

snichme · 2025-12-19T07:39:30Z

src/lavinmq/http/controller/nodes.cr

          run_queue:          0,
          sockets_used:       @amqp_server.vhosts.sum { |_, v| v.connections.size },
-          followers:          @amqp_server.followers,
+          followers:          merged_replicas,


Does this mean that we change the response? Meaning the keys for each follower is not the same? If so this is a breaking change. We could handle it like that or add merged_replicas as additional data in the response like so

Suggested change

followers: merged_replicas,

followers: @amqp_server.followers,

merged_replicas: merged_replicas

snichme · 2025-12-19T07:43:24Z

src/lavinmq/http/controller/nodes.cr

+          end
+        end
+
+        # Add any connected followers not yet in etcd (shouldn't happen but be safe)


How can a connected follower not be in etcd?

snichme · 2025-12-19T07:48:09Z

src/lavinmq/clustering/server.cr

-          @id.to_s(str, 36)
+      private def mark_insync(id : Int32)
+        key = "#{@config.clustering_etcd_prefix}/replica/#{id.to_s(36)}/insync"
+        @etcd.put(key, "1")


We should rescue Etcd::Error and re-raise with something more specific here to improve the error handling.

snichme · 2025-12-19T07:55:20Z

src/lavinmq/clustering/controller.cr

+    if isr = @etcd.get(old_key)
+      legacy_isr_exists = true
+      if isr.split(",").map(&.to_i(36)).includes?(@id)
+        Log.info { "In sync via legacy ISR key" }


If we are insync vid legacy ISR key, shouldn't we write that we are insync in with the new key as well?

Instead of maintaining a comma-separated list at a single "isr" key, each replica now has its own key at `{prefix}/replica/{id}/insync` with value "1" for in-sync and "0" for out-of-sync. Followers are only marked out-of-sync when a change arrives that they can't receive. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add known_replicas method to Clustering::Server that queries etcd - Merge etcd replicas with connected followers in NodesController - Add Connected and In-sync columns to followers table in nodes.ecr - Update nodes.js to render new columns with checkmarks - Simplify followers API: remove syncing_followers/all_followers, keep single followers method - Fix etcd.get_prefix to use map instead of map! (type mismatch fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Prevents a reconnecting follower from being incorrectly marked as out-of-sync. Also moves full_sync into begin block so ensure runs on sync failure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Allow administrators to remove disconnected replicas from etcd via the Nodes page UI or DELETE /api/nodes/:id endpoint. The button only appears for disconnected followers and includes a confirmation dialog. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add role field (leader/follower) to replica API response - Display "Leader" in Connected column for leader node - Sort replicas with leader first - Prevent forgetting leader node via API and hide Forget button 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Allows removing disconnected nodes from the cluster via CLI: lavinmqctl forget_cluster_node <id> 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Instead of generating random cluster node IDs, new nodes now get sequential IDs (1, 2, 3, ...) by atomically creating the replica insync key in etcd. This ensures IDs don't collide with existing nodes, including manually created ones. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Support rolling upgrades from v2.6.x by checking both the new individual replica keys and the old comma-separated ISR key format. When a new leader takes over, it deletes the legacy key to clean up. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Preserve backwards-compatible `followers` API field, add `merged_replicas` as new field - Remove unnecessary defensive code for followers not yet in etcd - Add error handling (rescue Etcd::Error) in mark_insync/mark_out_of_sync - Migrate legacy ISR key to new per-replica format when detected as in-sync - Add specs for known_replicas, forget_replica, and sequential ID generation - Remove openssl_sni.cr monkey-patch, use Crystal stdlib on_server_name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Older LavinMQ versions read the comma-separated isr key to determine sync status. Continue writing it alongside the new per-replica keys so rolling upgrades work correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-02-19T14:58:58Z

PR Review: Sequential cluster node IDs and per-replica insync tracking

1. Fiber and connection leak in `wait_to_be_insync` (controller.cr:159-186)

Two fibers are spawned to concurrently watch the new per-replica key and the legacy comma-separated ISR key. When one detects the insync state, it closes the channel and breaks out of its watch loop. However, the other fiber remains blocked inside @etcd.watch(), which holds an open streaming TCP connection to etcd. There is no cancellation mechanism, so this leaked fiber and its HTTP connection persist for the lifetime of the process.

Consider sharing a cancellation signal (e.g., closing the underlying socket or using an Atomic flag checked inside the watch callback) so the losing fiber can also exit.

2. Etcd network calls in hot path under lock (server.cr:287-289)

each_follower is invoked by append, replace_file, and delete_file -- the message publishing and delivery hot paths. The new code calls mark_out_of_sync for each entry in @disconnected_followers while holding @lock. Each mark_out_of_sync call makes 2 HTTP round-trips to etcd (one put for the replica key, one put for the legacy ISR update). With N disconnected followers, this is 2N synchronous network calls blocking all message replication under the mutex.

The old code performed at most 1 etcd call (update_isr) in the same position.

Suggestion: drain @disconnected_followers into a local variable inside the lock, then perform the etcd calls outside the lock, or batch the updates.

3. `known_replicas` key parsing breaks if prefix contains slashes (server.cr:153-157)

The code splits the full etcd key by / with a limit of 4 and assumes parts[2] is the replica ID. The clustering_etcd_prefix is user-configurable (default "lavinmq"). If it contains a slash (e.g., "my/prefix"), the key "my/prefix/replica/1a/insync" splits to ["my", "prefix", "replica", "1a/insync"]. Then parts[3] is "1a/insync" (not "insync"), so no replicas are ever matched.

Fix: strip the known prefix from the key before parsing the remainder.

carlhoerberg requested a review from a team as a code owner December 7, 2025 04:37

This was referenced Dec 7, 2025

Add out-of-sync replica (OSR) tracking to clustering #1494

Closed

OSR(out of sync replicas) for better nodes interface #1425

Open

carlhoerberg changed the title ~~Show all known replicas in nodes page with Connected/In-sync columns~~ Improved cluster replica tracking and management Dec 8, 2025

carlhoerberg force-pushed the osr2 branch from e107249 to 92efd16 Compare December 8, 2025 23:07

snichme reviewed Dec 19, 2025

View reviewed changes

snichme linked an issue Dec 19, 2025 that may be closed by this pull request

OSR(out of sync replicas) for better nodes interface #1425

Open

viktorerlingsson requested a review from a team February 10, 2026 12:47

carlhoerberg and others added 10 commits February 19, 2026 15:45

Add forget_cluster_node command to lavinmqctl

ff06349

Allows removing disconnected nodes from the cluster via CLI: lavinmqctl forget_cluster_node <id> 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

snichme force-pushed the osr2 branch from cc69a1c to 305e1cc Compare February 19, 2026 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Improved cluster replica tracking and management#1537

Improved cluster replica tracking and management#1537
carlhoerberg wants to merge 10 commits intomainfrom
osr2

carlhoerberg commented Dec 7, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 7, 2025 •

edited

Loading

Uh oh!

carlhoerberg commented Dec 8, 2025

Uh oh!

snichme left a comment

Uh oh!

snichme Dec 19, 2025

Uh oh!

snichme Dec 19, 2025

Uh oh!

snichme Dec 19, 2025

Uh oh!

snichme Dec 19, 2025

Uh oh!

claude bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	followers: merged_replicas,
	followers: @amqp_server.followers,
	merged_replicas: merged_replicas

Comments

Conversation

carlhoerberg commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Use individual etcd keys for ISR tracking

Show all known replicas in nodes page

Add forget button for disconnected replicas

Generate sequential cluster node IDs

Test plan

Uh oh!

claude bot commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Issues Found

1. Missing Error Handling in Etcd Operations

2. Potential Integer Overflow in Sequential ID Generation

3. Missing Validation in Node Forgetting API

4. Race Condition in Follower Connection Handling

5. Performance: Inefficient String Operations

No Issues Found For:

Uh oh!

carlhoerberg commented Dec 8, 2025

Uh oh!

snichme left a comment

Choose a reason for hiding this comment

Uh oh!

snichme Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

snichme Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

snichme Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

snichme Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Feb 19, 2026

PR Review: Sequential cluster node IDs and per-replica insync tracking

1. Fiber and connection leak in wait_to_be_insync (controller.cr:159-186)

2. Etcd network calls in hot path under lock (server.cr:287-289)

3. known_replicas key parsing breaks if prefix contains slashes (server.cr:153-157)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

carlhoerberg commented Dec 7, 2025 •

edited

Loading

claude bot commented Dec 7, 2025 •

edited

Loading

1. Fiber and connection leak in `wait_to_be_insync` (controller.cr:159-186)

3. `known_replicas` key parsing breaks if prefix contains slashes (server.cr:153-157)