Improved cluster replica tracking and management#1537
Improved cluster replica tracking and management#1537carlhoerberg wants to merge 10 commits intomainfrom
Conversation
|
Claude finished @carlhoerberg's task —— View job Code Review Complete
Issues Found1. Missing Error Handling in Etcd OperationsFile:
2. Potential Integer Overflow in Sequential ID GenerationFile:
3. Missing Validation in Node Forgetting APIFile:
4. Race Condition in Follower Connection HandlingFile:
5. Performance: Inefficient String OperationsFile:
No Issues Found For:
|
|
Fixes #1425 |
snichme
left a comment
There was a problem hiding this comment.
I looked at the code in Clustering::Server but haven't worked with that code so cannot give a good review there. Will run the code locally as well.
Possible to add some more specs, feel like there is a lot of logic added but not much specs around that
src/lavinmq/http/controller/nodes.cr
Outdated
| run_queue: 0, | ||
| sockets_used: @amqp_server.vhosts.sum { |_, v| v.connections.size }, | ||
| followers: @amqp_server.followers, | ||
| followers: merged_replicas, |
There was a problem hiding this comment.
Does this mean that we change the response? Meaning the keys for each follower is not the same? If so this is a breaking change. We could handle it like that or add merged_replicas as additional data in the response like so
| followers: merged_replicas, | |
| followers: @amqp_server.followers, | |
| merged_replicas: merged_replicas |
src/lavinmq/http/controller/nodes.cr
Outdated
| end | ||
| end | ||
|
|
||
| # Add any connected followers not yet in etcd (shouldn't happen but be safe) |
There was a problem hiding this comment.
How can a connected follower not be in etcd?
| @id.to_s(str, 36) | ||
| private def mark_insync(id : Int32) | ||
| key = "#{@config.clustering_etcd_prefix}/replica/#{id.to_s(36)}/insync" | ||
| @etcd.put(key, "1") |
There was a problem hiding this comment.
We should rescue Etcd::Error and re-raise with something more specific here to improve the error handling.
src/lavinmq/clustering/controller.cr
Outdated
| if isr = @etcd.get(old_key) | ||
| legacy_isr_exists = true | ||
| if isr.split(",").map(&.to_i(36)).includes?(@id) | ||
| Log.info { "In sync via legacy ISR key" } |
There was a problem hiding this comment.
If we are insync vid legacy ISR key, shouldn't we write that we are insync in with the new key as well?
Instead of maintaining a comma-separated list at a single "isr" key,
each replica now has its own key at `{prefix}/replica/{id}/insync`
with value "1" for in-sync and "0" for out-of-sync. Followers are
only marked out-of-sync when a change arrives that they can't receive.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add known_replicas method to Clustering::Server that queries etcd - Merge etcd replicas with connected followers in NodesController - Add Connected and In-sync columns to followers table in nodes.ecr - Update nodes.js to render new columns with checkmarks - Simplify followers API: remove syncing_followers/all_followers, keep single followers method - Fix etcd.get_prefix to use map instead of map! (type mismatch fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Prevents a reconnecting follower from being incorrectly marked as out-of-sync. Also moves full_sync into begin block so ensure runs on sync failure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Allow administrators to remove disconnected replicas from etcd via the Nodes page UI or DELETE /api/nodes/:id endpoint. The button only appears for disconnected followers and includes a confirmation dialog. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add role field (leader/follower) to replica API response - Display "Leader" in Connected column for leader node - Sort replicas with leader first - Prevent forgetting leader node via API and hide Forget button 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Allows removing disconnected nodes from the cluster via CLI: lavinmqctl forget_cluster_node <id> 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of generating random cluster node IDs, new nodes now get sequential IDs (1, 2, 3, ...) by atomically creating the replica insync key in etcd. This ensures IDs don't collide with existing nodes, including manually created ones. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Support rolling upgrades from v2.6.x by checking both the new individual replica keys and the old comma-separated ISR key format. When a new leader takes over, it deletes the legacy key to clean up. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Preserve backwards-compatible `followers` API field, add `merged_replicas` as new field - Remove unnecessary defensive code for followers not yet in etcd - Add error handling (rescue Etcd::Error) in mark_insync/mark_out_of_sync - Migrate legacy ISR key to new per-replica format when detected as in-sync - Add specs for known_replicas, forget_replica, and sequential ID generation - Remove openssl_sni.cr monkey-patch, use Crystal stdlib on_server_name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Older LavinMQ versions read the comma-separated isr key to determine sync status. Continue writing it alongside the new per-replica keys so rolling upgrades work correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: Sequential cluster node IDs and per-replica insync tracking1. Fiber and connection leak in
|
Summary
Changes
Use individual etcd keys for ISR tracking
Instead of maintaining a comma-separated list at a single "isr" key, each replica now has its own key at
{prefix}/replica/{id}/insyncwith value "1" for in-sync and "0" for out-of-sync.Show all known replicas in nodes page
known_replicasmethod to Clustering::Server that queries etcdAdd forget button for disconnected replicas
DELETE /api/nodes/:idlavinmqctl forget_cluster_node <id>CLI commandGenerate sequential cluster node IDs
Test plan
lavinmqctl forget_cluster_nodeworks🤖 Generated with Claude Code