-
Notifications
You must be signed in to change notification settings - Fork 4k
Replies: 3 comments · 14 replies
-
|
Channels are not supposed to "claim" unacknowledged deliveries. Unacknowledged deliveries are not "picked up" by other channels. Those deliveries are automatically requeued when their channels are closed. In order to reason about the unacknowledged deliveires metric of a queue that has undergone a leader election, we need an executable way to reproduce plus logs. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Certainly, here's my rabbitmq.conf: (The referenced definitions.json defines only users and access controls) I'm not setting Got it, next time it happens, I'll run the |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
A stroke of luck -- it happened again and I was able to grab some diagnostic information. First, the rabbitmq-diagnostics member_overview for db_loaderI was able to find another example of the connection pattern as well. This time, I had exactly 80 unacked messages (two Celery workers' prefetch count) "stuck" and confirmed that around the time they appeared in metrics, rabbitmq-01 and rabbitmq-02 lost connection to one another. 02 was the leader with one consumer connection. The other two consumers were connected to 01. It appears that the number of unacked messages correlates with the number of Celery workers connected to the quorum queue follower when connection between the follower and leader is lost. In the previous instance, it was one consumer, so 40 unacked messages, this time it was two, so 80 unacked messages. Here are the additional consumer connection detailsrabbitmq-01 logs from the eventrabbitmq-02 logs from the event |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
@bkienker what does "example of the connection pattern" mean specifically? The Celery prefetch of 40 multiplied by the number of connections or something else? |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
I meant to answer this question specifically: "Did you notice any pattern regarding where the consumer is connected?" It's only two occurrences, but in both cases, the number of lingering unacked messages matches the number of consumers that were connected through a follower when the follower lost its connection to the leader. In the first case, it was one Celery worker (40 unacked messages). In the second case, it was two Celery workers (80 unacked messages). I'm thinking that this is perhaps the beginning of a pattern where:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
I've had this happen again, and I believe it reinforces the pattern I outlined. This time, all three Celery workers were connected through a separate RabbitMQ instance, one through 01, one through 02, and one though the leader, 03. rabbitmq-01 lost its connection to the leader, and 40 unacked messages appeared and lingered in the queue. Notably, rabbitmq-02 did not lose the connection to the leader, which means that once again the number of unacked messages matches the prefetch count of the number of Celery workers connected to RabbitMQ instances that lost connection to the current leader. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
When comparing HTTP API metrics to what There is a way to clear the management plugin's stats DB. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
@bkienker can you retest with #13885 please? This introduces consumer timeouts in the queue which will return messages to the queue after the timeout (30 mins by default, but configurable per queue and even per consumer). See https://github.com/rabbitmq/rabbitmq-server/blob/ra-v3/consumer_timeouts.md |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Absolutely, thank you for the suggestion. I will give that a try tomorrow. I will also make another attempt to reproduce the problem in my test environment. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Thank you. You can use the following docker image to test this work in progress PR in your test environment: |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the bug
After a network disruption triggers a leader election on a quorum queue used by Celery workers, the queue reports unacknowledged messages that no channel claims ownership of. These messages are never redelivered and persist until the specific consumer whose connection is associated with the orphaned state is restarted. While orphaned, the messages prevent Raft log truncation, causing segment files to accumulate indefinitely until disk fills.
Reproduction steps
I have struggled to reproduce the problem manually. It seems to only happened during sustained network instability, which I have not been able to replicate in development. Here's what seems to happen:
A network event causes brief inter-node connectivity loss followed by a leader election on our high-traffic quorum queue. After the cluster recovers, list_queues shows
messages_unacknowledged = 40withmessages_ready = 0. However,list_channelsshowsmessages_unacknowledged = 0across every channel in the cluster. The management API confirms all three consumers are active and healthy withactivity_status = up.The 40 messages correspond exactly to one consumer's prefetch (
worker_prefetch_multiplier=4 x concurrency=10 = 40). In previous incidents on busier clusters the count was 80, corresponding to two consumers' prefetch. The count is always an exact multiple of the prefetch configuration.The queue continues accepting and delivering new messages normally. The orphaned messages sit in the unacked state permanently. Restarting the Celery workers one at a time, only one specific worker's restart clears the phantom messages and triggers segment truncation. The others have no effect.
Expected behavior
RabbitMQ will reconcile
messages_readywithmessages_unacknowledgedduring elections.Additional context
RabbitMQ version: 4.1.4 and 4.2.3 (I upgraded to 4.2.3 to get into community support and see if it would fix the problem, but I'm still running into it)
Erlang version: OTP 27 (erts-15.2.7.5)
Client: py-amqp 5.3.1 / Kombu 5.5.4 / Celery 5.5.3
Cluster: 3-node quorum queue cluster, Khepri metadata store
OS: RHEL 8.10, rootless Podman containers
Here's a snapshot of the state of the cluster when this happens:
Queue state (db_loader is a quorum queue, leader on rmq-03):
Consumers on the affected queue (3 consumers, prefetch 40 each):
Channel unacked counts: every channel across all three nodes reports 0. There are 200+ channels total. Sample:
Management API output for the queue (trimmed to relevant fields):
Quorum status shows all nodes fully caught up with identical snapshot indices:
Despite matching snapshot indices, segment files were accumulating on all nodes (taken simultaneously):
Please let me know if there is any other data from my environment that would be helpful in troubleshooting.
Beta Was this translation helpful? Give feedback.
All reactions