QQ DLX: use a pipeline command for checkout#15548
Conversation
07f2161 to
c59ef3c
Compare
|
Relevant logs of the flake: |
| handle_cast({queue_event, QRef, Evt}, | ||
| #state{dlx_client_state = DlxState0, | ||
| queue_type_state = QTypeState0} = State0) -> | ||
| case maybe_handle_dlx_event(Evt, DlxState0) of | ||
| {ok, DlxState, Actions} -> | ||
| State1 = State0#state{dlx_client_state = DlxState}, | ||
| State = handle_queue_actions(Actions, State1), | ||
| {noreply, State}; | ||
| {reject, _DlxState} -> | ||
| {stop, {shutdown, not_leader}, State0}; | ||
| ignore -> | ||
| case rabbit_queue_type:handle_event(QRef, Evt, QTypeState0) of | ||
| {ok, QTypeState1, Actions} -> | ||
| State1 = State0#state{queue_type_state = QTypeState1}, | ||
| State = handle_queue_actions(Actions, State1), | ||
| {noreply, State}; | ||
| {eol, Actions} -> | ||
| State = handle_queue_actions(Actions, State0), | ||
| remove_queue(QRef, State); | ||
| {protocol_error, _Type, _Reason, _Args} -> | ||
| {noreply, State0} | ||
| end | ||
| end; |
There was a problem hiding this comment.
Can we simplify this?
Conceptually after the worker sent its initial {dlx, #checkout{}} message with a correlation, it's in a separate state registering. The only message it should receive in this state is a reply with that given correlation. Maybe we can make this more apparent by introducing a rabbit_fifo_dlx_client:handle_registration function? If successful, the state changes to registered. Only thereafter will the worker handle other messages including dlx deliveries from its source queue and publisher confirms from target queues.
There was a problem hiding this comment.
I've refactored the code. Please take a look if that's what you had in mind
There was a problem hiding this comment.
Thank you. The code is much simpler now.
The only message it should receive in this state is a reply with that given correlation.
@kjnilsson gave me a hint that I was wrong on this one. Apparently, the applied notifications in Ra (including that the dlx checkout command was applied) are delivered after Ra effects (which may already include dlx deliveries). So, we have 3 options I suppose:
- Stash any early (pre-registartion) dlx deliveries that the dlx worker proc receives in its state, or
- Send the dlx worker checkout command for pre evaluation to the aux machine (similar as done in ) The leader will then only append this Ra command to its log if the command was sent from a local dlx worker, or
rabbitmq-server/deps/rabbit/src/rabbit_fifo_client.erl
Lines 981 to 982 in 5388ef9
- Modify Ra to have an option so that ra:process_command() won't redirect to the leader.
What do you think? Which approach is the simplest and/or safest?
Before this change, with two leader elections in short succession, we could end up in a situation where no DLX worker is running. single_dlx_worker test appeared to be flaky, but it actually triggered a real race condition: A node that briefly was a leader, started a DLX worker which registered itself as a consumer via ra:process_command. When that node loses leadership a moment later, process_command on the now-follower Ra server redirects the checkout command to the actual leader, where it gets committed to the Raft log. This overwrites the new/real leader's worker registration, terminating the leader's worker. The result is that the leader has no DLX worker: the old leader's worker is stopped to allow the new leader to take over, but the new leader's worker exits due to a stale checkout.
22f4b07 to
e0443bd
Compare
Before this change, with two leader elections in short succession, we could end up in a situation where no DLX worker is running.
single_dlx_worker test appeared to be flaky, but it actually triggered a real race condition:
A node that briefly was a leader, started a DLX worker which registered itself as a consumer via ra:process_command.
When that node loses leadership a moment later, process_command on the now-follower Ra server redirects the checkout command to the actual leader, where it gets committed to the Raft log. This overwrites the new/real leader's worker registration, terminating the leader's worker.
The result is that the leader has no DLX worker: the old leader's worker is stopped to allow the new leader to take over, but the new leader's worker exits due to a stale checkout.
This PR should address this failure:
https://github.com/rabbitmq/rabbitmq-server/actions/runs/22343794026/job/64653112641?pr=15502