-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[Bugfix] Broken Heartbeat system in Row Streamer #18390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Broken Heartbeat system in Row Streamer #18390
Conversation
Signed-off-by: siddharth16396 <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
|
Nice, and good catch! Is this an issue in older versions of vitess as well? Do you think there's a test case you could write that fails before / passes after? |
|
@arthurschreiber : Yes this seems to be an issue on atleast v20,21,22 (so i'm guessing its on all) WRT the test cases, i'll try to take a stab at it, since it was just a very minor change i didn't bother diving into it. |
|
@siddharth16396 thanks for the fix. Is there a way we can ensure a stream gets continuous heartbeats in a unit test? I'm hoping there is an existing test we can add more validation to |
Signed-off-by: siddharth16396 <[email protected]>
|
@timvaillancourt @arthurschreiber : I've added a UT which should work. |
Signed-off-by: siddharth16396 <[email protected]>
timvaillancourt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddharth16396 thanks for the new test. LGTM
rohit-nayak-ps
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
Signed-off-by: siddharth16396 <[email protected]>
Signed-off-by: siddharth16396 <[email protected]>
Signed-off-by: siddharth16396 <[email protected]>
…) (#18397) Signed-off-by: siddharth16396 <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Signed-off-by: siddharth16396 <[email protected]>
Signed-off-by: siddharth16396 <[email protected]>
Signed-off-by: siddharth16396 <[email protected]>
…) (#18398) Signed-off-by: siddharth16396 <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
…tests * origin/master: (32 commits) test: Fix race condition in TestStreamRowsHeartbeat (vitessio#18414) VReplication: Improve permission check logic on external tablets on SwitchTraffic (vitessio#18348) Perform post copy actions in atomic copy (vitessio#18411) Update `operator.yaml` (vitessio#18364) Feature(onlineddl): Add shard-specific completion to online ddl (vitessio#18331) Set parsed comments in operator for subqueries (vitessio#18369) `vtorc`: move shard primary timestamp to time type (vitessio#18401) `vtorc`: rename `isClusterWideRecovery` -> `isShardWideRecovery` (vitessio#18351) `vtorc`: remove dupe keyspace/shard in replication analysis (vitessio#18395) Topo: Add NamedLock test for zk2 and consul and get them passing (vitessio#18407) Handle MySQL 9.x as New Flavor in getFlavor() (vitessio#18399) Add support for sending grpc server backend metrics via ORCA (vitessio#18282) asthelpergen: add design documentation (vitessio#18403) `vtorc`: add keyspace/shard labels to recoveries stats (vitessio#18304) `vtorc`: cleanup `database_instance` location fields (vitessio#18339) avoid derived tables for UNION when possible (vitessio#18393) [Bugfix] Broken Heartbeat system in Row Streamer (vitessio#18390) Update MAINTAINERS.md (vitessio#18394) move vmg to emeritus (vitessio#18388) Make sure to check if the server is closed in etcd2topo (vitessio#18352) ...
…) (#18396) Signed-off-by: siddharth16396 <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Shlomi Noach <[email protected]>
Signed-off-by: siddharth16396 <[email protected]> Signed-off-by: Tanjin Xu <[email protected]>
* [Bugfix] Broken Heartbeat system in Row Streamer (vitessio#18390) Signed-off-by: siddharth16396 <[email protected]> Signed-off-by: Tanjin Xu <[email protected]> * test: Fix race condition in TestStreamRowsHeartbeat (vitessio#18414) Signed-off-by: siddharth16396 <[email protected]> Signed-off-by: Tanjin Xu <[email protected]> * fix spaces Signed-off-by: Tanjin Xu <[email protected]> * undo golangci Signed-off-by: Tanjin Xu <[email protected]> * undo golangci Signed-off-by: Tanjin Xu <[email protected]> --------- Signed-off-by: siddharth16396 <[email protected]> Signed-off-by: Tanjin Xu <[email protected]> Co-authored-by: siddharth16396 <[email protected]>
Signed-off-by: siddharth16396 <[email protected]>
Signed-off-by: siddharth16396 <[email protected]> Co-authored-by: siddharth16396 <[email protected]>
This PR implements the fix for the bug mentioned #18391
TL;DR
VStreamRows uses a broken heartbeat mechanism that only sends one heartbeat and then exits. In contrast, VStream (used in normal binlog replication) sends heartbeats continuously. This PR fixes the issue by introducing a proper loop in the rowstreamer.go heartbeat goroutine to ensure heartbeats are sent throughout the copy phase.
We found this out because Some of our long-running schema changes failed due to missing heartbeats during the copy phase of VReplication.
Description
This is already patched in our internal fork of vitess and has been tested to work correctly for all the failing cases we saw.
We've identified a critical architectural inconsistency in Vitess VReplication that explains this seemingly contradictory behavior. The reason some long-running schema changes don't get heartbeats while others do is because different phases of the operation use different heartbeat mechanisms.
The Two Different Heartbeat Systems
1. Working Heartbeat System (vstreamer.go)
Used by normal VReplication (binlog streaming phase):
This system works correctly - it continuously sends heartbeats every 900ms.
2. Broken Heartbeat System (rowstreamer.go)
Used by table copying phase (VStreamRows):
This system is broken - it only sends one heartbeat and then the goroutine dies.
Which Operations Use Which System
Operations That DO Get Heartbeats:
These use
vplayer.go→VStream()→vstreamer.go(working heartbeats)Operations That DON'T Get Heartbeats:
These use
vcopier.go→VStreamRows()→rowstreamer.go(broken heartbeats)The Code Flow
Copy Phase (Broken Heartbeats):
Normal Replication (Working Heartbeats):
Why You See This Pattern
VStreamRowsThe Fix
The immediate fix is to add a
forloop to the rowstreamer heartbeat goroutine:This explains the inconsistent behavior we observed - it's not random, it's deterministic based on which code path the operation takes. The bug only affects the table copying component, not the binlog streaming component.
Checklist