-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Introduction
This document focuses on reintroducing the atomic distributed transaction implementation and addressing the shortcomings with improved and robust support.
Background
Existing System Overview
Vitess has three transaction modes; those are Single, Multi and TwoPC.
In Single Mode, any transaction that spans more than one shard is rolled back immediately. This mode keeps the transaction to a single shard and provides ACID-compliant transactions.
In Multi Mode, a commit on a multi-shard transaction is handled with a best-effort commit. Any commit failure on a shard rolls back the non-committed transactions. The previously committed shard transactions and the failure shard need application-side handling.
In TwoPC Mode, a commit on a multi-shard transaction follows a sequence of steps to achieve an atomic distributed commit. The existing design document is extensive and explains all the component interactions needed to support it. It also highlights the different failure scenarios and how they should be handled.
Existing Implementation
A Two-Phase commit protocol requires a Transaction Manager (TM) and Resource Managers (RMs).
Resource Managers are the participating VTTablets for the transaction. Their role is to prepare the transaction and return a success or failure response. During the prepare phase, RMs store all the queries executed on that transaction in recovery logs as statements. If an RM fails, upon coming back online, it prepares all the transactions using the transaction recovery logs by executing the statements before accepting any further transactions or queries.
The Transaction Manager role is handled by VTGate. On commit, VTGate creates a transaction record and stores it in one of the participating RMs, designating it as the Metadata Manager (MM). VTGate then issues a prepare request to the other involved RMs. If any RM responds with a failure, VTGate decides to roll back the transaction and stores this decision in the MM. VTGate then issues a rollback prepared request to all the involved RMs.
If all RMs respond successfully, VTGate decides to commit the transaction. It issues a start commit to the MM, which commits the ongoing transaction and stores the commit decision in the transaction record. VTGate then issues a commit prepared request to the other involved RMs. After committing on all RMs, VTGate concludes by removing the transaction record from the MM.
All MMs have a watcher service that monitors unresolved transactions and transmits them to the TM for resolution.
Benefits of the Existing Approach:
- The application does not have to communicate upfront about transactions going cross-shard.
- TM maintains the transaction metadata with RM keeping itself stateless.
- Storing transaction metadata with one of the RM stating it as MM avoids the Prepare phase for the MM.
- The transaction is committed with non-2PC workflow if the transaction exists on a single shard.
Problem Statement
The existing implementation of atomic distributed commit is a modified version of the Two-Phase Commit (2PC) protocol that addresses its inherent issues while making practical trade-offs. This approach efficiently handles single-shard transactions and adopts a realistic method for managing transactions across multiple shards. However, there are issues with the watchdog design, as well as other reliability concerns. Additionally, there are workflow improvements and performance enhancements that need to be addressed. This document will highlight these issues and provide solutions with the rework.
Existing Issues and Proposal
1. Distributed Transaction Identifier (DTID) Generation
The Transaction Manager (TM) designates the first participant of the transaction as MM. It generates the DTID using MM’s shard prefix and the transaction ID. This method ensures uniqueness across shards, it introduces potential conflicts due to the auto-increment transaction ID being reset upon a VTTablet restart.
Impact:
- Additional Recovery Workflow: To prevent collisions, on startup all VTTablets must ensure their last transaction ID is set to the maximum value of the in-flight distributed transactions.
- Risk of DTID Collision: If synchronization is missed or fails, VTGate might generate DTIDs that collide with existing in-flight transactions. This could result in incorrect transactions being modified or committed, leading to data corruption and loss of transactional integrity.
- Exhaustion of ID Space: The skipping of ID ranges will likely continue, and over time, there is a potential risk of reaching the limit of the auto-increment ID range.
Proposals:
- Proposal 1: Use a centralized sequence generator that will ensure unique ID across keyspace and shards. This sequence will then be mapped to a shard using a hash function. That shard primary will become the MM for that transaction.
- Proposal 2: TM will create the DTID using UUIDv7 or Nano ID and will follow similar steps as Proposal 1. There is a possibility that the DTID generated might not be unique and can fail on Create Transaction Record API on MM which will lead to transaction rollback.
- Proposal 3: The first participant of the transaction will be the MM. It will create the DTID using a 32-byte keyspace-shard name as a prefix along with a 14-byte Nano ID when TM invokes the Create Transaction Record API.
- Proposal 4: TM creates the DTID using Proposal 3 and sends it to MM to store as part of the Create Transaction Record API. There is a possibility that the DTID generated might not be unique and can fail on Create Transaction Record API on MM which will lead to transaction rollback.
Conclusion:
Proposal 1 is good but it adds a dependency on a new system to provide the DTID. Proposal 2 reduces that dependency by having TM generate the DTID, but it risks generating duplicate DTID which might fail on Create Transaction Record API, leading to transaction rollback. Proposal 3 ensures the DTID is unique but results in a long DTID key. Proposal 4 also risks DTID collisions, causing transaction rollback on the Create Transaction Record API call.
Proposals 1 & 2 can map the DTID to non-participating RMs, making it the MM. These additional network calls will increase the system’s commit latency. Proposals 3 & 4 avoid this extra hop but significantly increase the DTID size. The larger DTID size outweighs the efficiency gains from using one of the participating RMs as MM in the overall commit process.
Proposal 3 looks like the most balanced and reliable option here.
2. Transaction Resolution Design
The MM is currently being provided with a fixed IP address of the TM on startup to invoke TM ResolveTransaction API for unresolved transactions.
Impact:
- Operational Overhead: If the TM's IP Address changes then MM will not be able to contact the TM without a manual update and restart of MM.
- Single Point of Failure: If the TM is down or not reachable, then MM will not be able to provide unresolved transactions.
- Scalability: Having a single TM to resolve the transactions will limit the ability to scale horizontally.
Proposals:
- Proposal 1: MM will notify VTGates via the vttablet health stream for pending unresolved transactions. VTGate will invoke MM to retrieve the unresolved transaction details. MM will manage the multiple invocations from VTGates and provide the different sets of unresolved transactions to each VTGate. VTGate will then resolve the transaction based on the current state of the transaction.
- Proposal 2: VTOrc will track the unresolved transactions from the MM. VTOrch will then notify the VTGates to resolve the transaction. This would need service discovery for VTGates.
Conclusion:
Proposal 1 is the more practical choice as it utilizes existing infrastructure, which is proven and already used for other purposes like real-time stats and schema tracking. Unlike Proposal 2, which requires full-fledged development of a VTGate service discovery system.
3. Connection Settings
The current implementation does not store changes in the connection settings in the transaction recovery logs. Its omission risks the integrity and consistency of the distributed transaction during a failure recovery scenario.
Impact:
- Data Integrity Risk: When the transaction is recovered using the recovery logs, the session state may not reflect the connection settings that were in effect when the transaction was originally executed. This can lead to inconsistencies in transaction behaviour such as differences in character sets, time zones, isolation levels and other session-specific settings.
- Failure Recovery Issue: During the failure recovery, the lack of stored connection settings can lead to some statements failing to get applied which will lead to failed prepared transactions and loss of atomicity of the transaction.
Proposal: Along with redo statement logs, the connections settings as set statements will be stored in the sequence of when they were executed. On recovery, the same sequence will be used to prepare the transaction.
4. Prepared Transactions Connection Stability
The current implementation assumes a stable MySQL connection after preparing a transaction on an RM. Any connection disruption will roll back the transaction and may cause data inconsistency due to modifications by other concurrent transactions.
Impact:
- Transaction Atomicity Loss: Recovery of the prepared transaction can fail as the underlying data might have changed and hence this transaction will be called as unrecoverable losing the atomicity of the transaction.
Proposals:
- Proposal 1: All the database connections that are part of the distributed transaction to MySQL would only be allowed on the Unix socket. Prepare Transaction step will fail if they are using a network connection.
- Proposal 2: The locking mechanism needs to be moved to MySQL this will ensure the rows part of the transaction remains locked for modification by other transactions even when the connection is disconnected. MySQL should be able to provide a recovery mechanism for such locked rows. MySQL does solve this problem via the XA protocol.
Conclusion:
Proposal 1 is recommended for immediate adoption to enhance connection stability and prevent unreliable TCP connections. If testing identifies issues with Unix socket stability, Proposal 2 will be implemented to leverage MySQL's XA protocol for transactional integrity and recovery.
5. Transaction Recovery Logs Application Reliability
The current implementation stores the transaction recovery logs as DML statements. On transaction recovery, while applying the statements from these logs it is not expected to fail as the current shutdown and startup workflow ensure that no other DMLs leak into the database. Still, there remains a risk of statement failure during the redo log application, potentially resulting in lost modifications without clear tracking of modified rows.
Impact:
- Lack of Visibility: In case of a failed statement application, it becomes unclear which rows were originally modified, complicating data recovery.
Proposals:
- Proposal 1: Implement a copy-on-write approach, where raw mutations are stored in a separate shadow table. Upon commit, these mutations are materialized to the primary data tables. This will ensure that any failure during redo log application can be traced back to the original modifications, improving data recovery and maintaining transactional integrity.
- Proposal 2: MySQL provides XA protocol support which handles the transaction recovery logs for distributed transactions. This will eliminate the need to store the recovery logs by Vitess.
Conclusion:
Currently, neither proposal will be implemented, as the expectation is that redo log applications should not fail during recovery. Should any recovery tests fail due to redo log application issues, Proposal 2 will be prioritized for its inherent advantages over Proposal 1.
6. Unsupported Consistent Lookup Vindex
The current implementation disallows the use of consistent lookup vindexes and upfront rejects any distributed transaction commit involving them.
Impact:
- Operational Disruption: Existing Vitess clusters using consistent lookup vindexes must drop them before enabling distributed commits, causing operational disruption and requiring significant changes to the existing setup.
Proposal: Allow the consistent lookup vindex to continue. The pre-transaction will continue to work as-is. Any failure on the pre-transaction commit will roll back the main transaction. The post-transaction will only continue once the distributed transaction is completed. Otherwise, the post-transaction will be rolled back.
7. Resharding, Move Tables and Online Schema Change not Accounted
The current implementation has not handled the complications of running a resharding workflow, a move tables workflow, or an online schema change workflow in parallel with in-flight prepared distributed transactions.
Impact:
- Transaction Atomicity Loss: Running these workflows in parallel with distributed transactions can destabilize prepared transactions, potentially compromising the atomicity guarantee.
Proposals:
- Proposal 1: All the VReplication-related workflow needs to account for ongoing distributed transactions during their cut-over. A new tablet manager API will be added to check if it is safe to continue the cutover. This API will take a lock on the involved table and the workflow needs to unlock them at the end of it. VTOrc needs to handle the unlocking of the table if the workflow is abandoned for any reason.
- Proposal 2: Look into different kinds of workflow and see if a safe cutover can happen and have different case-by-case implementations to support distributed transactions along the cutover.
Conclusion:
Proposal 1 is relatively easier to argue about the expectation. All workflows will use the same strategy. The new API can be extended to be used for other flows as well.
Exploratory Work
MySQL XA was considered as an alternative to having RMs manage the transaction recovery logs and hold up the row locks until a commit or rollback occurs.
There are currently 20 open bugs on XA. On MySQL 8.0.33, reproduction steps were followed for all the bugs, and 8 still persist. Out of these 8 bugs, 4 have patches attached that resolve the issues when applied. For the remaining 4 issues, changes will need to be made either in the code or the workflow to ensure they are resolved.
MySQL’s XA seems a probable candidate if we encounter issues with our implementation of handling distributed transactions that XA can resolve. XA’s usage is currently neither established nor ruled out in this design.
Rework Design
Commit Phase Interaction
The Component interaction for different cases.
Any error in the commit phase is indicated to the application with a warning flag. If an application's transaction receives a warning signal, it can execute a show warnings to know the distributed transaction ID for that transaction. It can watch the transaction status with show transaction status for <dtid>.
Case 1: All components respond with success.
sequenceDiagram
participant App as App
participant G as VTGate
participant MM as VTTablet/MM
participant RM1 as VTTablet/RM1
participant RM2 as VTTablet/RM2
App ->>+ G: Commit
G ->> MM: Create Transaction Record
MM -->> G: Success
par
G ->> RM1: Prepare Transaction
G ->> RM2: Prepare Transaction
RM1 -->> G: Success
RM2 -->> G: Success
end
G ->> MM: Store Commit Decision
MM -->> G: Success
par
G ->> RM1: Commit Prepared Transaction
G ->> RM2: Commit Prepared Transaction
RM1 -->> G: Success
RM2 -->> G: Success
end
opt Any failure here does not impact the reponse to the application
G ->> MM: Delete Transaction Record
end
G ->>- App: OK Packet
Case 2: When the Commit Prepared Transaction from the RM responds with an error. In this case, the watcher service needs to resolve the transaction and commit the pending prepared transactions.
sequenceDiagram
participant App as App
participant G as VTGate
participant MM as VTTablet/MM
participant RM1 as VTTablet/RM1
participant RM2 as VTTablet/RM2
App ->>+ G: Commit
G ->> MM: Create Transaction Record
MM -->> G: Success
par
G ->> RM1: Prepare Transaction
G ->> RM2: Prepare Transaction
RM1 -->> G: Success
RM2 -->> G: Success
end
G ->> MM: Store Commit Decision
MM -->> G: Success
par
G ->> RM1: Commit Prepared Transaction
G ->> RM2: Commit Prepared Transaction
RM1 -->> G: Success
RM2 -->> G: Failure
end
G ->>- App: Err Packet
Case 3: When the Commit Descision from MM responds with an error. In this case, the watcher service needs to resolve the transaction as it is not certain whether the commit decision persisted or not.
sequenceDiagram
participant App as App
participant G as VTGate
participant MM as VTTablet/MM
participant RM1 as VTTablet/RM1
participant RM2 as VTTablet/RM2
App ->>+ G: Commit
G ->> MM: Create Transaction Record
MM -->> G: Success
par
G ->> RM1: Prepare Transaction
G ->> RM2: Prepare Transaction
RM1 -->> G: Success
RM2 -->> G: Success
end
G ->> MM: Store Commit Decision
MM -->> G: Failure
G ->>- App: Err Packet
Case 4: When a Prepare Transaction fails. TM will decide to roll back the transaction. If any rollback returns a failure, the watcher service will resolve the transaction.
sequenceDiagram
participant App as App
participant G as VTGate
participant MM as VTTablet/MM
participant RM1 as VTTablet/RM1
participant RM2 as VTTablet/RM2
App ->>+ G: Commit
G ->> MM: Create Transaction Record
MM -->> G: Success
par
G ->> RM1: Prepare Transaction
G ->> RM2: Prepare Transaction
RM1 -->> G: Failure
RM2 -->> G: Success
end
par
G ->> MM: Store Rollback Decision
G ->> RM1: Rollback Prepared Transaction
G ->> RM2: Rollback Prepared Transaction
MM -->> G: Success / Failure
RM1 -->> G: Success / Failure
RM2 -->> G: Success / Failure
end
opt Rollback success on MM and RMs
G ->> MM: Delete Transaction Record
end
G ->>- App: Err Packet
Case 5: When Create Transaction Record fails. TM will roll back the transaction.
sequenceDiagram
participant App as App
participant G as VTGate
participant MM as VTTablet/MM
participant RM1 as VTTablet/RM1
participant RM2 as VTTablet/RM2
App ->>+ G: Commit
G ->> MM: Create Transaction Record
MM -->> G: Failure
par
G ->> RM1: Rollback Transaction
G ->> RM2: Rollback Transaction
RM1 -->> G: Success / Failure
RM2 -->> G: Success / Failure
end
G ->>- App: Err Packet
Transaction Resolution Watcher
If there are long pending distributed transactions in the MM. This watcher service will ensure that TM is invoked to resolve them.
sequenceDiagram
participant G1 as VTGate
participant G2 as VTGate
participant MM as VTTablet/MM
participant RM1 as VTTablet/RM1
participant RM2 as VTTablet/RM2
MM -) G1: Unresolved Transaction
MM -) G2: Unresolved Transaction
Note over G1,MM: MM sends this over health stream.
loop till no more unresolved transactions
G1 ->> MM: Provide Transaction details
G2 ->> MM: Provide Transaction details
MM -->> G2: Distributed Transaction ID details
Note over G2,MM: This VTGate recieves the transaction to resolve.
end
alt Transaction State: Commit
G2 ->> RM1: Commit Prepared Transaction
G2 ->> RM2: Commit Prepared Transaction
else Transaction State: Rollback
G2 ->> RM1: Rollback Prepared Transaction
G2 ->> RM2: Rollback Prepared Transaction
else Transaction State: Prepare
G2 ->> MM: Store Rollback Decision
MM -->> G2: Success
opt Only when Rollback Decision is stored
G2 ->> RM1: Rollback Prepared Transaction
G2 ->> RM2: Rollback Prepared Transaction
end
end
opt Commit / Rollback success on MM and RMs
G2 ->> MM: Delete Transaction Record
end
Improvements and Enhancements
- Track which shards have DML applied as not all transactions open will have DML. This can reduce the participating RMs in the distributed transaction. Any shard that does not have a DML applied but an open transaction can be closed without impacting the atomicity of the transaction.
- All the atomic transactions related APIs will be idempotent so that if TM tries to resolve the same DTID multiple times, the outcome will not be impacted.
- During Commit if DTID is generated and there is any error in the commit flow, it will be notified via a warning message to the app. The DTID provided can be tracked by the app via the
show transaction status for <dtid>command.
Implementation Plan
Task Breakdown:
- Modify the commit phase component interaction in TM based on new flows.
- Implement the new transaction resolution design
- Add support for tracking DTID state
- Modify existing API implementation to be idempotent
- Add prepared transaction protection from VTGate Restart
- Add prepared transaction protection from VTTablet Restart
- Add prepared transaction protection from MySQL Restart
- Add prepared transaction protection from PRS & ERS
- Add prepared transaction protection from Online DDL
- Add prepared transaction protection from Move Tables
- Add prepared transaction protection from Resharding - Making Reshard work smoothly with Atomic Transactions #16844
- In flight Distributed transaction visibility and actions through VTAdmin
- Record and store connection settings to the transaction recovery log - Support settings changes with atomic transactions #16974
- Record and store savepoints to the transaction recovery log
- Reject distributed commit on a network connection
- Tracking DMLs on shard transactions and using them for improved commit phase. #17386
- Add support for consistent lookup vindex with distributed transaction - Consistent lookup vindex tests for atomic distributed transactions #17393
- Emitting metrics and documenting the use case for them
- Troubleshooting tools - Add RPC to read the statements to be executed in an unresolved prepared transaction #17131
- User experience with how to handle unresolved transactions
- Benchmarking commit time
- Document Basic Guide
- Document User Guide
- Document Troubleshooting Guide - Add troubleshooting docs for atomic transactions website#1883
- Multi-Table DMLs in Fuzzer - Add multi table updates in the 2pc fuzzer testing #17293
- Lookup Vindex in fuzzer. - Add test for vindexes in atomic transactions package #17308
[ ] Implement new DTID generator logic
Testing Strategy
This is the most important piece to ensure all cases are covered, and APIs are tested thoroughly to ensure correctness and determine scalability.
Test Plan
Basic Tests
Commit or rollback of transactions, and handling prepare failures leading to transaction rollbacks.
| Test Case | Expectation |
|---|---|
| Distributed Transaction - Commit | Transaction committed, transaction record cleared, and metrics updated |
| Distributed Transaction - Rollback | Transaction rollbacked, transaction record cleared, and metrics updated |
| Distributed Transaction - Commit (Prepare to Fail on MM) | Transaction rollbacked, metrics updated |
| Distributed Transaction - Commit (Prepare to Fail on RM) | Transaction rollbacked, transaction record updated, metrics updated |
Resilient Tests
Handling failures of components like VTGate, VTTablet, or MySQL during the commit or recovery steps.
| Test Case | Expectation |
|---|---|
| Distributed Transaction - Store Commit Decision fails on MM | Transaction recovered based on transaction state. |
| Distributed Transaction - Prepared Commit fail on RM | Transaction recovered and committed |
| Distributed Transaction - Delete Transaction Record fail | Recovery and transaction record removed |
| Distributed Transaction - VTGate restart on commit received | Transaction rolled back on timeout |
| Distributed Transaction - VTGate restart after transaction record created on MM | Recovery and transaction rolled back |
| Distributed Transaction - VTGate restart after transaction prepared on a subset of RMs | Recovery and transaction rolled back |
| Distributed Transaction - VTGate restart after transaction prepared on all RMs | Recovery and transaction rolled back |
| Distributed Transaction - VTGate restart after storing the commit decision on MM | Recovery and transaction committed |
| Distributed Transaction - VTGate restart after transaction prepared commit on a subset of RMs | Recovery and transaction committed |
| Distributed Transaction - VTGate restart after transaction prepared commit on all RMs | Recovery and transaction record removed |
The failure on MM and RM includes the VTTablet and MySQL interuption cases.
System Tests
Tests Involving multiple moving parts such as distributed transactions with Reparenting (PRS & ERS), Resharding, OnlineDDL, and MoveTables.
Stress Tests
Tests will run conflicting transactions (single and multi-shard), and validate on error metrics related to distributed transaction failure.
Reliability tests
A continuous stream of transactions (single and distributed) will be executed, with all successful commits recorded along with the expected rows. The binary log events will be streamed continuously and validated against the ordering of the change stream and the successful transactions.
This test should run over an extended period, potentially lasting a few days or a week, and must endure various scenarios including:
- Failure of different components (e.g., VTGate, VTTablets, MySQL)
- Reparenting (PRS & ERS)
- Resharding
- Online DDL operations
Deployment Plan
The existing implementation has remained experimental therefore no compatibility guarantees will be maintained with the new design changes.
Monitoring
The existing monitoring support will continue as per the old design.
Future Enhancements
1. Read Isolation Guarantee
The current system lacks isolation guarantees, placing the burden on the application to manage it. Implementing read isolation will enable true cross-shard ACID transactions.
2. Distributed Deadlock Avoidance
The current system can encounter cross-shard deadlocks, which are only resolved when one of the transactions times out and is rolled back. Implementing distributed deadlock avoidance will address this issue more efficiently.