Two Phase Commit Protocol
07 Mar 2025
[
design
development
]
Context
- Distributed transaction updates data across multiple systems or databases.
- Data needs to be atomically stored on multiple cluster nodes
- Nodes cannot make the data accessible to clients until the decision of other cluster nodes is known.
- Each node needs to know if other nodes successfully stored the data or if they failed.
- Assume that each site logs actions at that site, but there is no global log.
Problem
- Keep data accurate across all systems (consistency)
- Make sure all parts of a transaction succeed or fail together (atomicity)
Solution
- To ensure atomicity property Transaction must either commit at all the sites, or it must abort at all sites.
- The protocol has two phases: Prepare and Commit/Abort
- Phase 1 - Coordinator sends a Prepare message to all the sites where the Transaction is executed. Each site must eventually send a response
- Phase 2 - Response abort or commit received by the Coordinator from all the sites that are collaboratively executing the transaction.
Concepts
- Transaction - Prepare, Commit, Abort
- Coordinator - First asks sites to prepare their part. Then asks to commit when all agreed, asks to abort when all disagreed. Waits for all to respond.
- Sites - Responds to prepare by lock resources, prepare data changes and check if they can finish.
Benefits
- Guarantees data consistency
- Guarantees data atomicity
- Handles failures
Drawbacks
- Coordinator site failure may result in blocking
- Impossible to determine what decision has been made if the Coordinator fails and the active sites keep no additional log-record
- Can negatively impact system performance
Challenges
- Blocking Problem - the locked data items remain inaccessible for other transactions until the Coordinator is restored or fixed
- Keeping data in sync across systems
- Dealing with network problems
- Handling partial failures
- Coordinating all systems involved
Design considerations
- Consider coordinator failure after prepare message - set a timeout, then check other sites.
- Consider site failure - should others continue?
- Set timeouts for responses
- Keep good logs
- Have a manual fix plan
- Use query messages for the status of coordinator and sites
- Handle system / network failures
References