TiDB 5.0 was officially released last week. In this major version update, improving the cross-center deployment capability of the TiDB cluster is an important focus. At the consensus algorithm level, the most exciting thing is the support of Joint Consensus. This feature helps TiDB 5.0 completely tolerate the unavailability of a minority number of AZs in cross-AZ scheduling. This article will first talk about the history of member changes in TiDB, then introduce the design of new features, and finally talk about the problems and solutions we encountered in the implementation process.
Member change
As the storage layer of TiDB, TiKV is responsible for data management and read and write operations. TiKV divides the data into roughly the same size fragments. Each fragment has multiple copies, stored in different AZ (Available Zone), and uses the Raft algorithm to ensure strong consistent read and write. When balanced scheduling is needed, TiDB's metadata management component PD will select the shards that need to be adjusted and issue a command to TiKV to complete the relocation of the copy. Since the Raft algorithm itself is designed for online member changes, the relocation of the copy is naturally completed through the member change algorithm.
TiKV was originally ported from Etcd's Raft. Etcd does not implement the complete Joint Consensus algorithm, it implements a special single-step change (similar to the single-step change mentioned by Diego Ongaro in his doctoral thesis, but not exactly the same). Therefore, when relocating copies, the entire process needs to be completed in multiple steps.
For example, when PD decides that it needs to move a copy from TiKV 2 to TiKV 3, it will first add a new copy to TiKV 3 through the AddNode command. Then use the RemoveNode command to delete the copy on TiKV 2 to complete the change. Theoretically, it is also possible to first RemoveNode and then AddNode to implement the change, but such an operation sequence will result in an intermediate state of 2 copies, and 2 copies cannot tolerate any node downtime, which is more dangerous.
Although the step of adding and subtracting will only produce 4 copies of the state, which can tolerate single node downtime, it is not 100% usable. When a cross-AZ scenario needs to be considered, the PD may need to relocate the copy to another TiKV in the same AZ. In the above figure, if AZ 2 becomes unavailable in the Raft group 4 replica status, only 2 replicas of AZ 0 and AZ 1 are left, and a quorum cannot be formed, resulting in the entire Raft group being unavailable. before 5.0, we introduced the learner role, and before entering the 4 voter, first add the copy to be added as a learner role to the Raft group through the AddLearner command. etc. catch up with enough data, and then perform the operation of adding first and then subtracting. Such a step can greatly reduce the time window for the existence of 4 copies (in milliseconds if it goes well), but it is still not 100% usable.
Joint Consensus
In fact, the Raft paper already provides a 100% usable membership change algorithm: Joint Consensus. We use C(a, b, c) to represent a Raft group with three copies of a, b, and c. When changing from C(a, b, c) to C(a, b, d), introduce an intermediate state joint C(a, b, c) & C(a, b, d). When the group is in the joint state, the log can only be counted as a commit if it is copied to the quorum in both member lists. To make changes, first change from C(a, b, c) to C(a, b, c) & C(a, b, d). When each node receives this command, it immediately changes the local member to the joint state. When this command is committed, submit a new command to exit the joint state, changing from C(a, b, c) & C(a, b, d) to C(a, b, d). The proof of the correctness of this algorithm is beyond the scope of this article. If you are interested, you can refer to Raft's paper.
Since quorum is calculated based on two 3-copy member lists, the intermediate joint state is the same as the multiple single-step changes mentioned above, and can tolerate any single node downtime. Compared with multiple single-step operations, joints can achieve 100% availability. For example, in the 4 replica state in the figure, the unavailability of AZ 2 will cause 2 nodes to be unavailable, but for the two 3 replica member lists, only a single node is unavailable, so it can still be available in the joint state.
landing
Above we mentioned that the implementation of Etcd's Raft algorithm is different from the one-step change mentioned by Diego Ongaro in the paper. In fact, the implementation of the Etcd algorithm was done before the doctoral dissertation. The main difference is that the member change log will only be executed after it is committed. The approach in the paper is to execute it immediately upon receipt. We began to research from 3.0 joint consensus of feasibility . Our initial approach was to be completely consistent with the paper, but there were too many compatibility issues and adjustments. At the same time, CockroachDB also began to add early Joint Consensus support to Etcd. We finally decided to embrace the community, be consistent with Etcd, and optimize and test together.
The implementation of Etcd's Joint Consensus is not completely consistent with the paper. Instead, it continues the above-mentioned practice before commit is executed. The advantage of this is that there is no difference between the member change log and the ordinary log processing logic, and a unified process can be used. Since the log after commit will not be copied, there is no need to make special change rollbacks like the execution upon receipt, which is simpler to implement. However, since the commit information is only available to the leader, there will be an availability bug in special scenarios due to the untimely synchronization of information. Interested students can go to the Etcd project to view the related issues we submitted 1 2 . Here is just a simple example. Suppose a joint state C(a, b, c) & C(a, b, d), a is the leader. If the command to exit the joint is copied to a, b, and c, a can be considered that the command has been committed, and the commit index will be synchronized to c and it will crash. Therefore, c will execute the commit log, thinking that the joint has exited, and delete itself; b and d do not know that the joint exit command has been committed, and will still seek votes for two quorums when the election is initiated. But a has crashed and c has self-destructed, so b and d cannot get quorum votes from (a, b, c), and are therefore unavailable.
These problems eventually led to the implementation of Joint Consensus compared with the original paper:
1. Voter needs to be downgraded to Learner before being removed.
2. Join the synchronous commit index mechanism during the election.
Let's look at the example above. Since voter will not be deleted directly, c will not delete itself, but will become a learner. When b seeks to vote from c, it will know the latest commit index and the joint state has exited, so it will only try to find quorum from C(a, b, d), and the election will be successful.
Summary
In 5.0, we added Joint Consensus support, which can fully tolerate the unavailability of a minority number of AZs during the cross-AZ scheduling process. The Raft algorithm itself is relatively clear and simple, but there will be different adjustments and trade-offs in engineering. If you are very interested in solving similar problems in distributed systems, welcome to participate in our projects TiKV , raft-rs , or send your resume to jay@pingcap.com to join us directly.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。