Raft or not? Why is Consensus-based replication not the silver bullet for distributed database log replication?
background
In the process of chatting with the teammates recently, I heard this point of view:
"Is the current technical route of our team wrong? Log storage based on Kafka/Pulsar feels more like middleware than database!"
In the process of discussing with many friends, I found that there are many misunderstandings about the basic concepts of Consistency, Consensus and Replication , and many people think that only log replication based on distributed consensus algorithms such as Paxos/Raft is a distributed database. The only correct solution.
To clarify this issue, we must first clarify the following basic concepts.
Replication
Replication refers to the process of copying data to multiple locations (different disks, processes, machines, clusters), which usually serves two purposes
- Improve data reliability - Solve the problem of fault recovery when bad disk, physical machine failure, and abnormal cluster
- Speed up queries - multiple replicas can be used concurrently to improve performance
There are many categories of Replication. First, let’s be clear. What we are discussing today is the synchronization method of incremental logs, not the copy of full data.
In addition, the common ways of distinguishing Replication are: synchronous/asynchronous, strongly consistent/eventually consistent, based on master-slave/decentralization, and so on.
The choice of replication mode will affect the availability and consistency of the system, so someone proposed the famous CAP theory, in the case of unavoidable network isolation, the system designer must make a trade-off between consistency and availability .
Consistency
A simple understanding of consistency is whether consistent data can be obtained by reading and writing multiple copies at the same time.
First make it clear that we are talking about the "C" in CAP, not the "C" in ACID. For the description of consistency levels, I think this CosmosDB documentation is the most reliable https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels (I have to admit Azure's product documentation It is much stronger than AWS, and it has a full academic style)
Usually OLTP databases require strong consistency, or linear consistency, that is:
- Any read can read the most recent write of a data.
- After any one read returns a new value, all subsequent reads (on the same or other clients) must also return the new value.
The essence of linear consistency is the recency guarantee between multiple copies of data, which guarantees that once a new value is written or read, all subsequent reads will see the written value until it is overwritten again . This means that in a distributed system that provides linear consistency guarantees, users do not need to care about the implementation of multiple copies, and each operation can achieve atomic ordering.
Consensus
People want to use distributed systems as if they were single-machine systems, so the "distributed consensus" problem is inevitably introduced.
Simply put, after a process proposes what a certain value is, all processes in the system can reach an agreement on the change of this value. The following figure is a consensus process.
The earliest consensus algorithm comes from Viewstamped Replication published in 1988 and Paxos algorithm proposed by Mr. Leslie in 1989.
In recent years, the Raft algorithm has also been widely used in the industry because it is relatively easy to implement (for example, in mainstream NewSQL: CockRoachDB, TiDB, Oceanbase are almost all implemented based on Raft and Paxos)
In fact, there is another type of consensus algorithm, that is, the leaderless consensus protocol.
Such algorithms are widely used in blockchains, such as the PoW algorithm used by Bitcoin. Due to efficiency problems, such algorithms are rarely used in database systems with high concurrency.
It is worth noting that even if a distributed consensus algorithm is adopted, it does not mean that the system can support linear consistency.
When we design a system, we need to carefully consider the commit order of the log and state machine, and carefully maintain the lease of the Raft/Paxos leader to avoid split-brain in the case of network isolation.
The Tunable Consistency of Milvus is actually very similar to the Follower Read implementation in the distributed consensus algorithm. Follower Read refers to the use of follower replicas to carry data reading tasks under the premise of strong consistent read, thereby improving the throughput of the cluster and reducing the leader load. The implementation method is: ask the master node for the latest log commit index, wait for all the data of the commit index to be successfully applied to the state machine, and then provide the query service.
In the design process of Milvus, we did not adopt the strategy of asking the producer for the commit index for every query, but periodically notify the query execution node of the commit index position through a mechanism similar to the watermark in Flink. This design is mainly because Milvus users do not have high requirements for data consistency. Usually, it is acceptable to reduce the visibility of data in exchange for higher performance, so it is not necessary to determine the position of the commit index for each query.
Why Consensus-based replication is so popular
Simply put, people love Linearizability.
Whether it is Raft, ZAB, or Aurora's log protocol based on Quorum, it is actually one of the variants of Paxos algorithm to some extent. Linear consistency is often the basis for implementing ACID in distributed databases, which makes this technology highly irreplaceable in transactional databases.
There are three other reasons why Consensused-based replication is so popular:
The first point is that compared to traditional master-slave replication, although Consensus-based replication is more focused on "CP" in CAP theory, it still provides good Availability. Usually, processes crash and server restarts can be recovered in seconds. .
The second point is that the proposal of Raft greatly simplifies the implementation complexity of the Consensus algorithm. More and more databases choose to write their own Raft algorithm, or transform the current Raft implementation. According to incomplete statistics, there are more than 15 Raft implementations with more than 1000 stars on Github. The most famous ones are the Raft library provided by Etcd and the open source sofa-jraft of Ant. There are countless projects based on these algorithms for secondary transformation. . In China, with the popularity of open source products such as TiKV and Etcd, more and more people are paying attention to this technical field.
The third point is that Consensus-based replication can indeed meet current business needs in terms of performance. In particular, the promotion of high-performance SSDs and 10 Gigabit NICs has greatly reduced the traffic and disk load of multi-copy synchronization, making the use of Paxos/Raft algorithms mainstream.
What's wrong with Consensus-based replication
Ironically, Consensus-based replication is not a panacea for distributed systems to solve everything. It happens that it cannot be the only de facto standard for replication because of the challenges of availability, complexity, and performance.
1) Availability
Compared with weakly consistent systems and distributed systems based on Quorum, Paxos/Raft often has a strong dependence on the master copy after optimization, resulting in a weaker ability to resist Grey Failure. The strategy of Consensus re-election of the master often relies on the master node not responding for a long time, and this strategy cannot deal with the problem of slow or shaking master nodes particularly well: in actual production, there are too many encounters because some machine fans are broken, System jitter caused by memory failure or frequent packet loss of the network card.
2) Complexity
Although there are already many reference implementations, it is not a simple thing to do the Consensus algorithm. With the emergence of algorithms such as Multi Raft and Parallel raft, the coordination between logs and state machines also requires more theoretical thinking and Test verification. On the contrary, I appreciate the replication protocols such as PacificA and ISR, and use a small Raft group for master election and membership management, which can greatly reduce the design complexity.
3) Performance cost
In the cloud-native era, shared storage solutions such as EBS and S3 are increasingly replacing local storage, and data reliability and consistency can be well guaranteed. It is no longer a rigid requirement to realize multiple copies of data based on distributed consensus, and this replication method has the problem of redundant data placement (based on Consensus itself needs multiple copies, and EBS itself has multiple copies).
For data backup across computer rooms/clouds, the cost of pursuing consistency is not only sacrificing availability, but also sacrificing request latency (refer to: https://en.wikipedia.org/wiki/PACELC_theorem ), resulting in a significant performance drop Therefore, in most business scenarios, linear consistency will not become a rigid requirement for disaster recovery across computer rooms.
In the cloud-native era, what log replication strategy should be adopted?
After talking about so many advantages and disadvantages of Consensus-based replication, what exactly is the Replication strategy that should be adopted by databases in the cloud-native era?
It is undeniable that algorithms based on raft and paxos will still be used by many OLTP databases, but we should be able to see some new trends from the PacificA protocol, Socrates, Aurora, and Rockset.
Before introducing the specific implementation, first put forward two principles I have summarized:
1) Replication as a service
Use a microservice dedicated to synchronizing data instead of tightly coupling the synchronization module and the storage module in one process.
2) "Matryoshka"
As mentioned earlier, avoiding Single Point Failure seems to be unavoidable from the limitation of Paxos, but if we narrow the problem and hand over leader election to raft/paxos implementation (such as based on chubby, zk, etcd services), then log replication It can be greatly simplified, and the problem of performance cost is also solved! Share some of my favorite designs.
The first example, my favorite protocol is Microsoft's PacificA. The paper of this article was published in 2008. Compared with the logical completeness of paxos, PacificA focuses more on engineering. When we designed a strong consistent solution for the self-developed Lindorm database in Alibaba Cloud, we found that our solution was very similar to that of PacificA.
The implementation of Microsoft's PacificA is also very simple. For data synchronization links, when the primary is replicated to the secondary, it is considered to be submitted after all nodes ack the request, so linear consistency is very easy to guarantee.
The availability guarantee of PacificA is simpler, and systems such as Zk and etcd are used to select the master or complete the member change operation. That is to say, the main node is suspended, the lease disappears, and the standby service will apply to replace the main node to withdraw the main node from the cluster. If the device fails, the primary will apply to withdraw the secondary from the group.
The second example, the system I admire is Socrates, and I have to say that Microsoft products really fit my personal technical aesthetic.
The core feature of Socrates is "decoupled computing-log-storage". Separating logging and storage, logging is implemented based on a separate service.
The log is implemented based on a separate service (XLog Service), and uses low-latency storage to achieve persistence: a component called landing zone is used in Socrates for high-speed three-copy persistence. Although the capacity is limited, it can only be used as a circular buffer, but it is easy to make I think of the use of capacitive memory for data persistence in EMC.
The master node will asynchronously distribute the logs to the log broker, and complete the data placement in the log broker Xstore (lower-cost data storage), and cache it in the local SSD for accelerated reading. Once the data is successfully placed on the disk, the buffer in the LZ will be can be cleaned up. In this way, the entire log data is divided into three layers: LZ, local SSD cache, and Xstore. Nearline data makes full use of hot storage, which can be better cached to improve failure recovery speed and log tailing efficiency.
The third example I think is worth mentioning is AWS Aurora, which has become almost synonymous with cloud-native databases.
First of all, Aurora also adopts a very typical storage and computing separation architecture. The storage layer is a customized service for MySQL, which is responsible for the persistence of redo log and page, and completes the conversion from redo log to page.
The power of Aurora is the high availability of writes using the 6-replica NWR protocol. Compared with NWR of DynamoDB, Aurora can generate incremental log sequence numer because it only has a single writer, which solves the problem of conflict arbitration, which is the most difficult to avoid in the traditional NWR system.
In fact, through the cooperation of Single DB instance and storage layer Quorum, a Paxos-like protocol is actually implemented. The Aurora paper does not discuss how to judge the failure of the upper-layer DB instance, but my guess is that the component switching strategy that relies on the Paxos, Raft, and ZAB protocols is still used. The component may perform lease recovery on the lower storage during switching to ensure no There will be dual main split brains at the same time (purely personal speculation, welcome to correct and answer questions).
As a final example, there is an interesting product called Rockset, which is an analytical product designed by the original team of Facebook RocksDB.
After the RocksDB Cloud architecture, I have the opportunity to talk about the rockset product alone. In my opinion, it is the best cloud native among OLAP products other than Snowflake. It has to be mentioned here that, like Milvus, they directly use Kafka/Kineses as distributed logs, S3 as storage, and local SSD as cache to improve query performance. What's more interesting is that Kafka's data replication protocol ISR also has many similarities with PacificA, and is essentially a "doll". Doing cloud-native services and learning to reduce the complexity of system implementation by leveraging other services has become a required course for architects.
Summarize
Now, more and more cloud databases make log replication a separate service. This greatly reduces the cost of adding read replicas/heterogeneous replicas, and is also more conducive to the performance cost optimization of log storage services. The microservice design can also quickly reuse some mature cloud infrastructure, which is unimaginable for traditional tightly coupled database systems: this independent log service may rely on Consensus-based replication, or it can use " "Matryoshka" strategy, and use various consensus protocols with Paxos/Raft to achieve linear consistency.
Finally, share another example.
Many years ago, when I overheard the way Google Colossus stores meta information, I couldn't help but admire its design: Colossus stores all meta information based on GFS, GFS data is stored on Colossus, and the most primitive meta information in Colossus The information is small enough to be stored directly on Chubby. Isn't this a "natural" Paxos-based, Zookeeper-like coordination service.
No matter how the technology progresses and develops, and how its external form changes, it is what technical people should do more to understand and think about "things behind the development of technology", and it is more in line with the first-principles thinking method.
references
- Lamport L. Paxos made simple[J]. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001), 2001: 51-58.
- Ongaro D, Ousterhout J. In search of an understandable consensus algorithm[C]//2014 USENIX Annual Technical Conference (Usenix ATC 14). 2014: 305-319.
- Oki BM, Liskov B H. Viewstamped replication: A new primary copy method to support highly-available distributed systems[C]//Proceedings of the seventh annual ACM Symposium on Principles of distributed computing. 1988: 8-17.
- Lin W, Yang M, Zhang L, et al. PacificA: Replication in log-based distributed storage systems[J]. 2008.
- Verbitski A, Gupta A, Saha D, et al. Amazon aurora: On avoiding distributed consensus for i/os, commits, and membership changes[C]//Proceedings of the 2018 International Conference on Management of Data. 2018: 789-796 .
- Antonopoulos P, Budovski A, Diaconu C, et al. Socrates: The new sql server in the cloud[C]//Proceedings of the 2019 International Conference on Management of Data. 2019: 1743-1756.
about the author
Luan Xiaofan
Zilliz Partner and Technical Director
Member of the Technical Advisory Board of the Linux Foundation AI & Data Foundation. Graduated from the Department of Computer Engineering of Cornell University, he has worked in Oracle US headquarters, Hedvig, a next-generation software-defined storage company, and Alibaba Cloud database team. He was responsible for the open source HBase of Alibaba Cloud and led the team to complete the research and development of the self-developed NoSQL database Lindorm.
With a vision to redefine data science, Zilliz is committed to building a global leader in open source technology innovation and unlocking the hidden value of unstructured data for enterprises through open source and cloud-native solutions.
Zilliz built the Milvus vector database to accelerate the development of a next-generation data platform. The Milvus database is a graduate project of the LF AI & Data Foundation. It can manage a large number of unstructured data sets and has a wide range of applications in new drug discovery, recommendation systems, chatbots, etc.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。