Replication (Part 2): Transactions, Consistency and Consensus

This article mainly introduces transactions, consistency and consensus. First, I will introduce how they work in distributed systems, and then I will try to describe the internal relationship between them, so that everyone can understand that there are certain "routines" when designing distributed systems. "Can be found. Finally, some tools and frameworks for validating distributed algorithms in the industry will be introduced. Hope it helps or inspires you all.

1. Previous review

In the previous article, we mainly introduced the common replication models in distributed systems, and described the advantages and disadvantages of each model, as well as the usage scenarios, and expounded some unique technical challenges in distributed systems. First of all, there are three common distributed system replication models, namely master-slave replication model, multi-master replication model and masterless replication model. In addition, replication is divided into synchronous replication && asynchronous replication from the perspective of the timeliness of the client. Asynchronous replication has a lag, which may cause data inconsistency. Because of this inconsistency, it will bring various problems.

In addition, the first article used the example of "the boss schedules people to work" as an analogy to the challenges unique to distributed systems, namely partial failures and unreliable clock problems. This brings a lot of trouble to the distributed system design. It seems that a plain distributed system can't do anything without a mechanism to guarantee it.

At the end of the last article, we made some assumptions about the distributed system system model, and these assumptions are actually very important to give the following solutions. First of all, for partial failure, we need to make assumptions about the timeout of the system. Generally, we assume a semi-synchronous model, which means that the delay is generally normal. Once a failure occurs, the delay will become very large. Also, for node failures, we usually design the system assuming a crash-recovery model. Finally, in the face of the two guarantees of the distributed system, Safty and Liveness, we give priority to ensuring that the system is Safety, that is, safety; and Liveness (activity) can usually only be satisfied under certain preconditions.

2. Introduction to this article

Through the first article, we know what problems are left for us to solve. Then in this article, we will solve the above challenges according to our assumptions. These guarantees include transactions, consistency, and consensus. Next, we will introduce their functions and internal connections, and then we will go back and review the design of the Kafka replication part to see if an actual system can really use those routines directly in design, and finally introduce the industry to verify the distributed algorithm some tools and frameworks. Next, continue our data replication journey!

3. Transactions & External Consistency

When it comes to transactions, I believe everyone can simply say one or two. The first thing that can react instinctively should be the so-called "ACID" feature, and there are various isolation levels. Yes, they are indeed all issues that transactions need to solve.

In this chapter, we'll get a more coherent understanding of how they are interconnected, and take a closer look at what a transaction is trying to solve. There are a lot of specific implementation details about database transactions in the book "DDIA", but this article will weaken them. After all, this article does not want to describe in detail how to design a database, we just need to explore the problem itself, and when we really find a solution Going to see the design in detail, the effect may be better. Now let's officially start introducing affairs.

3.1 The generation of transactions

The system may face the following problems:

The operating system layer on which the program relies, and the hardware layer may fail at any time (including halfway through an operation).
Applications can fail at any time (including halfway through an operation).
Network outages can occur at any time, cutting the link between the client and the server or the link between the database.
Multiple clients may access the server at the same time and update a unified batch of data, causing data to overwrite each other (critical section).
The client may read outdated data, because as mentioned above, the application may hang halfway through the execution of the operation.

Assuming that the above problems will appear in our access to the storage system (or database), we need to pay a lot of extra to deal with these problems while developing our own applications. The core mission of the transaction is to try to help us solve these problems, providing the security guarantee seen from its own level, so that we can only focus on our own writing and query logic when accessing the storage system, rather than these additional complexities. Exception handling. When it comes to the solution, it is guaranteed by its famous ACID feature.

3.2 Take the trouble - ACID features

The characteristics composed of these four abbreviations are believed to have formed an instinctive reaction, but the definitions given in the book "DDIA" are indeed more conducive to us to understand the relationship between them more clearly, which will be explained below:

A: Atomicity : Atomicity actually describes the restriction between multiple operations of the same client. The atom here means indivisible. The effect of atomicity is that it is assumed that there is a set of operations {A, B ,C,D,E}, the result after execution should be the same as the effect of a single client executing an operation. From this restriction we can know:

For the operation itself, even if there is any failure, we can't see any results in the middle of this operation set, such as the operation fails when it is executed to C, but the transaction should be retried until we need to wait until after the execution, or we should recover to the result before executing A.
For the operating server, if any failure occurs, our operation should not have any side effects on the server. Only in this way can the client retry safely. Otherwise, if each retry will have side effects on the server, Clients are afraid to retry safely all the time.

Therefore, for atomicity, what the book describes is that it can be discarded when an exception occurs in execution, it can be terminated directly, and it will not have any side effects on the server, and it can be retried safely. ".

C: Consistency : There are too many overloads of this term, which means that it has very different meanings in different contexts, but may be related, which may lead us into confusion, such as:

When data is replicated, there is consistency between replicas, and this consistency should refer to the consistency of different replica states mentioned in the previous chapter.
Consistent Hash, which is a partitioning algorithm, is personally understood to be able to function in a consistent manner under various circumstances.
The consistency in the CAP theorem refers to a special internal consistency to be introduced later, called "linear consistency".
We'll cover consistency in ACID later, referring to some "invariant," or "good state," of a program.

We need to distinguish the difference between the meanings expressed by consistency in different contexts, and I hope that after reading today's sharing, it can better help you remember these differences. Having said that, the consistency here refers to the fact that a set of specific statements for the data must be established, that is, the "invariant", which is somewhat similar to the "cycle invariant" in the algorithm, that is, when the external environment changes, this invariant The variant must be established.

The book emphasizes that the consistency in this more needs to be guaranteed by the user's application program, because only the user knows what the so-called invariant is. Here is a simple example. For example, when we append a message to Kafka, two of the messages contain 2. If there is no additional information, we do not know whether the client retried twice because of the failure. There are really two identical data.

If you want to distinguish, you can use custom deduplication logic after the user program is consumed, or you can start from Kafka itself, add a "number" link when the client sends to indicate the uniqueness of the message (implementation of Kafka transactions in the high version General idea) In this way, the engine itself has a certain ability to set "invariant" by itself. However, if it is a more complex situation, it still needs to be maintained by the user program and the calling service itself.

I: Isolation (Isolation) : Isolation is actually the highlight of a transaction, and it is also the most accessible part, because the problem solved by isolation is the concurrency problem when multiple transactions act on the same or the same batch of data. When it comes to concurrency, we know that it must not be a simple problem, because the essence of concurrency is the uncertainty of timing. When the scope of these uncertain timings has a certain conflict (Race), it may cause various This kind of problem is similar to multi-threaded programming, but the operation in this is much longer than a computer instruction, so the problem will be more serious and more diverse.

Here is a specific example to intuitively feel it. The following figure shows that two clients concurrently modify a counter in the DB. Since the time when User2's get counter occurs is in the process of User1's update, the counter read is an old one. The same is true for the update of User2, so the final counter value should be expected to be 44. As a result, the counter seen by both people is 43 (similar to two threads doing value++ at the same time).

A perfect transaction isolation, in the view of each transaction, the entire system only works by itself, and for the entire system, these concurrent transactions are executed one after another, as if there is only one transaction, such isolation becomes "serializable ( Serializability)". Of course, such an isolation level will bring huge overhead, so various isolation levels have emerged to meet the needs of different scenarios. The problems solved by different isolation levels will be described in detail later.

图1 隔离性问题导致更新丢失

D: Durability : This feature seems to be easy to understand, just one point, as long as the transaction is completed, no matter what problems occur, there should be no data loss. Theoretically speaking, if it is a stand-alone database, at least the data has been written to non-volatile storage (at least WAL), and the data in the distributed system is replicated to each replica and subjected to replica Ack. But in practice, it may not necessarily be able to guarantee 100% durability. There is a detailed introduction in the situation book here, and repeated copy work will not be done here, that is to say, the durability guaranteed by the transaction is generally the result of some trade-offs.

Among the above four characteristics, in fact, the problem of isolation may be the most diverse and the most complex. Because blindly emphasizing "serialization" may bring unacceptable performance overhead. Therefore, the following will focus on some weaker isolation levels than serializable.

3.3 Transactions are divided by operation objects && safe commit retry

Before introducing the following content, there are two things that need to be emphasized in advance, namely the object of the transaction operation and the commit and retry of the transaction, which are divided into single object && multiple objects.

Single Object Write : This book presents two cases.

The first is that a single thing performs a long write, for example, writing a 20KB JSON object, what happens if it breaks when the write reaches 10KB?
a. Whether there will be 10KB of dirty data that cannot be parsed in the database.
b. If the data can continue to be written after recovery.
c. Another client reads this document, whether it can see the latest value after recovery, or read a bunch of garbled characters.
The other is the auto-increment function similar to the Counter in the above figure.

The solution to this kind of transaction is generally guaranteed by log playback (atomicity), locks (isolation), CAS (isolation), etc.

Multi-object transaction : This type of transaction is actually more complex. For example, in some distributed systems, the objects operated may cross threads, processes, partitions, and even systems. This means that we have more problems than those specific to distributed systems mentioned in the previous article, and dealing with those problems is obviously more complicated. Some systems simply leave this "pot" to the user and let the application handle the problem by itself. That is to say, we may need to deal with the intermediate result problem caused by no atomicity, because there is no concurrency problem caused by isolation. . Of course, some systems also implement these so-called distributed transactions, and the specific implementation methods will be introduced later.

Another point that needs to be emphasized is retry. A core feature of transactions is that when an error occurs, the client can retry safely without any side effects on the server. For traditional databases that really implement ACID The system should follow this design semantics. But in actual practice, how to ensure that the above can be "safely retried"? The book gives some possible problems and solutions:

Suppose the transaction is successfully submitted, but a network failure occurs when the server is Ack. At this time, if the client initiates a retry, if there is no additional means, data duplication will occur, which requires the server or the application to provide the ability to distinguish Additional attributes of message uniqueness (transaction ID built into the server or attribute fields of the business itself).
The transaction submission failed due to too much load, which will increase the burden of the system by retrying rashly. At this time, some restrictions can be implemented on the client, such as using exponential backoff, or limiting the number of retries, and putting them into the client itself. The queue to which the system belongs, etc.
Judging before retrying, retrying only in case of temporary errors, if the application has violated some defined constraints, then such retrying is meaningless.
If the transaction is a multi-object operation and may have side effects in the system, a mechanism like "two-phase commit" is needed to implement transaction commit.

3.4 Weak isolation level

Transaction isolation needs to solve the concurrency problem. The concurrency problem needs to discuss the timing and competition of two problems. Often because the operation objects between things have a competitive relationship, and because of the uncertain timing relationship between concurrent transactions, these operations will be caused. of competing objects will have all sorts of strange results.

The so-called different isolation levels are trying to use different overheads to meet the strictness of timing requirements in different scenarios. We may not necessarily know how to implement these transaction isolation levels, but we should be very clear about the problems solved by each isolation level, so that we will not easily make trade-offs among various isolation levels and overheads. Here, we do not directly enumerate the isolation levels as in the book. We first describe the problems that may arise from concurrent transactions, and then introduce those problems that each isolation level can solve.

dirty read

The so-called dirty read refers to whether the user can see the result of a transaction that has not yet been committed. If so, it is a dirty read. The following figure shows what kind of promises should be met without dirty reads. A transaction of User1 sets x=3 and y=3 respectively, but before the transaction is committed, User2 needs to return 2 when calling get x, because at this time User1 The transaction is not committed.

图2 脏读

The meaning of preventing dirty reads:

If it is a single-object transaction, the client will see a value that may be rolled back in a while, and if I need to make a decision based on this value, there is a high chance of a decision error.
If it is a multi-object transaction, some of the data may be updated when the client accesses different systems, and some of the data may not be updated, so the user may be overwhelmed.

dirty writing

If one client overwrites another client's uncommitted write, we call this phenomenon a dirty write.

Here is also an example. For a second-hand car transaction, the database implementation needs to be updated twice, but there are two users who conduct transactions concurrently. If dirty writing is not prohibited as shown in the figure, there may be a sales list showing that the transaction belongs to Bob but The invoice is sent to Alice because the two transactions overwrite each other with the same record of the two data.

图3 脏写

Read skew (non-repeatable read)

Directly to the example, Alice has a total of 1000 yuan in two bank accounts, and each account is 500 yuan. Now she wants to transfer 100 yuan from one account to another, and she wants to keep staring at her two accounts to see if the money is transferred successfully. . Unfortunately, the first time he looked at the account, the transfer had not happened yet, but after success, he only checked the value of one account, which was exactly 100 less, so in the end, he would feel that he lost 100 yuan.

If it is just this scenario, it is actually just a temporary phenomenon, and the correct value will be obtained by querying later, but if you do other things based on such a query, there may be problems, such as selecting this record. Make a backup in case the DB crashes. But unfortunately, if it really crashes later, if a backup is made based on the data queried this time, the 100 yuan may be lost forever. If this is the case, non-repeatable reads are not acceptable.

图4 读偏差

update lost

Here we directly move the previous example of two users updating the counter according to the old value at the same time. This is a typical update loss problem:

图5 隔离性问题导致更新丢失

Write Bias && Phantom Reads

This kind of problem describes that the writing of a transaction needs to depend on the result of the previous judgment, and this result may be modified by other concurrent transactions.

图6 幻读

In the example, two people, Alice and Bob, decide whether to take off work. The premise of making this decision is to determine whether there are more than two doctors currently on duty. If so, they can take off work safely, and then modify the on-duty doctor information. However, due to the use of the snapshot isolation mechanism (described later), the results returned by the two transactions are all 2, and the modification stage is entered, but the final result actually violates the premise of two doctors on duty.

The root cause of this problem is a phenomenon called "phantom read", that is to say, two concurrent transactions, one of which changes the query result of the other thing, this kind of query generally queries an aggregated result, for example The above count or max, min, etc., this kind of problem will cause problems in the following scenarios.

Reserve a meeting room
Multiplayer Update Location
unique username

Above we have listed the problems that may arise from transaction concurrency, and now we will introduce the problems that can be solved by various isolation levels.

Isolation Level && Simple Implementation Means/Problems	dirty read	dirty writing	read bias	update lost	Write skew (phantom read)
read committed (row lock or remember old value)	Y	Y	N	N	N
Repeatable Read (Snapshot Isolation, CAS)	Y	Y	Y	Maybe	N
Serializable (2PL pessimistic locking or SSI optimistic locking)	Y	Y	Y	Y	Y

3.5 Chapter Summary

Transactions use their ACID properties to shield users from some erroneous processing. First, atomicity provides the user with an environment that is safe to retry without side-effects on the corresponding system. Consistency can make the program satisfy the so-called invariant to a certain extent. Isolation can solve different phenomena caused by transaction concurrency in different scenarios through different isolation levels. Different isolation solves different problems and costs. Decision-making on demand, and finally persistence allows users to write data into the system we designed with peace of mind.

In general, the transaction guarantees the consistency between different operations. An extremely perfect transaction implementation makes it appear that only one transaction is working, and only one atomic operation is performed each time. Therefore, what we call a transaction is the consistency of operations. In this chapter, we're talking more about single-machine-wide transactions. Next, we will expand the problem threshold. In fact, distributed systems also have such problems, and distributed systems have similar replication lag problems, resulting in different copies even if the operation seems to be an object. It will make the problems we face more complicated. In the next chapter, we focus on another kind of consistency problem and its solution.

4. Internal consistency and consensus

4.1 The problem of replication lag

Here we first return to the lag of replication mentioned in the previous article. One of the most intuitive problems caused by lag is that if a client initiates a read request during replication, the data read by different clients may be Different. There are three different types of consistency problems in the book. Let's look at these cases separately:

图7 复制滞后问题

The first picture shows an example where a user updates first, and then checks the update result. For example, a user makes his own comment on a blog. The DB in the service adopts pure asynchronous replication, and the data is written to the master node. It returns that the comment is successful, and then the user wants to refresh the next page to see how much resonance or follow-up his comment can cause. This is because the slave node is queried, so he finds that the comment he just wrote "disappeared". If the system To avoid the above situation, we call it "read-after-write consistency" (read-write consistency).

The above is an example of what a user would view after an update, and the next image shows another situation. The user also wrote a comment in the system. The module is still implemented by pure asynchronous replication. At this time, another user sees the comment of User1234 when he first refreshes the page, but the next time he refreshes, then This comment disappeared again, as if the clock had turned back. If the system can guarantee that this situation will not occur, it means that the system has achieved "monotonous reading" consistency (such as Tencent Sports scores and details pages).

图8 复制滞后问题

In addition to these two cases, there is one more case, as shown in the following figure:

图9 复制滞后问题

This question would seem more absurd than the previous example, where there are two write clients where Poons asks a question and Cake answers. In terms of order, MrsCake answered the question after seeing the Poons question, but the question and answer happened to be divided into two partitions (Partitions) of the database. For the following Observer, the leader delay of Partition1 is much larger than that of Partition2 Therefore, what you see from the Observer is the problem after the existing answer, which is obviously a violation of the laws of nature. If this problem can be avoided, it can be said that the system has achieved "prefix read consistency" .

In the last article, we introduced a way to detect causality like this, but in summary, we can see that due to the lag of replication, one consequence is that the system only has eventual consistency, because of this This eventual consistency will greatly affect some user experience. Although the above three examples represent different consistency, they all have one thing in common, which is the problem caused by the lag of replication. The so-called replication is the problem that occurs when multiple clients or even one client reads and writes multiple copies. Here we refer to this type of consistency problem as "internal consistency (memory consistency)", which characterizes the data inconsistency problem due to the timing of reads and writes of multiple copies.

4.2 Overview of Internal Consistency

In fact, internal consistency is not a unique problem of distributed systems. In the field of multi-core, it is also called memory consistency, which is to agree on cooperation between multi-processors. If certain coherences can be met among multiple processors, then certain promises can be made about the data processed by the multiple processors, the order of operations, and application developers can make assumptions about their systems based on these promises. As shown below:

图10 CPU结构

Each CPU logic core has its own set of independent registers and L1, L2Cache, which leads to that if we are programming concurrently, if each thread modifies a variable in a certain main memory address, it may modify itself first. cache, and the cache is also read first when reading variables. This is actually similar to our phenomenon of multiple clients reading and writing multiple copies in a distributed system, except that in a distributed system, the granularity of operations is the granularity of operations, while the granularity of processors is the granularity of instructions. There are several common models in the memory consistency of multiprocessors.

图11 内存一致性--百度百科

It can be seen that the core distinguishing point of these consistency constraints is the constraints on the order when generating concurrency, and in more professional terms, linear consistency requires the definition of "total order", while other consistency is Some kind of "partial ordering", which means that some concurrent operations are allowed to execute in all possible permutations and combinations without comparing the order.

4.3 Inferring others: Internal Consistency in Distributed Systems

As shown below:

图12 内存一致性

Internal consistency in distributed is mainly divided into 4 categories: linear consistency --> sequential consistency --> causal consistency --> processor consistency, and from partial order and total order, it is divided into Strong consistency (linear consistency) and eventual consistency.

But it should be noted that as long as it is not strongly consistent internal consistency, eventual consistency does not have any partial order guarantee. The consistency in the figure is actually limited by some partial order, which has a stronger guarantee than the simple final consistency. For specific examples of other consistency here, please refer to Chapter 2 of "Big Data Daily Knowledge". There is a relatively clear explanation of these consistency, and in this chapter we focus on strong consistency.

4.4 What we call "strong consistency" - linear consistency

A system that satisfies linear consistency gives us the feeling that the system sees only one replica, so that I can safely read the data on either replica to continue our application. Here is an example to specifically illustrate the constraints of linear consistency, as shown in the following figure:

图13 线性一致性

There are three clients operating the primary key x at the same time. This primary key is called the register in the book. There are the following operations on the register:

write(x, v) => r means try to update the value of x to v, and return the update result r.
read(x) => v means to read the value of x and return the value of x to v.

As shown in the figure, when C updates the value of x, A and B repeatedly query the latest value of x. The clear result is that ClientA reads before ClientC updates x, so the first read(x) must be 0, while ClientA's last read was after ClientC successfully updated the value of x, so it must return 1. The rest of the reads, due to uncertainty about the order (concurrency) with write(x,1), may return 0 or 1. For linear consistency, we make the following provisions:

图14 线性一致性

In a linear consistency system, there must be a point in time before the write operation is called to return, and the client can read the new value by calling read. After reading the new value, all subsequent read operations should return the new value. (The operations in the above figure are in strict order, and ClientA read->ClientB read->ClientC write-ClientA read->clientB read->clientAread) Here, for clarity, the book is further refined. In the following example, one more operation is added:

cas(x, v_old, v_new)=>r and if the value at this time is v_old, update the value of x to v_new, and return the update result.

As shown in the figure: each digital display represents the time point of a specific event, linear consistency requirements: if the above vertical lines are connected, the requirements must be moved forward in time order, and cannot be dialed back (read(x)= 2 does not satisfy the linearization requirement, since x=2 is to the left of x=4).

图15 线性一致性

4.5 When is dependency linearization required?

If it's just like the order of comments in the forum, or the jumping back and forth when the sports game page refreshes the page, it doesn't seem to be fatal. But in some scenarios, it may have more serious consequences if the system is not linear.

Locking && master selection : In the master-slave replication model, there needs to be a clear master node to receive all write requests. This master selection operation is generally implemented by locking. If the lock service we depend on does not support linear storage , then there may be jumps leading to the "split brain" phenomenon, which is absolutely unacceptable. Therefore, the storage module of the distributed lock service that depends on the main selection scenario must meet linear consistency (generally speaking, the storage of metadata also requires linear storage).
Constraints and uniqueness guarantees : This scenario is also obvious, such as unique IDs, primary keys, names, etc. If there is no strict ordering of such linearized storage commitments, it is easy to break the uniqueness constraints and lead to many strange phenomena and consequences .
Time dependence across channels (systems) : In addition to the same system, services may span different systems, and there is also a need to limit the timing between different systems for an operation. The book cites such an example.

图16 跨通道线性一致性

For example, if a user uploads a picture, a similar back-end storage service may generate a low-pixel picture based on a full-size picture to improve user service experience. However, since MQ is not suitable for sending large byte streams such as pictures, full-size pictures are sent directly to The back-end storage service, and the screenshot of the image is executed asynchronously in the background through MQ, which requires the file storage service uploaded in 2 to be a linearizable storage. If not, it may not be found when generating the low-resolution image, or read half the picture, which is definitely not what we want to see.

Linearization is not the only way to avoid contention. Like transaction isolation levels, the requirements for concurrency order may have different degrees of strictness depending on the scenario. This also gives birth to different levels of internal consistency levels, and different levels also correspond to different overheads, requiring users to make their own decisions.

4.6 Implementing a Linearized System

Having explained the usefulness of a linearized system, let's consider how to implement such a linearized system.

According to the definition of linearization above, the system looks like there is only one copy to the outside world, so the easiest way to think of is to simply use one copy. But this is not the original intention of the distributed system. A large part of the use of multiple copies is for fault tolerance. The implementation of multiple copies is replication, so let's see if the common copy methods in the previous sharing can achieve linearity. system:

Master-slave replication (partially achievable): If synchronous replication is used, then the system is indeed linear, but there are some extreme cases that may violate linearization, such as abnormal consumption due to "split-brain" problems during member changes, or If we use asynchronous replication failover will violate both persistence in transactional properties and linearization in internal consistency.
Consensus algorithm (linearization): Consensus algorithm will be introduced later. It is similar to master-slave replication, but implemented through a stricter negotiation mechanism, which can avoid some possible “split brains” based on master-slave replication. problem, linearized storage can be implemented relatively safely.
Multi-master replication (cannot be linearized).
Masterless replication (may not be linearized): It mainly depends on the configuration of the specific Quorum. For the definition of strong consistency, the following figure shows an example that although it satisfies strict Quorum, it still cannot meet the linearization.

图17 Quorum无法实现线性一致

The cost of linearization - it's time, CAP theory

In the last sharing, we talked about the unreliability of the network in the distributed system, and once the network is disconnected (P), the state between the replicas will definitely not reach linear consistency. The old value (A), or does it wait for the network to recover to ensure the linear consistency of the state (C), which is the famous CAP.

But in fact, the definition of CAP theory is still relatively narrow, in which C is only linear consistency, P only represents network partition (complete disconnection, not delay), there are actually quite a lot of compromises, which can fully satisfy our system Therefore, don’t be superstitious about this theory, you still need to analyze it according to the specific actual situation.

Progressive layer by layer -- realizing linearized system

From the definition of linear consistency, we can know that order detection is the key to realizing a linear system. Here we follow the ideas in the book to see step by step: how can we define the order of these concurrent transactions.

a. Capturing causality

Similar to the content shared last time, there are two types of concurrent operations. Some operations may have a natural logical causal relationship, while others cannot be determined. Here we first try to capture those operations that have a causal relationship to achieve a Causal consistency. To capture here, we actually need to store all causal relationships in database (system) operations. We can use a similar version vector method (for those who forget, you can look back at the example of two people operating shopping carts concurrently in the previous article).

b. Turn passive into active -- active definition

The above passive capture of cause and effect without any restrictions will bring huge running overhead (memory, disk). Although this relationship can be persisted to disk, it still needs to be loaded into memory during analysis, which allows us to have Another thought, can we make a mark on the operation to directly define such a causal relationship?

The easiest way is to build a global issuer and generate some sequence numbers to define the causal relationship between operations. For example, if you need to ensure that A occurs before B, then ensure that the total sequence ID of A is before B. Others There is no hard limit on the order of concurrent operations, but the relative order of operations on the processor remains unchanged, so that we not only achieve causal consistency, but also enhance this limit.

c.Lamport timestamp

Although the above idea is ideal, the reality is always beyond our imagination. The above method is easy to implement in the master-slave replication mode, but if it is a multi-master or unmastered replication model, it is difficult for us to design this global The serial number generator, the book gives some possible solutions, the purpose is to generate a unique serial number, such as:

Each node generates its own serial number.
Timestamps are attached to each operation.
Pre-allocate the sequence number that each partition is responsible for generating.

But in fact, all of the above methods may break the partial order commitment of causality, because different nodes have different loads, different clocks, and different frames of reference. Here our concurrency god Lamport comes on stage. He created a Lamport logical timestamp by himself, which perfectly solves all the above problems. As shown below:

图18 Lamport时间戳

When I first met Lamport Timestamp, I was in a distributed system class for graduate students. I took it again today and looked at it. With the context, I understood it a little bit. Simply put, the definition is to use logical variables to define dependencies, which are given a two-tuple <Counter, NodeId>, and then given a comparison method:

Compare the Counter first, and the larger Counter will occur later (a strict partial order relationship will be promised).
If the Counter is the same, compare the NodeId directly, and the larger one is defined as a later occurrence (concurrent relationship).

If there are only these two comparisons, the above problem of breaking the causal partial order cannot be solved, but the difference of this algorithm is that it will embed the Counter value of the Node into the response body of the request, such as A in the figure, When the update max request is sent to Node2 for the second time, the current c=5 will be returned, so that the Client will update the local Counter to 5, and the next time it will increase by 1, so that the use of the Counter on the Node will maintain the variables on each copy. Partial order relationship, if it is written to two Nodes, it is directly defined as concurrent behavior, and NodeId is used to define the order.

d. Can we achieve linearization - total order broadcasting

At this point, we can confirm that with Lamport timestamps, we can achieve causal consistency, but we still cannot achieve linearization, because we still need to notify all nodes of this total order, otherwise we may not be able to make decisions.
For example, in the case of a unique user name, suppose ABC tries to register the same user name with the system at the same time. The practice of using Lamport timestamp is that among the three concurrent requests, the first submitted returns successfully, and the other returns fail. But because we have the "God's perspective", we know ABC, but the actual request itself does not know that there are other requests when it is sent (different requests may be sent to different nodes), so the system needs to do this collection work, which is A similar coordinator is required to continuously ask each node whether there is such a request. If one of the nodes fails during the query process, the system cannot safely determine the specific RSP result of each request. So it is best for the system to broadcast this order to each node, so that each node really knows this order, so that decisions can be made directly.

Assuming that there is only a single-core CPU, it is naturally total order, but now we need to achieve this total order broadcast in the case of multi-core, multi-machine, and distributed, and there are some challenges. The main challenges are two:

Multi-machine
distributed

For multiple machines, the easiest way to implement total order broadcasting is to use the master-slave mode of replication, let the master node define the order of all operations, and then broadcast to each slave node in the same order. For a distributed environment, it is necessary to deal with the partial failure problem, that is, if the master node fails, it needs to deal with the change of the master member. Let's take a look at how the book solves this problem.

The so-called total order here generally refers to the total order within the partition, and if the total order across the partitions is required, additional work is required.

For total order broadcasting, the book gives two invariants:

Reliable delivery: You need to ensure that messages are delivered all-or-nothing (think about the previous chapter).
Strictly ordered: messages need to be sent to each node in the exact same order.

Implementation level

Let's talk about the simple implementation ideas against the above invariants. First of all, we must achieve reliable transmission. There are two meanings here:

Message can't be lost
Can't send part of the message

The message cannot be lost means that if some nodes fail and need to be retried, if a safe retry is required, then the broadcast operation itself cannot have side effects on the system itself after the failure, otherwise it will lead to the problem of sending messages to some nodes. . The atomicity of transactions in the previous chapter just solves this problem, and here are some ideas that we need to adopt transactions, but different from the above, this scenario is a distributed system and will be sent to multiple nodes, so it must be It is a distributed transaction (the familiar 2PC must be indispensable).

The other one is strictly ordered. In fact, we just need a data structure that can guarantee the order, because the operation is an Append-only structure in time order, and Log can solve this problem. Here is another frequently mentioned one. The technique, replicated state machine, this concept is what I saw in Raft's paper, assuming the initial value is a, if the operations ABCDE are performed in the same order, the final result must be the same. Therefore, it is conceivable that the final implementation of total order broadcasting will definitely use the Log data structure.

e. Implementation of Linear Systems

Now assuming that we already have total order broadcasting, then we continue to move forward like our goal - linearized storage. First, we need to clarify a problem. Linearization is not equivalent to total order broadcasting, because in the distributed system model we usually Using an asynchronous model or a semi-synchronous model, this model does not have a clear commitment to when the total order relationship is successfully sent to other nodes, so something needs to be done on the total order broadcast to truly realize the linearization system.

The book still cites the example of a unique username: it can be implemented using a linearized CAS operation, when the user creates a username if and only if the old value is empty. To achieve such a linearized CAS, the method of total order broadcasting + Log is directly used.

Writes a message to the log indicating the username to be registered.
Read the log, broadcast it to all nodes and wait for a reply (synchronous replication).
If the reply for the first registration of the table name comes from the current node, submit the log and return success; otherwise, if the reply comes from another node, return failure directly to the client.

These log entries will be broadcast to all nodes in the same order. If concurrent writing occurs, all nodes need to make a decision, whether to agree, and which node to agree to occupy this username. Above we have successfully achieved a linear consistency of writing to linear CAS. However, for read requests, due to the asynchronous update log mechanism, the client's read may read the old value, which may require some extra work to ensure the linearization of the read.

The position of the current latest message is obtained in a linearized way, that is, to ensure that all messages before this position have been read, and then read (sync() in ZK).
Add a message to the log and actually read it when the reply is received, so that the location of the message in the log can determine the point in time when the read occurred.
Read data from replicas that are kept up-to-date.

4.7 Consensus

When we implemented the linearized system above, we actually had a little sign of consensus, that is, multiple nodes needed to agree on a proposal, and once it was reached, it could not be revoked. In reality, problems in many scenarios can be equivalent to consensus problems:

Linearizable CAS
Atomic transaction commit
total order broadcast
Distributed locks and leases
member coordination
uniqueness constraint

In fact, finding a solution to any of the above problems is equivalent to achieving consensus.

two-phase commit

a. to realize

The book directly uses atomic commit as the entry point to talk about consensus. I don’t want to explain too much here, and directly introduce two-phase commit. According to the description in the book, two-phase commit can also be regarded as a consensus algorithm, but in reality, we prefer to use it as a means to achieve a better consensus algorithm and distribution. The core implementation method of transaction type (consensus algorithms such as Raft actually have similar semantics of two-phase commit).

图19 两阶段提交

This algorithm is actually relatively simple, it is two stages, there is a coordinator for collecting information and making decisions, and then it goes through two simple stages:

The coordinator sends a prepare request to the participants asking if they can commit. If the participant answers "yes", it means that the participant must commit to committing the message or transaction.
If the coordinator receives zone confirmations from all participants, the transaction is committed in the second phase, otherwise the transaction is terminated if either party answers "no".

Here is a seemingly very simple algorithm, which is unremarkable and nothing more than a preparation stage more than a normal submission. Why is it said that it can achieve atomic submission? This stems from the contractual commitment in this algorithm, let's continue to break down this process:

When starting a distributed transaction, a transaction ID is requested from the coordinator.
The application executes a single-node transaction on each participating node and attaches this ID to the operation. This is a single-node read and write operation, and can be safely terminated if a problem occurs (single-node transaction guarantee).
When the application is ready to commit, the coordinator sends Prepare to all participants, which will terminate the transaction if any of the requests fail or time out.
After the participant receives the request, writes the transaction data to persistent storage, and checks for violations, etc. At this time, the first commitment appears: if the participant sends "yes" to the coordinator, it means that the participant must not The transaction will be withdrawn again.
When the coordinator receives the replies from all participants, it makes decisions based on these restorations. If it receives all affirmative votes, it writes the "commit" decision to its own local persistent storage, where a second commitment will appear: The coordinator must commit the transaction until it succeeds .
Assuming an exception occurs during the submission process, the coordinator needs to keep retrying until the retry is successful.

It is precisely because of the above two commitments that 2PC can achieve atomicity, which is also the meaning of this paradigm.

b. Limitations

The coordinator needs to save the state, because the coordinator needs to guarantee that the transaction must be committed after deciding to commit, so its decision must be persistent.
The coordinator is a single point, so if there is a problem with the coordinator and cannot be recovered, the system does not know whether to commit or roll back, and it must be handled by the administrator.
The preparation stage of the two-phase commit requires all participants to vote in favor of continuing the commit, so if there are too many participants, the transaction will fail with a high probability.

Simpler definition of consensus algorithm

After reading a special case, the book summarizes several characteristics of the consensus algorithm:

Negotiated consensus : All nodes accept the same proposal.
Honesty : Once all nodes make a decision, they cannot go back and cannot have two different decisions on a proposal.
Validity : If a value v is determined, this v must have come from a proposal.
Terminability : Nodes must reach a resolution if they do not crash.

If we compare 2PC with these features, it can actually be considered a consensus algorithm, but these are not very important, and we focus on what kind of inspiration these features will give us.

The first three features specify safety. If there is no fault-tolerance limit, a Strong Leader can be designated directly and act as the coordinator. However, just like the limitations in 2PC, a problem with the coordinator will cause the system to fail. Continue to execute backwards, so there needs to be an additional mechanism to deal with this change (and rely on consensus), and the fourth feature determines the liveness (Liveness). As mentioned in the previous typing, security needs to be guaranteed first, and liveness guarantees require prerequisites. The conclusion is given directly in this book, and the premise of satisfying termination is that most nodes operate correctly.

Consensus Algorithms and Total Order Broadcasting

In fact, when the algorithm is finally designed and implemented, it is not for each message to reach a consensus according to the above four characteristics, but directly adopts the method of total order broadcasting, which promises that messages will be sent to each node in the same order. And there is one and only one, which is equivalent to doing multiple rounds of consensus. In each round, nodes propose the message they want to send next, and then decide the total order of the next message. The advantage of using total order broadcasting to achieve consensus is that it can provide higher efficiency than single-round consensus (ZAB, Raft, Multi-paxos).

discuss

There are a few more things here that could be brought up for some discussion. First of all, from the perspective of implementation, the master-slave replication model is particularly suitable for consensus algorithms, but when the master-slave replication was introduced before, the master-slave replication model alone is not enough to solve the consensus problem. There are two main points:

How to determine the new master if the master node hangs
How to prevent split brain

These two problems are actually solved by consensus again. In the consensus algorithm, epoch is actually used to identify the logical time, such as Term in Raft, Balletnumber in Paxos, if two nodes claim to be the master at the same time after the election, then the node with the updated Epoch is elected.

Similarly, before the master node makes a decision, it is also necessary to judge whether there is a node with a higher Epoch making a decision at the same time. If there is, it means that a conflict may occur (only the Controller has this logo in the low and medium versions of Kafka. In later versions, The data partition also carries a similar logo). At this time, the node cannot decide anything based on its own information only, it needs to collect votes from the Quorum nodes, the master node will send the proposal to all nodes, and wait for the return of the Quorum node, and it needs to confirm the master of the higher Epoch When the node exists, the node will vote on the current proposal.

Looking at this in detail, it involves two rounds of voting. Using Quorum is using the so-called coincidence. If a proposal is approved, the voting nodes must have participated in the latest round of master node elections. It can be concluded that the master node has not changed at this time, and it is safe to vote for the proposal of this master node.

In addition, at first glance, the consensus algorithm is all good, but there must be a price behind what seems to be good:

Before reaching a consensus resolution, the voting of nodes is a synchronous replication, which makes the consensus risk of losing messages and requires a trade-off between performance and linearity (CAP).
Most consensus sets up a fixed set of nodes, which means that members cannot be dynamically changed at will, and dynamic member changes can only be done after a deep understanding of the system (some systems may outsource member changes).
Consensus is extremely sensitive to the network, and timeouts are generally used for fault detection, which may lead to inexplicable invalid master selection operations due to network jitter, or even make the system unavailable.

Outsourcing consensus

Although, you can implement the consensus algorithm yourself according to the above description, but the cost may be huge. The best way may be to outsource this function and use a mature system to achieve consensus. If you really need to implement it yourself, it is best to Do it with proven algorithms, don't do it yourself. Systems such as ZK and etcd provide such services. They not only realize linear storage through consensus, but also provide consensus semantics to the outside world. We can rely on these systems to achieve various requirements:

Linearized CAS
total order of operations
Troubleshooting
Configuration changes

4.8 Chapter Summary

This chapter has spent a lot of effort explaining another consistency problem in distributed systems, internal consistency. This problem is mainly caused by the lag of replication. First, we introduce the origin of this problem, and then map it to distributed In the system, different conformances are classified.

For the strong consistency inside, we have carried out a detailed discussion, including definition, usage scenarios and implementation, etc., and derived from it, such as total order and partial order, causal relationship capture and definition (Lamport timestamp), total order broadcasting , 2PC finally reached consensus, which is enough to see the complexity of this consistency.

5. Let’s talk about distributed systems again

So far, starting from the topic of replication, we have discussed the replication model, challenges, transactions, and consensus of distributed systems. Combining the content of the two articles, I try to give a more detailed description of distributed systems, first describing the characteristics. and problems, and then give specific solutions.

Like a stand-alone system, a distributed system also has multiple clients that simultaneously perform various operations on the system. The objects involved in each operation may be one or more, and the concurrent operations of these clients may cause correctness problems.
In order to achieve fault tolerance, the data of a distributed system generally has multiple backups, which are realized by replication between different copies.
Common replication models include:
- master-slave mode
- multi-master mode
- masterless mode
From the perspective of timeliness and linear consistency, it can be divided into:
- synchronous replication
- Asynchronous replication
Asynchronous replication can have lag issues that can cause various internal consistency issues.
Compared with stand-alone systems, distributed systems have two unique characteristics.
- partial failure
- Missing global clock

Faced with so many problems, if an ideal distributed data system does not consider any performance and other overhead, the system we expect to implement should be like this:

The data of the entire system appears to have only one copy to the outside world, so that users do not have to worry about any inconsistency (linear consistency) when changing a certain state.
The whole system seems to have only one client operating, so there is no need to worry about various conflicts (serialization) when operating concurrently with other clients.

So we know that linear consistency and serialization are two orthogonal branches, representing the highest level of external consistency and the highest level of internal consistency, respectively. If this is really implemented, it will be very easy for users to operate the system. But unfortunately, reaching the highest level of these two aspects has a very large price, so various internal and external consistency are derived from these two branches.

Using Jepsen's official website to define these two kinds of consistency, the internal consistency constrains the operation of a single operation on a single object that may have different copies to satisfy the time total order, while the external consistency constrains the operation of multiple operations on multiple objects. . This is similar to concurrent programming in Java. Internal consistency is similar to volatile variables or Atomic variables to constrain the operation of multiple threads on the same variable, while external consistency is similar to synchronize or various locks in AQS to ensure Multithreading access to a block of code (multiple operations, multiple objects) is what the programmer expects.

图20 一致性

However, it should be noted that in a distributed system, these two kinds of consistency are not completely isolated. We generally use a consensus algorithm to achieve linear consistency, and in the process of implementing a consensus algorithm, a single operation may also involve multiple objects. The problem is that the operation of a distributed system may often act on multiple copies. That is to say, distributed transactions like 2PC will also be used to solve the consensus problem (although it is also called consensus in the book, it actually provides an operation similar to transaction atomicity), just like in Java concurrent programming , we may also use some volatile variables in the synchronize method.

However, 2PC is not the whole of distributed transactions. It is possible that some cross-partition transactions also need to use operations based on linear consistency to satisfy the consistency of operations on an object. That is to say, if we want to fully implement a distributed system, these two kinds of consistency depend on each other and complement each other. Only when we fully understand their core functions can we apply these seemingly boring terms in actual combat with ease.