A little bit about the cache coherence protocol, MESI, StoreBuffer, InvalidateQueue, memory barriers, Lock instructions and JMM

foreword

The thing is, a reader saw one of my articles and did not agree with the views in my article, so there was the following exchange.

Maybe it was the dog-headed look I posted that made this reader think I didn't respect him. So, this reader deleted me in a fit of rage, and before deleting my friend, he even told me to go home to farm.

To be honest, I admit when you say my food, but you want me to go home to farm, I don't understand. Why go home and farm? Isn't raising pigs more profitable than farming?

I thought about it for a long time and didn't understand it. Suddenly, when I saw this news, I instantly understood the well-intentioned readers.

So, I decided to write this article, a good analysis of several questions raised by readers.

Reader's point of view

A few comments for this reader:

The underlying implementation of the volatile keyword is the lock instruction
The lock instruction triggers the cache coherence protocol
JMM is guaranteed by the cache coherence protocol

I will give my opinion first:

The first point I think is correct, I also said in the volatile article, the underlying implementation of volatile is the lock prefix instruction
The second point I think is wrong
The third point I think is wrong

As for why I think so, I will give my reasons, after all, we are all reasonable people, right?

The reader's point of view revolves around the "cache coherence protocol", OK, then let's start with the cache coherence protocol !

Literally, the cache coherence protocol is "a protocol for solving the cache inconsistency problem of the CPU". Taking this sentence apart, there are several questions:

Why does the CPU need a cache during operation?
Why is the cache inconsistent?
What are the ways to solve the problem of cache inconsistency?

We analyze them one by one.

Why do you need caching

The CPU is an arithmetic unit, which is mainly responsible for operations;

Memory is a storage medium responsible for storing data and instructions;

In the days of no cache, the CPU and memory worked together like this:

In one sentence, the CPU runs at high speed, but the data retrieval speed is very slow, which seriously wastes the performance of the CPU.

then what should we do?

In engineering, there are two main ways to solve the speed mismatch, namely physical adaptation and spatial buffering .

Physical adaptation is easy to understand, and multi-stage mechanical gears are a typical example of physical adaptation.

As for spatial buffering , it is more used in software and hardware, and the CPU multi-level cache is a classic representative.

What is CPU multilevel cache?

To put it simply, based on the formula of time = distance / speed , by setting up multiple layers of cache between the CPU and the memory to reduce the distance of fetching data, the speed of the CPU and the memory can be better adapted.

Because the cache is close to the CPU and has a more reasonable structure, the speed at which the CPU fetches data is shortened, thereby improving the utilization of the CPU.

At the same time, because the CPU fetches data and instructions to meet the time locality and space locality, after the cache, when performing multiple operations on the same data, the intermediate process can use the cache to temporarily store the data, and further apportion time = distance / distance in speed , which improves CPU utilization better.

Temporal locality: if an item of information is being accessed, it is likely to be accessed again in the near future
Spatial locality: if a memory location is referenced, then future locations near him will also be referenced

Why is the cache inconsistent?

The emergence of the cache has greatly improved the utilization of the CPU.

In the single-core era, the CPU not only enjoys the convenience brought by the cache, but also does not have to worry about data inconsistency. But the premise of all this is based on "single core".

The arrival of the multi-core era has disrupted this balance.

After entering multi-core, the first question that needs to be faced is: Do multiple CPUs share a set of caches or do they each have a set of caches?

The answer is "each has its own set of caches".

why?

Let's make an assumption, what happens if multiple CPUs share a set of caches?

If a set of caches is shared, since the space of the low-level cache (the cache close to the CPU) is very small, the time of multiple CPUs will be spent waiting to use the low-level cache, which means that multiple CPUs become serial work. If If it becomes serial, it loses the essential meaning of multi-core - parallelism.

We proved by counter-evidence that it is not feasible for multiple CPUs to share a set of caches, so we can only let multiple CPUs each have a set of private caches.

So, the cache structure of multiple CPUs becomes like this (simplified multi-level cache):

Although this design solves the problem of multiple processors preempting the cache, it also brings a new problem, which is the headache of data consistency:

Specifically, if multiple CPUs use a certain piece of data at the same time, the data may be inconsistent due to the existence of multiple sets of caches.

We can see the following example:

Suppose there is age=1 in memory
CPU0 performs the age+1 operation
CPU1 also performs the age+1 operation

If there are multiple sets of caches, in a concurrent scenario, the following situations may occur:

It can be seen that the two CPUs add one to age=1 at the same time. Because of multiple sets of caches, the CPUs cannot perceive each other's modifications, and the data is inconsistent, resulting in the final result not being the expected value.

Data consistency problems can also occur without a cache, but they can become especially severe with a cache.

The problem of data inconsistency is fatal to the program. So there needs to be a protocol that can make multiple sets of caches look like there is only one set of caches.

Thus, the cache coherence protocol was born.

Cache Coherence Protocol

The cache coherence protocol was born to solve the cache coherence problem, and it aims to manage data coherence by maintaining a consistent view of cache lines in multiple cache spaces.

Here's the concept of a cache line first:

Cache line (cache line) is the smallest unit of cache read, cache line is an integer power of 2 consecutive bytes, generally 32-256 bytes, the most common cache line size is 64 bytes.

Linux systems can check the cache line size through the cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size command.

Mac systems can check the size of the cache line through sysctl hw.cachelinesize .

A cache line is also the smallest unit managed by the cache coherence protocol.

There are two main mechanisms for implementing the cache coherence protocol, which are based on directory and bus sniffing .

directory based

What is directory based?

To put it bluntly, a directory is used to record the usage of the cache line, and then when the CPU wants to use a cache line, it first obtains the usage of the cache line by checking the directory, and in this way ensures data consistency.

There are six formats for the directory:

Full bit vector format
Coarse bit vector format
Sparse directory format
Number-balanced binary tree format
Chained directory format
Limited pointer format

The names of these directories are fancy, but in fact they are not that complicated, but the data structure and optimization methods are different.

For example, the full bit vector format uses bits to record whether each cache line is cached by a certain CPU.

The catalogues of the latter formats are nothing more than some optimizations in terms of storage and scalability.

Compared with direct message communication, directory entry is more time-consuming, so the latency of cache coherence protocol based on the mechanism of directory will be relatively high. But there is also an advantage. The existence of the third-party directory simplifies the communication process, and the communication takes up less bus bandwidth.

Therefore, directory-based is suitable for large systems with a large number of CPU cores.

bus sniffing

Although the cache coherence protocol based on directory dependence occupies a small bandwidth, it has high latency and is not suitable as a cache coherence solution for small systems. Small systems mostly use a cache coherence protocol based on bus sniffing .

The bus is the bridge between the CPU and the memory address and data interaction. Bus sniffing is to monitor this interactive bridge and sense the changes of the data in time.

When the CPU modifies the data in the private cache, it sends an event message to the bus to tell other listeners on the bus that the data has been modified.

When other CPUs perceive that there is a modified copy of data in their own private cache, they can update the cached copy or invalidate the cached copy.

Updating the cached copy will generate huge bus traffic and affect the normal operation of the system. Therefore, when listening to the update event, it is more to invalidate the private cache copy, that is, discard the data copy.

This way of invalidating the modified data copy has a professional term called "Write-invalidate". The cache consistency protocol implemented based on "write-invalidate" is called the "write-invalidate" protocol. The common MSI, MESI , MOSI, MOESI, and MESIF protocols all fall into this category.

MESI

MESI protocol is a cache coherence protocol based on invalidation, it is the most commonly used protocol to support write-back cache, and it is also the most widely used cache coherence protocol. Two bits mark the state for each cache line and maintain state switching to achieve the purpose of cache coherence.

MESI status

MESI is an acronym for four words, each representing a state of the cache line:

M: modified , modified. A cache line has a different value than main memory. If another CPU core wants to read this piece of data in main memory, the cache line must be written back to main memory, and the state becomes the shared state (S).
E: exclusive , exclusive. The cache line is only in the current cache, but is consistent with the main memory data. When other caches read it, the state becomes shared; when data is currently written, it becomes the modified state (M).
S: shared , shared. Cache lines also exist in other caches and are clean. A cache line can be discarded at any time.
I: invalid , invalid. The cache line is invalid.

MESI message

In the MESI protocol, the switching of the cache line state depends on the transmission of messages. MESI has the following messages:

Read: Read data from an address.
Read Response: The response to the Read message.
Invalidate: Request cache lines corresponding to other CPU invalid addresses.
Invalidate Acknowledge: The response to the Invalidate message.
Read Invalidate: A combined message of Read + Invalidate messages.
Writeback: The message contains the address and data to be written back to memory.

MESI maintains a cache state machine through message transmission to implement shared memory. As for the details, I won't describe it too much here.

If you don't know about MESI, I suggest you go to this website for hands-on experiments. You can simulate various scenarios and generate animations in real time, which is easier to understand.

If you can't open this website, don't worry, the source code is taken down for you, reply "MESI" to pick it up, decompress it and run it locally.

Here's a simple example using this website:

CPU0 reads a0
CPU1 writes a0

Briefly analyze:

CPU0 reads a0, and after reading Cache0, the state of the cache line is E because of exclusive use.
CPU2 writes a0, first reads a0 to Cache2, because it is shared, so the state is S. Then modify the value of a0, the state of the cache line becomes E, and finally notify CPU0 to invalidate the cache line where a0 is located.

The existence of MESI ensures cache coherence and enables multi-core CPUs to better interact with data. Does that mean that the CPU is being squeezed to the extreme?

The answer is no, so let's move on.

The second half of the article depends on the knowledge points in the previous paragraph. If my statement does not make you understand the knowledge points in the first half, you can directly turn to the summary section, where I have prepared a summary of ideas.

If you're ready, let's keep driving and see how else we can squeeze the CPU.

Store Buffer

From the above, if the CPU writes a certain data, and the data is not in the private cache, then the CPU will send a Read Invalidate message to read the corresponding data and invalidate other cache copies.

But there is a question, have you thought about it, that is, after sending a message to receiving all the response messages, the middle waiting process is long for the CPU.

Can you reduce the time the CPU spends waiting for messages?

can! This is what the store buffer does .

How did you do it?

The store buffer is a structure between the CPU and the cache

When the CPU is writing, it can directly write to the store buffer without waiting for other CPUs to respond. After receiving the response message, the data in the store buffer is written to the cache line.

When the CPU reads data, it will first determine whether there is data in the store buffer. If there is, the data in the store buffer will be used first (this mechanism is called "store forwarding").

Thereby, the utilization rate of the CPU is improved, and it can also be ensured that on the same CPU, read and write can be executed sequentially.

Note that the sequential execution of read and write here refers to the same CPU, why should we emphasize the same ?

Because, the introduction of store buffer does not guarantee multi-CPU global order execution.

Let's look at the following example:

 // CPU0 执行
void foo() { 
    a = 1;
    b = 1;
}

// CPU1 执行
void bar() {
    while(b == 0) continue;
    assert(a == 1);
}

Assuming that CPU0 executes the foo method and CPU1 executes the bar method, if before execution, the cache is like this:

CPU0 caches b, because it is exclusive, so the state is E.
CPU1 has cached a, and the state is E because it is exclusive.

Well, after having a store buffer, it is possible to have this situation (simplifying the process of interacting with memory):

In words it is:

CPU0 executes a=1, because a is not in the cache of CPU0, there is a store buffer, directly write a=1 to the store buffer, and send a read invalidate message at the same time.
CPU1 executes while(b==1), because b is not in CPU1's cache, so CPU1 sends a read message to read.
CPU0 receives the read message from CPU1 and knows that CPU1 wants to read b, so it returns a read response message and changes the state of the corresponding cache line to S.
CPU1 receives the read response message and knows that b=1, so it puts b=1 into the cache and ends the while loop.
CPU1 executes assert(a==1), gets a=0 from the cache, and the execution fails.

We analyze and analyze from different angles:

Look at yourself from the perspective of CPU0: a = 1 precedes b = 1, so when b = 1, a must already be equal to 1.
Looking at CPU1 from the perspective of CPU0: because a must be equal to 1 when b = 1, so when CPU1 jumps out of the loop because b == 1, the next execution of assert must be successful, but it actually fails, that is to say, standing on From the perspective of CPU0, CPU1 is reordered.

So how to solve the global order problem caused by the introduction of store buffer?

Hardware designers provide developers with memory-barrier instructions. We only need to use memory barriers to transform the code. Add smp_mb() after a = 1 to eliminate the impact of the introduction of store buffers. .

 // CPU0 执行
void foo() { 
    a = 1;
    smp_mb();
    b = 1;
}

// CPU1 执行
void bar() {
    while(b == 0) continue;
    assert(a == 1);
}

How do memory barriers achieve global ordering?

There are two ways, namely waiting for the store buffer to take effect and queuing into the store buffer .

Wait for the store buffer to take effect

When the store buffer takes effect, subsequent writes of the memory barrier must wait for the values in the store buffer to receive the corresponding response messages and be written to the cache line.

Queue into store buffer

Queueing into the store buffer means that the subsequent writes of the memory barrier are directly written to the store buffer queue, and all the writes in front of the store buffer are written to the cache line.

As can be seen from the animation, both methods need to wait, but waiting for the store buffer to take effect is waiting for the CPU, and entering the store buffer queue is entering the store buffer and so on.

Therefore, queuing into the store buffer will be relatively efficient, and most systems use this method.

Invalidate Queue

Memory barriers can solve the global ordering problem caused by store buffers. But there is a problem. The capacity of the store buffer is very small. If the speed of responding to messages becomes slow when other CPUs are busy, the store buffer will be easily filled up, which will directly affect the operating efficiency of the CPU.

How to do it?

The root of this problem is that the slow response message causes the store buffer to be filled up. Can it improve the message response speed?

can! invalidate queue appeared.

The main function of invalidate queue is to improve the response speed of invalidate messages.

With the invalidate queue, when the CPU receives the invalidate message, it can put the message into the invalidate queue without talking about the invalidation of the corresponding cache line, and return the Invalidate Acknowledge message immediately. Whether there is an Invalidate message for the cache line in the invalidate queue, and if so, the Invalidate message will be processed at this time.

Although the invalidate queue can speed up the response speed of invalidate messages, it also brings a global order problem, which is similar to the global problem caused by the store buffer.

Consider the following example:

 // CPU0 执行
void foo() { 
    a = 1;
    smp_mb();
    b = 1;
}

// CPU1 执行
void bar() {
    while(b == 0) continue;
    assert(a == 1);
}

The above code still assumes that CPU0 executes the foo method, and CPU1 executes the bar method. If before execution, the cache is like this:

Then, after having invalidate queue, it is possible to have this kind of execution:

CPU0 executes a=1. The corresponding cache line is read-only in cpu0's cache, so cpu0 puts the new value a=1 in its store buffer and sends an invalidate message to flush the corresponding cache line from cpu1's cache.
When (b==0) continues execution, CPU1 executes, but the cache line containing b is not in its cache. Therefore, it sends a read message.
CPU0 executes b=1. Because this cache line has been cached, the cache line is updated directly, and b=0 is updated to b=1.
CPU0 receives the read message and sends the cache line containing b to CPU1, and the state of the cache line where b is located is changed to S.
CPU1 receives the invalidate message of a, puts it into its own invalidate queue, and sends an invalidate confirmation message to CPU0. Note that the original value "a" is still stored in CPU1's cache.
CPU1 receives the cache line containing b and writes it to its cache.
CPU1 can now finish executing when (b==0) continues, because it finds that b has a value of 1, and it continues with the next statement.
CPU1 executes assert(a==1). Since the original value a is still in CPU1's cache, the assertion fails.
cpu1 processes the invalidate message in the queue and invalidates the cache line containing a from its own cache. But it's too late.

As can be seen from this example, after the invalidate queue is introduced, the global ordering cannot be guaranteed.

How to solve it, the solution is the same as the store buffer solution, modify the code with a memory barrier:

 // CPU0 执行
void foo() { 
    a = 1;
    smp_mb();
    b = 1;
}

// CPU1 执行
void bar() {
    while(b == 0) continue;
    smp_mb();
    assert(a == 1);
}

The operation process after the transformation is not described too much, but in conclusion, the memory barrier can solve the global order problem caused by invalidate queue.

Memory Barriers and Lock Instructions

memory barrier

As can be seen from the above, the memory barrier has two functions, processing the store buffer and invalidate queue, and maintaining the global order.

But in many cases, only one of the store buffer and invalidate queue needs to be processed, so many systems subdivide memory barriers into read memory barriers and write memory barriers.

Read barriers are used to process invalidate queues, and write barriers are used to process store buffers.

Under the X86 architecture of the scene, the instructions corresponding to different memory barriers are:

Read barrier: lfence
Write barrier: sfence
Read-write barrier: mfence

Lock command

Let's recap, in the last article about volatile, I mentioned that the underlying implementation of the volatile keyword is the lock prefix instruction.

What is the relationship between lock prefix instructions and memory barriers?

I don't think it matters.

It's just that some functions of the lock prefix instruction can achieve the effect of memory barrier.

This is also described in the IA-32 Architecture Software Developer's Manual.

The definition of the lock prefix instruction in the manual is a bus lock, that is, the lock prefix instruction guarantees visibility and prohibits instruction reordering by locking the bus.

While the term "bus lock" is too old, today's systems are more about "locking cache lines". But what I want to express is that the core idea of lock prefix instruction is still "lock", which is fundamentally different from memory barrier.

Review questions

Let's review the reader's two points of view:

Reader: The lock instruction triggers the cache coherence protocol
Reader: JMM is guaranteed by the cache coherence protocol

For the first point, my take is:

The function of the lock prefix instruction is to lock the cache line, which can have the same effect as the read-write barrier, and the problem solved by the read-write barrier is the global order problem caused by the store buffer and invalidate queue.

The cache consistency problem is used to solve the cache consistency problem in multi-core systems. It is guaranteed by hardware and is transparent to software. It is associated with multi-core systems and is an objective thing that does not need to be triggered.

For the second point, my opinion is:

JMM is a virtual memory model. It abstracts the running mechanism of JVM, allowing Java developers to better understand the running mechanism of JVM. It encapsulates the underlying implementation of CPU, allowing Java developers to develop better. Not tormented by low-level implementation details.

What JMM wants to say is that, to some extent, you can make Java's memory model a strong consistency through some Java keywords.

So the JMM and the cache coherence protocol are not linked, and there is no connection in essence. For example, you can't just because you are single and then Liu Yifei is single, you say Liu Yifei is single because he is waiting for you.

Summarize

This article will be a little difficult for some students who have no foundation to understand, so we summarize an idea of the full text, and it is no problem to cope with ordinary interviews.

Because the speed of the memory does not match the CPU, a multi-level cache is added between the memory and the CPU.
Single-core CPU exclusive use will not cause the problem of data inconsistency, but there will be cache coherence problems in the case of multi-core.
The cache coherence protocol is to solve the cache coherence problem caused by multiple sets of caches.
There are two implementations of the cache coherence protocol, one is directory-based and the other is bus-sniffing.
The directory-based method has high latency, but occupies less bus traffic and is suitable for systems with many CPU cores.
The method based on bus sniffing has low latency, but occupies a large amount of bus traffic and is suitable for systems with a small number of CPU cores.
The common MESI protocol is implemented based on bus sniffing.
MESI solves the cache coherency problem, but still can't squeeze the CPU performance to the extreme.
In order to further squeeze the CPU, store buffer and invalidate queue are introduced.
The introduction of store buffer and invalidate queue results in the inability to satisfy global order, so write barriers and read barriers are required.
The read barrier instruction under the X86 architecture is lfenc, the write barrier instruction is sfence, and the read and write barrier instruction is mfence.
The lock prefix instruction directly locks the cache line and can also achieve the effect of a memory barrier.
Under the x86 architecture, the underlying implementation of volatile is the lock prefix instruction.
JMM is a model, an abstraction that is easy for Java developers to develop.
The cache coherence protocol is to solve the data consistency problem under the CPU multi-core system. It is an objective thing and does not need to be triggered.
JMM has nothing to do with cache coherence protocols.
JMM and MESI have nothing to do with one dime.

write at the end

This article mainly refers to the papers and books of Wikipedia and the Linux kernel giant Paul E. McKenney. If you want to have a more in-depth study of the underlying concurrent programming, Paul E. McKenney's papers and books are worth a look. If necessary, reply "MESI" in the background to pick up.

Because the author's level is limited, it is inevitable that there will be mistakes in the article. If you find it, please point it out!

Well, today's article ends here, I'm Xiao Wang, see you next time!

Welcome to my personal public number: CoderW

References

"In-depth understanding of parallel programming"
IA-32+ Architecture Software Developer's Manual
"Memory Barriers: a Hardware View for Software Hackers"
"Is Parallel Programming Hard, And, If So, What Can You Do About It?"