[elixir! #0080] Read the erlang development team blog on the performance optimization of N to 1 parallel messages

Since the erlang OTP team opened a technical blog, many high-quality articles have given us the opportunity to understand the various mechanisms inside erlang. For example, the recent https://www.erlang.org/blog/parallel-signal-sending-optimization/ describes how to perform the "N to 1" process messaging performance in the erlang virtual machine optimized.

This article is just a retelling of the content of the article from the perspective of the author. If there is any misunderstanding or inadequate place, please point it out in the comments.

The above picture is a very intuitive representation of the optimization effect. This is a performance comparison of many processes sending short messages to one process at the same time on a multi-core machine. The horizontal axis is the number of processes, and the vertical axis is the number of operations per second. It can be seen that after optimization, horizontal expansion has been achieved, that is, the more processes there are, the more operations per second. Before optimization, the higher the number of processes, the lower the performance.

Before we dive into how this optimization is done, let's first understand the signal mechanism in the erlang virtual machine.

In the erlang virtual machine, entities represent all things that are executed concurrently, including processes, ports, and so on. Ordinary process messages are also a kind of signal. The order of the signals follows the following rules:

如果实体 A 先发送信号 S1 给 B, 然后发送 S2 给 B。那么 S1 保证不会在 S2 之后到达。

In layman's terms, imagine a highway with N lanes and no overtaking is allowed. In the same lane, the order of cars is certain; and the front and back of cars are always changing between different lanes.

The figure below is a simplified structure of a process before optimization.

The process of sending a message is as follows:

Assign a node of a linked list, which contains signals
Acquire the lock of the OuterSinalQueue
Add the signal node to the back of the external signal queue
Release the lock.

The process of receiving messages is as follows:

Acquire the lock of the external signal queue
Add the contents of the outer signal queue to the back of the inner signal queue (InnerSinalQueue)
Release the lock.

The above is the mechanism when {message_queue_data, off_heap} The default option is {message_queue_data, on_heap} . This optimization actually only applies to off_heap , that is, if we do not message_queue_data , then this optimization has nothing to do with us. So what are the message delivery steps by default? Although it has nothing to do with this optimization, the article still introduces it in detail:

Steps to send a message:

Try to use try_lock to obtain the main process lock (MainProcessLock).
If successful:
1. Allocate space for the signal on the main heap of the process and copy the signal there
2. Allocate a linked list node, which contains a pointer to the location of that signal
3. Acquire the external signal queue lock
4. Add the signal node to the back of the external signal queue
5. Release the external signal queue lock
6. Release the main process lock
If it fails:
1. Assign a node of a linked list, which contains signals
2. Acquire the external signal queue lock
3. Add the signal node to the back of the external signal queue
4. Release the external signal queue lock.

It can be seen that on_heap is that when the main process lock is successfully acquired, the signal data is directly copied to the main heap of the process. The disadvantage is that you need to acquire the main process lock to prevent garbage collection from occurring in this process. Therefore, when many processes send messages to one process at the same time, off_heap has better scalability, because there is no need to compete for the receiver's main process lock.

Nevertheless, the external signal queue lock is still a performance bottleneck.

Below we can talk about how to optimize.

Looking back at the signal sequence requirements of the erlang virtual machine we mentioned earlier, we can see that what we need is an N-lane highway, but now there is only one toll station (the receiver's external signal queue lock), and the cars are all blocked here. . The optimized plan is obviously ready to come out, that is, to increase the number of "toll booths". By simply hashing the pid of the sender process, the signal is distributed to 64 slot queues.

This optimization will only be triggered when the number of processes simultaneously acquiring external signal queues exceeds a certain threshold.

This optimization provides better performance for our N-to-1 mass messaging on multi-core machines. See the original text for more details.

[elixir! #0080] Read the erlang development team blog on the performance optimization of N to 1 parallel messages

Ljzn

引用和评论

写一个简单的项目