头图
Author: Zhang Beihai

Our online feature data service DataService, in order to solve the problem of low CPU utilization of the machine due to the thread pool model and non-linear long-tail request delay (P99, p999 has a J-shaped curve). Good performance results after replacing the thread pool with the Disruptor. This article is mainly to briefly introduce the personal understanding of Disruptor and the results of the landing.

background

Disruptor is a high-performance framework for dealing with concurrency problems, designed and developed by LMAX (a British financial transaction company) for the construction of its own financial transaction system. Since then, open source has been used by many well-known open source libraries, such as Log4j, which broke out a vulnerability some time ago.

Among them, Log4j2 uses Disruptor to optimize the log drop performance in multi-threaded mode. Log4j2 did a test using: Sync mode (Sync), Async (ArrayBlockingQueue), and ALL_ASYNC (Disruptor) respectively. The test results are as follows: https: //logging.apache.org/log4j/2.x/manual/async.html

The throughput of Disruptor mode is 12 times that of JDK ArrayBlockQueue and 68 times that of synchronous mode.

The response time P99 indicator Disruptor mode is also better than BlockQueue, especially after optimization parameters such as Garbage-free are turned on.

Through the example of log4j, disruptor can make your system achieve more stable and low response time while achieving higher throughput.

So why disruptor can bring these benefits and what is the problem with jdk's thread pool mode?

Description of Disruptor

LMAX is a financial trading company. There are a large number of producer-consumer model business logic in their transactions. Naturally, they put the data produced by the producer into a queue (eg. ArrayBlockingQueue) and then open multiple consumer threads for concurrency. Consumption.

Then they tested that the performance of transferring data in a queue is similar to that of accessing disks (RAID, SSD). When the business logic requires multiple queues to transfer data between different business stages, the overhead of multiple serial queues was unbearable, and then they started to analyze why JDK's queues have such serious performance problems.

Problems with BolckQueue

Why is there such a drastic difference in using BlockQueue? Take Java's ArrayBlockingqueue as an example. The underlying implementation is actually an array. Reentrant locks are used to ensure the thread safety of queue data under concurrent conditions when entering and exiting the queue.

 /**
 * ArrayBlockQueue的入队实现
 */
public boolean offer(E e, long timeout, TimeUnit unit)
        throws InterruptedException {
        checkNotNull(e);
        long nanos = unit.toNanos(timeout);
        final ReentrantLock lock = this.lock;
            // 全局锁
        lock.lockInterruptibly();
        try {
            while (count == items.length) {
                if (nanos <= 0)
                    return false;
                nanos = notFull.awaitNanos(nanos);
            }
            enqueue(e);
            return true;
        } finally {
            lock.unlock();
        }
}

/**
 * ArrayBlockQueue的出队实现
 */
public E poll() {
        final ReentrantLock lock = this.lock;
        lock.lock();
        try {
            return (count == 0) ? null : dequeue();
        } finally {
            lock.unlock();
        }
}

/**
 * Inserts element at current put position, advances, and signals.
 * Call only when holding lock.
 */
private void enqueue(E x) {
    // assert lock.getHoldCount() == 1;
    // assert items[putIndex] == null;
    final Object[] items = this.items;
    items[putIndex] = x;
    if (++putIndex == items.length)
        putIndex = 0;
    count++;
    notEmpty.signal();
}

/**
 * Extracts element at current take position, advances, and signals.
 * Call only when holding lock.
 */
private E dequeue() {
    // assert lock.getHoldCount() == 1;
    // assert items[takeIndex] != null;
    final Object[] items = this.items;
    @SuppressWarnings("unchecked")
    E x = (E) items[takeIndex];
    items[takeIndex] = null;
    if (++takeIndex == items.length)
        takeIndex = 0;
    count--;
    if (itrs != null)
        itrs.elementDequeued();
    notFull.signal();
    return x;
}

It can be seen that ArrayBlockQueue is protected by a ReentrantLock for mutual exclusion when reading and writing, which will cause two problems:

  1. The dequeuing and enqueuing of data will be mutually exclusive, and applications with any characteristics will frequently cause lock collisions.
  2. ReentrantLock itself may cause multiple cas operations each time it is locked, and the cost of each Cas lock operation is not as small as imagined.

    1. A lock state change triggers a cas operation.
    2. Entering the competition queue after the lock competition fails will trigger cas.
    3. When the thread holding the lock is released and synchronized through the Condition, after the competing thread is awakened, the dequeuing of the awakened thread will also cause a Cas operation.

In order to verify this conjecture, LMAX ran another test to verify how much the overhead of various Locks is.

Their test case is to accumulate an int64 100 million times, the difference is only to use a single thread, single thread locking (synchronize, cas), or multiple threads locking (synchronize, cas).

The obtained test results are as follows: https://lmax-exchange.github.io/disruptor/disruptor.html#:~:text=2.1.- ,The%20Cost%20of%20Locks,-Locks%20provide%20mutual

  1. When a single thread executes lock-free, it only takes 300ms to complete.
  2. When a single thread executes with locking (actually no competition), it takes 10s.
  3. A single thread performs a little better than a mutex when executed using CAS.
  4. When there are more threads, whether it is a mutex or a CAS test case execution takes more and more time.
  5. The volatile modifier behaves on the same order of magnitude as CAS.
Method Time (ms)
Single thread 300
Single thread with lock 10,000
Two threads with lock 224,000
Single thread with CAS 5,700
Two threads with CAS 30,000
Single thread with volatile write 4,700

It seems that the overhead of locks and CAS operations is much higher than expected, so why is there such a large performance overhead.

In the concurrent environment (in the Java ecosystem), there are two implementations of locks: synchronize and cas. The cost of the two types of locks is analyzed below.

Synchronize overhead

jdk's introduction to synchronization: https://wiki.openjdk.java.net/display/HotSpot/Synchronization

In java, the mutex is reflected in the code block modified by the synchronize keyword (synchronize will be implemented using Mutex in lock escalation, which is pthread_mutex_t for Linux).

  1. kernel arbitration

    When the code block modified by the synchronize keyword is competed by multiple threads, it needs to switch between user mode and kernel mode, and the system kernel needs to arbitrate the ownership of the competing resources. The cost of this switch is very expensive (save and restore some registers, memory, etc.). data and other process context data).

  2. cache pollution

    Now the CPU has multiple cores, because the computing power of the core is much higher than the IO capacity of the memory. In order to reconcile the speed difference between the processing core and the memory, the CPU cache is introduced. When the core performs operations, if the memory data needs to be obtained from the L1 cache first, if it does not hit, it will be obtained from the L2 cache. If it has not been hit, it will be loaded from the main.

    When a thread context switch occurs, the switched thread will give up the CPU to let another thread execute its logic, and the data he just loaded from the main memory will be polluted by the new thread. Next time he succeeds in the competition, he still needs to load data from the master again. The competition will aggravate cache pollution and further affect system performance.

    from CPU to Approximate required CPU cycles Approximate time needed
    main memory - About 60-80ns
    QPI bus transfer (between sockets, not drawn) - about 20ns
    L3 cache About 40-45 cycles about 15ns
    L2 cache about 10 cycles about 3ns
    L1 cache About 3-4 cycles about 1ns
    register 1 cycle -
  3. false sharing

    Mutex locks also cause the problem that variables that are not locked themselves are forced to be mutually exclusive.

    The basic of CPU cache management is the cache line. When the CPU needs to load data from the main memory, it will load the memory block in the corresponding location together according to the size of the cache line. When the CPU modifies the data in the memory, it also directly modifies the data in the cache. There is a cache coherence protocol to ensure that the changes in the cache are flushed to the memory.

    Consider the following example:

     class Eg {
      private int a;
      private int b;
      
      public void synchronize incr_a(){
        a++;
      }
      
      public void incr_b(){
        b++;
      }
    }

    The two fields a and b in this object are very likely to be allocated to adjacent memory. When the CPU triggers a cache load, this memory is likely to be loaded into the same cache line together.

    When one thread calls incr_a and another thread calls the incr_b method, because incr_a is protected by a mutex, the cache line holding two variables a and b is also protected by a mutex, so although incr_b does not display the mutex The lock is actually locked, and this phenomenon is called false sharing.

  4. Additional CAS overhead

    In Synchronize, variables such as count count and the id of the thread held in the object header are maintained internally. When the thread enters the competition block multiple times, the count count and the thread id in the object header need to be changed through the CAS operation, so synchronize itself will also There is cas overhead.

CAS overhead

CAS is an atomic instruction supported by modern processors (for example: lock cmpxchg x86). The specific meaning is that when the original value of the changed variable meets the expectation, it will be updated directly, and if it does not meet the expectation, it will fail.

In Java, various AutoXX classes are the encapsulation of CAS instructions, and the implementation principle of Java's ReentrantLock is a CAS operation.

Corresponding to the cost of Cas itself, here is an example that can be considered:

Assuming that two threads located in two cores CAS a variable a at the same time, when thread 1 CAS succeeds, it writes data changes to cache A. So how can thread 2 perceive that the current value of variable a has changed at this time, and this CAS operation needs to fail.

Here, the cache coherence protocol is needed to guarantee, and memory barriers need to be inserted before and after the change of CAS variables to ensure the visibility of variables in multiple cores, which is also the real meaning of the java volatile keyword.

So in the first example, the reason why the performance of cas operation is similar to that of volatile variable is that both require cache synchronization.

Here we need to realize that although the cas operation has better performance than the mutex, it is not completely free of overhead. When a large number of cas operation failures and retries lead to a large number of cache invalidations, sometimes more serious problems can be caused.

For details on cache coherence and memory barriers, please refer to this article: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2009.04.05a.pdf

Optimization of Disruptor

After a lot of testing and in-depth analysis, LMAX looked at the overhead of locks, and abstracted a set of general lock-free concurrent processing frameworks according to their business.

Component Description

Before introducing Disruptor in detail, briefly explain the core abstraction of Disruptor.

abstract component illustrate
Ring Buffer Circular queue, used to store event data flowing between producers and consumers
Sequence An auto-incrementing sequence is used to store queue cursors that can be (published, consumed) between producers and consumers. It can be simply considered as a custom implementation of AutomicLong.
Sequencer The Sequence that holds producers and consumers is used to coordinate concurrency issues on both sides and is the core component of Disruptor.
Sequence Barrier Created by Sequencer for situations where consumers can track upstream producers to get consumable events.
Wait Strategy There are many implementation strategies for the strategy used when a consumer is waiting for a consumable event.
Event business event
Event Processor Business event consumer, which can be thought of as an abstraction of physical threads
Event Handler Real business processing logic, each Processor holds one
Producer producer

data production

The production of data is very simple, there are two cases:

  1. single producer

    In the case of a single producer, there is no competition to produce data into RingBuffer. The only point to pay attention to is that you need to pay attention to the consumption power of consumers, and do not overwrite the unconsumed data of the slowest consumer.

    In order to achieve this purpose, it is necessary to observe the consumption progress of the slowest consumer through Sequencer. The code is as follows, you can see that only one volatile operation will not have any locks in the whole process:

     // 申请n个可用于发布的slot
    public long next(int n)
        {
            if (n < 1)
            {
                throw new IllegalArgumentException("n must be > 0");
            }
    
            long nextValue = this.nextValue;
    
            long nextSequence = nextValue + n;
            long wrapPoint = nextSequence - bufferSize;
            long cachedGatingSequence = this.cachedValue;
    
            if (wrapPoint > cachedGatingSequence || cachedGatingSequence > nextValue)
            {        
                // 一次 volatile 操作
                cursor.setVolatile(nextValue);  // StoreLoad fence
    
                long minSequence;
                while (wrapPoint > (minSequence = Util.getMinimumSequence(gatingSequences, nextValue))){
                    // 当最慢的消费者进度低于当前需要申请的slot时,尝试唤醒消费者(唤醒策略不同表现不同,很多策略根本不会阻塞会一直spin) 
                    waitStrategy.signalAllWhenBlocking();
                     // park 1 纳秒继续尝试
                    LockSupport.parkNanos(1L); // TODO: Use waitStrategy to spin?
                }
                     // 申请成功
                this.cachedValue = minSequence;
            }
    
            this.nextValue = nextSequence;
    
            return nextSequence;
     }
  1. multiple producers

    The more complicated point of multi-producer is that there is a write competition before the producer thread, which requires CAS to coordinate. That is, the producer's Seq needs to perform an additional CAS operation, and the whole process is lock-free. The application code is as follows:

     // 申请n个可发布的slot
    public long next(int n)
        {
            if (n < 1)
            {
                throw new IllegalArgumentException("n must be > 0");
            }
    
            long current;
            long next;
    
            do
            {
                current = cursor.get();
                next = current + n;
    
                long wrapPoint = next - bufferSize;
                long cachedGatingSequence = gatingSequenceCache.get();
    
                if (wrapPoint > cachedGatingSequence || cachedGatingSequence > current)
                {
                    long gatingSequence = Util.getMinimumSequence(gatingSequences, current);
    
                    if (wrapPoint > gatingSequence){
                      // 当最慢的消费者进度低于当前需要申请的slot时,尝试唤醒消费者(唤醒策略不同表现不同,很多策略根本不会阻塞会一直spin) 
                        waitStrategy.signalAllWhenBlocking();
                        LockSupport.parkNanos(1); // TODO, should we spin based on the wait strategy?
                        continue;
                    }
    
                    gatingSequenceCache.set(gatingSequence);
                }
                else if (cursor.compareAndSet(current, next)){
                     // 通过自旋 + cas去协调多个生产者的
                    break;
                }
            }
            while (true);
    
            return next;
        }

    Although there is only one more CAS operation than a single producer, the core author of Disruptr has always emphasized that for higher throughput and stable delay, the design principle of a single producer is very necessary, otherwise with the increase in throughput Long-tail requests will experience non-linear latency growth.

    See the article of the specific author: https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html

data consumption

Whether it is single-producer or multi-producer data consumption is not affected. Disruptor supports opening multiple Processors (that is, threads), and each Processor uses a mode similar to while true to pull consumable events for processing.

The advantage of such a thread pool mode is to avoid the performance loss of thread creation, destruction, and context switching proxies (cache pollution...).

The competitive relationship between multiple consumers is coordinated through the abstract component of Sequence Barrier. The code is as follows. You can see that except for the waiting strategy, there may be a strategy to implement locks, and other steps are lock-free throughout the process.

 while (true){
            try
            {
                // if previous sequence was processed - fetch the next sequence and set
                // that we have successfully processed the previous sequence
                // typically, this will be true
                // this prevents the sequence getting too far forward if an exception
                // is thrown from the WorkHandler
                if (processedSequence)
                {
                    processedSequence = false;
                    do
                    {
                        nextSequence = workSequence.get() + 1L;
                        // 一次 Store/Store barrier  
                        sequence.set(nextSequence - 1L);
                    }
                    while (!workSequence.compareAndSet(nextSequence - 1L, nextSequence));
                    // 通过自旋 + cas协调消费者进度
                }

                if (cachedAvailableSequence >= nextSequence){
                    // 批量申请slot进度高于当前进度,直接消费
                    event = ringBuffer.get(nextSequence);
                    workHandler.onEvent(event);
                    processedSequence = true;
                }
                else{
                    // 无消息可消费是更具不同的策略进行等待(可以阻塞、可以自旋、可以阻塞+超时……)
                    cachedAvailableSequence = sequenceBarrier.waitFor(nextSequence);
                }
            }
      catch (final TimeoutException e){
                notifyTimeout(sequence.get());
            }
     catch (final AlertException ex){
                if (!running.get())
                {
                    break;
                }
            }
     catch (final Throwable ex){
                // handle, mark as processed, unless the exception handler threw an exception
                exceptionHandler.handleEventException(ex, nextSequence, event);
                processedSequence = true;
     }
}

It can be seen that the core producer and consumer concurrent coordination implementation is waitStrategy, and the framework itself supports multiple waitStrategy.

name measure Applicable scene
BlockingWaitStrategy synchronized Scenarios where CPU resources are scarce and throughput and latency are not critical
BusySpinWaitStrategy spin (while true) By constantly retrying, the system calls caused by switching threads are reduced, and the delay is reduced. Recommended for use in scenarios where threads are bound to a fixed CPU
PhasedBackoffWaitStrategy spin + yield + custom strategy Scenarios where CPU resources are scarce and throughput and latency are not critical
SleepingWaitStrategy spin + parkNanos There is a good compromise between performance and CPU resources. Uneven delay
TimeoutBlockingWaitStrategy synchronized + has a timeout limit Scenarios where CPU resources are scarce and throughput and latency are not critical
YieldingWaitStrategy spin + yield There is a good compromise between performance and CPU resources. Latency is even

The above strategies can actually be divided into two categories:

  1. Can burn CPU performance, aiming at extreme high throughput and low latency

    1. YieldingWaitStrategy, constant spin Yield
    2. BusySpinWaitStrategy, constant while true
    3. PhasedBackoffWaitStrategy, which can support custom strategies
  2. Not demanding extreme performance

    1. SleepingWaitStrategy, has little impact on the main thread such as Log4j implementation
    2. BlockingWaitStrategy
    3. TimeoutBlockingWaitStrategy

Other optimizations

  1. Pseudo-Share Handling

    There are also simple solutions to the false lock caused by false sharing and the cpu cache problem that was killed by mistake.

    The general Cache Line size is about 64 bytes, and then the Disruptor adds a lot of extra useless fields before and after very important fields. This field can be made to occupy an entire cache line, so that manslaughter caused by unshared can be avoided.

  1. memory preallocation

    In the Disruptor, the event object supports pre-allocation in the ringBuffer, and when a new event arrives, key information can be copied to the pre-allocated structure. Avoid GC problems caused by a large number of event objects.

  1. Apply for Slots in Batches

    When there is competition between multiple producers and multiple consumers, you can apply for multiple consumable and publishable slots in batches, further reducing the CAS overhead caused by competition.

practical application

Using Disruptor to replace the original jdk thread pool in our feature service system has achieved very good performance results.

test introduction

  1. Pressure test machine configuration

    configuration item configuration value
    machine physical machine
    system CentOS Linux release 7.3.1611 (Core)
    RAM 256G memory
    cpu 40 cores
  2. Test Case

    1. Randomly query several features through asynchronous client access to the feature service, and the features are stored in three external storages (Redis, Tair, Hbase).
    2. Feature services deploy a single node on a physical machine.
    3. Test the throughput and response delay distribution of the thread pool and the two processing queues of the Disruptor.

Test Results

The pressure measurement flow is still gradually increased from 5w/s until 10w/s

  1. Response time

    In the case of the same throughput, the disruptor is more problematic than the thread pool mode, and there are fewer long-tail responses.

  2. Timeout rate:

    The timeout rate is also more stable after turning on the Disruptor

References

  1. Disruptor User Guide: https://lmax-exchange.github.io/disruptor/user-guide/index.html
  2. DIsruptor technical paper: https://lmax-exchange.github.io/disruptor/disruptor.html
  3. But the producer pattern is discussed: https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html
  4. Why Memory Bairriers: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2009.04.05a.pdf
  5. Log4j2 Asynchronous Loggers for Low-Latency Logging: https://logging.apache.org/log4j/2.x/manual/async.html
This article is published from the NetEase Cloud Music technical team, and any form of reprinting of the article is prohibited without authorization. We recruit various technical positions all year round. If you are ready to change jobs and happen to like cloud music, then join us at grp.music-fe(at)corp.netease.com!

云音乐技术团队
3.6k 声望3.5k 粉丝

网易云音乐技术团队