Java concurrent packet atomic operation class analysis

foreword

The JUC package provides a series of atomic operation classes, which are implemented using non-blocking algorithm CAS , which has a greater performance improvement than using locks to implement atomic operations.

Since the principle of atomic operation is roughly the same, this article only explains the principle of the simple AtomicLong class and the principle of the LongAdder class newly added JDK8

Atomic variable operation class

The JUC concurrent package contains atomic operation classes such as AtomicInteger, AtomicLong, and AtomicBoolean. The principles are roughly similar. Next, let's take a look at the AtomicLong class.

AtomicLong is an atomic increment or decrement class, Unsafe is implemented internally using 061dbe082819bf. Let's look at the code below.

public class AtomicLong extends Number implements java.io.Serializable {
    private static final long serialVersionUID = 1927816293512124184L;
    //1. 获取Unsafe实例
    private static final Unsafe unsafe = Unsafe.getUnsafe();
    //2. 存放变量value的偏移量
    private static final long valueOffset;
    //3. 判断JVM是否支持Long类型无锁CAS
    static final boolean VM_SUPPORTS_LONG_CAS = VMSupportsCS8();

    private static native boolean VMSupportsCS8();

    static {
        try {
            //4. 获取value在AtomicLong中的偏移量
            valueOffset = unsafe.objectFieldOffset
                (AtomicLong.class.getDeclaredField("value"));
        } catch (Exception ex) { throw new Error(ex); }
    }

    //5. 实际变量值
    private volatile long value;

    public AtomicLong(long initialValue) {
        value = initialValue;
    }
   ......
}

First get an instance of the Unsafe class through the Unsafe.getUnsafe()

Why can I get an instance of the Unsafe class? Because the AtomicLong class is also under the rt.jar package, it can be loaded through the BootStrap class loader.

The second and fourth steps get the offset of the value variable in the AtomicLong class.

The value variable in the fifth step is declared as volatile , which is to ensure the memory visibility of under multi-threading, and the value stored is the value of the specific counter.

`Increment and decrement operation code`

Next we look at the main functions in AtomicLong.

//调用unsafe方法，原子性设置value值为原始值+1，返回递增后的值
public final long incrementAndGet() {
        return unsafe.getAndAddLong(this, valueOffset, 1L) + 1L;
}
//调用unsafe方法，原子性设置value值为原始值-1，返回值递减后的值
public final long decrementAndGet() {
        return unsafe.getAndAddLong(this, valueOffset, -1L) - 1L;
}
//调用unsafe方法，原子性设置value值为原始值+1，返回原始值
public final long getAndIncrement() {
        return unsafe.getAndAddLong(this, valueOffset, 1L);
}
//调用unsafe方法，原子性设置value值为原始值-1，返回原始值
public final long getAndDecrement() {
        return unsafe.getAndAddLong(this, valueOffset, -1L);
}

The above code is implemented by calling the getAndAddLong() method of Unsafe. This function is a atomic operation. The first parameter is the reference of the AtomicLong instance, the second parameter is the offset value of the value variable in AtomicLong, and the third The first parameter is the value of the second variable to set.

Among them, the implementation logic of the getAndIncrement() method in JDK1.7

public final long getAndIncrement() {
   while (true) {
        long current = get();
        long next = current + 1;
        if (compareAndSet(current,next))
            return current;
    }
}

In this code, each thread gets the current value of the variable (because the value is a volatile variable, so it gets the latest value), then adds 1 to the working memory, and then uses CAS modify the variable. value. If the setting fails, it will keep looping until the setting succeeds.

And the logic in JDK8

public final long getAndAddLong(Object var1, long var2, long var4) {
        long var6;
        do {
            var6 = this.getLongVolatile(var1, var2);
        } while(!this.compareAndSwapLong(var1, var2, var6, var6 + var4));

        return var6;
}

You can see, JDK1.7 circular logic in the AtomicLong has been JDK8 atomic class Unsafe built, the reason should be built taking into account the function will also be used in other places, but built can Improve reusability .

`compareAndSet(long expect, long update) method`

public final boolean compareAndSet(long expect, long update) {
        return unsafe.compareAndSwapLong(this, valueOffset, expect, update);
}

From the above code, we can know that the unsafe.compareAndSwapLong method is still called internally. If the value of value in the atomic variable is equal to expect, update the value with the update value and return true, otherwise return false.

Let's deepen our understanding through an example of using AtomicLong to count the number of 0s in multiple threads.

/**
 * @author 神秘杰克
 * 公众号: Java菜鸟程序员
 * @date 2022/1/4
 * @Description 统计0的个数
 */
public class AtomicTest {

    private static AtomicLong atomicLong = new AtomicLong();
    private static Integer[] arrayOne = new Integer[]{0, 1, 2, 3, 0, 5, 6, 0, 56, 0};
    private static Integer[] arrayTwo = new Integer[]{10, 1, 2, 3, 0, 5, 6, 0, 56, 0};

    public static void main(String[] args) throws InterruptedException {
        final Thread threadOne = new Thread(() -> {
            final int size = arrayOne.length;
            for (int i = 0; i < size; ++i) {
                if (arrayOne[i].intValue() == 0) {
                    atomicLong.incrementAndGet();
                }
            }
        });
        final Thread threadTwo = new Thread(() -> {
            final int size = arrayTwo.length;
            for (int i = 0; i < size; ++i) {
                if (arrayTwo[i].intValue() == 0) {
                    atomicLong.incrementAndGet();
                }
            }
        });
        threadOne.start();
        threadTwo.start();
        //等待线程执行完毕
        threadOne.join();
        threadTwo.join();
        System.out.println("count总数为: " + atomicLong.get()); //count总数为: 7

    }
}

This code is very simple, it calls the atomic increment method of AtomicLong every time a 0 is found.

When there is no atomic class, certain synchronization measures are required to implement the counter, such as the synchronized keyword, etc., but these are all blocking algorithms, which have a certain impact on performance, and the AtomicLong we use uses CAS non-blocking algorithm with better performance.

However, under high concurrency, AtomicLong will still have performance problems. JDK8 provides a LongAdder class with better performance under high concurrency.

`Introducing LongAdder`

As mentioned earlier, when using AtomicLong under high concurrency, a large number of threads will compete for the same atomic variable at the same time, but since only one thread's CAS operation will succeed at the same time, it will cause a large number of threads to compete and fail, and will spin in an infinite loop. Attempting CAS operations is a waste of CPU resources.

Therefore, a new in JDK8 atomic incremented or decremented based LongAdder to overcome the disadvantages of high concurrent AtomicLong . Since the performance bottleneck of AtomicLong is caused by multiple threads competing for the update of a variable, if a variable is divided into multiple variables and multiple threads compete for multiple resources, will the performance problem be solved? Yes, LongAdder is the idea.

As shown in the figure above, when using LongAdder, it maintains multiple Cell variables internally, and each Cell has a long variable with an initial value of 0. In this case, under the same amount of concurrency, the thread that competes for a single thread update operation It will reduce, and in disguise, it will reduce the concurrency of competing for shared resources.

In addition, if multiple threads fail to compete for the same Cell atomic variable, it will not spin and retry all the time, but will try other Cell variables for CAS attempts, which increases the possibility of the current thread retrying CAS successfully. Finally, when the current value of LongAdder is obtained, the value of is accumulated and then returned by base.

LongAdder maintains a lazy-initialized atomic update array (the Cell array is null by default) and a base value variable base. The Cells array is not created at the beginning, but is created when it is used, that is, lazy loading .

At the beginning, when it is judged that the Cell array is null and the number of concurrent threads is reduced, all accumulations are performed on the base variable , keeping the size of the Cell array to the power of 2, and the number of Cell elements in the Cell array during initialization The number is 2, and the variable entity in the array is of type Cell. The Cell type is an improvement of AtomicLong to reduce cache contention, which is to solve the false sharing problem.

When multiple threads concurrently modify multiple variables in a cache line, only one thread can operate the cache line at the same time, which will lead to performance degradation. This problem is called pseudo-shared .
Generally speaking, the cache line has 64 bytes, we know that a long is 8 bytes, and after filling 5 longs, it is 48 bytes in total.
In Java, the object header occupies 8 bytes under a 32-bit system and 16 bytes under a 64-bit system, so filling 5 longs can fill up 64 bytes, which is a cache line.
JDK8 and later versions of Java provide the sun.misc.Contended annotation, which can solve the problem of false sharing through the @Contented annotation.
After using the @Contented annotation, 128 bytes of padding will be added, and the -XX:-RestrictContended option needs to be turned on to take effect.
The real core of solving false sharing in Cell is in the 061dbe0828219b array, and the Cell array is annotated @Contented

Byte stuffing for most isolated atomic operations is wasteful, because atomic operations are scattered randomly in memory (that is, the memory addresses of multiple atomic variables are not contiguous), and many atomic operations are scattered randomly in memory. It is very unlikely that atomic variables will be put into the same cache line. However, the memory addresses of atomic array elements are contiguous, so multiple elements in the array can often share cache lines. Therefore, the @Contented annotation is used to fill the Cell class with bytes, which prevents multiple elements in the array from sharing a cache line. , which is an improvement in performance.

`LongAdder source code analysis`

problem:

What is the structure of LongAdder?
Which Cell element in the Cell array should the current thread access?
How to initialize the Cell array?
How to expand the Cell array?
How to deal with conflicts when threads access the allocated Cell elements?
How to ensure the atomicity of thread operations on allocated Cell elements?

Next we look at the structure of LongAdder:

The LongAdder class inherits from the Striped64 class and maintains these three variables inside Striped64.

The real value of LongAdder is actually the accumulation of the value of base and the value of all Cell elements in the Cell array. base is a base value, which is 0 by default.
cellsBusy used to implement spin lock , and the status values are only 0 and 1. When creating a Cell element, expanding a Cell array or initializing a Cell array, use CAS to manipulate this variable to ensure that only one thread can perform one of these operations at the same time.

transient volatile Cell[] cells;
transient volatile long base;
transient volatile int cellsBusy;

public class LongAdder extends Striped64 implements Serializable {

`Construction of Cell`

Let's take a look at the construction of Cell.

@sun.misc.Contended static final class Cell {
    volatile long value;
    Cell(long x) { value = x; }
    final boolean cas(long cmp, long val) {
        return UNSAFE.compareAndSwapLong(this, valueOffset, cmp, val);
    }

    // Unsafe mechanics
    private static final sun.misc.Unsafe UNSAFE;
    private static final long valueOffset;
    static {
        try {
            UNSAFE = sun.misc.Unsafe.getUnsafe();
            Class<?> ak = Cell.class;
            valueOffset = UNSAFE.objectFieldOffset
                (ak.getDeclaredField("value"));
        } catch (Exception e) {
            throw new Error(e);
        }
    }
}

As you can see, a variable declared as volatile is maintained internally, and volatile is declared here to ensure memory visibility. In addition, the CAS function ensures the atomicity of the value in the assigned Cell element when the current thread is updated through the CAS operation. And you can see that the Cell class is modified by @Contended to avoid false sharing.

So far we know the answers to questions 1 and 6.

`sum()`

The sum() method returns the current value. The internal operation is to accumulate all the value values inside the Cell and then accumulate the base.

Since the Cell array is not locked when calculating the sum, other threads may modify or expand the Cell value during the accumulation process, so the value returned by sum is not very accurate, and the return value is not a call Atomic snapshot value when sum() method.

public long sum() {
    Cell[] as = cells; Cell a;
    long sum = base;
    if (as != null) {
        for (int i = 0; i < as.length; ++i) {
            if ((a = as[i]) != null)
                sum += a.value;
        }
    }
    return sum;
}

`reset()`

The reset() method is a reset operation and sets the base to 0. If the Cell array has elements, the elements are reset to 0.

public void reset() {
    Cell[] as = cells; Cell a;
    base = 0L;
    if (as != null) {
        for (int i = 0; i < as.length; ++i) {
            if ((a = as[i]) != null)
                a.value = 0L;
        }
    }
}

`sumThenReset()`

The sumThenReset() method is a modified version of the sum() method. This method resets the current Cell and base to 0 after accumulating the corresponding Cell value using sum.

There is a thread safety problem in this method. For example, if the first calling thread clears the value of Cell, the value of 0 will be accumulated when the second thread calls it.

public long sumThenReset() {
    Cell[] as = cells; Cell a;
    long sum = base;
    base = 0L;
    if (as != null) {
        for (int i = 0; i < as.length; ++i) {
            if ((a = as[i]) != null) {
                sum += a.value;
                a.value = 0L;
            }
        }
    }
    return sum;
}

`add(long x)`

Next, we mainly look at the add() method, which can answer other questions just now.

public void add(long x) {
    Cell[] as; long b, v; int m; Cell a;
    //(1)
    if ((as = cells) != null || !casBase(b = base, b + x)) {
        boolean uncontended = true;
        //(2)
        if (as == null || (m = as.length - 1) < 0 ||
            //(3)
            (a = as[getProbe() & m]) == null ||
            //(4)
            !(uncontended = a.cas(v = a.value, v + x)))
            //(5)
            longAccumulate(x, null, uncontended);
    }
}

final boolean casBase(long cmp, long val) {
        return UNSAFE.compareAndSwapLong(this, BASE, cmp, val);
}

The method first judges whether cells are null, and if it is null, it accumulates on the base. If cells is not null, or the thread fails to execute the code cas, go to step 2. The second step of the code The third step determines which Cell element in the cells array the current thread should access, and executes code 4 if the element mapped by the current thread exists.

The fourth step mainly uses the CAS operation to update the value of the allocated Cell element. If the element mapped by the current thread does not exist or exists but the CAS operation fails, execute code 5.

Which Cell element of the cells array the thread should access is calculated by getProbe() & m, where m is the number of elements in the current cells array - 1, getProbe() is used to obtain the value of the variable threadLocalRandomProbe in the current thread, this value It is 0 at the beginning and will be initialized in the fifth step of the code. And the current thread guarantees the atomicity of updating the value of the Cell element through the cas function of the allocated Cell element.

Now we have understood the second question.

Let's take a look at the longAccumulate(x,null,uncontended) method, which is mainly where the cells array is initialized and expanded.

final void longAccumulate(long x, LongBinaryOperator fn,
                          boolean wasUncontended) {
    //6. 初始化当前线程变量ThreadLocalRandomProbe的值
    int h;
    if ((h = getProbe()) == 0) {
        ThreadLocalRandom.current(); // force initialization
        h = getProbe();
        wasUncontended = true;
    }
    boolean collide = false;                // True if last slot nonempty
    for (;;) {
        Cell[] as; Cell a; int n; long v;
        //7.
        if ((as = cells) != null && (n = as.length) > 0) {
            //8.
            if ((a = as[(n - 1) & h]) == null) {
                if (cellsBusy == 0) {       // Try to attach new Cell
                    Cell r = new Cell(x);   // Optimistically create
                    if (cellsBusy == 0 && casCellsBusy()) {
                        boolean created = false;
                        try {               // Recheck under lock
                            Cell[] rs; int m, j;
                            if ((rs = cells) != null &&
                                (m = rs.length) > 0 &&
                                rs[j = (m - 1) & h] == null) {
                                rs[j] = r;
                                created = true;
                            }
                        } finally {
                            cellsBusy = 0;
                        }
                        if (created)
                            break;
                        continue;           // Slot is now non-empty
                    }
                }
                collide = false;
            }
            else if (!wasUncontended)       // CAS already known to fail
                wasUncontended = true;      // Continue after rehash
            //9. 当前Cell存在，则执行CAS设置
            else if (a.cas(v = a.value, ((fn == null) ? v + x :
                                         fn.applyAsLong(v, x))))
                break;
            //10. 当前Cell元素个数大于CPU个数
            else if (n >= NCPU || cells != as)
                collide = false;            // At max size or stale
            //11. 是否有冲突
            else if (!collide)
                collide = true;
            //12. 如果当前元素个数没有达到CPU个数，并且存在冲突则扩容
            else if (cellsBusy == 0 && casCellsBusy()) {
                try {
                    if (cells == as) {      // Expand table unless stale
                      //12.1
                        Cell[] rs = new Cell[n << 1];
                        for (int i = 0; i < n; ++i)
                            rs[i] = as[i];
                        cells = rs;
                    }
                } finally {
                    //12.2
                    cellsBusy = 0;
                }
                //12.3
                collide = false;
                continue;                   // Retry with expanded table
            }
            //13. 为了能够找到一个空闲的Cell，重新计算hash值，xorshift算法生成随机数
            h = advanceProbe(h);
        }
        //14. 初始化Cell数组
        else if (cellsBusy == 0 && cells == as && casCellsBusy()) {
            boolean init = false;
            try {                           // Initialize table
                if (cells == as) {
                    //14.1
                    Cell[] rs = new Cell[2];
                    //14.2
                    rs[h & 1] = new Cell(x);
                    cells = rs;
                    init = true;
                }
            } finally {
                //14.3
                cellsBusy = 0;
            }
            if (init)
                break;
        }
        else if (casBase(v = base, ((fn == null) ? v + x :
                                    fn.applyAsLong(v, x))))
            break;                          // Fall back on using base
    }
}

The method is more complicated, and we mainly focus on question 3, question 4, and question 5.

How to initialize the Cell array?
How to expand the Cell array?
How to deal with conflicts when threads access the allocated Cell elements?

When each thread first time to execute the code in the sixth step, the current value will be initialized ThreadLocalRandomProbe variable thread, the primary value in order to calculate the current thread to be assigned to cells in which a cell array element.

The initialization of the cells array is carried out in the fourteenth step of the code, where cellsBusy is a flag, 0 indicates that the current cells array has not been initialized or expanded, and no new Cell elements are being created, and a value of 1 indicates that the cells array is being initialized or expanded, To create a new element, to switch the state of 0 or 1 through CAS, the call is casCellsBusy() .

Assuming that the current thread sets cellsBuys to 1 through CAS, the current thread starts the initialization operation, then other threads cannot be expanded at this time. For example, the code (14.1) initializes the number of cells array to 2, and then uses h & 1 calculate that the current thread should access cells In that position of the array, the h used is the threadLocalRandomProbe variable of the current thread. The Cells array is then marked and initialized, and finally (14.3) the cellsBusy flag is reset. Although the CAS operation is not used here, it is thread-safe because cellsBusy is of volatile type, which ensures memory visibility. The values of the two elements in the cells array initialized here are still null. Now we know the answer to question 3.

The expansion of the cells array is carried out in the twelfth step of the code. The expansion of the cells is conditional, that is, the expansion operation is performed after the tenth and eleventh steps are not satisfied. Specifically, the number of elements in the current cells is less than the number of CPUs in the current machine and multiple threads currently access the same element in the cells, resulting in a thread CAS failure before expansion is performed.

Why does it involve the number of CPUs? Only when each CPU runs a thread will the multi-threading effect be the best, that is, when the number of elements in the cells array is the same as the number of CPUs, each Cell uses a CPU for processing, and then the performance is improved. is the best.

The twelfth step of the code is to set cellsBusy to 1 through CAS first, and then expand the capacity. Assuming the CAS succeeds, execute the code (12.1) to expand the capacity by 2 times and copy the Cell elements to the expanded array. In addition, after the expansion, the cells array contains not only the copied elements, but also other new elements, and the values of these elements are still null. Now we know the answer to question 4.

In the seventh and eighth steps of the code, the current thread calls the add() method and calculates the subscript of the Cell element to be accessed according to the random number threadLocalRandomProbe of the current thread and the number of cells elements, and then if the value of the corresponding subscript element is found null, add a Cell element to the cells array, and race to set cellsBusy to 1 before adding it to the cells array.

In the thirteenth step of the code, the random value threadLocalRandomProbe of the current thread is recalculated for the thread that fails CAS, so as to reduce the chance of conflict when accessing the cells element next time. Here we know the answer to question 5.

`Summarize`

This class shares the amount of contention when multiple threads update an atomic variable at the same time under high concurrency through the internal cells array, so that multiple threads can simultaneously update elements in the cells array in parallel. In addition, the array element Cell is decorated with the @Contended annotation, which prevents multiple atomic variables in the cells array from being put into the same cache line, that is, avoids false sharing.

Compared with LongAdder, LongAccumulator can provide a non-zero initial value for the accumulator, which can only provide the default 0 value. In addition, the former can also specify accumulation rules, such as multiplication without accumulation, just need to pass in a custom binocular operator when constructing the LongAccumulator, while the latter has built-in accumulation rules.

Java concurrent packet atomic operation class analysis