Bi Sheng JDK: Why is a super easy to use JDK on ARM

Abstract: Bi Sheng JDK is an open source version customized by Huawei based on OpenJDK. It is a high-performance OpenJDK release that can be used in a production environment.

This article is shared from the Huawei Cloud Community " creation] Bi Sheng JDK: "Legend Reappearance" How does Huawei build the best JDK for ARM? ", the original author: Bailu first commander.

Preface

I don’t know if you have heard of or used Bi Sheng's JDK, are you engaged in Java work? Are you engaged in JVM low-level development? The vast majority of Java developers use Oracle's JDK or OpenJDK. In this article, we will introduce Huawei's Bisheng JDK and the related technical optimizations we have done. We hope to provide you with new choices in addition to the above two. .

1. What is Bisheng JDK?

1.1 Development history of Bi Sheng JDK

Bi Sheng JDK is an open source version customized by Huawei based on OpenJDK. It is a high-performance OpenJDK release that can be used in a production environment. Stable operation on more than 500 products within Huawei, Bisheng JDK is widely used within Huawei, the team has accumulated rich development experience, and solved many difficult problems encountered in actual business operations. We have resolved related issues such as crashes internally.

1.2, Bi Sheng JDK support architecture

Currently only supports Linux/AArch64 architecture. Developers are welcome to download and use.
Currently, Bi Sheng JDK supports two LTS versions, 8 and 11, and they are all open source.

1.3 The difference between Bi Sheng JDK, OpenJDK and Oracle JDK

We compare and analyze Bi Sheng JDK, OpenJDK and Oracle JDK to help you have a better choice when choosing a JDK.

As shown in the figure below, we use the blue area to represent OpenJDK, light yellow and red to represent Oracle JDK and Bisheng JDK respectively.

The above picture is for reference, we can find:

Like the Oracle JDK, Bi Sheng JDK is customized based on OpenJDK, but at the same time it is given different commercial features. For example, we all know that OpenJDK 12 adds a new garbage collection (GC) algorithm-Shenandoah, but it is not included in the release of Oracle JDK.
On the basis of OpenJDK customization, Bi Sheng JDK has some differences, mainly from some enhancements to product functions, bug fixes, and integration with upstream features.

2. Why do we need Bisheng JDK?

2.1, Oracle JDK authorization method has changed

Except for the "well-known" reasons, I don't know if you know that Oracle JDK is charged after the 8u212 version. As far as the company is concerned, combined with the security vulnerabilities of the JDK itself, the result of comprehensive commercial considerations is to develop a JDK that meets its own development.

Note: The above data comes from Oracle's official website.

2.2. Desire for valuable features of high-version JDK

JDK releases a new version every six months. There are many JDK versions, and different functions/features are in different JDK versions. Programmers expect to use the valuable features of the higher version as much as possible on the most familiar JDK. For example, G1 GC introduced a feature in JDK12 to return unused memory to the operating system. This feature is very valuable in cloud scenarios. JDK8 is currently the mainstream use. The Blckport feature in the self-developed JDK can quickly meet the needs.

2.3. Application customized optimization requirements

The hardware and scenarios that the application is running have special demands, but these demands are difficult to enter the community in the short term. For example, big data applications have relatively high demands in mathematics. In the self-developed JDK, compilation optimization techniques such as loop development and instruction optimization can be used for mathematical calculations to speed up calculations.

Third, the current situation of Bi Sheng JDK

3.1, Bi Sheng JDK R & D status

Like the Oracle JDK, Bi Sheng JDK is customized based on the open source OpenJDK. At the same time, the team has contributed a lot of valuable
Patch involves: garbage collection, JIT, runtime content, etc.
Bi Sheng JDK is open source under the copyright of GPLv2, and binary files can be downloaded for free from the official website.
Bi Sheng JDK adopts community-based development and operation, bi-weekly meetings, and currently has ARM, Powerland, Kylin and other small partners participating together. The Bi Sheng JDK community not only supports the ARM platform, any questions about the JDK can be discussed in the Bi Sheng JDK community and will be answered as soon as possible.
In the upstream community, the team currently has 1 Reviewer, 1 Committer, and 8 Authors with a total of more than 10 colleagues submitting code to the community.
Bi Sheng JDK has excellent performance and stability on ARM.

3.2, Bi Sheng JDK performance improvement examples

We analyzed the advantages of Bisheng JDK by running it in a test environment. The test environment is as follows:

Model：Taishan 2280V2
OS：openEuler20.09
HW：kenpeng 920-6426 2600MHz，128 cores
JDK：JDK8U262

By comparing the data on SPECjbb, we can find that Bi Sheng JDK has a significant improvement in critical and max: critical increased by 55%, max increased by 16%.

On the other hand, although the data on SPECjvm is not particularly obvious compared with the above, still has an average increase of 4.6%.

Fourth, Bi Sheng JDK's GC algorithm optimization

4.1, the concept of parallel replication algorithm

We all know that replication is a very important part of the GC algorithm, especially for the new generation of replication: the active objects in the from area are copied to the to area. The serial replication algorithm has only one thread responsible for this, and this cannot Meet our needs. So we used a parallel copy algorithm, so what is a parallel copy algorithm?

Objects A and B are replicated by different threads in the parallel replication algorithm. This may be due to the fact that objects A and B have different arrival paths and are replicated by different threads. Because of the problem of task balance, threads can steal the replication tasks of other threads.
For example, there are two threads T1 and T2 copy objects A and B respectively, T1: A→A´; T2: B→B´.
In addition to copying the content of the object during copying, a pointer (Forwarding Pointer) is also needed to record the address after the object is transferred to prevent the object from being copied repeatedly.

4.2. The impact of architecture on parallel replication algorithms

Multi-threaded parallel work needs to consider the memory models of different architectures. X86 is a strong memory sequence architecture, while ARM is a weak memory sequence. Their memory sequence is shown in the following table:
For the parallel copy algorithm, under the weak memory order architecture, due to the memory order design, other threads may first observe that the transfer pointer has been updated, but the object has not yet been copied. In order to ensure consistency, membar needs to be inserted between the copy and update of the object header, and the update of the object header in the JVM is unified and abstracted as a CAS function.
CAS is implemented differently in different architectures. X86 uses cmpxchgl instruction; ARM uses Ldaxr/Stlxr instruction.

4.3, the flow of the parallel replication algorithm

The flow chart of the parallel copy algorithm is shown in the figure below:

Copy the object obj to the new object position new_obj;
Insert the Memory Barrier, the object obj sets the transfer pointer through CAS, if it succeeds, it executes (3), if it fails, executes (4);
Push the reference of new_obj onto the stack and return new_obj;
Cancel the previously allocated object, and return the new_obj of the cas successful thread.

In the hot spot analysis, we found that 60% of the CPU consumption of the copy operation is inserted into the Memory Barrier.

4.4. Algorithm optimization to reduce the Q&A of membar

Q: If the memory barrier is not inserted and multiple threads observe memory inconsistencies, under what circumstances will problems be introduced?
A：

T1: The object has not been copied yet, but the object has been pushed onto the stack.
T2: Steal the object to be copied from the thread stack of T1, and copy and update the member variables of the object that has not been copied, resulting in inconsistent data.

Q: For objects that do not need to copy member variables (for example: the member variables of the object are all non-reference types; the reference types of the member variables of the object are all NULL, and the object itself is an array of primitive types), is it necessary to use Memory Barrier?
A： NO！

Q: recognize these objects?
A：

Static analysis of the object: It can be found that the member variables of the object are all arrays of non-reference types and primitive types. It has been open sourced.
Dynamic analysis object: identification through barrier technology.

Through the optimization of the parallel replication algorithm, we have achieved good expected results in SPECjbb and SPECjvm, as shown in the following figure:

4.5, G1, GC optimization

For G1 Full GC optimization, Full GC is divided into 4 stages, namely:

Mark: Mark the active objects in the entire heap space and record the active objects.
Prepare: Calculate the position of each active object after in-situ compression.
Adjust: Adjust the reference position of the object member variable according to the new address of the object.
Compact: Copy the memory data of the object.

The Compact phase is generally the most time-consuming and involves the movement of memory data. Then, allow a certain amount of wasted space, do not move or move less for the part of the partition with more active objects, so as to improve the efficiency of the algorithm? following picture for the active object:

we can find out that:

The proportion of active objects in the partition conforms to the U-shaped distribution.
Research on Benchmark shows that 41.27% of the partitions are active objects accounting for 98%.
To a certain extent, reducing the movement of objects also conforms to the hypothesis of strong generation theory.
The test found that the performance of similar applications can be improved by 3 to 5%.

We have contributed the relevant code to the community, and you are welcome to check it out.

4.6, ZGC optimization

Bi Sheng JDK 11 is the first JDK to support ZGC in the ARM architecture.
The goal of ZGC is to manage terabytes of memory, and the pause time of garbage collection is controlled at 10 milliseconds. The recovery process of ZGC includes 3 steps, namely: concurrent mark (Mark), concurrent transfer (Relocate) and concurrent relocation (Remap). In the transfer process, in order to improve the transfer efficiency, only when the garbage collection space of the page reaches a certain percentage will it participate in the transfer. In the current implementation, the ratio is controlled by the parameter ZFragmentLimit, and the default value of this parameter is 25.
How to set ZFragmentLimit? If it is too large, the memory will be wasted; if it is too small, the recycling efficiency will be low.
Collect transfer information (memory transfer rate, transfer time) during GC execution, and predict the memory that can be transferred in the next GC, and use the predicted value to control which pages can participate in the transfer. As shown below:
Calculate the transfer rate of memory:
Predict the transfer rate of this GC:
Use a normal distribution, supplemented by a 99% confidence level.
Forecast the transfer time of this GC:
Predict the transfer bytes of this GC:
The benchmark test shows that the effect is 3~5% improvement, the code has been open source, and it is being synchronized with the community.

Five, JIT optimization-SVE algorithm optimization

5.1, SVE algorithm optimization related introduction

SVE (Scalable Vector Extension) is the next generation SIMD instruction set of ARM AArch64 architecture.

Support SVE1 instruction set.
Automatically judge to adapt to SVE1/NEON
Support Z0~Z31 registers.
Supports full-size SVE registers from 128 to 2048 bits.
Support PO~P7 predicate register.
Support most of the automatic vectorization (SuperWord) Node.

5.2, SVE algorithm optimization results

VectorAPI's newly added Nodes are all contributed to the upstream community, and Bi Sheng JDK is currently not incorporated. So far, SVE has submitted a total of 11 patches to the upstream community, with more than 3000 lines of related code.

public static float sumReductionImplement(float[] a, float[] b, float[] c, float[] d, float total) {

               for (int i = 0; i < a.length; i++) {

                       d[i] = (a[i] * b[i]) + (a[i] * c[i]) + (b[i] * c[i]);

                       total += d[i];

               }

               return total;

        }

The optimized NEON machine code is shown in the figure below:

The optimized SVE machine code is shown in the figure below:

6. Software and hardware collaboration-Kunpeng KAE hardware acceleration

KAE (Kunpeng Accelerator Engine) is a hardware accelerator provided by Huawei Kunpeng server. There is an independent I/O DIE in the Kunpeng chip for processing encryption and decryption functions.
Bisheng JDK provides KAEProvider to give full play to the hardware capabilities. Applications only need simple adaptation and no code development is required to use the hardware capabilities of Kunpeng server to provide application operating efficiency.
In the latest version of Bisheng JDK, four encryption and decryption algorithms (AES, Digest, HMAC, RSA) have been released. In the benchmark test, some algorithms can be accelerated by 40%, which will greatly save running time in the security field. Currently, Baolande is conducting joint development. The second batch of algorithm support will be released in Q2.
The encryption and decryption scheme is based on JCA (Java Cryptography Architecture), which is an important part of the Java platform. KAE is based on JCA to provide encryption and decryption services, which is called KAEProvider in Bisheng JDK. The process is shown in the figure below:
JCA provides two ways to select different providers, through code designation or configuration files. as follows:
Method 1: Use Security API to add KAE Provider and set its priority.
Method 2: Modify the jre/lib/security/java.security file, add KAE Provider, and set its priority.

7. What value can Bi Sheng JDK bring?

After evaluation and testing, Bi Sheng JDK has also backported a number of valuable features based on the features of the community.
G1 NUMA-Aware, this feature gives full play to the advantages of NUMA, and the effect is better in a multi-core hardware platform. Bi Sheng JDK also fixed some problems on the basis of the community: for example, thread scheduling of the operating system caused threads to migrate on multiple nodes, and migration due to the NUMA feature would cause some memory partitions to be unable to be effectively recycled; enhanced large objects NUMA-Aware function. The effect is improved as shown in the figure below:
In the feature of AppCDS in JDK 10, the idea is to store String objects and class metadata objects in a shared file, so that multiple JVM processes can share information to reduce the loading and parsing of class metadata objects.
By porting this feature, Bi Sheng JDK found that good results have been achieved, and some scenarios of big data can be optimized by close to 10%.
G1 Uncommit, in the case of low memory usage, will periodically trigger GC for garbage collection and return the recovered memory to the operating system. This feature can significantly reduce the amount of private memory in cloud scenarios. Based on the community version, Bi Sheng JDK modified the serial memory release to concurrency (the same implementation is also adopted in the latest JDK 16).

After turning on G1 Uncommit, we can see in the figure below that it will drop steadily in scenarios where the memory is not used:

In the actual business scenario, the effect is even more obvious, as shown in the following figure:

Parallel task stealing mechanism is optimized. In some applications, it is found that the proportion of task stealing is high. For the theft of parallel tasks, Google has contributed a valuable design to the community, greatly optimizing the theft of parallel tasks. In Bisheng JDK, PS, ParNew, G1, Shenandoah, etc. have all benefited from this.
We are currently optimizing task stealing for multi-core servers, and will continue to open source when mature.

8. The future development of Bi Sheng JDK

8.1. Features coming soon

Improve the KAE hardware acceleration algorithm, and it is expected to be released in Q2.
Parallel NUMA-Aware and Full GC in G1 GC will be implemented in Bisheng JDK8, Q2.
jmap is enhanced to do parallel dump for CMS.

8.2. Future direction

Actively participate in the development and evolution of SVE and Vector API features in the community. At present, the submitted code exceeds 3000 lines.
Optimize memory management, ongoing: ZGC generation, Thread Local GC, AOT and other projects.

9. How to get Bi Sheng JDK and help?

Download JDK 8 and JDK 11: https://kunpeng.huawei.com/#/developer/devkit/complier?data=JDK

9.1, JDK 8 code warehouse

https://gitee.com/openeuler/bishengjdk-8

9.2, JDK 11 code warehouse

https://gitee.com/openeuler/bishengjdk-11

to sum up

In this article, we introduce to you what is Bisheng JDK, what is the overall development history, under what circumstances does Huawei want to build Bisheng JDK, and what has been done in terms of low-level optimization? At the same time, what are the hidden values that are worth developing? As Peng Chenghan, a senior technical expert of Huawei's compiler, said, it is our pursuit to bring the digital world into everyone, every family, and every organization, and build an intelligent world with all things connected!

Click to follow, and get to know the fresh technology of Huawei Cloud for the first time~