Author: vivo Internet Server Team - Chen Dongxing, Li Haoxuan, Chen Jinxia

With the increasing complexity of business, performance optimization has become a required course for every technician. Where to start with performance optimization? How to locate the performance bottleneck from the appearance of the problem? How to verify that optimization measures are effective? This article will introduce and share the performance tuning practices in vivo push recommended projects, hoping to provide you with some lessons and references.

1. Background introduction

In Push recommendation, the online service receives events from Kafka that need to reach users, and then selects the most suitable articles for these target users to push. The service is developed in Java and is CPU-intensive.

With the continuous development of the business, the concurrency of requests and the amount of model calculation are increasing, resulting in performance bottlenecks in engineering, serious backlog of Kafka consumption, unable to complete the distribution of target users in time, and unsatisfied business growth demands. Therefore, special performance optimization is urgently needed.

图片

2. Optimizing metrics and ideas

Our performance measurement indicator is throughput TPS, which can be known from the classic formula TPS = number of concurrent / average response time RT. To increase TPS, there are two ways:

  • Increase the number of concurrency , such as increasing the number of parallel threads of a single machine, or horizontally expanding the number of machines;
  • Reduce the average response time RT , including the execution time of application threads (business logic) and the GC time of the JVM itself.

In practice, the CPU utilization of our machines is already very high, reaching more than 80%, and the expected benefit of increasing the number of concurrency of a single machine is limited, so the main energy is devoted to reducing RT.

The following will explain in detail the hot code and JVM GC, how we analyze and locate the performance bottleneck, and use 3 tricks to increase the throughput by 100%.

Three, hot code optimization articles

How to quickly find the most time-consuming hot code in the application? With Alibaba's open source arthas tool, we obtained the CPU flame graph of the online service.

图片

Description of flame graph: The flame graph is an SVG image generated based on the perf result, which is used to display the call stack of the CPU.

The y-axis represents the call stack, with each layer being a function. The deeper the call stack, the higher the flame, with the executing function at the top and its parent function below.

The x-axis represents the number of samples. If a function occupies a wider width on the x-axis, it means that it has been sampled more times, that is, the execution time is long. Note that the x-axis does not represent time, but all call stacks combined, in alphabetical order.

The flame graph is to see which function at the top level occupies the largest width. As long as there is a "plateaus", the function may have performance issues.

The color has no special meaning, because the flame graph represents the busyness of the CPU, so generally choose a warm color.

3.1 Optimization 1: Try to avoid the native String.split method

3.1.1 Performance Bottleneck Analysis

From the flame graph, we first found that 13% of the CPU time was spent on the java.lang.String.split method.

Students who are familiar with performance optimization will know that the native split method is a performance killer with low efficiency, and it will consume a lot of resources when it is called frequently.

However, it is necessary to frequently split when processing features in business. How to optimize it?

By analyzing the split source code and the usage scenarios of the project, we found 3 optimization points:

(1) Regular expressions are not used in the business, and when the native split processes the delimiter of 2 or more characters, it is processed as a regular expression by default; as we all know, the efficiency of regular expressions is low.

图片

(2) When the delimiter is a single character (and not a regular expression character), the native String.split performs performance optimization processing, but some internal conversion processing in the middle is redundant and consuming in our actual business scenario. performance.

The specific implementation is: implement segmentation processing through String.indexOf and String.substring methods, store the segmentation result in ArrayList, and finally convert ArrayList to string[] for output. In our business, in fact, list-type results are often needed, and there are two more interchanges between list and string[].

图片

(3) In the business where split is called most frequently, in fact, only the first result after split is needed; the native split method or other tools have overloaded optimization methods, you can specify the limit parameter, and you can return in advance after the limit number is met; but In business code, use str.split(delim)[0] method, which is not the best performance.

3.1.2 Optimization scheme

For business scenarios, we have customized a performance-optimized split implementation.

 import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang3.StringUtils;
  
/**
 * 自定义split工具
 */
public class SplitUtils {
  
    /**
     * 自定义分割函数,返回第一个
     *
     * @param str   待分割的字符串
     * @param delim 分隔符
     * @return 分割后的第一个字符串
     */
    public static String splitFirst(final String str, final String delim) {
        if (null == str || StringUtils.isEmpty(delim)) {
            return str;
        }
  
        int index = str.indexOf(delim);
        if (index < 0) {
            return str;
        }
        if (index == 0) {
            // 一开始就是分隔符,返回空串
            return "";
        }
  
        return str.substring(0, index);
    }
  
    /**
     * 自定义分割函数,返回全部
     *
     * @param str   待分割的字符串
     * @param delim 分隔符
     * @return 分割后的返回结果
     */
    public static List<String> split(String str, final String delim) {
        if (null == str) {
            return new ArrayList<>(0);
        }
  
        if (StringUtils.isEmpty(delim)) {
            List<String> result = new ArrayList<>(1);
            result.add(str);
  
            return result;
        }
  
        final List<String> stringList = new ArrayList<>();
        while (true) {
            int index = str.indexOf(delim);
            if (index < 0) {
                stringList.add(str);
                break;
            }
            stringList.add(str.substring(0, index));
            str = str.substring(index + delim.length());
        }
        return stringList;
    }
  
}

Compared with the native String.split, there are several changes:

  1. Abandon regular expression support, only support split by delimiter;
  2. Returns a list directly as a parameter. The implementation of segmentation processing is similar to the processing of single characters in the native implementation. Using the string.indexOf and string.substring methods, the segmentation results are put into the list, and the parameters are returned directly to the list, reducing data conversion processing;
  3. Provides the splitFirst method, which further improves performance when the business scenario only needs the first string before the separator.

3.1.3 Microbenchmarks

How to verify our optimization effect? First, jmh is selected as the micro-benchmark testing tool , and the native String.split and apache's StringUtils.split methods are selected for comparison. The test results are as follows:

Use single character as delimiter

图片

It can be seen that the performance of the native implementation is similar to that of the apache tool class, while the performance of the custom implementation is improved by about 50%.

Use multiple characters as delimiters

图片

When the separator uses 2 characters, the performance of the original implementation is greatly reduced, only 1/3 of the single char; and the apache implementation is also reduced to 2/3 of the original, and the custom implementation is basically the same as the original.

Use a single character as the separator, just return the first split result

图片

When a single character is used as the delimiter and only the first split result is required, the performance of the custom implementation is 2 times that of the native implementation, and 5 times that of taking the complete result of the native implementation.

3.1.4 End-to-end optimization effect

After verifying the benefits through micro-benchmark testing, we will optimize the deployment to online services to verify the overall end-to-end performance benefits;

Using arthas again to collect flame graphs, the time-consuming of the split method is reduced to about 2% ; the overall end-to-end time consumption is reduced by 31.77% , the throughput is increased by 45.24% , and the performance gains are particularly obvious.

3.2 Optimization 2: Speed up the table lookup efficiency of map

3.2.1 Performance Bottleneck Analysis

From the flame graph, we found that the HashMap.getOrDefault method also takes a lot of time, reaching 20%, mainly on the query weight map, because:

  1. In the business, high-frequency calls are indeed required, and the number of features expands after feature cross processing, and the call concurrency of a single machine reaches about 1000w ops/s.
  2. The weight map itself is also very large, storing more than 10 million entries, occupying a large amount of memory; at the same time, the probability of hash collision also increases, and the query efficiency during collision is reduced from O(1) to O(n) (linked list). ) or O(logn) (red-black tree).

Hashmap itself is a very efficient map implementation. At first, we tried to adjust the load factor loadFactor or switch to other map implementations, but did not achieve significant benefits.

How can we improve the performance of the get method?

3.2.2 Optimization scheme

During the analysis, we found that the key of the query map (feature key after cross processing) is a string type, and the average length is more than 20; we know that the equals method of string is actually traversing and comparing the characters in char[], the longer the key is The comparison efficiency is lower.

 public boolean equals(Object anObject) {
       if (this == anObject) {
           return true;
       }
       if (anObject instanceof String) {
           String anotherString = (String)anObject;
           int n = value.length;
           if (n == anotherString.value.length) {
               char v1[] = value;
               char v2[] = anotherString.value;
               int i = 0;
               while (n-- != 0) {
                   if (v1[i] != v2[i])
                       return false;
                   i++;
               }
               return true;
           }
       }
       return false;
   }

Is it possible to shorten the length of the key, or even change it to a numeric type? With a simple microbenchmark, we found that the idea should work.

So I communicated with the algorithm classmates. Coincidentally, the algorithm classmates happened to have the same demands. During the process of switching to the new training framework, they found that the efficiency of string was particularly low, and the features needed to be replaced by numerical values.

It hit it off, and the plan was quickly determined:

  1. The algorithm students map the feature key to a long value, and the mapping method is a custom hash implementation to minimize the hash collision probability;
  2. The algorithm classmates train and output the weight map of the new model, which can retain more entries to level the performance indicators of the baseline model;
  3. After leveling the performance indicators of the baseline model, the new grayscale model on the online server, the key of the weight map is changed to the long type, and the performance indicators are verified.

3.2.3 Optimization effect

With a 30% increase in the number of feature entries (the model effect exceeds the baseline), the engineering performance has also achieved significant gains;

The overall end-to-end time consumption decreased by 20.67% , and the throughput increased by 26.09% ; in addition, good gains were also achieved in memory usage, and the memory size of the weight map decreased by 30% .

图片

Fourth, JVM GC optimization articles

Java is designed with automatic garbage collection to free application developers from manual dynamic memory management. Developers do not need to care about memory allocation and reclamation, nor the lifetime of allocated dynamic memory. This completely eliminates some memory management related bugs at the cost of adding some runtime overhead.

When developing on small systems, the performance overhead of GC is negligible, but when scaled to large systems (especially those applications with large amounts of data, many threads, and high transaction rates), the overhead of GC is non-negligible and may even become significant performance bottleneck.

图片

The diagram above simulates an ideal system that is fully scalable except for garbage collection. The red line represents an application that spends only 1% of its time in garbage collection on a uniprocessor system. This means a throughput penalty of over 20% on a system with 32 processors. The magenta line shows that for an application with 10% garbage collection time (in a uniprocessor application, the garbage collection time is not too long), when scaling to 32 processors, more than 75% of the throughput is lost .

Therefore , JVM GC is also an important performance optimization measure.

Our recommended service uses high-profile computing resources (64 cores and 256G), and the influencing factors of GC are considerable. By collecting and monitoring online service GC data, it is found that our service GC situation is quite bad, and the cumulative YGC time per minute is about 10s.

Why is the GC overhead so large, and how to reduce the GC time?

4.1 Optimization 3: Use off-heap cache instead of on-heap cache

4.1.1 Performance Bottleneck Analysis

图片

We dumped the live heap objects of the service, and used the mat tool to perform memory analysis. We found that two objects were particularly huge, accounting for 76.8% of the total live heap memory. in:

  • The first largest object is the local cache, which stores common data at a fine-grained level, with a data volume of tens of millions per machine; using the caffine cache component, the automatic cache refresh cycle is set to 1 hour; the purpose is to minimize the number of IO queries;
  • The second largest object is the model weight map itself, which resides in memory and will not be updated. After the new model is loaded, it will be unloaded as the old model.

4.1.2 Optimization scheme

How to cache as much data as possible while avoiding excessive GC pressure?

We thought of moving the cache object to the outside of the heap, so that it is not limited by the size of the in-heap memory; and the out-of-heap memory is not controlled by the JVM GC, which avoids the impact of too large cache on the GC. After research, we decided to use the mature open source off-heap cache component OHC.

(1) Introduction to OHC

Introduction

The full name of OHC is off-heap-cache, that is, off-heap cache. It is a caching framework developed for Apache Cassandra in 2015. It was later separated from the Cassandra project and became a separate class library. Its project address is https://github.com /snazy/ohc .

characteristic
  • Data is stored outside the heap, and only a small amount of metadata is stored in the heap, which does not affect GC
  • Support to set expiration time for each cache item
  • Support configuring LRU, W_TinyLFU eviction policy
  • Ability to maintain a large number of cache entries
  • Support for asynchronous loading of caches
  • Read and write speed in microseconds

(2) OHC usage

Quick start:

 OHCache ohCache = OHCacheBuilder.newBuilder().
        keySerializer(yourKeySerializer)
        .valueSerializer(yourValueSerializer)
        .build();

Optional configuration items:

图片

In our service, the capacity is set to 12G, the segmentCount is 1024, and the serialization protocol uses kryo.

4.1.3 Optimization effect

After switching to off-heap cache, the service YGC was reduced to 800ms/min, and the overall end-to-end throughput increased by about 20%.

4.2 Thinking questions

In the Java GC optimization, we moved the local cache objects from the Java heap to the heap, and achieved good performance gains. Remember the other giant object mentioned above, the model weight map? Can the model weight map also be removed from the Java heap?

The answer is yes. We used C++ to rewrite the model inference calculation part, including the storage and retrieval of the weight map, sorting score calculation and other logic; then output the C++ code as a so library file, and the Java program is called through the native method to remove the weight map from the JVM heap. , with good performance gains.

图片

V. Conclusion

Through the three measures introduced above, we have improved the service load and performance from the aspects of hot code optimization and JVM GC, and the overall throughput has doubled, reaching the staged target.

However, performance tuning is endless, and the actual situation of each business scenario and each system is also very different. It is difficult to cover all the optimization scenarios in one article. I hope that some of the tuning practical experience introduced in this article, such as how to determine the optimization direction, how to start the analysis, and how to verify the benefits, can give you some lessons and references.


vivo互联网技术
3.3k 声望10.2k 粉丝