GPU optimization practice for the new generation of CTR prediction services

The CTR model has a wide range of applications in Internet search, recommendation, advertising and other scenarios. In recent years, with the introduction of deep neural networks, CTR model reasoning has gradually increased the requirements for hardware computing power. This article introduces Meituan’s practice in CTR model optimization. By analyzing the structural characteristics of the model and combining the GPU hardware architecture, we designed a series of processes to customize and optimize the model, achieving the goals of reducing latency, increasing throughput, and saving costs.

1 background

CTR (Click-Through-Rate) is the click-through rate, which refers to the click-through rate of online advertisements, that is, the actual number of clicks of the advertisement divided by the amount of display of the advertisement. The scoring model that serves the CTR indicator is generally called the CTR model. We can further extend this concept to various models for estimating conversion rates in Internet applications. The CTR model is widely used in scenarios such as recommendation, search, and advertising. Compared with the models of CV (computer vision) and NLP (natural speech processing) scenes, the historical structure of the CTR model is relatively simple and the amount of calculation is small. Meituan’s CTR model has been using CPU reasoning. With the introduction of deep neural networks in recent years, the structure of the CTR model has gradually become more complex, and the amount of calculations has become larger and larger, and the CPU has begun to fail to meet the requirements of the model for computing power.

The GPU has thousands of computing cores, can provide intensive parallel computing capabilities in a single machine, and has demonstrated powerful capabilities in CV, NLP and other fields. Through CUDA[1] and related APIs, NVIDIA has established a complete GPU ecosystem. Based on this, Meituan’s basic R&D platform deploys the CTR model to the GPU through a set of solutions. From the perspective of the model prediction stage alone, the GPU deep optimization solution we provide based on NVIDIA T4, under the same cost constraints, compared to CPU, increased the throughput capacity by 10 times. At the same time, in a typical search refinement scenario, from an end-to-end perspective, the overall throughput has more than doubled.

In addition to increasing throughput and reducing costs, the GPU solution also brings additional possibilities for the application of the CTR model. For example, in a scene where the search box is automatically completed, due to the natural interaction properties, the delay requirements are very demanding, and in general, complex models cannot be used. With the support of GPU capabilities, the average response time of a complex model has been reduced from 15 milliseconds to 6-7 milliseconds, which has reached the online requirements.

Next, this article will discuss with you the GPU optimization ideas, effects, advantages and disadvantages of the new generation of CTR prediction services provided by the Meituan machine learning platform, hoping to help or inspire students engaged in related work.

2 Challenges of GPU inference in CTR model

2.1 Challenges at the application layer

The structure of the CTR model is changeable and contains a large number of business-related structures. At the same time, new SOTA models are emerging in an endless stream. Due to limited manpower, hardware suppliers will focus on optimizing commonly used classic structures, such as ResNet. For structures without convergence, there is no official end-to-end optimization tool to support.
The CTR model usually contains a larger Embedding table structure, and it must be considered that the Embedding table cannot be stored.
In a typical recommendation scenario, in order to achieve faster POI exposure, the timeliness of the model is very demanding, and the online model service needs to provide the ability to incrementally update the model.

2.2 Challenges at the framework level

operator level : The current mainstream deep learning frameworks, such as TensorFlow and PyTorch, can be said to be the second-generation deep learning frameworks. They must first solve the problem of the first-generation framework Caffe. An obvious problem with Caffe is the granularity of the layer. Rough, the algorithm developers of that era must have the ability to "write their own custom layer". Both TensorFlow and PyTorch prioritize model expression capabilities, resulting in relatively small operator granularity, which will bring great additional overhead to both CPU and GPU architectures.
framework level : TensorFlow and PyTorch are essentially training frameworks, which are friendly to algorithm developers, but not deployment friendly. There are many designs to facilitate distributed training. For example, TensorFlow has a built-in Partitioned_Variable design in order to facilitate the disassembly of Variable to different PSs. In scenarios based on GPU single-machine prediction, these structures will also bring additional overhead.

2.3 Challenges at the hardware layer

First, TensorFlow's operator granularity is finer, resulting in a model usually composed of thousands of operators, and the execution of these operators on the GPU is transformed into the execution of the corresponding GPU kernel. The kernel is a function executed in parallel on the GPU.

The GPU kernel can be roughly divided into several stages, such as data transmission, kernel startup, and kernel calculation. The startup of each kernel requires about 10 𝞵𝘀. A large number of small operators result in a short execution time for each kernel, and the kernel startup takes most of the time. Adjacent kernels need to read and write video memory for data transmission, resulting in a large amount of memory access overhead. The GPU memory access throughput is much lower than the calculation throughput, resulting in low performance and low GPU utilization.

Second, the GPU card contains multiple computing units. In theory, different computing units can run different kernels, but in fact, for programming simplicity, CUDA assumes that the same kernel runs in the same stream at the same time by default. Although it can be run through multiple streams, there is a lack of fine-grained coordination mechanisms between multiple Steam.

After thorough investigation and discussion, we decided to focus on how to solve the problem of inefficient execution of common CTR model structure on NVIDIA GPU under the TensorFlow framework in the first phase. We first converged the problem into the following two sub-problems:

The operator granularity is too fine, and the GPU execution efficiency is low.
The model structure is changeable, the manual optimization investment is large, and the versatility is poor.

3 Optimization methods

In order to solve the above problems, we conducted some research on the industry's deep learning accelerators. The more mature inference optimization solutions in the industry are mainly TensorRT/XLA/TVM. TensorRT uses manual optimization, operator fusion of some customized model structures, and efficient tuning of computationally intensive operators (such as convolution). XLA is a built-in compilation optimization tool of TensorFlow, mainly for memory-intensive structures, and realizes the integration of operators through compilation methods. TVM [2] has a more comprehensive optimization capability, uses compiling means to integrate operators, and can realize automatic tuning of computationally intensive operators through machine learning.

After extensive research and comparison, we finally chose TVM as the optimization tool. TVM can better cope with the changing model structure through compiling means, and solve the problem of poor versatility of manual optimization. However, TVM applications in the business model also have a series of problems: the number of operators supported is small, and the current support for dynamic shapes is not good enough. In response to these two problems, we combined TVM and TensorFlow, combined the structural characteristics of the CTR model with the hardware characteristics of the GPU, and developed a series of processes to optimize the CTR model.

3.1 Operator fusion

By fusing multiple small operators into one semantically equivalent large operator, the number of kernels on the GPU can be effectively reduced. On the one hand, the reduction in the number of kernels directly reduces the overhead of kernel launch; on the other hand, the amount of calculation performed by the large kernel after fusion increases, which avoids frequent memory accesses caused by data transmission between multiple kernels, and improves the memory access ratio of calculations. .

It can be seen that in the left and right equivalent structure in the above figure, the operations performed by the 21 operators on the left can be completed in one equivalent operator. Reflected in the activity of the GPU, there are at least 21 GPU kernels and 21 video memory reads and writes on the left, while only 1 kernel and 1 video memory read and write are required on the right. For each merged operator, a corresponding kernel implementation is required. However, the combination of operators in the model is infinite, and it is unrealistic to manually implement the kernel for each fusion operator. TVM can automatically perform operator fusion and device code generation through compilation methods, avoiding the burden of handwriting kernels one by one.

3.1.1 TF-TVM automatic image cutting optimization

In the TensorFlow model, if operators that are not supported by TVM are included, the TVM conversion cannot be performed. Our idea is to cut out the parts that can be optimized with TVM and convert them to TVM's engine, while the other parts still use TensorFlow operators. There are similar problems when converting between XLA and TRT. We analyzed the implementation of TF-XLA and TF-TRT:

In the implementation of TF-XLA, after Grappler[4] optimizes the graph, there is a POST_REWRITE_FOR_EXEC (searchable in the source code through this keyword) stage. At this stage, three passes for Graph will be executed, which are used to Mark the operator, encapsulate the subgraph, rewrite the subgraph and build LaunchOp.
The implementation of TF-TRT, TF-TRT registers an optimizer in Grappler. In this optimizer, the connected subgraph is found and replaced with the TRT Engine.

In the final solution, we refer to the design of TF-TRT. The advantage of this design compared to XLA is that the XLA graph cutting solution is tightly coupled with the TensorFlow source code, and the three Passes of XLA are directly embedded in the main process of starting the Session. As for the graph cutting strategy and optimization strategy, there will be very frequent iterations in the follow-up, and we don't want to be too coupled with the source code of TensorFlow. We have extended the TF-TVM solution, and in actual use we have taken this picture cutting process into an independent process. It is triggered automatically when the model is deployed or updated.

In the inference phase, the optimized subgraph is executed using TVM, and the rest of the calculation graph is executed natively using TensorFlow, and the two are combined to complete the inference of the model. Since the Runtime of TVM and TensorFlow each use independent memory management, data transmission between different frameworks will cause additional performance overhead. In order to reduce this part of the overhead, we have opened up the underlying data structure of the two frameworks to avoid additional data copies as much as possible.

3.1.2 Equivalent replacement of calculation graph

Too many operators in the TensorFlow model that are not supported by TVM will cause fragmentation of TF-TVM and affect the final optimization effect. In order to make the TF-TVM cut map as large and complete as possible, and to make the fusion in the TVM optimization process greater, we detect some complex structures in the model and replace them with equivalent structures that perform more efficient or easier fusion.

For example, TensorFlow's native EmbeddingLookup structure, in order to support distributed training, will segment the Embedding table to generate dynamic operators such as DynamicPartition and ParallelDynamicStitch. These dynamic operators are not supported by TVM, resulting in the TF-TVM graph segmentation is too fine. In order to make the TF-TVM cut map more complete, we modify this structure through map replacement, and obtain a simplified EmbeddingLookup structure by merging the Embedding sub-tables in advance.

3.2 CPU-GPU data transfer optimization

The subgraph optimized by TVM is replaced with a node, which is executed on the GPU and usually has dozens or even hundreds of inputs. The front input of the node (such as Placeholder) is usually executed on the CPU and involves multiple times. CPU-GPU transfer. Frequent transmission of small amounts of data cannot make full use of bandwidth. In order to solve this problem, we modify the model structure, add merge and split nodes in the calculation graph, control the position of the cut graph, and reduce the number of data transmissions.

One possible way of merging is to merge these inputs according to the same Shape and Dtype, then split them later, and cut the split nodes into the TVM subgraphs to optimize them together. This method will cause some problems, such as the poor fusion effect of some subgraphs; on the other hand, the parameter transfer memory of the GPU kernel function is limited to 4KB. For TVM node inputs (such as more than 512), You will encounter situations where the generated code is illegal.

3.3 Manual optimization of high-frequency sub-images

For the subgraphs that cannot be supported by TVM, we abstracted the structure of high-frequency use in the business, adopted hand-written custom operators, and implemented efficient GPU implementation.

For example, some of the time series features in the model use String type input, the input string is converted into a complemented digital Tensor, and the int type Tensor is used as a subscript for Embedding operation. The semantics of this part of the subgraph is as shown in the figure, hereinafter referred to as the SE structure (StringEmbedding):

In this part of the structure, the native implementation of TensorFlow is only a CPU-based version. Under the scenario of a large amount of data and a high degree of parallelism, the performance drops severely and becomes the bottleneck of the entire model. In order to optimize the performance of this part of the structure, we implemented efficient equivalent operations on the GPU.

As shown in the figure, the PadString operator fills up multiple strings according to the maximum length on the CPU side, and splices them into a uint8 type Tensor with continuous memory for one-time transmission to the GPU. After StringEmbedding receives the completed string, it utilizes the parallel computing feature of the GPU to cooperate with a large number of threads to complete string segmentation and table lookup operations. In the key processes involving statute summation, prefix summation, etc., the Reduce/Scan algorithm on the GPU is used, and the encoding process uses the warp_shuffle instruction. Different threads exchange data through registers, avoiding the overhead of frequent memory access and obtaining a good result. performance.

The GPU Scan algorithm indicates that an 8-element prefix and operation requires only 3 iteration cycles. In a model with dozens of similar operations, the GPU timeline before and after manual optimization is compared as shown in the figure below. It can be seen that the time-consuming part of the H2D + StringEmbedding structure has been greatly reduced, from 42 milliseconds to 1.83 milliseconds.

In addition to the StringEmbedding structure, we have implemented efficient integration of StringSplit + ToNumber + SparseSegmentSqrt, multi-path parallel StringEmbedding and other structures. In the optimization process, structure matching is used for corresponding replacement.

3.4 CPU-GPU shunt

For actual online RPC requests, the number of samples in each request (hereinafter referred to as Batch) varies within the range of [1,MaxValue]. MaxValue is relatively fixed due to factors such as upstream business systems and other basic system capabilities. As shown in the above figure, taking a search service as an example, we have counted the online Batch value distribution. The requests of Batch=MaxValue accounted for about 45%, Batch=45 accounted for 7.4%, and Batch=1 accounted for 2.3%. The proportion of the rest of Batch varies from 0.5% to 1%. For the GPU, increasing the batch of a single request can make better use of hardware resources, give play to the GPU's parallel computing capabilities, and show better latency and throughput relative to the CPU; when the batch is small, the advantage of the GPU relative to the CPU is not obvious. (The figure below shows the change of latency on CPU/GPU when we test the same model under fixed pressure).

Most of the requests are made by the GPU, and the CPU resources have more space. We put some small batch requests on the CPU to run, which can make the resource utilization of the entire Worker more balanced and improve the overall performance of the system. We set a batch threshold according to the test, and the judgment logic for the differential execution of the calculation graph on heterogeneous hardware: for the case of small batches, the calculation graph is executed directly on the CPU, and only requests that the Batch exceeds the threshold will be inferred on the GPU. . From the online statistics, 77% of the overall traffic ran on the GPU, and 23% ran on the CPU.

In a series of optimization strategies and actions of the GPU, the batch size is very important information. The kernel implementation optimized under different batches may be different to achieve the optimal computing performance under the corresponding workload; due to the characteristics of online traffic, The batch distribution of requests sent to the GPU is relatively fragmented. If we optimize the kernel implementation of a model for each batch, it is obviously not economical and general. Therefore, we designed a Batch bucket strategy to generate N fixed batch optimization models, find the closest bucket to the batch when the actual request arrives, and pad the request up to the corresponding Batch calculation, thereby improving the efficiency of GPU utilization. .

4 Analysis of pressure test performance

We select a model for online performance stress test analysis.

The CPU model test environment is 16-core Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz, 16G memory.
The GPU model test environment is 8-core Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz, Tesla T4 GPU, 16G memory.

The following figure compares the inference delay (y-axis) of the GPU model under each BatchSize under different QPS (x-axis). If the GPU model is below BatchSize=128, the reasoning time difference is not obvious, and the larger BatchSize is more conducive to throughput; compare the GPU model with BatchSize=256 and the CPU model with BatchSize of 25, when the QPS is lower than 64, the two The reasoning time is basically the same; when the QPS exceeds 64, the reasoning delay of the GPU is lower than that of the CPU. The throughput of GPU is 10 times higher than that of CPU.

At the same time, we can see the steepness of the different curves. After the CPU QPS is higher than 64, the latency will rise rapidly, while the GPU will remain stable. It will not rise significantly until the QPS exceeds 128, but it is still more stable than the CPU.

5 Overall architecture

Aiming at the structural characteristics of the CTR model, we abstracted a set of platform-based general optimization procedures. Through the analysis of the model structure, the appropriate optimization strategy is automatically applied, and the performance evaluation and consistency check are used to ensure the optimization effect of the model.

6 Weaknesses and future planning

In terms of ease of use, the current form of the solution is to provide a set of online optimization scripts, after the user submits the model, the deployment is automatically optimized. Due to the analysis and editing of the calculation graph structure and the compilation of TVM, the current model optimization takes a long time, and most of the model optimization takes about 20 minutes. Follow-up needs to consider speeding up the efficiency of TVM compilation.

In terms of versatility, from our actual application, TVM compilation optimization and high-performance handwriting operators are the main sources of revenue. Manual optimization is a test of development students’ understanding of business models and GPU programming capabilities. It is not easy to write a high-performance fusion operator, but it is even more difficult to have a certain degree of migration ability and scalability.

In general, there are still many issues that need to be considered for CTR model inference in the future on GPU. In addition to providing better performance based on business understanding, it is also necessary to consider the problem that the model cannot be fully put into the video memory after the large scale of the model and the problem of supporting online model updates.

About the Author

Weilong, Xiaozhuo, Wenkui, Yunfei, Xiaoxin, etc., all come from Meituan's basic research and development platform-machine learning prediction engine group.

Reference

[1] CUDA C++ Programming Guide
[2] TVM Documentation
[3] Accelerating Inference In TF-TRT User Guide
[4] TensorFlow graph optimization with Grappler

Job Offers

Meituan's machine learning platform is continuously recruiting a large number of positions. Internships and social recruitment are available. It is located in Beijing/Shanghai. Interested students are welcome to join us to build a multi-field company-level machine learning platform to help everyone eat better and live better . Resume can be sent to: wangxin66@meituan.com.

Read more technical articles from the

| in the public account menu bar dialog box, and you can view the collection of technical articles from the Meituan technical team over the years.

| This article is produced by the Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "the content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activity, please send an email to tech@meituan.com to apply for authorization.

GPU optimization practice for the new generation of CTR prediction services

1 background