On July 9, the GOTC 2021 Global Open Source Technology Summit Shanghai Station was jointly held with the WAIC World Artificial Intelligence Conference. The summit focused on AI and cloud native, two open source-driven cutting-edge technologies, and invited first-line technologies from national research institutions and top Internet companies. Experts bring the hardest industry technology dry goods to participating developers and technology enthusiasts, and provide a rare technology exchange platform.

At this meeting, Tencent Cloud senior engineer Gao Ce conducted a technology sharing entitled "Exploration and Practice of Building Cloud Native AI Platforms on Public Clouds", and introduced the current status of AI services on public clouds and the corresponding technology options. Types and problems faced. Finally, by analyzing the trends of the open source community and the industry, we shared with the audience our vision for the future of fully flexible AI infrastructure.

This article is compiled from the content of this sharing speech, and shared with everyone to review the wonderful content.

Follow the [Tencent Cloud Native] official account and reply to the keyword [Cloud Native AI] to get the speech PPT manuscript.

Background and current situation

Since the development of deep learning, new model structures have emerged one after another. Since GPT-1 and Bert came out in 2018, the parameters of the model structure have increased exponentially. At present, Transformers and other structures are not only shining in the field of natural language processing, but also in the field of computer vision and other fields. It can be seen that the demand for computing power and video memory will become stronger in the future. The hardware performance provided by hardware manufacturers represented by Nvidia cannot be improved simultaneously. The figure above shows the gap between the two. The red line is the changing trend of the model parameter scale, which is currently increasing at a rate of 120 times per year. The green line represents the rate of increase of memory capacity only 2 times each year.

Therefore, whether it is in the fields of computer vision, natural language processing, or the search advertisement recommendation field where the Internet industry is widely deployed, distributed training has become the mainstream training method.

Correspondingly, deep learning frameworks are also blooming. Traditional frameworks such as TensorFlow, PyTorch, Keras are still very popular. And some new frameworks are gradually appearing, such as Microsoft's DeepSpeed, Baidu's Paddle and so on.

In summary, AI has been widely implemented in all areas of the industry. Needless to say, the traditional search advertisement recommendation field, in the field of vision and natural language processing, deep learning-based methods have become state-of-art. In fields such as games and robotics, reinforcement learning is also slowly moving towards production. In order to meet the business needs for complex models, new hardware and frameworks are emerging in an endless stream. Of course, there is also a very obvious trend. Many AI-type businesses are on the public cloud, hoping to use the elastic computing capabilities of the public cloud to reduce computing power costs and improve efficiency.

AI landing on the public cloud

Next, we introduce some observations about cloud-native AI when serving customers on public clouds.

Cloud-native AI based on the public cloud is currently gradually landing, which includes both sparse search/advertising/recommendation services and dense computer vision services. There are relatively many recommended scenarios in the Internet field. It is also due to the complex search/advertising/recommendation business scenarios and low end-to-end latency requirements, so the cost of transformation is relatively high, so most businesses, especially offline training processes, still cannot make good use of the elastic capabilities of the cloud.

At the same time, from the perspective of a deep learning framework, most businesses are still using TensorFlow. This has a certain correlation with previous observations. TensorFlow still occupies an absolute market in the search/ad/recommendation business. However, PyTorch is currently being used more and more, especially in the fields of computer vision and natural language processing.

Tencent Cloud Native AI Service

Combining our observations and practices, the Tencent Cloud Native Team built a cloud-native AI productization solution for Tencent Cloud Container Service around Kubeflow. The free internal test has already started, welcome to contact us for a trial, and any of your suggestions will become our valuable motivation.
Tencent Cloud's native AI service provides users with rapid delivery and management capabilities of the AI environment, flexible Jupyter services, and distributed model services. At present, product features such as model management are also being gradually built.
In addition, in order to solve the bottleneck problem of bandwidth performance, we not only cooperated with the Tencent COS team on the storage side to optimize with the GooseFS cache engine, but also on the computing side with Tencent Cloud Youtu Lab, with the help of years of experience in training and inference. , Ready to launch a highly optimized deep learning framework. We will make full use of the advantages of cloud-native AI as a unified window, and cooperate with multiple teams of Tencent Cloud to build a platform, provide out-of-the-box productization capabilities, and feed back customers and the community.
More best practices on cloud native AI will be introduced in our follow-up "Cloud Native AI Standard Guide" and "Cloud Native AI Frontier Observation" series.

Landing practice

After introducing the native AI cloud implementation of public clouds, let's share a typical selection of AI services running on public clouds. The first is to train related technology stacks. First, at the bottom of the cloud server side, generally speaking, it is a virtual machine or bare metal machine provided by a cloud vendor. At present, most businesses use Kubernetes container service, so the general computing side will organize servers into a Kubernetes cluster for resource management and scheduling. On top of it, it generally relies on object storage, file storage or block storage for the storage of training samples and models. Generally speaking, object storage is mostly used in scenarios where the read and write pressure is not too high. Compared with other methods, object storage supports hierarchical compression and archiving, which is cost-effective. In scenarios where the read and write pressure is relatively high, file storage and block storage have more landings.

In order to increase the throughput of data as much as possible, some computing-side caches are sometimes used for acceleration. Among them, the selection includes Alluxio and Tencent Cloud object storage cache acceleration product GooseFS. By caching the remote data in the computing-side cluster, the overhead of pulling data from the remote end is avoided, and the training speed can be significantly improved in some scenarios.

Built on the server and storage is a distributed training infrastructure. Kubeflow is currently the most widely used. Through Kubeflow, users can easily create distributed training tasks for TensorFlow, PyTorch, Horovod and other frameworks. And Kubeflow can work well with various features of Kubernetes, and can support schedulers such as Volcano.

Although Kubeflow has been able to support users to train and evaluate models, there are still some problems with using Kubeflow directly. Different data dependencies may be in different data systems, so the logic of data processing may be very complicated. In order to simplify the use process of algorithm engineers and improve user experience, a pipeline system is generally constructed at the upper level to combine and connect various links in the machine learning process. At the same time, a convenient programmable environment will be provided to help algorithm engineers realize their business faster. In this link, generally speaking, the optional systems include Jupyter, Argo Workflow, Airflow, Kubeflow, etc. From the user's point of view, algorithm engineers only need to care about the uppermost experimental environment and pipeline system. The underlying layers of Infra are provided by the infrastructure team and the public cloud. Such stratification can reduce the mental burden of engineers in different roles and improve efficiency.

Next, we take distributed training as an example to introduce the problems that may be encountered in the selection and the solutions. In distributed training, according to the different ways of parameter update, it can be divided into Parameter Server (hereinafter referred to as PS) Worker mode and AllReduce mode. In PS mode, there are two characters participating in training, namely PS and Worker. The Worker is responsible for the main calculations. The calculated gradient will be sent to the corresponding PS, and the PS will update the corresponding parameters and then send it back to the Worker. In the AllReduce mode, each worker has a full amount of models, and different workers receive different data, transfer gradients to each other, and update and synchronize the gradients.

No matter which kind of training method mentioned above, there are some problems. The first is that when there are many model parameters, the network bandwidth requirement for gradient or parameter communication is very high, and the network will become a bottleneck in the training process. This problem is particularly obvious in the training of dense models. Secondly, there are often multiple deep learning tasks running on a cluster that runs deep learning tasks. Different tasks need to access storage, and storage bandwidth may also become a bottleneck at this time. To sum up, in the network and storage, it is possible to encounter the problem of insufficient bandwidth.

On public clouds, cloud servers usually do not provide RDMA network cards, and the internal network bandwidth is usually around 20-50Gbps. In such an environment, in order to reduce the bandwidth pressure caused by gradient synchronization, it is generally necessary to perform optimizations such as gradient compression. Gradient compression can reduce the gradient size of a single synchronization. At the same time, it can also replace the implementation of AllReduce and choose an implementation that is more friendly to a low-bandwidth environment, such as 2DReduce. These tasks are implemented in Tencent Cloud's Ti-Horovod. It will perform better than the native Horovod under low bandwidth conditions.

And if you train on a server such as bare metal, you can use the RDMA network card to accelerate the gradient. In such a training environment, there is a VPC network card for interacting with cloud products such as object storage; a RoCE network card and a graphics card. Therefore, a certain transformation is required to support the pulling of training samples through the VPC network card, and the gradient synchronization update is performed through the RDMA network card.

In this way, there is a higher probability of encountering the storage bandwidth problem mentioned earlier. The synchronization of the gradient is accelerated by the high-bandwidth RDMA, and the corresponding storage is more likely to become a bottleneck. To solve this problem, you can use computing-side caching products on the public cloud, such as Tencent Cloud's GooseFS, or the open source Allxuio, to cache data in the cluster to avoid pulling data from object storage online during training , To avoid bottlenecks caused by storage.

In the inference scenario, the architecture is relatively simpler. The bottom layer is still a Kubernetes cluster composed of cloud servers. Generally speaking, models are stored in object storage, and model services are provided externally through TFServing, Triton Inference Server, or self-developed service frameworks.

Because the end-to-end process of some businesses is relatively complicated, there are complicated pre-processing and post-processing links. If you use TFServing or Triton Inference Server to implement, the logic will be particularly complicated. At the same time, the model service will be coupled with the internal infrastructure and need to be connected to the internal gateway and other services. Therefore, the demand for self-developed service framework is relatively strong. Although TFServing and Triton Inference Server have received widespread attention in the open source field, there are still considerable scale businesses using self-developed service frameworks.

Future outlook

There are various problems in the process of AI business on the public cloud. Needless to say, there are bandwidth bottlenecks on the communication and storage sides. In addition, deep learning often relies on many underlying libraries of Nvidia and various dependencies of Python. In the integrated environment, the GPU memory used by Jupyter and the calculation utilization are too low.

The evolution of the infrastructure will also move in the direction of solving these problems. We believe that the future AI infrastructure must be fully flexible. In the training scenario, the original training method needs to fix the configuration of each role participating in the training. For example, in a distributed training task involving 5 Workers, it is necessary to ensure that only 5 Workers participate in the training process. This makes the resource configuration can only be statically specified, and the number of workers participating in training cannot be dynamically adjusted when the cluster resource situation changes.

At present, we can see that more and more deep learning frameworks are supporting flexible training. Take Horovod as an example, it introduces the concept of Driver to manage the life cycle of Worker. When there is a problem with any Worker, the Driver will catch the exception and re-establish the loop according to the configuration, allowing the training to continue. During this process, training will not be interrupted. This allows training tasks to be expanded when the cluster load is low and there are idle GPUs, and scaled down when the cluster load is high. Such an architecture can combine capabilities such as elastic instances of the public cloud to improve fault tolerance while reducing training costs.

Similarly, there are flexible Jupyter capabilities. In the original implementation of Jupyter, each Kernel runs with the Notebook, which means that it needs to occupy a complete GPU card for a long time, which also makes the utilization of the GPU not improved. If Jupyter can apply for on-demand use of the card, it will definitely further improve the resource utilization of the cluster, reduce costs and increase efficiency.

Summarize

Finally, we summarize the main points of this sharing. At present, the intranet bandwidth of the public cloud is still a major problem that restricts the AI business to go to the cloud. We have different methods to alleviate it for different scenarios, and there are also RDMA solutions including bare metal to choose from. I believe that with the gradual increase of public cloud network bandwidth in the future, this will no longer be a problem.

Second, the industry still lacks de facto standards for AI infrastructure. There are currently many open source AI infrastructure projects, of which Kubeflow is the one that has landed the most. With its deep integration with Kubernetes, it can work better with the company's existing infrastructure, which has certain advantages. But overall, there is still a lack of de facto standards in this field. The differences between the various systems are very large. This is also one of the biggest problems in this field at present. Each company's AI infrastructure has its own characteristics. It is difficult to form a joint force in the community and jointly promote the progress of the industry like Kubernetes in the cluster scheduling field.

Finally, a fully flexible architecture is our next evolutionary direction. At present, elasticity cannot be used well in the AI business, and this is the biggest dividend that cloud computing brings us. Only by relying on a truly flexible architecture can applications be born on the cloud, grow on the cloud, and serve the ultimate goal of reducing costs and increasing efficiency for enterprises.

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

账号已注销
350 声望974 粉丝