ACK hybrid cloud practice of vivo AI computing platform

Author | Liu Dongyang, Wu Ziyang

At the end of 2018, in order to solve the pain points of a unified high-performance training environment, large-scale distributed training, and efficient utilization and scheduling of computing resources, the vivo AI Research Institute began to build an AI computing platform. After more than two years of continuous iteration, great progress has been made in platform construction and landing, and it has become the core basic platform in the vivo AI field. The platform has evolved into three modules including VTraining, VServing, and VContainer from the original service of deep learning training, providing external model training, model reasoning and containerization capabilities. The platform's container cluster has thousands of nodes and has more than hundreds of PFLOPS of GPU computing power. Thousands of training tasks and hundreds of online services are running in the cluster at the same time. This article is one of the series of articles on the actual combat of the vivo AI computing platform. It mainly shares the platform's practice in the construction of hybrid cloud.

background

Hybrid cloud is one of the new directions in the cloud native field in recent years. It refers to a solution that combines private cloud and public cloud services. At present, several major public cloud vendors have provided their own hybrid cloud solutions, such as AWS's AWS Outpost, Google's GEC Anthos, and Ali's ACK hybrid cloud. Most manufacturers use Kubernetes and containers to shield the differences in the underlying infrastructure and provide unified services. The AI computing platform chooses to build a hybrid cloud mainly based on the following two reasons.

Elastic resources of public cloud

The cluster of the platform uses the bare metal server in the company's self-built computer room. The procurement process of new resources is complicated and the cycle is long. It cannot respond to the large amount of temporary computing power needs of the business in time, such as the training of large-scale parameter models and the holiday activities of online services. Expansion. At the same time, due to the severe situation of the server supply chain this year, hardware devices such as network cards, hard drives, and GPU cards are all out of stock, and server procurement and delivery are at greater risk. Public cloud resources can be applied for and released on demand. The use of public cloud resources through hybrid cloud can meet the temporary computing power requirements of the business and effectively reduce costs.

Advanced features of public cloud

The public cloud has some advanced features, such as AI high-performance storage CPFS, high-performance network RDMA, and deep learning acceleration engine AIACC. These solutions or features are currently not available in the company's private cloud, and the cost of time and money for privatization is very high. , These features can be used quickly and at low cost through hybrid cloud.

plan

Scheme selection

Through preliminary research, the following three solutions can meet the requirements of hybrid cloud:

The implementation cost of Scheme 1 is low, does not change the current resource application process, and can be implemented quickly. The business can accept hour-level expansion. So we chose option one.

Overall structure

The overall architecture of the hybrid cloud is shown in the figure below. The management plane of the K8s cluster is deployed in the company's self-built computer room, and the working plane includes the physical machine in the computer room and the cloud host of Alibaba Cloud. The computer room and Alibaba Cloud have opened up the network through a dedicated line, and the physical machine and the cloud host can access each other. The solution is transparent to the upper platform. For example, the VTraining training platform can use the computing power of the cloud host without modification.

Landing practice

Register the cluster

First, you need to register the self-built cluster to Alibaba Cloud. Note that the network segment of the VPC used cannot conflict with the Service CIDR of the cluster, otherwise it cannot be registered. The CIDR of the VPC virtual switch and Pod virtual switch cannot overlap with the network segment used in the computer room, otherwise there will be routing conflicts. After successful registration, ACK Agent needs to be deployed. Its function is to actively establish a long link from the computer room to Alibaba Cloud, receive requests from the console and forward them to the apiserver. For environments without dedicated lines, this mechanism can avoid exposing the apiserver to the public network. The link from the console to the apiserver is as follows:

Alibaba Cloud ACK console <<-->> ACK Stub (deployed in Alibaba Cloud) <<-->> ACK Agent (deployed in K8s) <<-->> K8s apiserver

The request from the console to the cluster is safe and controllable. When the Agent connects to the Stub, it will bring the configured token and certificate; the link adopts the TLS 1.2 protocol to ensure data encryption; the access authority of the console to K8s can be configured through ClusterRole.

Container network configuration

The container network of K8s requires normal communication between Pod and Pod, and Pod and host. The platform adopts the Calico + Terway network solution. The working nodes in the computer room use Calico BGP, and the Route Reflector will synchronize the routing information of the Pod to the switch, and the physical machine and the cloud host can access the Pod IP normally. The working node on Alibaba Cloud will adopt the Terway shared network card mode, and the Pod will be assigned an IP from the network segment configured by the Pod virtual switch, which can be accessed in the computer room. The platform labels the cloud host, configures the nodeAffinity of the calico-node component, and does not schedule it to the cloud host; at the same time, configures the nodeAffinity of the Terway component so that it only runs on the cloud host. In this way, the physical machine and the cloud host use different network components. In deploying and using Terway, we encountered and solved the following three problems:

1. The creation of the terway container fails, and the /opt/cni/bin directory does not exist.

The above problem can be solved by modifying the type of the hostPath of the path in the gateway daemonset and changing from Directory to DirectoryOrCreate.

2. The creation of the business container fails, and the loopback plug-in is reported not to be found.

Terway does not deploy the loopback plug-in (to create a loopback network interface) in the /opt/cni/bin/ directory like calico-node. We added InitContainer to the gateway daemonset to deploy the loopback plug-in, which solved the problem.

3. The IP assigned by the service container belongs to the network segment of the host switch.

This is because in use, we added a new availability zone, but did not configure the information of the Pod virtual interactive machine in the availability zone to the gateway. The problem can be solved by adding the Pod virtual switch information of the availability zone in the vswitches field of the terway configuration.

Cloud host joins the cluster

The process of adding a cloud host to a cluster is basically the same as that of a physical machine. First apply for a cloud host through the company's cloud platform, and then initialize the cloud host through VContainer's automation platform and add it to the cluster. Finally, the cloud host is marked with a unique label for the cloud host. For the introduction of automation platform, please refer to vivo AI computing platform cloud native automation practice.

Reduce pressure on dedicated lines

The dedicated line from the computer room to Alibaba Cloud is shared by all the company's businesses. If the platform occupies too much dedicated line bandwidth, it will affect the stability of other businesses. When landing, we found that the deep learning training task pulls data from the storage cluster in the computer room, which does cause pressure on the dedicated line. To this end, the platform has taken the following measures:

1. Monitor the network usage of the cloud host, and the network team will assist in monitoring the impact on the dedicated line.
2. Use the tc tool to limit the downstream bandwidth of the cloud host eth0 network card.
3. Support the business to use the data disk of the cloud host to pre-load the training data to avoid repeatedly pulling data from the computer room.

Landing effect

Several business parties temporarily need a large amount of computing power for deep learning model training. Through the ability of hybrid cloud, the platform adds dozens of GPU cloud hosts to the cluster, and provides users to use on the VTraining training platform, which meets the computing power requirements of the business in a timely manner. The user experience is exactly the same as before. According to different business conditions, the use period of this batch of resources ranges from one month to several months. After estimation, the use cost is much lower than the cost of purchasing a physical machine by yourself, which effectively reduces the cost.

Future outlook

The construction and implementation of hybrid cloud has achieved phased results. In the future, we will continue to improve the functional mechanism and explore new features:

Support AI online services to be deployed to cloud hosts through hybrid cloud capabilities to meet the temporary computing power requirements of online businesses.
Establish a simple and effective process mechanism for resource application, release, and renewal to improve the efficiency of cross-team communication and collaboration.
Measure and assess the cost and utilization of cloud hosts to encourage business parties to make good use of resources.
Automate the entire process of cloud host application and joining the cluster, reducing manual operations and improving efficiency.
Explore advanced features on the cloud to improve the performance of large-scale distributed training.

Thanks

Thanks to Huaxiang, Jianming, Liusheng from the Alibaba Cloud container team and Yang Xin, Huang Haiting, and Wang Wei from the company's basic platform for their strong support in the design and implementation of the hybrid cloud solution.

the Author:

Dongyang, a senior engineer in the computing platform group of the vivo AI Research Institute, has worked in companies such as Kingdee and Ant Financial; pays attention to cloud native technologies such as k8s and containers.

Wu Ziyang, a senior engineer in the computing platform group of the vivo AI Research Institute, worked in Oracle, Rancher and other companies; contributors to projects such as kube-batch, tf-operator, etc.; focused on cloud native, machine learning systems and other fields.

ACK hybrid cloud practice of vivo AI computing platform

background

Elastic resources of public cloud

Advanced features of public cloud

plan

Scheme selection

Overall structure

Landing practice

Register the cluster

Container network configuration

Cloud host joins the cluster

Reduce pressure on dedicated lines

Landing effect

Future outlook

Thanks

阿里云云原生

引用和评论

Spring AI Alibaba + Nacos 动态 MCP Server 代理方案

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强