[Editor's note: In the last issue, we introduced the development background and driving force of the algorithm development platform, the main classification of the algorithm development platform-integrated machine learning platform and AI basic software platform, and the core value of the algorithm development platform. In this issue of sharing, we will analyze and compare the functions and technologies of the integrated machine learning platforms of cloud vendors mentioned in the previous issue]
In recent years, cloud computing manufacturers have transformed to cloud computing + AI, whether it is the "integration of cloud and intelligence" proposed by Baidu Cloud, the "big data + "AI engineering" proposed by Alibaba Cloud, or Huawei Cloud's AI full-stack and full-scenario strategy , are a powerful manifestation of this trend. On the basis of cloud-native data and computing power, cloud manufacturers have extended to build an integrated machine learning platform covering the whole process of algorithm development, helping enterprises to release data value and accelerate intelligent transformation. The deep collaboration of cloud vendors is the foundation of the cloud vendor's machine learning platform, and it also shapes its product and service architecture.
1. Cloud vendor integrated machine learning platform product and service architecture
Cloud vendors have obtained a rich customer base through cloud services, and accumulated a large number of machine learning application practices in customer service. Based on these advantages, cloud vendors usually provide products and services that include underlying cloud computing infrastructure, machine learning platforms, and application-layer industry solutions.
The cloud computing infrastructure layer mainly manages and schedules heterogeneous hardware resources in a unified manner through containers, helping customers to achieve flexible resource allocation in artificial intelligence services, and allowing the most suitable dedicated hardware to serve the most suitable business scenarios. At the same time, a big data computing engine is configured to provide infrastructure support for large-scale distributed computing.
At the industry application layer, it usually provides targeted algorithm solutions for specific industries and specific scenarios based on its own business or customer service experience, such as Alibaba's internal search system, recommendation system, financial service system and other algorithms, through the PAI platform The output empowers corporate customers such as retail and finance.
For the core machine learning platform layer, the integrated machine learning platform products are built on the mainstream machine learning frameworks and are compatible with open source frameworks such as TensorFlow, Pytorch, and Caffe, providing users with higher flexibility and reducing the cost of environment configuration. Functionally, it integrates and provides products and services for data management and preparation, model development, computing and training, inference deployment and operation and maintenance.
In addition, from the perspective of ecological construction, cloud manufacturers rely on their own integrated machine learning platforms to build AI markets, attracting developers and algorithm demanders, and promoting the sharing and trading of algorithms and models. However, the AI market is still in the early stage of development, and the commercial and ecological feedback role of the integrated algorithm development platform is given priority. The main challenges are: the industry value and application potential of the developed models have yet to be tapped; market buyers who can put forward clear needs are still It needs to be cultivated, and the transaction and supply chain mechanism still needs to be improved (including the matching of supply and demand of algorithms and models, debugging services for model production optimization, etc.).
Figure 1 Cloud vendor integrated machine learning platform product and service architecture
2. Comparison of core functions and technologies of some platforms
For the core functions of the integrated machine learning platform, namely data management and preparation, model development, computing and training, inference deployment and operation and maintenance, we will take AWS SageMaker, Baidu BML, Alibaba Cloud PAI and Huawei ModelArts as examples for in-depth analysis.
2.1 Data management and preparation
The core value of the data management and preparation module of the machine learning platform is to allow data scientists and algorithm engineers to easily access data and quickly understand data. The four integrated machine learning platforms mainly provide functions such as data access, data management, data processing, data annotation, data exploration and advanced exploration in terms of data management and development preparation. The heaviest. In the absence of effective tools, these two issues typically consume the most development preparation time and effort for algorithm developers.
Data processing is to extract or generate valuable data sets from a large amount of non-standard and messy data for subsequent data labeling and model training. Judging from the public information on the official websites of various companies, AWS SageMaker and Huawei Cloud ModelArts have relatively richer data processing types, including data verification, data selection, data cleaning, and data enhancement; Alibaba Cloud only presets data processing tools for visual modeling , Algorithm engineers and data scientists who use interactive modeling need to use Dataworks products for data processing.
Data labeling requires a large amount of labelled data during model training. The four integrated machine learning platforms involved in this article all provide manual labeling, intelligent labeling and team labeling functions. However, at present, the intelligent annotation and team annotation functions cannot be used in all scenarios and on a large scale. Taking Huawei Cloud ModelArts as an example (as shown in Figure 3), intelligent annotation only supports image classification and object detection. not support.
SageMaker is currently the only platform among the four integrated machine learning platforms that provides feature libraries. The SageMaker Feature Store is a fully managed machine learning feature repository that helps teams of data scientists and algorithm engineers efficiently and securely store, share, and retrieve engineering data for training and prediction work.
Figure 2 Comparison of data management and preparation functions/techniques of cloud vendors' integrated machine learning platforms
Figure 3 HUAWEI CLOUD ModelArts data annotation function
2.2 Model Development
In terms of model development, the functions of the four integrated machine learning platforms involved in this paper are basically equivalent (see Figure 4). The users it serves include not only professional data scientists and algorithm engineers, but also business personnel and AI beginners. According to the differentiated needs of the two types of users, it provides interactive modeling and visual modeling environments respectively.
For interactive modeling, the four major platforms use the integrated JupyterLab/Jupyter notebook to optimize plug-ins to a certain extent, and put more energy into the creation of visual modeling tools. The users targeted by visual modeling lack the ability to build models, and even know little about the basic steps and concepts of model development. This type of user can build models and make business predictions with a simple click and drag through visual modeling tools, without writing code or any machine learning experience. Because visual modeling is tightly coupled with business applications, the core differentiated competitiveness of visual modeling tools lies in industry expertise and the richness and quality of built-in operators. At present, the application of visual modeling on various platforms is limited to some focused scenarios. For example, the hundreds of mature machine learning algorithms built in Alibaba PAI mainly focus on high-frequency scenarios such as product recommendation, financial risk control, and advertising prediction. AWS SageMaker's visualization Modeling currently focuses on churn prediction, price optimization, and inventory optimization scenarios.
In addition to the development environment, workflow scheduling and management are also an important part of improving model development efficiency. Judging from the information released on the official website, SageMaker has a relatively complete workflow management tool, and Ali PAI's workflow is mainly based on the open source MLflow.
2.3 Computation and training
In computing and training, the core requirement is to support distributed training and flexible computing resource management to improve the efficiency of large model training and save computing power costs. The four platforms compared and analyzed in this paper can better support these two functional requirements.
For distributed training, AWS SageMaker and Alibaba Cloud PAI provide distributed training libraries that support data parallelism and model parallelism based on their own deep learning containers to improve training speed and throughput. In addition, Huawei Cloud ModelArts provides Huawei's self-developed distributed training acceleration framework - Moxing, which is built on the open source deep learning algorithm frameworks TensorFlow, MXNet, PyTorch, etc., to improve the training performance of these frameworks. From the test results disclosed by HUAWEI CLOUD, using 128 V100 GPUs to train the ResNet-50 model on the ImageNet dataset, compared with fast.ai, after using Moxing to accelerate, the training time is shortened from 18 minutes to 10 minutes, which is very convenient for users. 44% cost savings1.
In terms of computing resource management, the four platforms are based on their own cloud services and can support automatic expansion and contraction. In particular, SageMaker offers Managed Spot Training to train models using Amazon EC2 Spot Instances (available idle compute capacity in AWS) rather than on-demand examples. Compared with training that obtains computing resources on demand, Spot training can greatly reduce the cost of computing power. However, since the Spot training can be interrupted, the training takes longer. Therefore, the Spot instance with checkpoint is more suitable for non-emergency complex large model training.
In addition, model debugging functions and model evaluation tools such as super arithmetic optimization are gradually being integrated into the machine learning platform. However, the relevant tools are still in the process of improvement.
Figure 4 Comparison of model development, computing and training functions of cloud vendors' integrated machine learning platforms
2.4 Inference deployment and operation and maintenance
The ultimate goal of model development and training is to deploy it into a production environment to empower the business. SDK release, API release and multi-version management are the basic functions of each integrated machine learning platform.
In addition to the data processing and labeling mentioned above, another core difficulty in engineering machine learning models is the optimization of inference performance. As production environments become increasingly diverse and decentralized (need to support diverse algorithm frameworks, heterogeneous hardware and systems) and models become increasingly complex, the need for optimization of inference performance becomes more prominent. All platforms have begun to provide inference optimization tools, encapsulating technologies such as compilation optimization, computational graph optimization, etc., to lower the threshold for model optimization and improve user experience and production efficiency. In addition, model conversion is also an important means to improve production efficiency. By converting the model format, it is more suitable for the target production environment. However, only some players have explicitly mentioned supporting model conversion. For example, Huawei Cloud ModelArts currently supports model conversion whose original framework types are Caffe and Tensorflow, and the target deployment chip supports three types: Ascend chip, ARM or GPU. There is still a lot of room for improvement in the model conversion function in the future.
Figure 5 Comparison of cloud vendors' integrated machine learning platform deployment and operation and maintenance functions/technologies
3. Summary
The cloud vendor's integrated machine learning platform has basically covered the tools required for the whole process of AI development and production. With the large-scale implementation of AI applications, the operation and maintenance management (MLOps) of artificial intelligence systems will be the future development direction of such platforms. Through standardized model development, deployment and operation and maintenance processes, continuous integration and continuous deployment, enterprises will be further accelerated. While model development and deployment, the quality of the model is effectively guaranteed.
【References】
- HUAWEI CLOUD products and solutions, "HUAWEI CLOUD ModelArts achieves the ultimate performance! 128 GPUs, ImageNet training time 10 minutes"
Official website: https://baihai.co/
Public number: Baihai IDP
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。