Haomo Zhixing Fluid Practice: Cloud-native AI makes cars "smarter"

Introduction to machine learning training scenario has high performance requirements for data reading, and high requirements for fine-grained control of metadata and data caching. The caching capabilities of Fluid + JindoRuntime can flexibly cache OSS training files. Metadata and data provide efficient metadata and data access performance. Based on this solution, we can achieve fine-grained control of cached content and improve production resource utilization, which not only effectively relieves the pressure on OSS bandwidth, but also greatly improves training efficiency.

author:

Li Fan: Haomo Zhixing server development engineer, responsible for the development and algorithm optimization of AI automatic training platform

Tiewen: Haomo Zhixing server development engineer, responsible for the upper-level research and development of the AI automatic training platform

Introduction: Fluid is a cloud-native data orchestration and acceleration project under the CNCF of the Cloud Native Foundation. It was jointly initiated and open sourced by Nanjing University, Alibaba Cloud, and the Alluxio community. This article mainly introduces the use of the Haomo Zhixing machine learning platform in autonomous driving scenarios, and how to break through the performance bottleneck caused by the original storage and computing separation architecture based on Fluid + JindoFS, thereby improving the utilization of production resources, effectively alleviating the OSS bandwidth pressure, and greatly Production practice to improve training efficiency.

Autonomous driving commercial applications are entering the fast lane

Haomo Zhixing is an artificial intelligence technology company dedicated to autonomous driving and providing intelligent logistics solutions. The corporate mission is to aim at zero accidents, zero congestion, free travel and efficient logistics, to assist customers in reshaping and comprehensively upgrading the travel and logistics methods of the entire society.

Data intelligence is the core capability of Haomo Zhixing. The three main categories of passenger car autonomous driving systems and solutions, low-speed unmanned vehicle ecosystems and solutions, and autonomous driving-related product development and customization services are data intelligence services and data intelligence feedback The three vertical products have consolidated their absolute leading positions in their respective markets. After nearly 10 years of accumulation and full-stack self-research, as well as more than 90% of the R&D investment, we have continuously accumulated relevant data in three aspects: passenger cars, low-speed unmanned vehicles, and smart hardware. At present, we have hatched small magic boxes and small magic boxes. More than 10 mature products such as camel and small magic plate.

The rapid development of Haomo Zhixing also reflects that higher-level intelligent driving will play a role in a wider range of scenarios, and the commercial applications of autonomous driving are entering the fast lane.

Traditional machine learning training performance encounters a bottleneck

With the widespread use of machine learning in autonomous driving business scenarios, machine learning platforms have played a very central role. The platform adopts an architecture that separates storage and computing, so that computing resources can be decoupled from storage resources, thereby realizing flexible resource ratio and convenient storage expansion, and reducing storage funds and operation and maintenance costs.

However, this architecture also brings some challenges, among which the more critical issues are reflected in data access performance and stability:

1. The separation of computing and storage architecture leads to high latency of data access and slow training:

The machine learning tasks used by the business team must frequently access the data on OSS in real time during the training process. When the bandwidth of OSS is limited or under high pressure, the speed of accessing data on OSS is much slower than that of accessing local files;

2. The data cache of the Kubernetes scheduler is not aware, and the access to the same data source is still slow after multiple runs:

In real applications, deep learning tasks will continuously access the same data repeatedly, including tasks of the same model with different hyperparameters, tasks of fine-tuning the model with the same input, and AutoML tasks. The repeated data access of this deep learning task produces a data cache that can be reused. However, because the native Kubernetes scheduler cannot perceive the cache, the results of application scheduling are poor, the cache cannot be reused, and the performance is difficult to improve;

3. OSS has become the bottleneck of concurrent data access, and the stability challenge is great:

A large number of machine learning tasks on the Haomo machine learning platform will concurrently access the back-end OSS storage during simultaneous training. The IO pressure caused by this concurrent machine learning training is relatively large, and the OSS service has become a single point of performance. Once the OSS bandwidth becomes a bottleneck, it will affect all machine learning tasks;

4. Training files are scattered, and metadata pressure is high:

The training data files of machine learning tasks are usually scattered in different paths, and it takes a lot of time to read the files in the list operation. The list operation performance of object storage is poor, so the OSS metadata is under great pressure when large-scale list is performed, and often timeouts or list failures occur.

In real applications, through monitoring and analysis of the Haimer machine learning platform, we found that IO performance problems will cause expensive computing resources such as GPUs to not be fully utilized. The characteristics of machine learning's own training lead to scattered access to data files and greater pressure on metadata. If metadata and file data can be cached in a refined manner, on the one hand, cache efficiency and disk utilization can be improved, and on the other hand, metadata loss caused by file search operations can be solved.

Production practice based on Fluid+JindoRuntime to accelerate model inference training

In order to better meet the high-efficiency requirements of large-scale machine learning model training, the model training process needs to achieve better data localization effects for data access. Therefore, we hope to achieve the following goals:

Computing can make full use of localized data access: This eliminates the need to repeatedly read through the network, thereby accelerating the training speed of machine learning models and increasing the GPU utilization of the cluster.

Reduce OSS load pressure: Through the application of local reading of part of the data, reduce the data access delay and reduce the bandwidth pressure on the underlying OSS.

Give full play to the advantages of cache nodes for hot data sets: Intelligently schedule tasks to data cache nodes without user perception, making common model training programs faster and faster.

Cache specified files in the form of a custom file list: Only files needed for training are cached, which greatly improves cache utilization and disk utilization.

Metadata caching and data caching are separated: files can be individually cached for metadata, and the caching strategy can be customized.

Read data through the POSIX interface: This eliminates the need to use different data access interfaces in the model development and training stages, reducing the cost of developing machine learning model programs.

In order to achieve the above goals, we are eager to find a system platform with distributed cache acceleration capabilities on Kubernetes. We found that the CNCF Sandbox project Fluid can just meet our demands. Therefore, we designed a new fluid-based architecture scheme. After verification and comparison, we chose JindoRuntime as the accelerated runtime.

3.1 Technical Solution

Fluid

Fluid is a scalable, distributed data orchestration and acceleration system running on Kubernetes. It uses data orchestration and application scheduling using data to solve the problem of cloud-native orchestration framework running such applications facing high data access latency and multi-data source integration. Pain points such as difficult analysis and complicated data application process.

JindoRuntime

JindoRuntime is an implementation of Fluid's distributed cache Runtime, based on the JindoFS distributed cache acceleration engine . JindoFS is a self-developed big data storage optimization engine developed by the Alibaba Cloud open source big data-data lake storage team. It is fully compatible with the Hadoop file system interface and brings customers more flexible and efficient computing and storage solutions. JindoRuntime uses JindoFS's Cache mode for remote file access and caching, and supports the access and cache acceleration of multiple storage products such as OSS, HDFS, and standard S3 protocols. The process of using and deploying JindoRuntime on Fluid is simple, compatible with the native K8s environment, and can be used out of the box. Deeply integrate object storage features, use Navite framework to optimize performance, and support cloud data security functions such as password-free and checksum verification.

The selection of fluid based on JindoRuntime is mainly based on the following considerations:

Fluid can arrange data sets in a Kubernetes cluster to achieve the co-location of data and calculations, and provides a Persistent Volume Claim interface to achieve seamless connection of applications on Kubernetes. At the same time, JindoRuntime provides access to data on OSS and cache acceleration capabilities, and can use FUSE's POSIX file system interface to realize that massive files on OSS can be used as easily as a local disk. Deep learning training tools such as pytorch can be read using POSIX file interface. Training data.

Provide metadata and data distributed cache, and metadata cache can be warmed up separately.

Provide metadata cache warm-up to avoid a large number of metadata operations on OSS for training files, and provide a data warm-up mechanism to avoid data access competition caused by pulling data at the training time

Provide data customized preheating in the form of a file list, and refined preheating data.

With Fluid's data-aware scheduling capabilities, users can place tasks on nodes with cached data without knowing the cache node information, maximizing the advantages of data access performance.

3.2 Landing Practice

Select the appropriate cache node:

Using JindoRuntime can get better data local performance. In actual production, we found that not all nodes are better for caching performance. The reason is that the disk and network IO performance of some nodes is not very good. At this time, we need to choose the cache node as far as possible to some large-capacity disk and better network nodes. Fluid supports the schedulability of the dataset, in other words, the schedulability of the cache node. We specify the nodeAffinity of the dataset to schedule the cache node of the dataset to ensure that the cache node can efficiently provide cache services.

Configure cache capacity and path:

The mount directory of data can be set through Mounts of dataset and tieredstore of JindoRuntime. At the same time, in order to avoid too much data and cause the cache to be too large, you can manually configure JindoRuntime's tieredstore to constrain the maximum capacity and water mark of the cache (data that exceeds the water mark will be automatically discarded). The tieredstore also includes the storage path of the cache. Settings and storage layer (SSD/MEM/HDD) settings to meet the needs of various scenarios. For multi-node scenarios, using the replacement of datasets can support the deployment of multiple datasets on the same cluster.

sets the cache security policy:

When creating a Dataset in Fluid, sometimes we need to configure some sensitive information in the mounts, such as the accessKeyId and accessKeySecret of the oss account. In order to ensure security, Fluid provides the ability to configure these sensitive information using Secret. By creating a Secret, the dataset specifies the name of the Secret in the EncryptOptions field to realize the binding of sensitive information.

data preload:

For datasets and jindoruntime that have been created, the first access to the mounted data will go through the process of downloading all the files in the data directory. This creates a problem: if the directory where the data is located contains other data that does not need to be used, it will cause Meaningless waste of space resources and network resources. To avoid this problem, Fluid supports not only the preloading of data, but also metadata caching. By creating dataload to read the data path information to be preloaded, data can be injected dynamically. dataload supports caching metadata and shielding access to non-preloaded data, which greatly reduces the efficiency of data access.

3.3 brings significant performance improvement

We used different models, inferred and trained on the same data, inferred and trained using JindoRuntime and not using JindoRuntime respectively. Comparing the training time, we found that the performance has brought a display improvement:

The model infers the test results of 10,000 frames of pictures in the cloud

Another slightly larger model infers the test results of 10,000 frames of pictures in the cloud

The model uses 4 cards to train 10,000 frames in the cloud and time-consuming test results

After integrating Fluid+JindoRuntime, it significantly improves the efficiency of cloud training and inference, especially for some small models. Training and inference in the cloud JindoRuntime can effectively solve the IO bottleneck problem, and the training speed can be increased by about 300%. At the same time, it also greatly improves the efficiency of cloud GPU usage and accelerates the iterative efficiency of data-driven cloud computing.

build the Fluid open source ecosystem to make more industries "smarter"

The Haemo machine learning training scene has high performance requirements for data reading, and high requirements for fine-grained control of metadata and data caching. Through the caching capabilities of Fluid + JindoRuntime, OSS training files can be flexibly cached for metadata and data. , Provide efficient metadata and data access performance. Based on this solution, we can achieve fine-grained control of the cached content and improve the utilization of production resources, which not only effectively relieves the pressure on OSS bandwidth, but also greatly improves training efficiency.

The current Fluid + JindoRuntime can meet the basic needs of the production environment, and the acceleration effect on OSS is also obvious, and the refined caching strategy provided is more efficient. We expect to be able to use flexible data acceleration as the differentiated competitiveness of the millet machine learning platform, and improve the overall training task speed and the utilization of computing resources. In future work, we also hope to help the community continue to evolve and help more developers. Specifically, the functions planned to be added to the platform later include:

Support timing tasks, support dynamic expansion and contraction

Provide performance monitoring console

Support full lifecycle management of multiple data sets in large-scale K8s clusters

Support for dynamic deletion of cached data and cached metadata

thanks

Thanks to Chenshan, Yangli and Che Yang from the JindoFS team of Alibaba Cloud for their tremendous help in the entire scheme design and optimization process, to provide customized support for the needs in the production process, and to quickly deal with various problems encountered. Help and solve.

[2] JindoFS:_ https://github.com/aliyun/alibabacloud-jindodata_

If you are interested in the Fluid project, please click on the link below to learn more:

https://github.com/fluid-cloudnative/fluid

"Kubernetes Difficulty Breaking Series: Container Persistent Storage Training Camp" countdown starts!

From September 22 to 24, we focused on a breakthrough in 3 days, starting with container persistent storage, and opening up the journey of breaking through the difficulties of Kubernetes. Complete all the check-in tasks, and there are rich training camp prizes waiting for you, such as Xiaomi headphones, Alibaba Cloud customized hoodies, and exquisite peripherals!

Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.