3

Autonomous driving is a hot field in recent years. Start-up companies, new car companies, and traditional car manufacturers that focus on autonomous driving technology have invested a lot of resources in this field to promote L4 and L5 level autonomous driving experience into our daily routine as soon as possible Life.

The core part of the realization of automatic driving technology is the training of the automatic driving model. The training data is real road driving videos actually collected by the car, and the data scale is as much as several petabytes to dozens of petabytes. Before model training, these original videos must be processed first, and the key frames in them must be intercepted and saved as photos. Then the professional data annotation team will mark key information on the picture, such as traffic lights, road markings, etc. In the end, billions of labeled images and labeled data become what really needs to be "feed" to the training framework.

Those who are familiar with distributed systems and distributed storage must know that LOSF (Lots of Small Files) is a big problem in the storage field. In the field of artificial intelligence CV (Computer Vision), training based on LOSF is just needed, including subdivision areas such as autonomous driving, face recognition, and object detection.

This article comes from the architectural practice of a JuiceFS customer in the autonomous driving industry. It has conducted a series of successful explorations in the tens of billions of small-file training scenarios, and hopes to bring some reference and inspiration for applications in related industries.

The ultimate challenge of tens of billions of small files management

Most of the training data sets of autopilot systems have billions to tens of billions of small files (files less than 1MiB), and one training usually requires tens to hundreds of millions of files. Moreover, each time a training worker generates a mini-batch, it needs frequent access to the storage system, most of which are requests for metadata. Therefore, metadata performance directly affects the efficiency of model training.

This requires the storage system not only to have the ability to manage tens of billions of files, but also to maintain low-latency and high-throughput metadata performance under high concurrent requests.

In the selection of storage systems, object storage can carry tens of billions of files, but the lack of native directory support, lack of complete POSIX semantic support, and weak metadata performance make object storage not suitable for massive small file training scenarios. .

In some common distributed file system architecture designs, HDFS is not suitable for storing small files. Although Scale-Up NameNode and federation can be used to accommodate a certain scale of data, it is still necessary to store tens of billions of small files. A very difficult thing; although the MDS of CephFS has Scale-Out capability, the concurrent processing capability of a single process is not high. As the size of the MDS cluster increases, the inter-process coordination overhead increases, making the overall performance less than linear growth.

Although the TFRecord format that combines multiple small files into large files is supported in TensorFlow to reduce the metadata load pressure on the storage system during the training process, in the field of autonomous driving, this solution reduces the accuracy of random sampling of the data set, and Other training frameworks (such as PyTorch) are not compatible, causing a lot of inconvenience.

How to solve JuiceFS?

JuiceFS is an open source distributed file system designed for cloud native environments. The innovations of JuiceFS are:

  • Any object storage can be used as a data persistence layer to save data content. Regardless of any public cloud or private cloud environment, as long as there is an object storage service, JuiceFS can be used;
  • 100% compatible with the three major access protocols of POSIX, HDFS, and S3, and can connect to all applications;
  • The metadata engine is a pluggable architecture that supports a variety of databases including Redis, TiKV, and MySQL as storage engines. At the same time, it also provides a commercial metadata engine with high performance and mass storage.

The commercial metadata engine of JuiceFS uses the Raft algorithm to form a distributed cluster to ensure data reliability, consistency and high availability. Metadata is all stored in the memory of the node to ensure low-latency response. The metadata engine adopts the dynamic directory tree scheme for horizontal expansion. Each shard is an independent Raft group. The file system directory tree can be divided arbitrarily and allocated to the required shards. Automatic balancing and manual balancing are combined. The fragmentation mechanism is transparent to client access.

Flexible configuration of cache greatly improves training efficiency

Since the training task requires frequent access to the storage system, the cost of each passing through the network is superimposed and it is not a small redundancy. At present, the industry is exploring a cache acceleration scheme after storage and computing are separated. JuiceFS has built-in caching capabilities. The data accessed by the client can be automatically cached on the storage medium designated by the node, and the cache can be directly hit the next time it is accessed without reading it over the network. Similarly, metadata will also be automatically cached in the client's memory.

The caching mechanism is transparent in use, and there is no need to change the existing application. Just add a few parameters when mounting the JuiceFS client to indicate the cache path, capacity and other information. The effect of caching on training acceleration is very obvious. You can refer to our other article "How to use JuiceFS to speed up AI model training by 7 times". Caching not only speeds up training, but also significantly reduces object storage API calls, thereby reducing costs.

For a distributed training platform, the same training data may be shared by different tasks, and these tasks may not necessarily be scheduled on the same node. If it is distributed on different nodes, can the cached data be shared? Using the "cache data sharing" feature of JuiceFS, multiple training nodes together form a cache cluster, and training tasks in this cluster can share cache data. That is to say, when the node where the training task is located does not hit the cache, the data can be obtained through other nodes in the same cluster without requesting remote object storage.

The training node may not be a static resource. Especially in the container platform, the life cycle change is very fast. Will it affect the effect of cached data sharing? This leads to the next question.

Challenges of caching mechanism in elastic clusters

Every company in the field of autonomous driving has many algorithm researchers and engineers, and their algorithms must share the company's computing resources to complete training and verification. From a platform perspective, resource elastic scaling is a good way to improve overall utilization. Resources are allocated to each training task as needed to avoid waste.

But in this kind of elastically scalable cluster, the aforementioned local cache data will be affected to some extent. Although the cache cluster uses consistent hashing to ensure that when the cluster members change, the data that needs to be migrated is as small as possible, but for large-scale training clusters, such data migration will still affect the overall training efficiency.

Is there a method that can meet the demand for elastic scaling of training cluster resources without significantly affecting the efficiency of model training?

This requires the unique "independent cache cluster" feature of JuiceFS.

The so-called independent cache cluster is to independently deploy the nodes responsible for storing cached data to provide resident cached data services. In this way, it will not be affected by the dynamic changes of the training cluster, so that the training task has a higher and more stable cache hit rate.

The overall system architecture is shown in the figure below:

For example, there is a dynamic training cluster A and a cluster B dedicated to caching. They both need to use the same mount parameter --cache-group=CACHEGROUP to construct a cache group. The nodes of cluster A need to be mounted with the --no-sharing parameter. When the application of cluster A reads data, if there is no such cached data in the memory and cache disk of the current node, it will select a node from cluster B to read the data according to the consistent hash.

At this time, the entire system is composed of three levels of cache: the system cache of the training node, the disk cache of the training node, and the cache in the cache cluster B. The user can configure the cache media and capacity of each level according to the access characteristics of the specific application.

In order to ensure that the training task will not be affected when the disk is damaged, JuiceFS also provides cache data disaster tolerance. If the disk of the cache node is accidentally damaged, JuiceFS can automatically rebuild the required cache data after replacing the new disk.

How to reduce costs and increase efficiency in hybrid cloud?

The training task of autonomous driving requires a lot of GPU resources. If you make full use of it, purchasing GPUs in your computer room can be much cheaper than using public clouds. This is also the choice of many autonomous driving companies. However, building a self-built storage system in the computer room is not so simple, and will encounter two challenges:

  • Data growth is fast, so purchases are difficult to keep up with the expansion speed, and buying too much at a time will cause waste;
  • Maintaining large-scale storage clusters must face problems such as disk damage, which results in high operation and maintenance costs and low efficiency;

Compared with self-built storage systems, object storage services on public clouds can be flexibly scaled, unlimited expansion, unit cost is low, data reliability and service availability are higher than self-built storage in computer rooms, and it is a good choice for storing massive amounts of data. .

JuiceFS is very suitable for this kind of hybrid cloud architecture of IDC computer room + public cloud. The user connects his IDC computer room with the public cloud dedicated line, and the data is persisted to the public cloud object storage through JuiceFS. A cache cluster is set up in the IDC computer room, which has the effect of caching data and accelerating training. Compared with each access from the object storage Data, not only can save dedicated line bandwidth, but also save object storage API call costs.

When the hybrid cloud architecture is combined with JuiceFS, it not only enjoys the convenience of cloud storage, but also reduces GPU costs through self-built IDC. It is very simple and convenient for users and maintainers of the training platform, and meets the diversified infrastructure design requirements of enterprises.

Data synchronization and management in multiple computer rooms

In this practical case, the customer has two IDCs, thousands of kilometers apart, and the training tasks will also be allocated to two IDCs, so the data set also needs to be accessed in both IDCs. Before, the customer manually maintained and copied the data set to the two IDCs. After using JuiceFS, the "data mirroring" feature can save the previous manual labor, and the data can be synchronized in real time to meet the needs of collaborative work in multiple locations.

Specifically, the data mirroring function requires the deployment of JuiceFS metadata clusters in both IDCs. When data mirroring is enabled, the original file system will automatically copy the metadata to the mirroring area. After the mirrored file system is mounted, the client pulls data from the object storage of the original file system and writes it to the object storage of the mirrored file system. After the mirrored file system is mounted, the data will be read first from the local object storage. If the read fails due to the incomplete synchronization, it will try to read from the object storage of the original file system.

After data mirroring is enabled, all data can be automatically copied to the two object stores, which can play a role in remote disaster recovery. If remote disaster recovery is not required, no object storage can be configured in the mirror area, and only metadata replication can be performed. The data can be pre-heated to an independent cache cluster in the mirror area to speed up training. This saves a copy of the cost of object storage, and the customer in this case adopted this solution.

Comprehensive data security protection

Regardless of whether it is for assisted driving or true autonomous driving, a large amount of road mining data needs to be collected daily through road mining vehicles. These data will be processed through some processing procedures and then finally stored in the storage system of the enterprise. Autonomous driving companies have extremely high requirements for the safety and reliability of these data, so data protection is a very critical issue.

Let's first take a look at the security issues after the enterprise goes to the cloud. Many times, companies have certain data security concerns about going to the cloud, especially when it comes to some sensitive data. The "data encryption" feature provided by JuiceFS supports both encryption in transit and encryption at rest to ensure data security in all aspects of the cloud.

The next thing you may face is the problem of data management. In order to prevent data leakage or misoperation, enterprises may need to manage and control permissions for different teams and different users. JuiceFS hosting service can limit the read and write permissions and accessible subdirectories of a certain IP range through the "access token". After mounting, JuiceFS supports a permissions management model based on "users/user groups", which can flexibly set permissions for teams or individuals.

If a user has the authority to access certain data, it is still necessary to further protect the data. For example, the user may delete or update the data by mistake. For accidental deletion, the "Recycle Bin" function provided by the JuiceFS hosting service can ensure that the data can be restored again within a period of time after the data is deleted.

But if the data is updated by mistake or is damaged for some reason, even if there is a recycle bin, it will not help. At this time, the "real-time data protection" feature of JuiceFS is very useful. The realization principle of this function is to keep the Raft log for a certain period of time, so that when the data is incorrectly updated, the current metadata can be restored by playing back the historical log. At the same time, because the files written by JuiceFS to the object storage are stored in blocks, updating the files will not modify the historical blocks but generate new blocks. Therefore, as long as the historical blocks on the object storage have not been deleted, the data can be completely restored. It's like a "time machine" that can go back in time at any time!

Summarize

Complete architecture design

The following figure is the overall architecture diagram of this case. JuiceFS metadata clusters and corresponding independent cache clusters are deployed in both computer rooms A and B. When the model is trained, the data set will be read through the cache cluster first. If the cache is not hit, then Read data from object storage. In the actual test, because the cache hit rate is very high, computer room B hardly needs to access object storage across computer rooms.

The following figure describes the data writing process. Customers write data through the S3 gateway provided by JuiceFS. When new data is written, the metadata will be copied to another computer room according to the data mirroring process described earlier. At the same time, there are corresponding tasks in the two computer rooms that are responsible for warming up the independent cache cluster to ensure that new data can be cached in time.

Customer benefits

This solution has been launched into the customer’s production environment. Some important indicators are listed below:

  • Billions of files have been stored and are still growing;
  • JuiceFS metadata can still provide 1ms delay response under the pressure of hundreds of thousands of QPS;
  • Model training throughput is dozens of GiB/s;
  • The hit rate of independent cache clusters is 95%+;
  • The average delay of data synchronization between two IDCs is on the order of tens of milliseconds.

By upgrading to a JuiceFS-based storage system, customers can not only easily manage massive data sets, but also ensure the efficiency of model training with the help of JuiceFS's independent cache cluster feature. operation and maintenance costs are significantly reduced, the hybrid cloud architecture of the computer room + public cloud has a lower TCO than the single public cloud architecture. It can not only use the cost-effective computing resources of the computer room, but also combine the flexible storage resources on the public cloud .

Thanks to JuiceFS's full compatibility with POSIX, customers do not need to make any modifications to the code of the training task during the migration process.

With the data mirroring feature of JuiceFS, data is automatically synchronized from one computer room to another, solving the problem of multi-site collaboration and meeting the needs of enterprises for remote disaster recovery.

Recommended reading:
Elasticsearch saves 60% of the storage cost, and the draft is finalized. Technology dry goods sharing

project address : Github ( https://github.com/juicedata/juicefs ) If you have any help, welcome to follow us! (0ᴗ0✿)


JuiceFS
183 声望9 粉丝

JuiceFS 是一款面向云环境设计的高性能共享文件系统。提供完备的 POSIX 兼容性,可将海量低价的云存储作为本地磁盘使用,亦可同时被多台主机同时挂载读写。