Introduction: Deep learning platforms play an important role in Weibo social business. Under the separation of computing and storage architecture, the Weibo deep learning platform has problems with low performance in data access and scheduling. This article will introduce a new set of new fluid (including JindoRuntime)-based architecture schemes designed and implemented in Weibo, which significantly improves the performance and stability of model training in massive small file scenarios. Multi-machine multi-card distributed training scenarios can be used The speed of model training is increased by 18 times.
Author|
Wu Tong Weibo Deep Learning Platform Engineer
Hao Li Weibo Deep Learning Platform Engineer
Introduction: Deep learning platform plays an important role in Weibo social business. Under the separation of computing and storage architecture, the Weibo deep learning platform has problems with low performance in data access and scheduling. This article will introduce a new set of new fluid (including JindoRuntime)-based architecture schemes designed and implemented in Weibo, which significantly improves the performance and stability of model training in massive small file scenarios. Multi-machine multi-card distributed training scenarios can be used The speed of model training is increased by 18 times.
Background
Sina Weibo is China's largest social media platform, with hundreds of millions of pieces of content generated and disseminated on trillion-level social networks every day. The following figure is the business ecology of Weibo. Through the production and dissemination of high-quality content by high-quality users, ordinary users consume these content, and then pay attention to their favorite bloggers, establish contacts, and form a closed-loop ecology.
The main function of the Weibo machine learning platform is to make the whole process flow more efficiently and smoothly: by understanding high-quality content, constructing user portraits, pushing high-quality content that users are interested in to users, allowing them to interact with content producers, thereby stimulating producers Produce more and better content to achieve a win-win situation for information consumers and information producers. As multimedia content becomes mainstream, deep learning technology becomes more important. From the understanding of multimedia content to the optimization of CTR tasks, the support of deep learning technology is inseparable.
Large-scale deep learning model training challenges
With the widespread use of deep learning in Weibo business scenarios, the Weibo deep learning platform plays a very central role. The platform adopts an architecture that separates storage and computing, so that computing resources can be decoupled from storage resources, thereby achieving flexible resource ratios and convenient storage expansion, and reducing storage costs.
However, this architecture also brings some challenges, among which the more critical issues are reflected in data access performance and stability:
The separation of computing and storage architecture leads to high latency of data access and slow training: the deep learning tasks (image or voice models) used by the business team will access a large number of small files. Experiments have shown that HDFS reads massive small file scenarios and the performance of local reads differs by nearly ten times or even a hundred times.
Kubernetes scheduler data cache is not aware, and the same data source runs multiple times and access is still slow: the same model, different hyperparameters; fine-tuning models, the same input; AutoML and other deep learning tasks will continue to access the same data repeatedly, which can be reused Data cache. However, because the native Kubernetes scheduler cannot perceive the cache, the results of application scheduling are poor, the cache cannot be reused, and the performance cannot be improved.
Most deep learning frameworks do not support the HDFS interface, which leads to development difficulties: frameworks such as PyTorch and MxNet only support the POSIX protocol interface, and the HDFS interface requires additional docking development. Therefore, it is necessary to support the POSIX interface in the model development stage and the HDFS interface in the model training stage at the same time, and introduce the model code to adapt to the complexity of different storage.
HDFS has become the bottleneck point of concurrent data access, and the stability challenge is great: hundreds of GPU machines on the Weibo machine learning platform will concurrently access the HDFS cluster when training at the same time. At the same time, the IO pressure of deep learning training is relatively high, and the HDFS service has become a single point of performance. This poses a huge challenge to the performance and stability of HDFS. Once a task slows down the HDFS system, other training tasks will also be affected. Moreover, once HDFS fails to work, the entire training cluster will also be affected.
Through monitoring and analysis of the Weibo deep learning platform, we found that: on the one hand, expensive computing resources such as GPUs cannot be fully utilized due to IO performance issues; on the other hand, we also found that the water level of the memory and local hard disks in the cluster is very low. The margin is large and stable, because most of the deep learning tasks do not use local disks, and the memory usage is not high. Therefore, we consider that it would be a better solution if we can make full use of the cluster's own memory and disk resources to accelerate data access.
Fluid + JindoRuntime: Provide efficient support for Weibo deep learning platform
In order to better meet the computational requirements of large-scale deep learning model training, better data locality effects need to be achieved. Therefore, we hope to achieve the following goals:
Computing can make full use of localized access data, so that data does not need to be repeatedly read through the network, which accelerates the speed of deep learning model training and improves the GPU utilization rate of the cluster.
Reduce the HDFS load pressure, and reduce the data access delay and improve the availability of HDFS through the application of local reading of part of the data.
Give full play to the advantages of cache nodes for hot data sets, and intelligently schedule tasks to data cache nodes without being aware of users. Make commonly used model training programs faster and faster.
Reading data through the POSIX interface eliminates the need to use different data access interfaces in the model development and training stages, reducing the cost of developing deep learning model programs.
In order to achieve the above goals, we are eager to find software with distributed cache acceleration capabilities on Kubernetes. Fortunately, we found that the CNCF Sandbox project Fluid can just meet our demands. Therefore, we designed a new fluid-based architecture scheme. After verification and comparison, we chose JindoRuntime as the accelerated runtime.
- Introduction to Architecture Components
1)Fluid
Fluid[1] is a scalable distributed data orchestration and acceleration system running on Kubernetes. It uses data orchestration and application scheduling to use data to solve the problem of high data access latency and multiple data access problems faced by cloud-native orchestration frameworks running such applications. The joint analysis of data sources is difficult, and the process of applying and using data is complicated.
2)JindoRuntime
JindoRuntimed[2] is an implementation of Fluid's distributed cache Runtime, based on the JindoFS distributed cache acceleration engine. JindoFS is a self-developed big data storage optimization engine developed by the Alibaba Cloud EMR team. It is fully compatible with the Hadoop file system interface and brings customers more flexible and efficient computing and storage solutions. JindoRuntime uses JindoFS's Cache mode for remote file access and caching, and supports the access and cache acceleration of multiple storage products such as OSS, HDFS, and standard S3 protocols. The process of using and deploying JindoRuntime on Fluid is simple, compatible with the native K8s environment, and can be used out of the box. Deeply integrate object storage features, use Navite framework to optimize performance, and support cloud data security functions such as password-free and checksum verification.
- Reasons to use Fluid based on JindoRuntime
Fluid can arrange data sets in a Kubernetes cluster to achieve the co-location of data and calculations, and provides a Persistent Volume Claim interface to achieve seamless connection of applications on Kubernetes. At the same time, JindoRuntime provides access to data on HDFS and cache acceleration capabilities, and can use FUSE's POSIX file system interface to realize that massive files on HDFS can be used as easily as a local disk. Deep learning training tools such as pytorch can be read using POSIX file interface. Training data.
Aiming at the remote data access performance problem of massive small files, JindoRuntime has made a lot of targeted optimizations on the data organization and management and access performance of small files, which can provide efficient small file access performance, which is much higher than the direct data access performance of HDFS.
Provide metadata and data distributed hierarchical cache, and efficient small file retrieval.
Provide a data warm-up mechanism to avoid data access competition caused by pulling data at the training time.
Slab allocation organizes file data and efficiently utilizes cache space.
With Fluid's data-aware scheduling capabilities, users can place tasks on nodes with cached data without knowing the cache node information, maximizing the advantages of data access performance.
Provides different caching strategies and storage methods for large files and small files, and has good adaptability for small file AI training scenarios without user configuration.
- Landing practice
Choose the appropriate cache node: Use JindoRuntime to get better data local performance. In actual production, we also found that not all nodes are better for cache performance. The reason is that the disk and network IO performance of some nodes is not very good. At this time, we need to select some large-capacity disks and nodes with better network as the cache node. Fluid supports the schedulability of datasets, in other words, the schedulability of cache nodes. We specify the nodeAffinity of the dataset to schedule the cache nodes of the dataset to ensure that the cache nodes can efficiently provide cache services.
Specify Master scheduling strategy: JindoRuntime is composed of three parts: master/worker/fuse. The master is responsible for the brain of the cluster and is responsible for the management of metadata and cluster cache. Therefore, the master node must have strong reliability and failure recovery speed. In the production process, we found that without using multiple masters, a single master also has strong stability and failure recovery speed. The important factor that affects the stability of the master node is the stability of the host, such as the host is full of disks, Communication failure, etc. Based on this, we use nodeselector for the mater node to select a host with better performance as the environment of the master container to further ensure the stability of the master environment.
Timed data warm-up: An important step before training is to warm-up metadata and data. Fluid provides metadata and data caching in the form of CRD. The metadata and data of training files are cached before training. Going locally can greatly accelerate the training speed. However, the training files stored on HDFS are updated once a day, so we need to perform the data preheating process periodically. Based on the dataload CRD, we use cronJob for periodic scheduling, so that it can be completed before each training Metadata and data preparation for efficient training. Of course, JindoRuntime itself also supports the function of incremental synchronization, so you only need to update the changed files each time, which greatly accelerates the speed of data preheating.
- Performance test plan
In order to verify the overall effect of the above scheme, we have verified it from different perspectives of stability and performance. Here we focus on the performance test scheme. The trained models are all based on mmaction's video understanding model, using the rawframes_train method, which has 400w pictures. The training data set experiment. The data is obtained by extracting frames from 40w videos extracted from real business scenes. 10 frames of pictures are drawn in each scene. Due to the different video definitions, the size of each picture varies from a few KB to a dozen M. , The total size is about 780G, and each cache node provides 300G of cache space; at the same time, based on experience, the model will generally converge at about 50 epoch.
When we adjusted the test video data volume to 100w, the total data size was 2T. Due to the large amount of data and the long delay, the HDFS interface method could not work at all; however, fluid+JindoRuntime can meet the needs of the business.
The test process is to preheat the data through Fluid JindoRuntime, and then perform model training.
- Performance test results
Combining the Fluid+JindoRuntime solution, under the premise of data preheating, we have achieved a very significant increase in training speed. As can be seen from the figure below: in the scenario of 3 machines and 12 cards, we found an experiment based on the HDFS interface to read data Often due to network communication and other problems, the experiment cannot be completed. After adding exception handling, the waiting time between workers will be longer. As a result, increasing the number of cards will not increase the training speed, but will slow it down. It can be observed that the overall training speed of the scenarios of 1 machine with 8 cards and 3 machine with 12 cards is basically the same, and the expansion of computing resources. Through the new scheme, we found that compared with the HDFS interface, 1 machine with 4 cards can get 5 times acceleration, 2 machines with 8 cards can get 9 times acceleration, and 3 machines with 12 cards can get 18 times acceleration.
As the speed and stability of training are guaranteed, the end-to-end model training time has also been significantly improved, and the total training time has been shortened from the original 389 hours (16 days) to 16 hours.
Summary: The training speed jumped from two weeks to 16 hours
After integrating Fluid+JindoRuntime, it significantly improves the performance and stability of model training in small file scenarios. In the case of multi-machine multi-card distributed training, the speed of model training can be increased by 18 times; it will take two weeks to complete in the past The training was reduced to 16 hours. Shorter training time and smaller HDFS pressure also improve the stability of the training task, increasing the training success rate from 37.1% to 98.3%. At present, the amount of data we have in the production environment is 4TB, and the amount of data continues to grow as we continue to iterate.
The Weibo AI training scene has high performance requirements for data reading, and a large number of small files are also very sensitive to access delays. The cache capability of JindoRuntime can effectively cache and accelerate data on the big data storage system. Stable and reliable data access performance with high throughput and low latency can also effectively relieve the pressure on the back-end storage system and ensure the stability of the back-end storage. Combining its own specific scenarios, optimizing small file reading and caching can not only alleviate the IO pressure of the HDFS cluster, but also greatly improve training efficiency.
Outlook
At present, Fluid+JindoRuntime is more like a killer tool, used to accelerate small file scenarios, instead of conventional weapons for accelerating optimization of all data sets, we expect to be able to use flexible data acceleration as the differentiation capability of the Weibo deep learning platform to improve the overall Training task speed and computing resource utilization; on the other hand, it also helps the community to evolve continuously and help more developers. Specifically:
Support timing tasks, support dynamic expansion and contraction
The improvement of data preheating performance and the provision of a metadata backup mechanism to achieve the ability to quickly rebuild data sets
Provide performance monitoring console
Supports high availability and image upgrade of Runtime metadata
Support full lifecycle management of multiple data sets in large-scale K8s clusters
Thanks
Thanks to Chenshan, Yangli, and Che Yang of the Alibaba Cloud JindoFS team for their great help in the entire solution design and optimization process. Under the premise of almost no application transformation, the data acceleration capability is given to existing applications; at the same time, for The needs and problems in the testing and production environments are also timely and professionally supported.
Related Links
For more information about Fluid and JindoFS, please refer to the following link:
[1] Fluid:https://github.com/fluid-cloudnative/fluid
[2] JindoFS:https://github.com/aliyun/alibabacloud-jindofs
👇👇 Click the link below to go directly to the project GitHub address!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。