Serverless exploration of NetEase cloud music audio and video algorithms

Author: Liao Xiangli
Planning: Wang Chen
Review & proofreading: Xiaohang
Editing & Typesetting: Wen Yan

Most of the original audio and video technologies of NetEase Cloud Music are applied to the data processing of the music library. Based on the experience of audio and video algorithm service, the cloud music music library team and the audio and video algorithm team collaborated to build NetEase Cloud Music audio and video together. The algorithm processing platform provides a unified audio and video algorithm processing platform for the entire cloud music. This article will share how we use Serverless technology to optimize our entire audio and video processing platform.

This article will introduce to you from three parts:

Status Quo: The application of audio and video technology in NetEase Cloud Music, and the problems encountered before the introduction of Serverless technology;
Selection: considerations when investigating Serverless solutions;
Landing and prospects: What transformations have we made, the final landing effect and future plans.

status quo

As a music-oriented company, audio and video technology is widely used in many business scenarios of NetEase Cloud Music. In order to make everyone feel more vividly, here are five common scenarios:

By default, users hear the sound quality of the standardized bitrate that we use the audio transcoding algorithm to convert in advance, but due to limited traffic or their own higher requirements for sound quality, they want to switch to a worse or better sound quality.
Users can use the song-recognition function in the Cloud Music APP to identify the music in the environment, which uses audio fingerprint extraction and recognition technology.
For some VIP songs on the platform, in order to give users a better audition experience, we will do chorus detection so that the audition can directly locate the climax segment. Here, the chorus detection algorithm is used.
In the K song scene of Cloud Music, we need to display the audio pitch and assist in scoring. Here we use the pitch generation algorithm to improve the basic data of K song.
In order to better satisfy the music listening experience of users in small languages on the cloud music platform, we provide transliteration lyrics for Japanese, Cantonese, etc., and an automatic Roman sound algorithm is used here.

As can be seen from the above scenes, audio and video technology is widely used in different scenes of cloud music, playing an important role.

From our audio and video technology to make a simple division, it can be divided into three categories: analysis and understanding, processing, creation and production, these parts are processed on the end in the way of SDK; and the more parts are Through the method of algorithm engineering, the back-end cluster deployment management is used to provide general audio and video capabilities in the form of services, and this part is the focus of our sharing today.

In the service deployment of audio and video algorithms, we need to understand the characteristics of many related audio and video algorithms, such as deployment environment, execution time, and whether they can support concurrent processing. As our landing algorithms increase, we have summarized the following rules:

Algorithm execution time is long: the execution time is often proportional to the original audio duration. In many cloud music scenes, the audio and video duration ranges are very large. Based on this feature, we often use asynchronous design in the execution unit. Model.
Audio and video algorithms have multi-language features: Cloud Music's algorithms include C++, Python and other languages, which can cause great troubles when connecting to the environment context. In order to solve this problem, we adopt standardized conventions and image delivery methods to decouple various types. The work of environment preparation, so the follow-up whether it can support mirror deployment, will become a key inspection of our technology selection.
The demand for flexibility is getting bigger: songs on the cloud music platform, from 500w when I joined the company, to more than 6000w online now, the gap between incremental vs. stock is getting bigger and bigger. When we quickly implement an algorithm, we must not only consider the increment. In order to facilitate rapid expansion, the minimum granularity of the execution unit will be stripped out separately in the system design.

Based on our understanding of engineering and the characteristics of audio and video algorithm processing, the overall architecture of Cloud Music's audio and video processing platform is as follows:

For the common parts of different audio and video algorithm processing, we have made a unified design, including algorithm processing visualization, monitoring, rapid trial and processing data statistics, etc., and also designed a unified and configurable management mode for resource allocation, so that the entire system The public part of can be abstracted and reused as much as possible.

The most critical of the entire audio and video algorithm processing platform, and also the focus of today's sharing, is the interaction and design of the execution unit. Cloud Music solves many efficiency problems in docking and deployment through a unified docking standard and a mirrored delivery method. For the use of resources, as we continue to have new algorithms and inventory/incremental services, before going to the cloud, we used internal private cloud cloud host application/recycling, and content containerization.

In order to better describe the operation process of the cloud music execution unit, we will make it more detailed, as shown in the following figure:

The message queue decouples the interaction between the execution unit and other systems. Inside the execution unit, we adapt the algorithms of different concurrency performance by controlling the concurrency of the message queue, and try to control the main work of the execution unit only for the calculation of the algorithm. In this way, when the system is finally expanded, we can achieve the smallest granularity expansion.

In this mode, we have implemented more than 60 audio and video algorithms. Especially in the past year, servicing algorithms accounted for half of them. These algorithms provide service capabilities for 100+ business scenarios of Cloud Music. However, more complex algorithms and more business scenarios have put forward higher requirements for our service efficiency, operation and maintenance deployment, and flexibility. Before we went to the cloud, we had used more than 1,000 different specifications internally. Cloud host and physical machine.

Selection

With the increase in the complexity of business scenarios and algorithms, although many ways have been adopted to simplify the docking of internal business scenarios, algorithms, etc., there are more and more algorithms that contain inventory, incremental processing, and the scale of business scenarios with different flows, and Different business scenarios may reuse the same type of algorithm, which allows us to spend far more time processing machine resources than we spend in development.

This also prompted us to start to consider more ways and methods to solve the problems we encountered, the most direct there are three pain points.

The first is that the difference between the stock and the increment becomes larger, and with the increase in the implementation of new algorithms, we spend more and more time coordinating resources with the stock and increment; secondly, as the complexity of the algorithm increases, we are applying /When purchasing a machine, we need to pay attention to the overall specifications and utilization of the machine; finally, we hope that the processing of the stock can be speeded up, and there are sufficient resources when processing the stock, and the stock can be compressed when processing massive audio and video data. Time inconsistent with the increment. In general, we hope to have sufficient large-scale flexible resources so that audio and video algorithm services do not need to pay more attention to machine management.

However, the actual transformation does not only focus on the final service capability, but also requires comprehensive consideration of the ROI of the investment. Specifically:

Cost: There are two aspects, the implementation cost of the transformation and the cost of computing resources. The former can be evaluated in conjunction with specific plans to get the man-days needed to invest. In addition, the flexibility of future expansion after the transformation is also a point that we need to consider. The latter can be estimated through the cost calculation model officially given by the cloud vendor, combined with our execution data. The key to our cost selection is that if the cost of the transformation is acceptable, the future IT cost will not increase by a large amount.
Support for the operating environment: As mentioned earlier, the operating environment of Cloud Music is more diverse and deployed in the form of image delivery; the team has relatively complete CICD support, which requires future upgrades and deployment issues, such as specifications In terms of configuration, whether it can simplify the developer's attention to the machine and so on. We hope that after the transformation, we don't need to spend too much time and energy on such matters and pay more attention to the algorithm execution itself.
Elasticity: In addition to the scale of the computing resource pool provided by cloud vendors, we will also pay attention to the startup speed of elastic computing power, whether it is possible to reserve instances for fixed scenarios, and whether to provide flexible and elastic capabilities that are more in line with business requirements for better To support the development of the business.

In fact, these are in line with the definition of serverless. The server is not required to be managed to build and run applications, and the flexibility is outstanding. Based on the above considerations, we chose the public cloud function computing method, which can intuitively map our current calculation execution process, and at the same time, it can also meet the subsequent attempts to try to arrange algorithm through Schema. Below I will focus on sharing the process of introducing a function to calculate FC.

Landing

We quickly tried the function calculation FC within a week, but a complete and highly reliable architecture requires more considerations. Therefore, the focus of our transformation is to only pop out computing tasks through function calculation FC. The overall external input and output of the system remain unchanged, and the system has flow control capabilities, which can be downgraded to a private cloud in special circumstances. Processing is performed to ensure the high reliability of the system. The specific architecture transformation is shown in the following figure:

The adaptation of the development environment of Cloud Music and functional computing is the focus of the transformation. We focused on the deployment, monitoring, and hybrid cloud support. In terms of deployment, we have fully applied the support of function computing on CICD and the support of image deployment to realize the automatic extraction of images; in the monitoring design, on the one hand, we use the monitoring and alarm function on the cloud, and on the other hand, we transform it into We have internal monitoring system parameters to maintain consistency in the overall development, operation and maintenance processing. Finally, from the code design, considering the implementation of compatible hybrid cloud deployment, we finally completed the serverless transformation of our audio and video processing platform.

From the billing strategy of function calculation, we can see that there are three major factors that affect the final cost, the specifications of the memory, the number of times the calculation is triggered, and the cost of outgoing traffic from the public network. Directly from the technical architecture, everyone may pay more attention to the first two. In fact, the traffic cost is also a lot of money. This is also a focus for us.

According to the cost characteristics of function calculations, in the first stage, when the storage system still uses NetEase's private cloud, the first step is to select audio and video algorithms with less public network traffic. Regarding the relatively small outgoing traffic from the public network, I will give an example of feature extraction of audio. For example, if an audio enters and extracts a 256-dimensional array, the result obtained is only a 256-dimensional array, which is much smaller than the audio traffic itself. , So the traffic cost out of the public network will be relatively small.

In the first stage of the introduction of function calculations, the feature extraction algorithm has been improved by 10 times; the sparse algorithm can be understood as an algorithm with a very low daily usage rate, which has greatly reduced the cost. In addition, the image cache acceleration capability of function calculation optimizes the startup speed of our nodes, so that all services can be pulled up in seconds. These tasks have reduced a large number of operation and maintenance costs in the algorithm operation and maintenance process, allowing us to focus more on the algorithm and the business itself.

The picture on the right above is an example of the operation of one of the algorithms of Cloud Music. It can be seen that our flexibility in the range of changes is very large, and the function calculation satisfies this demand very well.

In the future, we hope to further liberate our human investment in operation and maintenance through serverless technology, and will try from storage to solve the problem of public network traffic, so that audio and video algorithms in more scenarios can be realized naturally. ; Secondly, with the further increase in the complexity of the algorithm, the use of computing resources is more complicated, and it is hoped that the GPU instance can be used to optimize the calculation process; finally, in the business scene of cloud music, the scene of real-time audio and video processing is becoming more and more In the same way, it also has obvious peaks and troughs. We hope to accumulate more serverless service experience and ultimately help the development of cloud music real-time audio and video technology.

Author: Liao Xiangli, joined NetEase Cloud Music in 2015 and is the head of Cloud Music Music Library R&D.

For more information, please scan the QR code below or search for WeChat account (AlibabaCloud888) to add a cloud native assistant! Get more information!

二维码.png

Serverless exploration of NetEase cloud music audio and video algorithms

status quo

Selection

Landing

阿里云云原生

引用和评论

如何在通义灵码里使用 MCP 能力？

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

全网首发 | PAI Model Gallery一键部署阶跃星辰Step-Video-T2V、Step-Audio-Chat模型

无需编码5分钟免费部署云上调用满血版DeepSeek

支付宝H5下载被拦截的原因排查与解决指南

如何在通义灵码里用上DeepSeek-V3 和 DeepSeek-R1 满血版671B模型？

云上玩转DeepSeek系列之四：DeepSeek R1 蒸馏和微调训练最佳实践