To ensure real-time audio and video service experience, Huawei Cloud native media network has 7 secrets

Abstract: guarantee the practice of real-time audio and video service experience? Why do we need a media network? How can we improve our practices in real-time audio and video experience?

This article is shared from the Huawei Cloud Community " Decrypting Huawei Cloud Native Media Network How to Ensure Real-time Audio and Video Service Quality ", the original author: Audio and Video Manager.

Hello everyone, I am Huang Ting from Huawei Cloud. I am currently responsible for the design of Huawei Cloud video architecture. Today I will share with you the practice of how Huawei Cloud Native Media Network guarantees real-time audio and video service experience.

I will share from the above several parts. First, explain why we need a media network; secondly, I will introduce the overall architecture design of the Huawei Cloud native media network, and finally, I will share how we can improve the real-time audio and video experience. Practice.

01Why do I need a media network

1.1 Content expression is video-oriented, and various industries have the needs of video distribution

Why do we need a media network? I mainly summarized three reasons. The first reason is that we see that video content expression is an obvious trend at present, and many industries have very strong demand for video distribution. To give a small example from my own personal experience, during the Chinese New Year this year, my family wanted to take off the ring that had been worn for many years, because it took a long time to wear it, and the fingers became thicker and could not be taken off. At first, our first reaction was to go to the mall to ask a salesperson to help take it down. Later, with the mentality of giving it a try, I searched for the word "take the ring" on Douyin. I found a very simple method in the search results. The video time is not long, and the ring is taken off quickly after doing so, and there is no damage to the ring, and the fingers are not painful. If you are interested, you can search for it. This is actually a manifestation of the video-based expression of knowledge content. This trend has appeared in many fields. In addition to short videos, such as the current e-commerce live broadcast, online education, cloud gaming and other industries, the development trend of video-based content expression has also appeared. .

1.2 With the emergence of new media expressions, the requirements for audio and video technology are getting higher and higher

The second reason is that we see many new forms of media expression in the future. For example, VR and the recent hot free perspective, the emergence of these new forms of expression, will bring users a more immersive experience. But its requirements for audio and video technology are all-round improvements, mainly including bandwidth, delay, rendering complexity, and so on. You can see the picture on the left. Take VR as an example. If you wear a VR headset to watch the video, to achieve the ultimate retinal experience, the bit rate required is very large, and it needs to reach a bit rate of 2Gbps through simple calculations. And the factors that affect the VR experience have become more compared to flat video: refresh rate, field of view, resolution, low MTP latency, attitude tracking, eye tracking, and so on.

1.3 The Internet does not promise the quality of service to users

We generally analyze a product from the two dimensions of the demand side and the supply side. The first two can be regarded as demand-side analysis. Next, let's look at the supply-side analysis. A very important supply side of real-time audio and video services is the Internet infrastructure. We all know that the Internet basically has no promises for the quality of service to users. How to understand it? First of all, the cost of building the Internet is very expensive. For example, it is necessary to pull optical cables under the sea. This laying cost is very expensive. This includes manpower and material resources. The other part is the cost of wireless spectrum, such as 3G, 4G, and 5G spectrum. Therefore, the construction of the Internet must consider sharing, and sharing requires the use of multiplexing and exchange technologies. How to understand exchange? Take a look at this simple diagram below. Suppose we want to build 4 network nodes A, B, C, and D; if there is no exchange, 6 wires are needed to interconnect each other. But if exchange is used, only 4 wires are needed. Therefore, from cost considerations, switching technologies are needed; we know that there are generally two types of switching technologies, one is Circuit switching, and the other is Packet switching. Circuit switching is characterized by capacity reservation, but there is a waste of resources, because once reserved , Even if there is no data transmission, bandwidth resources are still occupied. The Packet switching technology is link resource sharing, so it can achieve lower cost switching. At that time, the Internet design took the cost factor into consideration and chose the technology of Packet switching to evolve; because of the choice of Packet switching, coupled with the best effort forwarding mode, it brought a series of packet loss, duplicate packets, Problems such as time delay and disorder. Therefore, we conclude that packet loss, duplication, delay, and disorder are the inherent attributes of this generation of Internet.

Here everyone can think about a question, why the Internet did not consider solving this problem at the network layer when it was first designed. Or change to a bigger problem. If we redesign the Internet today, what will we do? Will you try to let the Internet solve these problems? The second question is how to solve the problems of packet loss, duplication, delay, and disorder in your daily application development process.

1.4 Inspiration for us

The previous analysis has given us some inspiration. First, we believe that we need to build a media network to bridge the gap between the supply side and the demand side through this network. The supply side is the infrastructure of the Internet, and the demand side is developing rapidly. Audio and video services. The second point: through this network to meet the strong demand for audio and video distribution in different industries. The third point is to use this network to meet the challenges of new technologies that will emerge in the future.

02Introduction to Huawei Cloud Native Media Network Architecture

The previous explained why we need a media network. Next, I will introduce the Huawei Cloud native media network architecture.

2.1 Huawei Cloud Native Media Network

You can think that the Huawei Cloud Native Media Network is a technical base for cloud-native video services. Based on this cloud-native media network, a series of cloud-native video services from production to processing to distribution to playback will be built, such as CDN, live broadcast, RTC, etc., through these cloud-native video services to support the above-mentioned customers in thousands of industries. Our cloud-native media network mainly includes 7 major features: flatness, mesh, intelligence, low latency, flexibility, diversity, and end-side cloud collaboration.

2.2 Wide coverage: support multiple access methods to achieve global interconnection

Next, I will introduce the Huawei Cloud native media network, three more important architecture design goals. Because our service targets are all over the world, we must first be a globally deployed network. This network mainly solves three major problems: the first is the need to support multiple access methods, the second is the interconnection and intercommunication of nodes; the third is to consider a high-availability design redundant coverage.

First of all, because we are a paas-type service, there are many customers from different industries. Taking cloud meetings as an example, many customers have very high requirements for the security and quality of cloud meetings, so he hopes to be able to pass a dedicated line from his enterprise park Come access this network. However, some customers hope that their users can access this network to distribute services anytime and anywhere. For example, some Internet customers need to support Internet access at this time. In addition, because the traffic of a large number of our business ends at the edge, we mainly use telecommunications, China Unicom, and mobile single-line access in China to save service bandwidth costs; domestically, use three-line computer rooms or BGP resources to solve the problem of cross-operator network resource exchange; overseas , We will give priority to IXP node access with rich network resources; realize cross-border interconnection through HUAWEI CLOUD infrastructure or high-quality Internet resources. In addition, we must consider high-availability design during deployment planning. The common method of high-availability design is to increase redundancy. We consider site redundancy and bandwidth redundancy when planning. We will ensure that users in the coverage area have at least 3 sites that can provide services corresponding to the quality requirements. In addition, when we are doing resource planning, we will plan based on more than twice the bandwidth required by the business to deal with some bursts.

2.3 The whole industry: to meet different business requirements such as entertainment, communication, industry video, etc.

Because we are a Paas service, we cannot affect the characteristics of other customers just because we meet the needs of one type of customer, and we must try to meet the needs of different customers as quickly as possible. This puts forward three requirements for the technology: First, because it needs to meet the different business needs of different industries, the agility of business application development is very important. We need to make new functions quickly online to any edge node in the world, and at the same time, in order to reduce The risk of new features going online, we need to support the new features to go online at different edge gray levels. We call this development method Living on the edge.

The second technical requirement is also our very important design principle-Edge Services are independent and autonomous. Edge Services is a series of microservices deployed around the network nodes of the media network, which we collectively call Edge Services. Each Edge Services must be independent and autonomous, because we are a distributed media network. We definitely don't want a node failure (such as a network failure) to affect our entire network business. Therefore, each Edge Services must be independent. What is autonomy? When there are some temporary failures in the edge and control center networks, my architecture must ensure that Edge Services can be autonomous, which means that its local services can still be provided. We can see that four microservices are simply listed on the left. Among them, local scheduling is to reduce the dependence on global scheduling. When the edge and control center network have some temporary failures, the edge can still provide services. In addition, our internal architecture in Edge Services is mainly divided by microservices. Its core purpose is to help us quickly and flexibly launch some features. For example, we have protocol-adapted microservices inside the edge service, so that when we need to support new terminals and adapt to some protocols, we can quickly launch a new one. The protocol-adapted microservices can be launched quickly without affecting the support of the terminals that are already online.

The third technical requirement is that the overlay network needs to be able to flexibly define its routing. For example, for example, Huawei Cloud Conference, which needs to support a large number of high-standard government-level conferences, and this requires very high security and quality. We need to let all the messages of this conference that enter our media network go to us. The backbone network of HUAWEI CLOUD avoids the use of Internet resource transmission. There are some customers who are more price-sensitive. For such customers, we will try our best to use cost-effective network resources to forward his messages. This requires a programmable overlay network to realize flexible network routing and forwarding.

2.4 Whole process: provide full-process services of media production, processing, distribution, and playback

The third more important design goal is that our architecture needs to be able to provide end-to-end services from production to processing to distribution to playback. We divide our customers into two main categories. One is cloud native. Many Internet customers were on the cloud at the beginning of their birth, so they can easily use our cloud services. However, some customers need to transform from traditional offline to online. In order to serve such customers, our production and processing system is based on Huawei's unified Huawei Cloud Stack unified technology stack, which supports flexible and rapid deployment both online and offline. We also provide a convenient SDK, which can help customers cover more terminals across terminals and with low power consumption. The last technical requirement is that the entire real-time media processing pipeline can be flexibly arranged and dynamically managed. For example, our joint innovation project with Douyu last year helped Douyu move the special effects algorithm on the side to Edge services. This directly brings three benefits to Douyu. The first benefit is that the development workload is reduced. The original special effects algorithm needs to be adapted to different terminals and different chips. The second benefit is that the iteration speed of the special effect algorithm has become faster. Customers only need to update and deploy the special effect algorithm in Edge services, and customers can experience it. The third advantage is that the number of terminal models covered has increased. Because of the traditional special effects developed on the end-side, there are actually many low-end machines that cannot be experienced. If you put it on our Edge services, you can quickly To meet the requirements of many low-end models.

2.5 Architecture hierarchical design: adapt to the characteristics of the Internet

Finally, let's share a very important design idea of architecture layering. We draw lessons from the design ideas of computer network systems. You can imagine what kind of experience our application development would be like without the current computer network layered system. Maybe I need to list the nodes of the entire network topology, and need to find the optimal path to send my message from a to destination b. In this process, I have to deal with various network abnormalities, such as packet loss and heavy traffic. Transmission, disorder, etc., which are obviously very unfriendly to application development.

Computer network system design is to solve these problems. The first is the idea of layering. There is a link layer at the bottom layer to shield the differences of different link transmission technologies. For example, after we support 5G, the upper layer application does not need to be modified. At the top is the network layer, which mainly has two major functions, forwarding and routing, so there is no need for each application to define the forwarding path by itself. On the top is End to End layer. This is the upper transmission layer and the expression layer. A general term for the application layer. The purpose of layering is to modularize, reduce coupling, and focus on solving the problems of each layer.

The layering of our cloud-native media network architecture also draws on this idea. We have enhanced the design at the network layer to improve the delay and arrival rate of message forwarding. We make the real-time audio and video application development of the upper layer easier through the self-developed real-time transmission protocol in the End to End layer. In this way, our application development can focus more on business logic. At the same time, we abstracted the media processing module, so that audio and video-related codec technologies, pre- and post-processing technologies, can evolve independently and innovate quickly.

2.6 Architecture Layered Design-Network Layer

Before introducing some of our key designs in the network layer and End to End layer, let's first look at what is wrong with the network layer. The Internet has a very important quality attribute at the beginning of its design, which is the high availability of interconnection. We know that the Internet is composed of tens of thousands of ISPs. If any ISP fails, the network can still communicate normally. Among them, the BGP protocol is a very important design. It mainly considers connectivity, but does not make some sense of service quality. We can see the picture on the left. User A wants to send a message to user B. Cross-operator. It is very likely that it will traverse many different ISPs via the Internet. This will cause a lot of problems, such as aggravating packet loss. Many of these key issues are non-technical factors. For example, the network strategy of many operators for a certain network is not necessarily the best quality, it may be the best cost, for example, there are some cold potatoes or hot potatoes routing strategies.

The second reason is that it is possible that the operator will need to upgrade equipment tonight, requiring operation and maintenance personnel to perform some configuration changes, and there may be human errors in the configuration change process that may cause link failures, and there may be a hot spot in these areas. Events may cause congestion.

In order to solve this problem, we decided to enhance the network layer. Here we mainly have two technical means; one is underlay and the other is overlay.

1) The first is underlay. We use Huawei Cloud global network infrastructure to improve the quality of network access and interconnection. Once we enter our underlay network, we can avoid competing with other Internet traffic for bandwidth, which not only improves quality, but also guarantees security. .

2) Next is the overlay part. In addition to building our own backbone network, we will also deploy some overlay nodes to optimize message transmission paths and efficient forwarding based on different QoS goals, instead of letting messages be arbitrarily forwarded. Our design principle at the network layer is also a very classic design idea of separating the control plane from the data plane. Simply put, the control plane is responsible for routing and controlling the operation of the entire network, and the data plane is responsible for forwarding.

In order to make data forwarding easier, we also adopted a very classic design idea in the network: the idea of source routing algorithm. The core purpose is also to reduce the complexity of forwarding equipment. Specifically, when a message enters the first forwarding node of our network, the system will encapsulate all the forwarding node information that the message will pass through, including the destination node, in the message header, so that each forwarding node After the node receives the message, it only needs to parse the message header to know where to send the next hop, which can greatly reduce the complexity of the forwarding device.

There is also a very important design principle, that is, we do not make reliability commitment requirements for the network layer. Although we do not guarantee reliability, we will still use redundant error correction, multi-path transmission and other technologies to improve message forwarding. Latency and arrival rate. This is why we call this layer the network layer. He still focuses on routing and forwarding. Just made some enhancements.

2.7 Architecture layered design-End to End layer

The enhancement of the network layer can help us to achieve lower delay forwarding and higher arrival rate. Next up is our End to End layer. Here you can think about a question first. The previous article mentioned that the Internet has so many inherent properties, such as packet loss, disorder, and retransmission, which seem to be very unfriendly to developers. However, the development of the Internet has been very prosperous. There have been generations of Internet applications such as email, web, IM, audio, and video. What is the reason for this?

Here to share my thoughts, a very important point is the protocol. There are many important protocols in the End to End Layer, which greatly reduces the technical threshold of our application developers. For example, we go from TCP to HTTP to QUIC, etc., every generation There is a protocol behind the development of Internet applications. The core design goal of End to End layer is to define a good protocol and development framework to make application development simple.

How to do this? You can see the picture on the left. The middle part is the general functional diagram of our self-developed real-time transmission protocol. We will provide a unified interface in its north direction. Through this set of northbound interfaces, we can not only develop real-time audio and video services, but also develop reliable messaging services. At the same time, let’s take a look at its southbound. The protocol stack shields the underlying layers that use UDP or ADNP. Differences, so application development will become easier.

The purpose of protocol stack design is to make application development simple. So we also abstracted two modules, NQE and QOS, through these two modules to provide callback methods to quickly feedback network information to the upper application, such as the encoding module. The encoding module can quickly adapt to network conditions to adjust its encoding parameters.

Another very important design principle is high efficiency. Because we know that, as mentioned earlier, there will be many IoT terminals in the future. The IoT terminal has a big feature, that is, the requirements for power consumption are very high. We hope that this issue will be considered at the beginning of the protocol stack design. So we don't want to easily add some unnecessary copies on this layer. Here is the design principle of ALF, which is also very classic. RTP also followed this design principle when it was designed.

In addition, our protocol stack design also refers to the design ideas of quic. Support multiplexing, network multi-path, Huawei LinkTurbo, priority management and other functions. Here is a little experience we share. We are developing services such as free perspective and VR, which require very high bandwidth. At this time, we will enable the multi-path function to obtain a relatively large improvement in experience.

2.8 Target Architecture of Huawei Cloud Native Media Network

Finally, I make a brief summary of the target architecture of the entire media network.

1) In simple terms, it means to simplify complex issues, divide and conquer, and enable each layer to be decoupled from each other and evolve quickly through a layered design;

2) Each Edge service is independent and autonomous to improve the availability of the entire service;

3) By dividing Edge services according to microservices, we can adapt to the needs of customers more flexibly and achieve rapid launches according to microservice levels.

03Real-time audio and video service quality assurance practice

In the third part, I will share some of our practices in real-time audio and video service quality assurance. Here are some thoughts on algorithm design, and the previous article is mainly about some thoughts on architecture.

3.1 Video, audio, and network are the key system factors that affect the experience

As shown in the figure above, we have done an analysis of the relevant dimensions that affect the experience. From objective indicators to subjective indicators, and then to QoE, a simple map is made. Through analysis, we found that the three systemic factors that affect the quality of real-time audio and video service experience are video, audio, and network. Next, I will introduce the algorithm practice of these three parts.

3.2 Video coding technology

First, let's look at video encoding. We have made a simple classification of video coding technology according to the design goals. The first category, its design goal is how to scientifically reduce the redundancy of video coding and reduce the impact of coding distortion on the subjective perception of the human eye. Because our real-time audio and video services are mainly human-oriented, there are some very classic optimization ideas, such as: starting from the human, analyzing the visual characteristics of the human eye, and optimizing the coding algorithm based on these characteristics. The figure simply lists several major The visual characteristics of the human eye with a relatively high correlation between classes and codes.

There is also an optimization idea, that is, starting from the source, that is, starting from the content, we will analyze the characteristics of different scene content to optimize the encoding algorithm, for example, the characteristics of computer-generated images are low noise, large flat areas, and so on.

The second design goal is how to scientifically increase redundancy to resist the impact of weak network transmission on the subjective perception of personnel. Here is a brief list of several types of encoding to increase redundancy, such as extreme full I frame encoding, intra-frame refresh mode, and long-term reference frame and SVC encoding. In some spatial video services, in order to improve the time delay in spatial positioning, we will use some full I-frame coding combined with some ordinary coding to reduce the time delay of spatial positioning. In order to reduce the burst of large I-frames in cloud games, we will use the encoding method of intra-frame refresh. In real-time audio and video services, long-term reference frames and SVC are relatively common encoding methods.

3.3 PVC aware coding

Here are some of our specific coding techniques. Our cloud video team worked with Huawei's 2012 Central Media Technology Institute to start from the analysis of the human visual system and improve the PVC perceptual coding algorithm. Our algorithm has gone through several iterations. The latest perceptual coding 2.0 algorithm achieves a 1Mbps bit rate and provides a 1080P 30-frame high-definition picture quality experience; the main improvement idea of the algorithm is: firstly, through pre-analysis and encoding feedback information, the scene and area are distinguished, and the real-time call scene is mainly Highly sensitive areas include: face area and static area. For different scenarios and regions, different coding parameters and bit rate allocation strategies are adopted. For example, a lower bit rate is allocated to non-highly sensitive areas; the 2.0 algorithm is based on 1.0, and we have added AI technology to the code control. In the previous, fixed bit rate and resolution combination, new method, we based on AI-based perceptual code control, to obtain the best combination of bit rate and resolution in different scenarios, to achieve better subjective effects under low bandwidth.

3.4 SCC encoding

The second encoding technology is SCC encoding, which is mainly used in computer-generated image encoding, such as screen sharing scenes in education or conferences. Compared with x265 ultrafast gear, our algorithm has improved compression performance by 65%. With the same computing resources, our encoding speed has increased by 50%. For the screen sharing scene, we also solved some of its unique problems. When sharing, often share some graphics, such as word or ppt. This type is relatively static. At this time, the encoding parameters will generally use a low frame rate, and try to ensure the encoding method of its image quality, but in many cases, it will switch to shared video after sharing pictures and texts. If we can't perceive this well, our experience of watching videos is a discontinuous picture, similar to a gif.

In order to solve this problem, we adopt the method of adaptive video coding frame rate based on the complexity analysis of video time and space domain. In this way, there can be a high-quality image under the static image and text screen, and the smoothness can also be ensured when switching to video sharing.

The second problem we solved was the color distortion problem caused by downsampling from YUV444 to YUV420 scene, because we know that many screens share static graphics and text, and the requirements for color are relatively high. But when it is down-sampled from YUV444 to YUV420, the signal in the UV domain will be greatly attenuated. The left picture is the effect before the new algorithm is used, and the right picture is the effect after the new algorithm is applied. Obviously you can see the right The font of the picture will be clearer and the color distortion will be smaller. The core here is the use of low-complexity color correction algorithms.

3.5 Adaptive long-term reference frame coding

The first two coding techniques are to reduce redundancy, while the adaptive long-term reference frame coding technique is to scientifically improve redundancy. In order to better understand, let's simplify the complicated problem first and understand what a fixed long-term reference frame is. We see the picture above on the left. The red is the I frame, the green is the long-term reference frame, and the blue is the normal P frame. . Through such a reference frame method, the original normal Ipppp forward reference dependency is interrupted, so that when its P2 or P3 is lost, the decoding of the subsequent P5 will not be affected, and the decoding can still be continued. This will improve its fluency. But there are still shortcomings. For example, the green long-term reference frame P5 is lost, because the subsequent P frames rely on it, so they cannot be decoded. The second problem is fixed. Because of the long reference frame, it will bring a certain degree of redundancy, which will cause the quality of the same bandwidth to be reduced. Therefore, we hope that when the network is good, we can try to make redundancy as much as possible. Make it smaller to improve the image quality, so we proposed an adaptive long-term reference frame method.

The core idea of adaptive long-term reference frame is two points. The first is to add a feedback mechanism. Add a feedback mechanism on the decoder side to tell the encoder side that I have received this long-term reference frame. After the encoder side knows that the frame has been received, it will follow Refer to this frame for encoding. The second is to add a dynamic mark long-term reference frame mechanism, that is, I will dynamically optimize the step length of the long-term reference frame encoding according to the QOS situation of the network. When the network is good, the step length is shortened, and when the network is poor, the step length is adjusted. a little longer.

But after adding a feedback mechanism, there will be a problem. When the RTT is relatively long in some network models, my feedback cycle will be relatively long. Moreover, the feedback message may be lost, and it needs to be fed back again. This will cause the step length of the long-term reference frame to become very long. Once the step length becomes longer, its encoding quality will decline, or even drop to the level of service failure. To the point of acceptance, this is also considered when we optimize the algorithm. When the step size of the long-term reference frame is too long, we will force the P-frame to refer to the long-term reference frame closest to it, instead of relying entirely on the feedback mechanism. This will bring two better optimization effects. One is that its picture fluency becomes better under sudden packet loss scenarios. At the same time, it has a better network adaptive ability, which can take into account fluency and painting. quality.

3.6 Network transmission technology: seeking the optimal solution for interactivity and quality

The previous is some sharing of video coding technology, then let's take a look at our practice on network transmission. In our definition of network transmission, the core goal is to require the optimal solution for interactivity and quality. We know that network transmission technology mainly resists packet loss, delay, and jitter. Common technologies such as ARQ, FEC, asymmetry protection, as well as jitter estimation, buffer scaling, etc. In addition to anti-jitter and anti-packet loss, congestion control is also required. The core purpose of congestion control is to make the "sending rate" as possible To approach the "available rate" while keeping the delay as low as possible, if the sending rate does not match the available bandwidth of the network, it will cause packet loss, jitter, or low bandwidth utilization. Another very important thing is the source-channel linkage. The dynamic long-term reference frame we saw earlier is a way to dynamically adjust the encoding parameters based on the channel information. Based on this linkage, we can better improve our experience. .

3.7 Based on reinforcement learning, improve bandwidth prediction accuracy and improve QoE experience quality

Whether it is congestion control or source-channel linkage, the bandwidth prediction algorithm is very important in this process. The traditional method is to use artificial experience and some decision tree algorithms to make some predictions for the bandwidth under different network models. However, in complex scenarios, the effect of this approach is not particularly ideal, so we hope to use reinforcement learning to do some predictions. Improve this.

The main idea is based on the network QoS fed back by the receiving end, which mainly feeds back four information: receiving rate, sending rate, packet loss rate, and delay jitter. Based on this information, the accuracy of bandwidth prediction is improved through reinforcement learning methods. After the algorithm is optimized, our HD ratio has been increased by 30%, and the freeze rate has dropped by 20%.

3.8 Audio 3A technology: improve audio clarity

Finally, I will share the technical practice in audio. A good 3A algorithm is essential for the speech intelligibility experience. We apply AI technology to the 3A algorithm to improve the voice experience.

First of all, we apply AI to echo cancellation, which is a very important step in the entire 3A. The echo cancellation done by traditional algorithms in a steady-state environment is relatively mature and generally handles better. However, when there are some changes in the environment, for example, I hold a mobile phone to make a hands-free call and walk from the room to the balcony at home At this time, the environment has changed, and echo cancellation will encounter many challenges. These problems can be better handled through AI. Especially for dual-talk scenarios, our new algorithm solves the problems of echo leakage and word loss.

The second is noise reduction. Traditional noises, such as steady-state noises such as fans and air conditioners, are relatively well suppressed. Our AI-based noise reduction algorithms can not only deal with stationary noises better, but also deal with keyboards, In the scene of sudden noise such as the sound of mouse tapping or drinking water or coughing, we can also quickly suppress the noise.

Another important part of 3A is the automatic gain. In the call scene, the automatic gain is mainly based on the recognition of the human voice. At this time, the detection of human voice VAD is very important. In this regard, we also use AI technology to improve the accuracy of human voice detection and improve the effect of automatic gain.

3.9 Audio packet loss recovery technology: reduce the impact of packet loss on audio experience

Another difference from video technology is the audio packet loss recovery technology. The picture on the left is also a classic technology map of packet loss recovery. It is mainly divided into two categories, one is based on active packet loss recovery, the other is The class is based on passive packet loss recovery.

Active packet loss recovery technologies mainly include common FEC, ARQ, etc. There are three main methods of passive recovery, interpolation, insertion and regeneration. The idea of algorithm optimization is the same as that of video, which is based on the researcher. Video is to study the human eye to visual characteristics, so audio is to study the human voice mechanism, and the fundamental frequency information reflects the vibration frequency of the vocal cords to a certain extent. The envelope information reflects the shape of the mouth to a certain extent. Combining these two information with AI's vocoder technology can achieve a recovery level of audio message loss of about 100 milliseconds. We know that the utterance of a Chinese character is generally 150 milliseconds to 200 milliseconds. The traditional PLC-based signal-based recovery method can generally achieve 50ms audio signal recovery. Now our AI-based method can achieve 100ms audio signal recovery.

3.10 Case 1: Huawei Changlian, the world's first full-scene audio and video call product

Finally, two cases are shared. Our products not only serve external customers, but also support Huawei's many other products and services internally. I have been joking that it is actually harder to support internal customers, and it is harder to support Huawei’s internal customers than to support internal customers. Their requirements are very high. Now we support the smooth connection service of Huawei mobile phones. It is the world's first real-time audio and video call product in all scenarios (in addition to supporting mobile phones, it will also support Huawei's large screens, Huawei tablets, Huawei notebooks, watches, and bracelets). Provides high-quality 1080p30 frame call effects under bit rate conditions.

3.11 Case 2: Webinar: conference + live broadcast integration experience, easier to hold a conference

More difficult than supporting one Huawei internal customer is supporting two. The second internal customer we support is Huawei Cloud Conference. The webinar scene of Huawei Cloud Conference is also developed based on our real-time audio and video services. The single webinar we can do now supports three thousand parties at the same time. Of the audience, one hundred of them are interactive. In the second half of this year, our cloud conference product will support a single webinar and support an audience of 10,000 parties, and five hundred parties will interact.

04Summary

Finally, I will make a summary of what I shared today. First of all, we can clearly see that video services are driving the development of the entire Internet technology, including audio and video coding/transmission technology, and edge computing and edge networking technologies. So we need a service or system to bridge the gap between the Internet infrastructure (supply side) and the rapidly developing video service (demand side).

The second point is that today's sharing is just the beginning. With the increase of real-time audio and video technology application scenarios, data-driven, our cloud-native media network architecture and various algorithms will continue to be optimized.

Finally, I hope that Huawei Cloud's native video service can join you in entering the "new era" of video.

thank you all.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~