头图

Audio and video conferences, live broadcasts and short videos have become part of people's work, teaching and entertainment, which are inseparable from the extensive application of key technologies such as audio and video real-time communication. In terms of audio, it is foreseeable that the diversity of customer business forms, the complexity of the environment, and a series of problems caused by differences in access equipment, we realize that the technology and strategy of a single scene can no longer meet the increasingly exposed line. For the above problem, the only way out is the audio preprocessing 3A (AEC, ANS, AGC) algorithm to adapt to the whole scene. In order to solve the noise problem in complex environments, we have launched AliCloudDenoise - speech enhancement algorithm , which complements the defects of traditional noise reduction technology in non-steady noise suppression; in order to solve the volume problem, we have launched AliAGC - automatic gain control algorithm, It greatly improves the problem of inconsistent volume in different environments, devices, and scenarios, and is more intelligent than traditional WebRTC AGC.

What's wrong with existing WebRTC AGC algorithms?

In the article " Explanation of WebRTC's High Quality and Low Latency - AGC (Automatic Gain Control) ", an in-depth interpretation of the core principles of different modes of WebRTC AGC is made. Based on the fixed gain mode, it extends digital/analog adaptive All modes have stability problems such as excessive response and untimely response, and inaccurate compensation gain estimation. The technical details will not be expanded here. From the perspective of the direction, the goal of WebRTC AGC to pursue self-adaptation is correct, and we need to optimize it first. , let me first take a look at the pain points encountered online:

(1) The problem of inconsistent volume <br>In a multi-person conference, the audio capture effect is affected by many factors such as equipment differences, the environment, and the speaker itself. If only the fixed gain scheme is used, there will be different speakers in the sense of hearing. The problem of inconsistent volume, continuous low volume or sudden high volume can only be solved by frequently adjusting the playback volume button of the device. Similar problems will inevitably be encountered when switching between live rooms/short videos.

(2) The problem of excessive amplification of noisy human voices in the environment <br>In open environments such as offices and stores, when the speaker turns on the microphone but does not speak, the surrounding noisy human voice is likely to be used as the speaker's voice. Traditional The adaptive solution will trigger gain compensation, resulting in very obvious noise during the whole process, which seriously affects the experience of conferences and live broadcasts.

(3) The problem of background music volume fluctuating in entertainment scenarios such as live streaming <br>It is extremely common to play background music in entertainment scenarios such as live streaming. Many anchors use sound cards. Generally, the business layer chooses to turn off AGC and adjust the The right of volume is handed over to the host. From a macro perspective, it cannot solve the problem of large differences in volume between different live broadcasts in (1), and it is even difficult for the host to detect the popping sound and low volume. Therefore, it is necessary to turn on AGC in such scenarios. . However, the traditional gain compensation strategy does not distinguish between human voice and background music, which will inevitably lead to fluctuations in music volume, which is unacceptable to the audience. Controlling the scene with music is the biggest challenge for AGC.

It can be seen that the robust adaptive analog/digital gain is very basic and can only solve the problem of non-uniform volume in (1). We also need to add other methods or modules to deal with the volume problem in specific scenarios. .

AliAGC algorithm optimization direction

In order to pursue the ultimate audio and video call experience, the Alibaba Cloud Video Cloud Audio technical team puts forward the following requirements for AGC as the last link in the audio 3A algorithm:

① Gain compensation and adaptive adjustment strategies respond quickly, achieving convergence in seconds;

② The gain range is large, which can cover most mobile and PC devices;

③ In complex scenes such as noise and music, it has good stability and does not trigger misadjustment;

④ Low power consumption and lossless sound quality;

In order to achieve the above goals, we have made the following main optimizations based on the AGC framework in WebRTC (for details, please refer to " Explanation of WebRTC's High Quality and Low Latency - AGC (Automatic Gain Control) "):

① Digital gain adaptive solution: A new VAD/envelope detection module is added to calculate the audio signal volume in real time, to quickly determine the maximum gain upper limit to guide the current digital gain adjustment;

② Analog gain adaptive scheme: Based on the detected amount of human voice/noise floor, it is used to guide the adjustment of the analog gain, so as to control the collected noise floor and the volume of the human voice to be in the target range;

③ Scene adaptation scheme: Added multi-task detection modules such as voice/noisy/music to dynamically estimate the current noise level, music and other states to activate the corresponding adjustment strategy and adapt the algorithm to most current application scenarios.

④ Audio statistics data construction: Added human voice/noise volume statistics and other data and event detection to provide accurate data support for other modules. At the same time, it also improves the buried points through the data reporting channel and enriches the background dashboard.

AliAGC algorithm effect

Based on the above difficulties, let's take a look at the effect of AliAGC after optimization:

(1) Fast convergence speed <br>In the case of extremely low acquisition volume: -30dB → -3db takes 5s - 8s; under normal conditions: -20dB → -3db only takes 3s - 5s.

Conversely, when the acquisition volume is large and the digital gain is seriously excessive, the convergence speed of the down-regulation is also fast. Most of the scenes are basically the time to say a sentence, and it will converge.

(2) Digital gain adaptive update capability <br>It can be seen in the previous case that the volume of the first stage is extremely small (<-34dB), and the audio volume of the middle and rear stages is relatively large. It can be seen from the output results that the final output volume Basically in the target range of [-1dB, -3dB], there is no difference in hearing.

Let's take a look at a more extreme case: the vocals change alternately from loud to small. If the gain adaptive adjustment is not timely, we will see the situation that the peak is flattened by the compressor, and the problem that the low volume is not raised in time (you can explain it in simple terms). see this article). After optimization, it can be seen that the overall output volume is stable and the waveform remains intact.

At the same time, we recorded a local playback audio data of participant F in a multi-person conference. The final push volume of participants A ~ E was basically equalized to around -3dB. For participant F, his subjective listening experience basically the same.

(3) Gain control in noisy environment <br>Similarly, we selected a piece of push-stream audio data recorded in a real meeting. Before the speaker speaks, there are other colleagues in the environment. Due to the lack of monitoring of the noisy environment in the traditional adaptive solution, the voices of other colleagues have also been greatly increased. The optimized solution avoids this situation, and only when the speaker The adaptive logic is activated only when the person starts to speak, avoiding the problem of over-gaining surrounding noisy human voices.

At the same time, for the original acquisition with high noise floor and noisy human voices, the gain is kept relatively well before the speaker speaks, and there is no problem that the noise floor is greatly amplified due to the gain of the AGC. When the speaker starts to speak, the gain adaptive adjustment is triggered, and it is finally gained to a suitable position.

(4) Gain control in entertainment live broadcast scenarios <br>We have selected a piece of material where the anchor and background music alternately appear. In the traditional gain compensation scheme, human voice and music are treated equally, and both are finally improved. Background music The music fluctuates. In the optimized scheme, due to the better performance of the music detection module, it will guide the AGC to control the gain of the music part, and the output results are in line with expectations. The overall gain is only adaptive according to the part of the host's vocals.

Full-scene adaptation, the subsequent optimization goal of the AliAGC algorithm

The audio 3A algorithm (not limited to 3A) provided by the Alibaba Cloud Video Cloud Audio technical team is the guarantee of the audio quality of the AliRTC streaming end. There should be no obvious shortcomings in various audio indicators. In complex application scenarios, all three are indispensable. , which together affect the audio quality and subjective experience. We cannot optimize a certain algorithm separately. For example, if the AGC gain is too large, it will not only excessively gain noise, but also increase the nonlinear components of the echo collected by the opposite end, which will affect the effect of echo cancellation. In addition, the noise reduction capability is too poor, which will also limit the upper limit of the maximum gain that the AGC can achieve. At the same time, in a noisy environment, you cannot simply rely on the AGC to control the noisy human voice. After all, there is a possibility of false detection. If the intelligent noise reduction is used by default, the pressure of the AGC will be greatly reduced in such scenarios. Small.

In the subsequent optimization, the configuration of 3A will be gradually refined according to the scene, and the final effect of 3A will be seen as a whole. For the optimization of a single algorithm, the gap between major manufacturers is constantly narrowing, and personalized and differentiated innovation is particularly important. On the one hand, the AliAGC algorithm needs to actively explore online badcases and continuously strengthen the stability construction; on the other hand, it needs to deepen the exploration and application of machine learning, array and other technologies to enrich product highlights.

"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.

CloudImagine
222 声望1.5k 粉丝