Detailed explanation of the high sound quality and low latency behind WebRTC-AGC (Automatic Gain Control)

In front of us the WebRTC audio. 3A acoustic echo cancellation (AEC: Acoustic Echo Cancellation) basic principles and direction of optimization, then this chapter we talk about the other "A" - automatic gain control (AGC: Auto Gain Control ). This article will comprehensively analyze the basic framework of WebRTC AGC with examples, and explore its basic principles, differences in modes, existing problems, and optimization directions together.

Author｜Luo Shen
Review｜Taiyi

Preface

Automatic gain control (AGC: Auto Gain Control) is the audio algorithm module that I think has the longest link and most affects sound quality and subjective hearing. On the one hand, AGC must act on the sender to cope with various collection devices on the mobile and PC. On the other hand, AGC is often used as a compressor on the receiving end to balance the mixed signal to prevent popping. The most direct manifestation of the diversity of equipment is the difference in audio collection, which is generally manifested as excessive volume leading to popping, and too small collection volume sounds very difficult on the other end.

In the real scene of audio and video calls, different participants have different speaking volumes. Participants need to frequently adjust the playback volume to meet the needs of hearing. Users wearing headsets are always exposed to the loud sound of the ears. hit". Therefore, the equalization of the sender's volume is particularly important in the above scenarios. The excellent automatic gain control algorithm can unify the audio volume, which greatly alleviates the volume fluctuation caused by factors such as device collection differences, speaker volume, and distance. difference.

The position of AGC in WebRTC

Before talking about the AGC audio stream processing framework, let's take a look at the position of AGC in real-time audio and video communication. Figure 1 shows the same device as the sender audio data from collection to encoding, and as the receiver audio data from decoding to playback. the process of. AGC acts as an equalizer and a compressor at the sending end to adjust the volume of the push stream, and at the receiving end only serves as a compressor to prevent the audio data from popping after mixing. In theory, after the push end AGC is robust enough, pull the stream end Just as a compressor is enough, some manufacturers will do another AGC in order to further reduce the volume difference of different voices after mixing.

Figure 1 Block diagram of the audio signal uplink and downlink processing flow in WebRTC

The core parameters of AGC

First, let’s take a look at Sample and the decibel dB , taking 16-bit quantized audio sample points as an example: dB = 20 * log10 (Sample / 32768.0) , and the right ordinate of Adobe Audition The scale is consistent.
Amplitude value representation: 16bit sampling minimum value is 0, maximum absolute value is 32768 ( right column of the figure ).

Decibel means: the maximum value is 0 decibels ( right column of the figure ). Generally, the volume reaches -3dB and it is already relatively large, and 3 is often set as the AGC target volume.

The core parameters are:

typedef struct {
  int16_t targetLevelDbfs;    // 目标音量
  int16_t compressionGaindB;  // 增益能力
  uint8_t limiterEnable;      // 压限器开关
} AliyunAgcConfig;

target volume-targetLevelDbfs : indicates the target value of the volume equalization result, if set to 1, the target value of the output volume is-1dB;

gain capability-compressionGaindB : indicates the maximum audio gain capability, if set to 12dB, the maximum can be increased by 12dB;

Compressor Switch-limiterEnable : Generally used in conjunction with targetLevelDbfs, compressionGaindB is to adjust the gain range of small volume, and limiter is to limit the part that exceeds targetLevelDbfs to avoid data popping.

The core model of AGC

In addition to the above three core parameters, WebRTC AGC provides the following three modes for different access devices:

enum {
  kAgcModeUnchanged,
  kAgcModeAdaptiveAnalog,  // 自适应模拟模式
  kAgcModeAdaptiveDigital, // 自适应数字增益模式
  kAgcModeFixedDigital  // 固定数字增益模式
};

In the following, we will describe these three modes in terms of basic functions, applicable scenarios, signal flow diagrams, and existing problems in conjunction with examples.

Fixed Digital Gain-FixedDigital

The most basic gain mode of the fixed digital gain mode is also the core of AGC, and the other two modes are expanded on this basis. Mainly to amplify the signal with a fixed gain, the maximum gain does not exceed the set gain capacity compressionGaindB , when used in combination with limiter , the upper limit does not exceed the set target volume targetLevelDbfs .

In fixed digital gain mode, only the core function WebRtcAgc_ProcessDigital used to equalize the input signal volume. Since there is no feedback mechanism, the signal processing process is extremely simple. After setting the parameters, the signal will go through the following process:

The fixed digital gain mode is the core mode. There are mainly two aspects worthy of our in-depth study:

The basic idea of the voice detection module WebRtcAgc_ProcessVad

In the real-time communication scenario, there will be components of the far-end signal in the near-end signal collected by the microphone. In the process, the far-end signal will be analyzed through the WebRtcAgc_ProcessVad function. When detecting the actual near-end signal envelope, the far-end signal needs to be removed. The interference term of the end signal prevents the residual echo signal from affecting the statistics of parameters such as the envelope of the near end signal. The most traditional VAD distinguishes speech segments and non-speech segments based on indicators such as energy, zero-crossing rate, and noise threshold. WebRTC AGC provides new ideas for roughly distinguishing speech segments:

Calculate the short-term mean and variance, describe the instantaneous change of the voice envelope, which can accurately reflect the voice envelope, such as Figure 2 left red curve ;

// update short-term estimate of mean energy level (Q10)
tmp32 = state->meanShortTerm * 15 + dB;
state->meanShortTerm = (int16_t)(tmp32 >> 4);
  
// update short-term estimate of variance in energy level (Q8)
tmp32 = (dB * dB) >> 12;
tmp32 += state->varianceShortTerm * 15;
state->varianceShortTerm = tmp32 / 16;
  
// update short-term estimate of standard deviation in energy level (Q10)
tmp32 = state->meanShortTerm * state->meanShortTerm;
tmp32 = (state->varianceShortTerm << 12) - tmp32;
state->stdShortTerm = (int16_t)WebRtcSpl_Sqrt(tmp32);

Calculate the long-term mean and variance, describe the overall slow change trend of the signal, outline the "center of gravity" of the signal, and it is more smooth to use the threshold as the detection condition, such as Figure 2 left blue curve ;

// update long-term estimate of mean energy level (Q10)
tmp32 = state->meanLongTerm * state->counter + dB;
state->meanLongTerm = WebRtcSpl_DivW32W16ResW16(tmp32, WebRtcSpl_AddSatW16(state->counter, 1));
// update long-term estimate of variance in energy level (Q8)
tmp32 += state->varianceLongTerm * state->counter;
state->varianceLongTerm = WebRtcSpl_DivW32W16(tmp32, WebRtcSpl_AddSatW16(state->counter, 1));

Calculate the standard score and describe the deviation of the short-term average from the "center of gravity". The part above the center can be considered as having a great possibility of voice activity;

tmp32 = tmp16 * (int16_t)(dB - state->meanLongTerm);
tmp32 = WebRtcSpl_DivW32W16(tmp32, state->stdLongTerm);
state->logRatio = (int16_t)(tmp32 >> 6);

Figure 2 Left: Long and short time mean and variance. Right: Input and vad detection threshold

How WebRtcAgc_ProcessDigital gains audio data

The three core parameters are all developed around the fixed digital gain mode. What we need to figure out is how the core function in WebRTC AGC- WebRtcAgc_ProcessDigital gains audio data.

According to the specified targetLevelDbfs and compressionGaindB, calculate the gain table gainTable;

/* 根据设置的目标增益与增益能力，计算增益表gainTable */
if (WebRtcAgc_CalculateGainTable(&(stt->digitalAgc.gainTable[0]), stt->compressionGaindB, stt->targetLevelDbfs, stt->limiterEnable, stt->analogTarget) == -1) {
    return -1;
 }

In this step, the gain table gainTable can be understood as the quantization of the signal energy value (the square of the amplitude). We first fix the targetLevelDbfs and set the compressionGaindB to 3dB~15dB respectively. The corresponding gain table curve is as follows, you can see that the higher the gain capability setting is Larger, the higher the curve, as shown in the figure below.

You may be wondering why the length of gainTable 32 actually represents the 32 bits of an int data (the energy value range of short data is [0, 32768^2] and can be represented by unsigned int data), from high to low, the highest bit of 1 has the largest The order of magnitude is called the integer part-intpart, and the fractional part of the subsequent digits is called fracpart. Therefore, any number between [0, 32768] corresponds to a gain value in the digital gain table. Next we talk about how to look up the meter and apply the gain value to complete the volume equalization.

/** 部分关键源码 */
/** 提取整数部分和小数部分 */
intPart = (uint16_t)(absInLevel >> 14);          // extract the integral part
fracPart = (uint16_t)(absInLevel & 0x00003FFF);  // extract the fractional part
......
/** 根据整数部分和小数部分生成数字增益表 */
gainTable[i] = (1 << intPart) + WEBRTC_SPL_SHIFT_W32(fracPart, intPart - 14);

Find the gain value in the gain Table according to the input signal envelope, and apply the gain to the input signal;

Based on the hearing curve of the human ear, the application gain in AGC is segmented. A frame of 160 sample points will be divided into 10 segments, each segment has 16 sample points, so the segmented gain array gains will be introduced, in the following code Describes the relationship between the digital gain table and the gain array, which directly reflects the process of looking up the table. The idea is similar to the calculation of the gain table. The integer part and the decimal part are calculated first, and then the new gain value is calculated through the combination of the gain table. It includes compensation for the fractional part.

// Translate signal level into gain, using a piecewise linear approximation
    // find number of leading zeros
    zeros = WebRtcSpl_NormU32((uint32_t)cur_level);
    if (cur_level == 0) {
      zeros = 31;
    }
    tmp32 = (cur_level << zeros) & 0x7FFFFFFF;
    frac = (int16_t)(tmp32 >> 19);  // Q12.
    tmp32 = (stt->gainTable[zeros - 1] - stt->gainTable[zeros]) * frac;
    gains[k + 1] = stt->gainTable[zeros] + (tmp32 >> 12);

The following code is based on the segmented gain array gains, and the actual gain value is obtained by shifting 16 bits to the right (the are based on the sample point energy, here shifting 16 bits to the right can be understood as finding an integer α, Make the signal amplitude value sample multiplied by α closest to 32768 ), and directly multiply it to the output signal (the output signal here has been copied to the input signal at the beginning of the function).

/** 增益数组gains作用到输出信号，完成音量均衡  */
 for (k = 1; k < 10; k++) {
   delta = (gains[k + 1] - gains[k]) * (1 << (4 - L2));
   gain32 = gains[k] * (1 << 4);
   // iterate over samples
   for (n = 0; n < L; n++) {
     for (i = 0; i < num_bands; ++i) {
       tmp32 = out[i][k * L + n] * (gain32 >> 4);
       out[i][k * L + n] = (int16_t)(tmp32 >> 16);
     }
     gain32 += delta;
   }
 }

Let's take the curve of compressionGaindB = 12dB as an example. The above figure is the actual value of the calculated digital gain table gainTable, and the figure below is the actual gain multiple obtained by shifting 16 bits to the right. It can be seen that when compressionGaindB = 12dB, the maximum gain of the integer part is 3. Theoretically, the gain of 12dB is actually amplified by 4 times. Here, the integer part can be multiplied by a maximum of 3 times, and then the decimal part will be used to supplement the remaining 0~1.0 times. This can prevent popping. Give two simple examples:

A. Data with amplitude value of 8000, envelope cur_level = 8000^2 = 0x3D09000, through WebRtcSpl_NormU32 ((uint32_t) cur_level); calculated by WebRtcSpl_NormU32 ((uint32_t) cur_level); There are 6 leading zeros, and the integer part gain is stt->gainTable [6] ] = 3, that is, 8000 can be boldly multiplied by 3 times, and then the part of the gain multiplier less than 1.0 is determined by fracpart;

B. Data with amplitude value of 16000, envelope cur_level = 16000^2 = 0xF424000, through WebRtcSpl_NormU32 ((uint32_t) cur_level); calculated by WebRtcSpl_NormU32 ((uint32_t) cur_level); There are 4 leading zeros, and the integer part gain is stt->gainTable [4] ] = 2, you will find that 16000 * 2 = 32000 at this time, and then the process of equalizing to the target volume is determined by the limiter, and the details will not be expanded here.

Simply put, if any number in [0, 32768] wants to gain the specified decibel and the result does not exceed 32768, you can find certain elements in the digital gain table gainTable to meet this requirement.

The application of target gain targetLevelDbfs and Limiter is reflected in WebRtcAgc_ProcessDigital and related functions, so I won’t elaborate here, you can read the source code and learn more.

Let's use a few cases to look at the effects and problems of the fixed digital gain mode. First, set targetLevelDbfs = 1 and compressionGaindB = 12.

1. The acquisition volume is small, and the improvement is not obvious after equalization;

The collection volume of the device is-24dB, after equalization, the volume is only-12dB, and the overall volume will feel a little low in the sense of hearing;

2. The acquisition volume is large, and the noise floor is significantly enhanced;

The device collects volume-9dB, and after equalization, the volume reaches-1dB. The overall volume is audibly normal, but the fluctuation between voice frames is reduced, mainly because the noise part of the speechless segment is greatly improved. The main problem in this case is that when the acquisition volume itself is relatively large, if the environmental noise is large and the noise reduction capability is not strong, once the compressionGaindB set to a large value, the voice part will be limited to targetLevelDbfs , but The noise floor of the non-speech part will be fully improved, and the participants at the opposite end can hear obvious noise.

3. The collected sound fluctuates a lot (take the artificially spliced audio from large to small as an example), and it still cannot be improved after equalization;

Adaptive Analog Gain-AdaptiveAnalog

Before we talk about adaptive analog gain, we need to clarify the function of the PC side that affects the acquisition volume:

The PC end supports adjusting the collection volume, the adjustment range is 0~1.0, and the WebRTC client code is internally mapped to 0~255;

/** 以mac为例，麦克风灵敏度被转成了0~255 */
int32_t AudioMixerManagerMac::MicrophoneVolume(uint32_t& volume) const {
  ......
    // vol 0.0 to 1.0 -> convert to 0 - 255
    volume = static_cast<uint32_t>(volFloat32 * 255 + 0.5);
    ......
  return 0;
}

Most windows notebook devices have a built-in microphone array and provide a microphone array enhancement algorithm. While reducing noise, it will also provide an additional gain of 0~10dB (different models have different ranges, and Lenovo’s device gains are as high as 36dB), as shown in Figure 3;

Figure 3 Left: MAC side analog gain adjustment. Right: Windows side microphone array's built-in gain capability

Because there are too many modules to control the volume, the AGC algorithm on the PC side is more sensitive. The default values set by many online customers are not reasonable, which will directly affect the experience of audio and video calls:

Excessive collection volume causes the noise to be significantly increased, and the human voice pops;

Excessive collection volume will cause the playback signal to have large non-linear distortion after returning to the microphone, which is a big challenge to the echo cancellation algorithm;

The collection volume is too low, and the digital gain capability is limited, causing the peer to be inaudible;

Most users do not know that the PC device also has the function of manually adjusting the acquisition gain after detecting the abnormal sound. It is almost impossible to rely on online users (especially many elementary school students in educational scenes) to adjust the analog gain value by themselves. The function of dynamic adjustment of gain value makes the AGC algorithm more feasible, and the digital gain part equalizes the near-end signal to the ideal position. Therefore, WebRTC scientists developed an adaptive analog gain mode, which adjusts the original acquisition volume through a feedback mechanism. The goal is to cooperate with the digital gain module to find the most suitable microphone gain value and feed it back to the device layer, so that the near-end data reaches the target gain after the digital gain. The audio data flow block diagram is as follows:

There are two main additions based on the fixed digital gain:

After the digital gain, a new analog gain update module is added: WebRtcAgc_ProcessAnalog , according to the current analog gain value inMicLevel (the scale is mapped to 0~255 in WebRTC) and other intermediate parameters to calculate the analog gain value that needs to be adjusted next time outMicLevel and feedback to the device layer.

// Scale from VoE to ADM level range.
uint32_t new_voe_mic_level = shared_->transmit_mixer()->CaptureLevel();
if (new_voe_mic_level != voe_mic_level) {
    // Return the new volume if AGC has changed the volume.
    new_mic_volume = static_cast<int>((new_voe_mic_level * max_volume +static_cast<int>(kMaxVolumeLevel / 2)) / kMaxVolumeLevel);
    return new_mic_volume;
}

The default settings of some equipment manufacturers’ microphone arrays are relatively small. Even if the analog gain is adjusted to full, the acquisition is still very small. At this time, the digital gain compensation part is needed to improve: WebRtcAgc_AddMic , which can be amplified on the basis of the original acquisition 1.0~3.16 Times, as shown in Figure 4. So, how to judge the enlargement is not enough? The final output of the analog gain update module in the previous step is actually the smaller one between micVol and the maximum value maxAnalog(255)

*outMicLevel = WEBRTC_SPL_MIN(stt->micVol, stt->maxAnalog) >> stt->scale;

That is, the actual value micVol calculated according to the relevant rules may be greater than the specified maximum value maxAnalog, which means that adjusting the analog gain to the maximum cannot reach the target volume. WebRtcAgc_AddMic will monitor the occurrence of this event and pass it. The way of checking the meter gives extra compensation.

Gain table kGainTableAnalog:

static const uint16_t kGainTableAnalog[GAIN_TBL_LEN] = {
    4096, 4251, 4412, 4579,  4752,  4932,  5118,  5312,  5513,  5722, 5938,
    6163, 6396, 6638, 6889,  7150,  7420,  7701,  7992,  8295,  8609, 8934,
    9273, 9623, 9987, 10365, 10758, 11165, 11587, 12025, 12480, 12953};
// apply gain
sample = (in_mic[j][i] * gain) >> 12; // 经过右移之后，数组被量化到0~3.16.

Fig. 4 Gain curve of gain table

The input signal is compensated with a fixed step of 1 each time, gainTableIdx = 0 means that the magnification is 1 times, that is, nothing is done.

/* Increment through the table towards the target gain.
 * If micVol drops below maxAnalog, we allow the gain
 * to be dropped immediately. */
if (stt->gainTableIdx < targetGainIdx) {
    stt->gainTableIdx++;
} else if (stt->gainTableIdx > targetGainIdx) {
    stt->gainTableIdx--;
}
gain = kGainTableAnalog[stt->gainTableIdx];
// apply gain
sample = (in_mic[j][i] * gain) >> 12;

Existing problems:

Up-regulation behavior of the analog value in the state of no speech;

The adjustment range is too large, causing obvious sound fluctuations;

Frequent adjustments to the operating system API will cause unnecessary performance consumption, and severely cause thread blocking;
The digital part has limited gain capability and cannot complement the analog gain;
The pop detection is not very sensitive, and the analog gain cannot be adjusted down in time;
The accuracy of the AddMic module is insufficient, and there is a risk of popping during the compensation process.

Adaptive Digital Gain-AdaptiveDigital

Entertainment, social networking, online education and other fields based on audio and video communication are inseparable from a variety of smart phones and tablet devices. However, these mobile terminals do not have an interface similar to the PC terminal to adjust the analog gain. The distance between the sound source and the device, the volume of the sound source, and the hardware acquisition capability will affect the acquisition volume. The effect of relying solely on fixed digital gain is very limited, especially in a multi-person meeting, you will obviously feel that the volume of different speakers is not consistent. It feels that the volume fluctuates a lot.

In order to solve this problem, WebRTC scientists imitated the ability of PC-side analog gain adjustment and added a virtual microphone adjustment module based on the analog gain framework: WebRtcAgc_VirtualMic, using two arrays of length 128: gain curve-kGainTableVirtualMic and suppression curve-kSuppressionTableVirtualMic. Simulate the PC end analog gain (the gain part is a monotonously increasing straight line, and the restraining part is a monotonously decreasing concave curve). The former provides 1.0~3.0 times gain capability, and the latter provides 1.0~0.1 down pressure capability.

Figure 5 Gain curve and suppression curve

The core logic is consistent with the adaptive analog gain.

Same as the gain mode of adaptive mode, still use WebRtcAgc_ProcessAnalog to update micVol;
Update the gain subscript gainIdx in the WebRtcAgc_VirtualMic module according to micVol, and look up the table to get the new gain gain;

/* 设置期望的音量水平 */
  gainIdx = stt->micVol;
  if (gainIdx > 127) {
    gain = kGainTableVirtualMic[gainIdx - 128];
  } else {
    gain = kSuppressionTableVirtualMic[127 - gainIdx];
  }

Apply gain gain, once saturation is detected during the period, gainIdx will be gradually decreased;

/* 饱和检测更新增益 */
if (tmpFlt > 32767) {
    tmpFlt = 32767;
    gainIdx--;
    if (gainIdx >= 127) {
        gain = kGainTableVirtualMic[gainIdx - 127];
    } else {
        gain = kSuppressionTableVirtualMic[127 - gainIdx];
    }
}
if (tmpFlt < -32768) {
    tmpFlt = -32768;
    gainIdx--;
    if (gainIdx >= 127) {
        gain = kGainTableVirtualMic[gainIdx - 127];
    } else {
        gain = kSuppressionTableVirtualMic[127 - gainIdx];
    }
}

The gain data is passed to WebRtcAgc_AddMic , check whether micVol is greater than the maximum value maxAnalog to determine whether additional compensation needs to be activated.

The audio data flow block diagram is as follows:

The existing problem is similar to the adaptive mode gain. One problem that needs to be clearly stated here is that the sensitivity of the digital gain adaptive adjustment is not high. When the input volume fluctuates, it is prone to block up or compression. Use a more obvious example to illustrate: When the volume is high, the compression curve needs to be called. If it is followed by a small volume, the small volume will be further compressed, and then the gain will be increased. At this time, if the small volume is followed by a large volume, it will cause a large volume to pop, which requires the participation of the limiter. Compression limits, there is distortion to the sound quality.

Summary and optimization direction

In order to have a better listening experience, the goal of the AGC algorithm is to ignore the difference in equipment collection, and still be able to equalize the audio volume at the push end to the ideal position, eliminate low volume, eliminate popping, and solve the fluctuations in the volume of different people after multi-person mixing, etc. the core issue.

In view of the problems of the various modes mentioned in the above chapters, there are several enlightenments as follows:

Analog gain adjustment, it is necessary to repair the frequent adjustment, too large step size and other issues;
The accuracy of AddMic part is not enough, you can predict in advance, don't wait for the detection of popping and then callback;
The PC end digital gain and analog gain module are independent of each other, but the effect should be mutually compensated;
AGC's balance of volume should not affect MOS, and MOS should not be abandoned because of the pursuit of sensitivity.

In addition, many bit operations in the code are easier to dissuade from the first reading. I hope everyone can grasp the core code and practice more after forming the overall framework, and then absorb and digest.

Finally, let us look at the optimized effect:

After the analog gain is adjusted, the volume of the collected audio signal fluctuates. After the digital part is equalized, the audio envelope remains better, and the volume is consistent overall;

The noise in the voice and the environment, the volume of the voice part fluctuates and decreases after AGC, and the noise part does not increase significantly;

In a more extreme case, the small voice part is increased by 35dB at the maximum, and the convergence time is kept within 10s.

"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Technology Exchange Group, discuss audio and video technologies with the author, and get more industry latest information.

Detailed explanation of the high sound quality and low latency behind WebRTC-AGC (Automatic Gain Control)

Preface

The position of AGC in WebRTC

The core parameters of AGC

The core model of AGC

Fixed Digital Gain-FixedDigital

The basic idea of the voice detection module WebRtcAgc_ProcessVad

How WebRtcAgc_ProcessDigital gains audio data

Adaptive Analog Gain-AdaptiveAnalog

Adaptive Digital Gain-AdaptiveDigital

Summary and optimization direction

CloudImagine

引用和评论

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

花了一年空闲时间打磨了一款 IM

AI实时对话的通信基础，WebRTC技术综合指南