Introduction to This article will comprehensively analyze the basic framework of WebRTC AGC with examples, and explore its basic principles, model differences, existing problems and optimization directions together.

In front of us the WebRTC audio. 3A acoustic echo cancellation (AEC: Acoustic Echo Cancellation) basic principles and direction of optimization, then this chapter we talk about the other "A" - automatic gain control (AGC: Auto Gain Control ). This article will comprehensively analyze the basic framework of WebRTC AGC with examples, and explore its basic principles, differences in modes, existing problems, and optimization directions together.

Author|Luo Shen

Review|Taiyi

Preface

Automatic Gain Control (AGC: Auto Gain Control) is the audio algorithm module that I think has the longest link and most affects sound quality and subjective hearing. On the one hand, AGC must act on the sender to cope with the diverse mobile and PC Acquisition equipment, on the other hand, AGC is often used as a compressor on the receiving end to balance the mixed signal to prevent popping. The most direct manifestation of the diversity of equipment is the difference in audio collection, which is generally manifested as a loud sound that causes popping, and too small a collection volume sounds very difficult on the other end.

In the real scene of audio and video calls, different participants have different speaking volumes. Participants need to frequently adjust the playback volume to meet the needs of hearing. Users wearing headsets are always exposed to the loud sound of the ears. hit". Therefore, the equalization of the sender's volume is particularly important in the above scenarios. The excellent automatic gain control algorithm can unify the audio volume and greatly alleviate the volume fluctuation caused by factors such as device collection differences, speaker volume, and distance. difference.

The position of AGC in WebRTC

Before talking about the AGC audio stream processing framework, let's take a look at the position of AGC in real-time audio and video communication. Figure 1 shows the same device as the sender audio data from collection to encoding, and as the receiver audio data from decoding to playback. the process of. AGC acts as an equalizer and a compressor at the sending end to adjust the volume of the push stream, and at the receiving end only serves as a compressor to prevent the audio data from popping after mixing. In theory, after the push end AGC is robust enough, pull the stream end Just as a compressor is enough, some manufacturers will do another AGC in order to further reduce the volume difference of different voices after mixing. image.png Figure 1 Block diagram of the audio signal uplink and downlink processing flow in WebRTC

The core parameters of AGC

First, let’s take a look at Sample and the decibel dB , taking the 16bit quantized audio sample point as an example: dB = 20 * log10 (Sample / 32768.0d on the right side of the Adobe Auditionee The scale is consistent.

Amplitude value represents: 16bit sampled minimum value is 0, the maximum absolute value of 32768 ( amplitude value below right column ordinate ). image.png

Expressed in decibels: maximum value is 0 dB ( decibel value below right column ordinate ), generally has a relatively large volume reaches a -3dB, often 3 to AGC target volume. image.png

The core parameters are:

typedef struct {
  int16_t targetLevelDbfs;    // 目标音量
  int16_t compressionGaindB;  // 增益能力
  uint8_t limiterEnable;      // 压限器开关
} AliyunAgcConfig;

target volume-targetLevelDbfs : indicates the target value of the volume equalization result, if set to 1, the target value of the output volume is-1dB;

gain capability-compressionGaindB : indicates the maximum audio gain capability, if set to 12dB, the maximum can be increased by 12dB;

limiter switch-limiterEnable : Generally used in conjunction with targetLevelDbfs, compressionGaindB is to adjust the gain range of small volume, and limiter is to limit the part that exceeds targetLevelDbfs to avoid data popping.

The core model of AGC

In addition to the above three core parameters, WebRTC AGC provides the following three modes for different access devices:

enum {
  kAgcModeUnchanged,
  kAgcModeAdaptiveAnalog,  // 自适应模拟模式
  kAgcModeAdaptiveDigital, // 自适应数字增益模式
  kAgcModeFixedDigital  // 固定数字增益模式
};

In the following, we will describe these three modes in terms of basic functions, applicable scenarios, signal flow diagrams, and existing problems in conjunction with examples.

Fixed Digital Gain-FixedDigital

The most basic gain mode of the fixed digital gain mode is also the core of AGC, and the other two modes are expanded on this basis. Mainly to amplify the signal with a fixed gain, the maximum gain does not exceed the set gain capacity compressionGaindB , combined with limiter when used, the upper limit does not exceed the set target volume targetLevelDbfs targetLevelDbfs 160b86fea3d7a.

In the fixed digital gain mode, it only relies on the core function WebRtcAgc\_ProcessDigital to equalize the input signal volume. Since there is no feedback mechanism, the signal processing process is also extremely simple. After setting the parameters, the signal will go through the following process:

image.png

Fixed digital gain mode is the core model, mainly the following two aspects worthy of our in-depth study:

The basic idea of the speech detection module WebRtcAgc\_ProcessVad

In the real-time communication scenario, there will be components of the far-end signal in the near-end signal collected by the microphone. In the process, WebRtcAgc\_ProcessVad function to detect the actual near-end signal envelope. At this time, it is necessary to eliminate the interference term of the far-end signal to prevent the residual echo signal from affecting the statistics of the near-end signal envelope and other parameters. The most traditional VAD distinguishes speech segments and non-speech segments based on indicators such as energy, zero-crossing rate, and noise threshold. WebRTC AGC provides new ideas for roughly distinguishing speech segments:

  1. Calculate the short-term , describe the instantaneous change of the voice envelope, and can accurately reflect the voice envelope, as shown in Figure 2 left red curve ;
// update short-term estimate of mean energy level (Q10)
tmp32 = state->meanShortTerm * 15 + dB;
state->meanShortTerm = (int16_t)(tmp32 >> 4);
// update short-term estimate of variance in energy level (Q8)
tmp32 = (dB * dB) >> 12;
tmp32 += state->varianceShortTerm * 15;
state->varianceShortTerm = tmp32 / 16;
// update short-term estimate of standard deviation in energy level (Q10)
tmp32 = state->meanShortTerm * state->meanShortTerm;
tmp32 = (state->varianceShortTerm << 12) - tmp32;
state->stdShortTerm = (int16_t)WebRtcSpl_Sqrt(tmp32);

2. Calculate the long-term , describe the overall slow change trend of the signal, and outline the "center of gravity" of the signal, which is relatively smooth and facilitates the use of the threshold value as the detection condition, as shown in Figure 2 left blue curve ;

// update long-term estimate of mean energy level (Q10)
tmp32 = state->meanLongTerm * state->counter + dB;
state->meanLongTerm = WebRtcSpl_DivW32W16ResW16(tmp32, WebRtcSpl_AddSatW16(state->counter, 1));
// update long-term estimate of variance in energy level (Q8)
tmp32 += state->varianceLongTerm * state->counter;
state->varianceLongTerm = WebRtcSpl_DivW32W16(tmp32, WebRtcSpl_AddSatW16(state->counter, 1));

3. Calculate the standard score and describe the deviation of the short-term average from the "center of gravity". The part above the center can be considered as having a great possibility of voice activity;

tmp32 = tmp16 * (int16_t)(dB - state->meanLongTerm);
tmp32 = WebRtcSpl_DivW32W16(tmp32, state->stdLongTerm);
state->logRatio = (int16_t)(tmp32 >> 6);

image.png

image.gif Figure 2 Left: Long and short time mean and variance Right: input and

How WebRtcAgc\_ProcessDigital gains audio data

The three core parameters are all developed around the fixed digital gain mode. What we need to figure out is how the core function in WebRTC AGC- WebRtcAgc\_ProcessDigital gains audio data.

1. According to the specified targetLevelDbfs and compressionGaindB, calculate the gain table gainTable ;

/* 根据设置的目标增益与增益能力,计算增益表gainTable */
if (WebRtcAgc_CalculateGainTable(&(stt->digitalAgc.gainTable[0]), stt->compressionGaindB, stt->targetLevelDbfs, stt->limiterEnable, stt->analogTarget) == -1) {
    return -1;
 }

In this step, the gain table gainTable can be understood as the quantization of the signal energy value (the square of the amplitude). We first fix the targetLevelDbfs and set the compressionGaindB to 3dB~15dB respectively. The corresponding gain table curve is as follows, you can see that the higher the gain capability setting is Larger, the higher the curve, as shown in the figure below.

image.png

You may be wondering why the length of gainTable 32 actually represents the 32 bits of an int data (the energy value range of short data is [0, 32768^2] and can be represented by unsigned int data), from high to low, the highest bit of 1 has the largest The order of magnitude is called the integer part-intpart, and the fractional part of the subsequent digits is called fracpart. Therefore, any number between [0, 32768] corresponds to a gain value in the digital gain table. Next we talk about how to look up the meter and apply the gain value to complete the volume equalization.

/** 部分关键源码 */
/** 提取整数部分和小数部分 */
intPart = (uint16_t)(absInLevel >> 14);          // extract the integral part
fracPart = (uint16_t)(absInLevel & 0x00003FFF);  // extract the fractional part
......
/** 根据整数部分和小数部分生成数字增益表 */
gainTable[i] = (1 << intPart) + WEBRTC_SPL_SHIFT_W32(fracPart, intPart - 14);

gainTable according to the input signal envelope, and apply the gain to the input signal;

Based on the hearing curve of the human ear, the application gain in AGC is segmented. A frame of 160 sample points will be divided into 10 segments, each segment has 16 sample points, so the segmented gain array gains will be introduced, in the following code Describes the relationship between the digital gain table and the gain array, which directly reflects the process of looking up the table. The idea is similar to the calculation of the gain table. The integer part and the decimal part are calculated first, and then the new gain value is calculated through the combination of the gain table. It includes compensation for the fractional part.

// Translate signal level into gain, using a piecewise linear approximation
    // find number of leading zeros
    zeros = WebRtcSpl_NormU32((uint32_t)cur_level);
    if (cur_level == 0) {
      zeros = 31;
    }
    tmp32 = (cur_level << zeros) & 0x7FFFFFFF;
    frac = (int16_t)(tmp32 >> 19);  // Q12.
    tmp32 = (stt->gainTable[zeros - 1] - stt->gainTable[zeros]) * frac;
    gains[k + 1] = stt->gainTable[zeros] + (tmp32 >> 12);

The following code is based on the segmented gain array gains, which is shifted by 16 bits to the right to obtain the actual gain value (the are based on the sample point energy, here shifting 16 bits to the right can be understood as finding an integer α, Make the signal amplitude value sample multiplied by α closest to 32768 ), and directly multiply it to the output signal (the output signal here has been copied to the input signal at the beginning of the function).

/** 增益数组gains作用到输出信号,完成音量均衡  */
  for (k = 1; k < 10; k++) {
    delta = (gains[k + 1] - gains[k]) * (1 << (4 - L2));
    gain32 = gains[k] * (1 << 4);
    // iterate over samples
    for (n = 0; n < L; n++) {
      for (i = 0; i < num_bands; ++i) {
        tmp32 = out[i][k * L + n] * (gain32 >> 4);
        out[i][k * L + n] = (int16_t)(tmp32 >> 16);
      }
      gain32 += delta;
    }
  }

Let's take the curve of compressionGaindB = 12dB as an example. The above figure is the actual value of the calculated digital gain table gainTable, and the figure below is the actual gain multiple obtained by shifting 16 bits to the right. It can be seen that when compressionGaindB = 12dB, the maximum gain of the integer part is 3. Theoretically, the gain of 12dB is actually amplified by 4 times. Here, the integer part can be multiplied by a maximum of 3 times, and the fractional part will be used to supplement the remaining 0~1.0 times. This can prevent popping. Give two simple examples:

image.gifimage.png

A. data with an amplitude value of 8000, the envelope cur\_level = 8000^2 = 0x3D09000, through WebRtcSpl\_NormU32 ((uint32\_t) cur\_level); calculated by WebRtcSpl\_NormU32 ((uint32\_t) cur\_level); The gain of the integer part is stt->gainTable [6] = 3, that is, 8000 can be boldly multiplied by 3, and then the part where the gain multiple is less than 1.0 is determined by fracpart;

B. data with an amplitude value of 16000, envelope cur\_level = 16000^2 = 0xF424000, through WebRtcSpl\_NormU32 ((uint32\_t) cur\_level); calculated by WebRtcSpl\_NormU32 ((uint32\_t) cur\_level); there are 4 leading zeros, which can be obtained by looking up the table The integral part of the gain is stt->gainTable [4] = 2. At this time, it will be found that 16000 * 2 = 32000, and then the process of equalizing to the target volume is determined by the limiter, and the details will not be expanded here.

means that any number in [0, 32768] wants to gain the specified decibel and the result does not exceed 32768, and certain elements can be found in the digital gain table gainTable to meet this requirement.

Regarding the target gain targetLevelDbfs and Limiter , the application is reflected in WebRtcAgc\_ProcessDigital and related functions, so I won’t elaborate here, you can read the source code and learn more.

Let's use a few cases to look at the effects and problems of the fixed digital gain mode. First, set targetLevelDbfs = 1 , compressionGaindB = 12.

  1. acquisition volume is small, and the improvement is not obvious after equalization;

Device collection volume-24dB, after equalization, the volume is only-12dB, the overall volume will feel low in the sense of hearing; image.png

2. The acquisition volume is large, and the noise floor is significantly enhanced;

The device collects volume-9dB, and after equalization, the volume reaches-1dB. The overall volume is audibly normal, but the fluctuation between voice frames is reduced, mainly because the noise part of the speechless segment is greatly improved. The main problem in this case is that when the acquisition volume itself is relatively large, if the environmental noise is large and the noise reduction capability is not strong, once the compressionGaindB set to a large value, the voice part will be limited to targetLevelDbfs , but The noise floor of the non-speech part will be fully improved, and the participants at the opposite end can hear obvious noise. image.png

3. The collected sound fluctuates a lot (take the artificially spliced audio from large to small as an example), and it still cannot be improved after equalization;

image.png

Adaptive Analog Gain-AdaptiveAnalog

Before we talk about adaptive analog gain, we need to clarify the function of PC side that affects the acquisition volume:

1. The PC end supports adjusting the collection volume, the adjustment range is 0~1.0, and the WebRTC client code is internally mapped to 0~255;

/** 以mac为例,麦克风灵敏度被转成了0~255 */
int32_t AudioMixerManagerMac::MicrophoneVolume(uint32_t& volume) const {
  ......
    // vol 0.0 to 1.0 -> convert to 0 - 255
    volume = static_cast<uint32_t>(volFloat32 * 255 + 0.5);
    ......
  return 0;
}

2. Most windows notebook devices have built-in microphone arrays and provide microphone array enhancement algorithms. While reducing noise, they also provide additional gains of 0~10dB (different models have different ranges, and Lenovo’s device gains are as high as 36dB), as shown in the figure 3;
image.png Figure 3 Left: MAC side analog gain adjustment Right: Windows side microphone array built-in gain capability

Because there are too many modules to control the volume, the AGC algorithm on the PC side is more sensitive. The default values set by many online customers are not reasonable, which will directly affect the experience of audio and video calls:

1. Excessive collection volume causes the noise to be significantly increased, and the human voice pops;

image.png

2. The collection volume is too large, which will cause the playback signal to have large nonlinear distortion after being returned to the microphone, which is not a small challenge to the echo cancellation algorithm;

image.png

3. The collection volume is too low, and the digital gain capability is limited, causing the peer to be inaudible; image.gif image.png

Most users do not know that the PC device also has the function of manually adjusting the acquisition gain after detecting the abnormal sound. It is almost impossible to rely on online users (especially many elementary school students in educational scenes) to adjust the analog gain value by themselves. The function of dynamic adjustment of gain value makes the AGC algorithm more feasible, and the digital gain part equalizes the near-end signal to the ideal position. Therefore, WebRTC scientists developed an adaptive analog gain mode, which adjusts the original acquisition volume through a feedback mechanism. The goal is to cooperate with the digital gain module to find the most suitable microphone gain value and feed it back to the device layer, so that the near-end data reaches the target gain after the digital gain. The audio data flow block diagram is as follows: image.gif image.png

There are two main additions based on the fixed digital gain:

1. After the digital gain, a new analog gain update module is added: WebRtcAgc\_ProcessAnalog , according to the current analog gain value inMicLevel (the scale is mapped to 0~255 in WebRTC) and other intermediate parameters to calculate the next need to adjust The analog gain value of outMicLevel is fed back to the device layer.

// Scale from VoE to ADM level range.
uint32_t new_voe_mic_level = shared_->transmit_mixer()->CaptureLevel();
if (new_voe_mic_level != voe_mic_level) {
    // Return the new volume if AGC has changed the volume.
    new_mic_volume = static_cast<int>((new_voe_mic_level * max_volume +static_cast<int>(kMaxVolumeLevel / 2)) / kMaxVolumeLevel);
    return new_mic_volume;
}

2. The default setting of the microphone array of some equipment manufacturers is relatively small, even if the analog gain is adjusted to full, the acquisition is still very small. At this time, the digital gain compensation part is needed to improve: WebRtcAgc\_AddMic , which can be amplified on the basis of the original acquisition 1.0~3.16 times, as shown in Figure 4. So, how to judge that the magnification is not enough? The final output of the analog gain update module in the previous step is actually the smaller one between micVol and the maximum value maxAnalog(255)

*outMicLevel = WEBRTC_SPL_MIN(stt->micVol, stt->maxAnalog) >> stt->scale;

That is to say, the actual value micVol calculated according to the relevant rules may be greater than the specified maximum value maxAnalog, which means that the target volume cannot be reached even if the analog gain is adjusted to the maximum. WebRtcAgc\_AddMic will monitor the occurrence of this event, and Additional compensation will be given by checking the meter.

Gain table kGainTableAnalog:

static const uint16_t kGainTableAnalog[GAIN_TBL_LEN] = {
    4096, 4251, 4412, 4579,  4752,  4932,  5118,  5312,  5513,  5722, 5938,
    6163, 6396, 6638, 6889,  7150,  7420,  7701,  7992,  8295,  8609, 8934,
    9273, 9623, 9987, 10365, 10758, 11165, 11587, 12025, 12480, 12953};
// apply gain
sample = (in_mic[j][i] * gain) >> 12; // 经过右移之后,数组被量化到0~3.16.

image.png

Figure 4 Gain curve of the gain table

The input signal is compensated with a fixed step of 1 each time, gainTableIdx = 0 means that the magnification is 1 times, that is, nothing is done.

/* Increment through the table towards the target gain.
 * If micVol drops below maxAnalog, we allow the gain
 * to be dropped immediately. */
if (stt->gainTableIdx < targetGainIdx) {
    stt->gainTableIdx++;
} else if (stt->gainTableIdx > targetGainIdx) {
    stt->gainTableIdx--;
}
gain = kGainTableAnalog[stt->gainTableIdx];
// apply gain
sample = (in_mic[j][i] * gain) >> 12;

:

1. Up-regulation behavior of the analog value in the state of no voice; image.png

2. The adjustment range is too large, causing obvious sound fluctuations; image.png

3. Frequent adjustments to the operating system API will cause unnecessary performance consumption, and severely cause thread blocking;

4. The gain capability of the digital part is limited and cannot be complementary to the analog gain;

5. The pop detection is not very sensitive, and the analog gain cannot be adjusted down in time;

6. The accuracy of the AddMic module is insufficient, and there is a risk of .

Adaptive Digital Gain-AdaptiveDigital

Entertainment, social networking, online education and other fields based on audio and video communication are inseparable from a variety of smart phones and tablet devices. However, these mobile terminals do not have an interface similar to the PC terminal to adjust the analog gain. The distance between the sound source and the device, the volume of the sound source, and the hardware acquisition capability will affect the acquisition volume. The effect of relying solely on fixed digital gain is very limited, especially in a multi-person meeting, you will obviously feel that the volume of different speakers is not consistent. It feels that the volume fluctuates a lot.

In order to solve this problem, WebRTC scientists imitated the ability of PC-side analog gain adjustment, based on the analog gain framework, added a virtual microphone adjustment module: WebRtcAgc\_VirtualMic, using two arrays of length 128: gain curve-kGainTableVirtualMic and suppression curve- kSuppressionTableVirtualMic is used to simulate the PC-side analog gain (the gain part is a monotonously increasing straight line, and the suppression part is a monotonously decreasing concave curve). The former provides 1.0~3.0 times gain capability, and the latter provides 1.0~0.1 downscaling capability.

image.png Figure 5 Gain curve and suppression curve

The core logic logic is consistent with the adaptive analog gain.

  1. Same as the gain mode of adaptive mode, still use WebRtcAgc\_ProcessAnalog update micVol ;
  1. In the micVol WebRtcAgc \ _VirtualMic update module gain subscript gainIdx , and the look-up table to obtain a new gain GAIN;
/* 设置期望的音量水平 */
  gainIdx = stt->micVol;
  if (gainIdx > 127) {
    gain = kGainTableVirtualMic[gainIdx - 128];
  } else {
    gain = kSuppressionTableVirtualMic[127 - gainIdx];
  }

3. Application gain gain , once saturation is detected during the period, it will gradually decrease gainIdx ;

/* 饱和检测更新增益 */
if (tmpFlt > 32767) {
    tmpFlt = 32767;
    gainIdx--;
    if (gainIdx >= 127) {
        gain = kGainTableVirtualMic[gainIdx - 127];
    } else {
        gain = kSuppressionTableVirtualMic[127 - gainIdx];
    }
}
if (tmpFlt < -32768) {
    tmpFlt = -32768;
    gainIdx--;
    if (gainIdx >= 127) {
        gain = kGainTableVirtualMic[gainIdx - 127];
    } else {
        gain = kSuppressionTableVirtualMic[127 - gainIdx];
    }
}

gain data to 160b86fea3e5d8 WebRtcAgc\_AddMic , check whether micVol is greater than the maximum value maxAnalog to determine whether additional compensation needs to be activated.

The audio data flow block diagram is as follows: image.png

The existing problem is similar to the adaptive mode gain. One problem that needs to be clearly stated here is that the sensitivity of the digital gain adaptive adjustment is not high. When the input volume fluctuates, it is prone to block up or compression. Use a more obvious example to illustrate: When the volume is high, the compression curve needs to be called. If it is followed by a small volume, the small volume will be further compressed, and then the gain will be increased. At this time, if the small volume is followed by a large volume, it will cause a large volume to pop, which requires the participation of the limiter. Compression limits, there is distortion to the sound quality. image.png

image.png

Summary and optimization direction

In order to have a better listening experience, the goal of the AGC algorithm is to ignore the difference in equipment collection, and still be able to equalize the audio volume at the push end to the ideal position, eliminate low volume, eliminate popping, and solve the fluctuations in the volume of different people after multi-person mixing, etc. the core issue.

In view of the problems of the various modes mentioned in the above chapters, there are several enlightenments as follows:

  1. Analog gain adjustment, it is necessary to repair the frequent adjustment, too large step size and other issues;
  2. The accuracy of AddMic part is not enough, you can predict in advance, don't wait for the detection of popping and then call back;
  3. The PC end digital gain and analog gain module are independent of each other, but the effect should be mutually compensated;
  4. AGC's balance of volume should not affect MOS, and MOS should not be abandoned because of the pursuit of sensitivity.
In addition, many bit operations in the code are easier to dissuade from the first reading. I hope everyone can grasp the core code and practice more after forming the overall framework, and then absorb and digest.

Finally, let us look at the optimized effect:

1. After adjusting the analog gain, the volume of the collected audio signal fluctuates. After the digital part is equalized, the audio envelope remains better, and the volume is consistent overall. image.png

2. The noise in the voice and the environment. After AGC, the volume of the voice part is reduced, and the noise part is not significantly improved; image.gif image.png

3. In a more extreme case, the small voice part is increased by 35dB at the maximum, and the convergence time is kept within 10s;

image.png


"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Technology Exchange Group, discuss audio and video technologies with the author, and get more industry latest information.
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。