Detailed explanation of low latency and high sound quality: echo cancellation and noise reduction

In the real-time audio interactive scene, in addition to the codec mentioned in our previous article, which will affect the sound quality and experience, on the end, noise reduction, echo cancellation, and automatic gain modules also play an important role. In this content, we will mainly focus on echo cancellation and noise reduction modules, talk about technical challenges in real-time interactive scenarios, and our solution ideas and practices.

Optimization of three algorithm modules for echo cancellation

In the voice communication system, Echo Cancellation has always played the role of the core algorithm. Generally speaking, the effect of echo cancellation is affected by many factors, including:

Acoustic environment, including reflection, reverberation, etc.;
The acoustic design of the communication device itself, including the sound cavity design and the non-linear distortion of the device, etc.;
System performance, the computing power of the processor, and the ability of operating system thread scheduling.

At the beginning of the design of the acoustic network echo cancellation algorithm, the algorithm performance, robustness and universality were the final optimization goals, which is very important for an excellent audio and video SDK.

First of all, how does the echo come about? To put it simply, your voice is emitted from the other party's speaker, this voice is recorded by his microphone, and the voice recorded by the microphone is transmitted back to your end, and you hear the echo. In order to eliminate the echo, we have to design an algorithm to remove this sound signal from the microphone signal.

So how does the Acoustic Echo Cancellation module (AEC, Acoustic Echo Cancellation) cancel the echo? The specific steps are shown in the following diagram:

The first step is to find the delay between the reference signal/speaker signal (blue broken line) and the microphone signal (red broken line), that is, delay=T in the figure.
The second step is to estimate the linear echo component in the microphone signal based on the reference signal, and subtract it from the microphone signal to obtain the residual signal (black broken line).
The third step is to completely suppress the residual echo in the residual signal through nonlinear processing.

Corresponding to the above three steps, echo cancellation also consists of three large algorithm modules:

Delay Estimation
Linear Adaptive Filter
Nonlinear Processing

Among them, "delay estimation" determines the lower limit of AEC, "linear adaptive filter" determines the upper limit of AEC, and "non-linear processing" determines the final call experience, especially the balance between echo suppression and dual talk.

Note: Dual talk means that in an interactive scene, two or more interacting parties speak at the same time, and the voice of one of them will be suppressed, resulting in intermittent situations. This is due to the "overcorrection" of the echo cancellation algorithm, which eliminates some audio signals that should not be removed.

Next, we will first talk about the technical challenges and optimization ideas around these three algorithm modules.

1. Delay estimation

Affected by the specific system implementation, when the reference signal and the microphone signal are sent to the AEC module for processing, there is a time delay between the data buffers stored in them, which is what we see in the figure above. delay=T". Assuming that the device that produces the echo is a mobile phone, after the sound is emitted from its speaker, part of the sound will be transmitted to the microphone through the device, and may also be transmitted back to the microphone through the external environment. So this delay includes the length of the device's acquisition and playback buffer, the time that the sound is transmitted in the air, and the time difference between the start of the playback thread and the collection thread. is due to many factors that affect the delay, so the value of this delay is different in different systems, different devices, and different SDK underlying implementations. It may be a fixed value during the call, or it may change midway (so-called overrun and underrun). This is why an AEC algorithm may work on device A, but the effect may be worse when changing to another device. The accuracy of delay estimation is a prerequisite for AEC to work. Excessive estimation deviation will cause AEC's performance to drop sharply or even fail to work. The inability to quickly track delay changes is an important factor for occasional echoes.

Enhanced delay estimation algorithm robustness

Traditional algorithms usually determine the delay by calculating the correlation between the reference signal and the microphone signal. The correlation calculation can be placed in the frequency domain. The typical method is Binary Spectrum. By calculating whether the signal energy at a single frequency point exceeds a certain threshold, the reference signal and the microphone signal are actually mapped into a two-dimensional 0/ 1 array, and then find the delay by constantly shifting the array offset. The latest WebRTC AEC3 algorithm uses multiple NLMS linear filters in parallel to find the delay. This method has achieved good results in detection speed and robustness, but the amount of calculation is very large. When calculating the cross-correlation of two signals in the time domain, an obvious problem is that the speech signal contains a large number of harmonic components and has time-varying characteristics. Its related signals often exhibit multi-peak characteristics, and some peaks are not. Represents the real delay, and the algorithm is susceptible to noise interference.

The acoustic network delay estimation algorithm can effectively suppress the value of local maxima by reducing the correlation between the signals (de-correlate) to greatly enhance the robustness of the algorithm. As shown in the following figure, the left side is the cross-correlation between the original signals, and the right side is the cross-correlation after the sound network SDK processing. It can be seen that the preprocessing of the signal greatly enhances the robustness of the delay estimation:

algorithm is adaptive, reducing the amount of calculation

Generally, in order to reduce the need for calculation, the delay estimation algorithm presupposes that the echo signal appears in a lower frequency band, so that the signal can be down-sampled and then sent to the delay estimation module to reduce the computational complexity of the algorithm. However, in the face of tens of thousands of devices and various routers on the market, the above assumptions are often not true. The following figure is the frequency spectrum of the microphone signal of VivoX20 in headset mode. It can be seen that the echo is concentrated in the frequency band above 4kHz. The traditional algorithm will cause the failure of the echo cancellation module for these cases. The acoustic network delay estimation algorithm searches for the area where the echo appears in the entire frequency band, and adaptively selects the area to calculate the delay, ensuring that the algorithm has an accurate delay estimation output under any device and route.

Picture: The microphone signal after VivoX20 is connected to the headset

Dynamically update the audio algorithm library to improve equipment coverage

In order to ensure the continuous iterative improvement of the algorithm, Shengwang maintains a database of audio algorithms. We use a large number of different test equipment to collect various combinations of reference signals and microphone signals in different acoustic environments, and the delays between them are all calibrated by offline processing. In addition to the real collected data, the database also contains a large amount of simulated data, including different speakers, different reverberation intensities, different noise floor levels, and different types of nonlinear distortion. In order to measure the performance of the delay estimation algorithm, the delay between the reference signal and the microphone signal can be randomly changed to observe the algorithm's response to sudden delay changes.

Therefore, to judge the pros and cons of a delay estimation algorithm, we need to investigate:

1. Adapt to as many equipment and acoustic environment as possible, and match the appropriate algorithm according to the factors of the equipment and acoustic environment in the shortest possible time;

2. After sudden random delay changes, the algorithm strategy can be dynamically adjusted in time.

The following is a comparison of latency estimation performance between Shengwang SDK and friends. A total of 8,640 sets of test data in the database are used. From the data in the figure, it can be seen that the Shengwang SDK can find the initial delay of most test data in a shorter time. In 96% of the test data, the sound network SDK can find their correct delay within 1s, while the proportion of friends is 89%.

The second test is the random delay jitter during the call. The test delay estimation algorithm should find the accurate delay value in the shortest possible time. As shown in the figure, in 71% of the test data, the Shengwang SDK can find the accurate delay value after the change within 3s, while the proportion of friends is 44%.

Two, linear adaptive filter

For linear filters, a large number of documents have introduced their principles and practices. When applied to the application scenario of echo cancellation, the main considerations include convergence rate, steady-state misalignment and tracking capability. There are often conflicts between these indicators. For example, a larger step size can improve the convergence speed, but it will cause a larger imbalance. This is the No Free Lunch Theorem in adaptive filters.

For the type of adaptive filter, in addition to the most commonly used NLMS filter (Model Independent), RLS filter (Least Squares Model) or Kalman filter (State-Space Model) can also be used. In addition to the various assumptions, approximations, and optimizations in their respective theoretical derivations, the performance of these filters ultimately comes down to how to calculate the best step factor (the step factor in the Kalman filter is merged into the calculation of Kalman Gain). When the filter has not converged or the environment transfer function has a sudden change, the step factor needs to be large enough to track the environment change. When the filter converges and the environment transfer function changes slowly during the time period, the step factor should be reduced as much as possible to reach the limit. Possibly small steady state imbalance. For the calculation of the step size factor, the energy ratio between the residual echo and the residual signal after the adaptive filter needs to be considered, which is modeled as the leakage coefficients of the system. This variable is often equivalent to finding the difference between the filter coefficient and the real transfer function (called the state space state vector error in the Kalman filter), which is also the difficulty in the entire estimation algorithm. In addition, the problem of filter divergence in the dual-talk phase is also a point to be considered. Generally speaking, this problem can be solved by adjusting the filter structure and using two echo path models.

The acoustic network adaptive filter algorithm does not use a single filter type, but takes into account the advantages of different filters, and uses an adaptive algorithm to calculate the optimal step factor. In addition, the algorithm estimates the transfer function of the environment in real time through linear filter coefficients, and automatically corrects the length of the filter to cover scenes with high reverberation and strong echoes such as communication equipment connected to HDMI peripherals. The following is an example. In a medium-sized conference room (approximately 20m2 in area and three glass walls) in the office of Soundnet, a Macbook Pro is used to connect a Xiaomi TV through HDMI. The picture shows the change trend of the linear filter time domain signal. The algorithm can automatically Calculate and match the length of the actual environment transfer function (a strong reverberation environment is automatically detected around the 1400th frame) to optimize the performance of the linear filter.

Similarly, we also use a large amount of test data in the database to compare the performance of the sound network SDK with friends. The indicators for comparison include steady-state imbalance (the degree of echo suppression after the filter converges) and the convergence speed (the filter reaches the convergence The time required for the state). The first picture represents the steady-state imbalance of the adaptive filter. In 47% of the test data, the sound network SDK can achieve an echo suppression of more than 20dB, while the proportion of friends is 39%.

The figure below shows the convergence speed of the adaptive filter. In 51% of the test samples, the sound network SDK can converge to a steady state within 3s before the call, while the proportion of friends is 13%.

Three, non-linear processing

Non-linear processing aims to suppress the echo components not predicted by the linear filter, usually by calculating the correlation between the reference signal, the microphone signal, the linear echo and the residual signal, or directly mapping the correlation to the suppression gain, or The power spectrum of the residual echo is estimated through correlation, and the residual echo is further suppressed by traditional noise reduction algorithms such as Wiener filter.

As the last module of the echo cancellation algorithm, in addition to suppressing the residual echo, the non-linear processing unit is also responsible for monitoring whether the entire system is working properly. For example, is the linear filter unable to work normally due to delay jitter? Is there any residual echo that cannot be processed by the hardware echo cancellation before the echo cancellation in the sound network SDK?

The following is a simple example. The internal parameters such as the echo energy estimated by the adaptive filter can find the phenomenon of delay changes faster and prompt the NLP to take corresponding actions:

As the scenes covered by the Shengwang SDK become more and more extensive, the transmission of music signals has become an important scene. Shengwang SDK has made a lot of optimizations for the echo cancellation experience of music signals. A typical scenario is the improvement of the estimation algorithm of comfort noise. The traditional algorithm uses the algorithm principle based on Minimum Statistics to estimate the noise floor in the signal. When this algorithm is applied to the music signal, because the music signal is more stable than the voice signal, it will overestimate the noise power and reflect it to The echo cancellation will cause the noise floor (background noise) between the echo period and the non-echo period after processing to be unstable, and the experience is extremely poor. Through signal classification and module fusion, the Soundnet SDK completely solves the noise floor fluctuations caused by CNG estimation.

In addition, the Soundnet SDK has also carried out a lot of optimizations for all possible extreme situations, including non-causal systems, device frequency deviations, acquisition signal overflow, sound card containing system signal processing, etc., to ensure that the algorithm can Work in all communication scenarios.

Noise reduction strategy with priority on sound quality

The effect of noise reduction on signal sound quality is greater than that of the echo cancellation module. This is derived from our a priori assumption that the noise floor is a stable signal (at least short-term stable) at the beginning of the design of the noise reduction algorithm, and according to this assumption , The distinction between music and noise floor is obviously weaker than that between speech and noise floor.

The sound network SDK presets a signal classification module at the front end of the noise reduction module, which can accurately detect the type of signal, and adjust the type and parameters of the noise reduction algorithm according to the type of the signal. Common signal types include general voice, a cappella, and music. Signal etc. The following figure shows the signal fragments processed by two noise reduction algorithms. The first one is a mixed signal of speech and music. The first 15 seconds is the noisy speech signal, the next is 40s is the music signal, and the next is the 10s. For noisy speech, the spectrogram from top to bottom is the original signal, the processing result of a friend, and the processing result of the sound network SDK. The results show that on the premise that the noise reduction performance of the voice signal is similar, the music part of the signal processed by the competing product has been severely damaged, and the processing of the Shengwang SDK has not reduced the sound quality of the music.

In the second example, the audio used is a cappella by the singer, in which the singer repeatedly makes the "ah" sound. In the spectrogram in the figure below, from top to bottom are the original signal, the processing result of a friend, and the processing result of the Shengwang SDK. The results show that the noise reduction processing of friends has severely damaged the spectrum components of the original voice, and the sound network SDK completely retains the harmonic components of the original voice, ensuring the sound quality experience of the singer when singing a cappella.

Concluding remarks

Since MM Sondhi of Bell Labs pioneered the use of adaptive filters to eliminate echo in 1967, countless studies and practices have been devoted to the most basic problem of voice communication. In order to solve the echo problem perfectly, in addition to a powerful algorithm as a basis, many optimizations in the field of engineering optimization are also required. Soundnet will continue to improve the experience of echo cancellation in various application scenarios.

In the next content of this series, we will follow the audio signal and enter the real network environment from the device side. While traveling around Shanghai on the spot, we will talk about the delay, jitter, and packet loss resistance in the audio interactive scene. Optimization Strategy. (A simple spoiler with a picture, so stay tuned)

Detailed explanation of low latency and high sound quality: echo cancellation and noise reduction

Optimization of three algorithm modules for echo cancellation

1. Delay estimation

Two, linear adaptive filter

Three, non-linear processing

Noise reduction strategy with priority on sound quality

Concluding remarks

RTE开发者社区

引用和评论

语音独角兽 ElevenLabs 创始人：人性中的不完美，恰是人愿意互动的关键；秘塔「今天学点啥」：解析复杂内容语音讲解丨日报

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

开放创新，昇腾 CANN 再向深处

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！