Improve RTC audio experience-start from understanding the hardware

Preface

The rapid development of RTC (Real-time Audio and Video Communication) technology has facilitated the popularization of interactive entertainment such as live broadcasts and short videos; in the context of the continuous spread of the global epidemic, the demand for cloud conferences has shown explosive growth, which has further promoted the rapid development of the RTC industry . In order to provide customers with stable and reliable services, the network system needs to continuously improve the channel connection rate, reduce the interruption rate during the meeting, and enhance the ability to resist weak networks; the video aspect needs to improve the video clarity, reduce the video freeze rate, etc., audio While pursuing end-to-end MOS, we must also focus on the effects of the audio 3A algorithm. These are the "internal strengths" that manufacturers must practice, and they are also the core competitiveness that will eventually settle down. This article will focus on the importance of audio quality collected by hardware devices to the RTC end-to-end audio experience.

What is the impact of poor collection quality?

In the RTC architecture, the end-to-end audio signal processing flow is roughly as shown in the figure below. The upstream has undergone audio signal collection, audio 3A (AEC: echo cancellation, ANS: adaptive noise reduction and AGC: automatic gain control) and encoding; downstream Respectively after packet loss recovery, decoding, mixing and playback.

End-to-end audio signal processing flow

It is not difficult to see that the audio signal undergoes analog-to-digital conversion, then passes through the audio signal processing chip integrated in the device, and finally passes to the RTC SDK. Due to different hardware manufacturers, audio capture solutions are uneven, so the quality of the audio collected directly affects the availability of the production materials obtained by the 3A algorithm, and also determines the upper limit of the audio signal quality received by the end user. According to the audio problems encountered in actual work, the problems caused by equipment collection can basically be summarized into the following categories:

To give a few examples:

(1) Abnormal collection

The abnormal acquisition is mainly reflected in the "fuzzy" of the frequency spectrum, which can lead to the inability to understand the semantics and affect normal communication. The following spectrogram.

In addition, after the collection is abnormal, the played signal will be abnormal after being collected by the microphone, which will cause serious nonlinear distortion and affect the echo cancellation effect, as shown in the figure below.

(2) Collect jitter

The common thing is to collect lost data. In the sense of hearing, you will hear a lot of high-frequency noise (the picture below is the local picture after the noise is amplified in the picture above), which will seriously affect the accuracy of the delay estimation in the AEC algorithm and the far and near non-distance. Causal problems, serious ones can cause echo leakage.

(3) Problems with popping and low volume

The problem of collecting pop sound mainly occurs on the PC, and it is also the most important problem for PC-end equipment to avoid. It has a greater impact. In addition to the spectrum distortion caused by truncation, severe nonlinear distortion will affect the echo cancellation effect. The popping problem needs to be solved by the AGC algorithm by adaptively adjusting the analog gain of the PC and the microphone.

(4) Absence of spectrum

The lack of spectrum is mainly due to the inconsistency between the audio sampling rate of the hardware callback and the actual spectrum distribution. Even if the encoder gives a high coding rate, there is no high sound quality effect in the sense of hearing. As shown in the figure below, the sampling rate of the collected signal is 48kHz, but The upper limit of the spectrum is only 8k.

What can we do at the hardware level to improve the collected sound quality?

Hardware devices with RTC capabilities have already penetrated all aspects of our lives, such as mobile phones and PCs, and now even children's phone watches, Tmall Genie, and various high-end fingerprint password locks and other devices support RTC. However, the diversity of equipment directly determines the difference in the acquisition capabilities. Aside from the difference in acoustic component design, as far as the Android side is concerned, the difference in chips and software systems makes it impossible for the same brand of mobile phone to use the same type. The configuration adapts to all models of mobile phones.

In addition, most mobile devices now have their own hardware audio signal processing (hereinafter referred to as hardware 3A) capabilities. The effects of different chips are also very different. What is more serious is that the audio signal spectrum after hardware processing is often missing. For example, after turning on the hardware 3A, the upper limit of the audio signal spectrum that is called back to the RTC SDK only supports 8k, which is equivalent to the audio signal sampled at 16kHz. Especially in terms of entertainment, it cannot satisfy our pursuit of high sound quality. Therefore, doing a good job of adapting to the hardware layer is the basis for ensuring the RTC's high-quality audio experience.

Android side

(1) It is necessary to understand the difference between the two modes of javaaudioclass and opensles, as well as the parameters that each need to be adapted, and master the configuration of turning off the hardware 3A.

(2) When collecting jitter or abnormal audio volume, you can try to change the requested sampling rate. The 48k sampling that is usually set will not be applicable to all android devices.

Windows side

(1) Many Windows devices currently have a built-in microphone array at the top of the screen to provide audio enhancement functions. The opening method is as shown in the figure below. This function defaults the corner area directly in front of the screen as the pickup area. The microphone array technology can effectively enhance the speaker's voice in the pickup area and "isolate" the "noise" outside the pickup area. The main drawback is that it is turned on. After the function, only 8k spectrum is supported, and the enhancement algorithms of various manufacturers are different, and the effects are also uneven. Therefore, the software needs to have the ability to bypass the hardware's own audio enhancement function to ensure high sound quality.

The dual-microphone array that comes with the Windows device (picture comes from the network)

Enhanced function switch in audio settings

After the audio enhancement is turned on, the spectrum is missing

(2) In terms of volume, PC devices all support analog gain adjustment, and most Windows devices with arrays have additional microphone enhancements (as shown in the figure below). The software algorithm level (AGC in 3A) needs to have the ability to adaptively adjust them to ensure the stability of the audio collection volume to control the collection noise level. Improper initial value setting or adaptive adjustment will cause problems such as low volume and popping, which will seriously affect the effect of echo cancellation and noise reduction, and bring the risk of affecting usability.

Analog gain and microphone enhancement

Apple device

(1) There is less adaptation work on the ios end, and you need to be familiar with the configuration of turning off the hardware 3A, because the hardware 3A spectrum of the ios device can only support 10k-12k.

(2) Mac notebook devices are relatively simple and only provide analog gain adjustment. But one thing to note is that when RTC supports dual-channel playback, the microphone will be on the same side of a certain speaker, causing the nearby microphone to collect pops when playing audio. Generally, it can only be solved by optimizing the software AEC algorithm.

Summarize

When 48k high sound quality has become a rigid demand, in order to ensure the high quality of the acquisition link, on the one hand, it is necessary to invest time to master the rules of Android parameter adaptation. At the same time, more and more customized android devices (watches, smart speakers) are appearing on the market. Etc.), it is also indispensable to determine the configuration parameters first; on the other hand, turning off the audio processing function of the hardware device and enabling the pure soft 3A algorithm that comes with the RTC is also a trend. The premise is to optimize the software 3A algorithm. The overall effect and good power consumption control are also a must-test item for customers to evaluate the audio experience between various manufacturers, and it is also one of the core competitiveness of each manufacturer.

"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Product Technology Exchange Group, discuss audio and video technologies with industry leaders, and get more industry latest information.

Improve RTC audio experience-start from understanding the hardware

Preface

What is the impact of poor collection quality?

(1) Abnormal collection

(2) Collect jitter

(3) Problems with popping and low volume

(4) Absence of spectrum

What can we do at the hardware level to improve the collected sound quality?

Android side

Windows side

Apple device

Summarize

CloudImagine

引用和评论

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

支付宝H5下载被拦截的原因排查与解决指南

GPUDirect RDMA 的演进与实现

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

2025年4月中国数据库流行度排行榜：OB高分复登顶，崖山稳驭撼十强

三分钟掌握视频剪辑 | 在 Rust 中优雅地集成 FFmpeg