Detailed explanation of low latency and high sound quality: codec articles

Voice social interaction has been around for decades, and the recent "interactive podcast" scene has made audio interaction the focus of the industry again. How to provide a good audio interactive experience? How to optimize the sound quality? How to deal with network challenges under global transmission? How to make the sound more pleasant on the basis of high sound quality? We will start from today and answer these questions one by one through the "Detailed Explanation of Low Latency and High Sound Quality" series.

Following Elon Musk, Bill Gates also opened a " interactive podcast ". Now, many teams have begun to increase the audio social scene. This scenario may seem simple to implement, but it is not so easy for users in different countries to get the same high-quality sound experience.

Then we will talk about the technical principles and "reconstruction" ideas behind the high sound quality and low latency from the aspects of encoding and decoding, noise reduction and echo cancellation algorithms, network transmission, and sound quality optimization.

Let's talk about the voice codec first. But before we talk about the voice codec, we need to understand the principle of audio codec in order to understand more quickly what is affecting the sound quality experience.

Speech coding and music coding

Audio coding refers to the process of converting audio signals into digital code streams (as shown in the figure below). In this process, the audio signal will be analyzed to generate specific parameters. Subsequently, these parameters will be written into the bitstream according to certain rules. This bit stream is what we often call the bit stream. After receiving the code stream, the decoder will restore the code stream to parameters according to the agreed rules, and then use these parameters to construct the audio signal.

Image source of

The development history of audio codecs is very long. The core algorithm of early codecs is non-linear quantization. This is a relatively simple algorithm now. Its compression efficiency is not high, but it is suitable for voice and music. The vast majority of audio types. Later, with the development of technology and the refinement of the division of codec, the evolution direction of the codec was divided into two paths-voice encoder and music encoder.

The speech codec, which is mainly used to encode speech signals, has begun to gradually evolve towards a time-domain linear prediction framework. This kind of codec refers to the pronunciation characteristics of the sound channel, and decomposes the speech signal into the main linear prediction coefficient and the secondary residual signal. linear prediction coefficient coding requires very little bit rate, but it can efficiently construct the "skeleton" of the speech signal; the residual signal is like "flesh and blood" that can supplement the details of the speech signal. greatly improves the compression efficiency of speech signals, but this linear prediction framework based on time domain cannot encode music signals well under limited complexity.

The music codec, which encodes music signals, has embarked on another evolutionary path. Because the information of the frequency domain signal is more concentrated on a small part of the frequency point than the time domain signal, it is more conducive to the analysis and compression of the encoder. Therefore, the music codec basically chooses to encode the signal in the frequency domain.

Later, as the technology matured, the two codec architectures came together again, that is, the voice and music hybrid encoder. The codec Opus used by default in WebRTC is this type of codec. The feature of this type of codec is that it combines two coding frameworks and automatically switches the appropriate coding framework for the signal type. Opus is used in some well-known products at home and abroad, such as Discord.

What is affecting the interactive experience in speech coding?

Speaking of some technical indicators of voice codecs, they generally talk about sampling rate, bit rate, complexity, anti-packet loss ability, etc. What do these technical indicators represent, and how do they affect the audio experience?

You may have seen "the higher the sampling rate, the better the sound quality" and "the higher the encoding complexity, the better", but this is not the case!

1. Sampling rate

It requires a sampling process to convert the analog signal that can be heard by the human ear to the digital signal that can be processed by the computer. Sound can be decomposed into a superposition of sine waves of different frequencies and different intensities. Sampling can be thought of as collecting a point on a sound wave. The sampling rate refers to the number of points sampled per second in this process. The higher the sampling rate, the less information is lost in this conversion process, that is, the closer to the original sound.

The sampling rate determines the resolution of the audio signal. Within the perceptible range of the human ear, the sampling rate, the more high frequency components are retained, and the clearer and brighter the hearing of this segment of the signal. For example, when we make a traditional phone call, we often feel that the other party’s voice is dull. This is because the sampling rate of the traditional phone is 8kHz, which only retains low frequency information that can guarantee intelligibility, and many high frequency components are lost NS. Therefore, if you want a better audio interactive experience, you need to increase the sampling rate as much as possible within the perceptible range of the human ear.

2. Bit rate

After sampling, the sound is converted from an analog signal to a digital signal. The code rate represents the amount of data of this digital signal per unit time.

code rate of 1619244c3b2b8f determines the degree of detail reduction of the audio signal after encoding and decoding. codec will assign the given code rate to the parameters output by each analysis module according to the priority. In the case of a limited coding rate, the codec will give priority to encoding parameters that have a greater impact on voice quality, and give up encoding some parameters that have less impact. In this way, at the decoding end, because the parameters used are not complete, the speech signal constructed by it will also have unavoidable damage. Generally speaking, the higher the bit rate of the same codec, the smaller the damage after encoding and decoding. But the higher the bit rate, the better. On the one hand, the bit rate and the codec quality are not linear. After the "quality dessert" is exceeded, the increase in the bit rate will not significantly improve the quality; on the other hand, , In real-time interaction, too high a bit rate may squeeze bandwidth and cause network congestion, which will cause packet loss, which in turn destroys the user experience.

About quality dessert: In the video field, quality dessert refers to the best subjective video quality experience by setting reasonable resolution and post rate under a given bit rate and screen size. There is a similar situation in the audio field.

Third, coding complexity

The coding complexity is generally concentrated on the signal analysis module at the coding end. Generally speaking, the more detailed the analysis of the speech signal, the higher the potential compression rate, so the coding efficiency and complexity have a certain correlation. Similarly, encoding complexity and encoding and decoding quality are not linear. There is also a "quality sweet spot" between the two. can design high-quality encoding and decoding algorithms under the premise of limited complexity often directly affects the encoding. Availability of decoders.

4. Anti-packet loss ability

First of all, what is the principle of anti-packet loss? We will encounter packet loss when transmitting audio data. If the current data packet is lost, we hope that we can guess it by some means or get the approximate information of the current frame, and then use these incompletely accurate information to decode a A speech frame similar to the original signal. Of course, just guessing out of thin air is generally not a good result. If the previous data packet or the next data packet can tell the decoder some key information of the current lost packet, it will be fine. The more this information, the better it is for the decoder to recover. Lost voice frames. These "key information" contained in the "previous data packet" or "next data packet" are the "interframe redundancy information" we will refer to later. ( , we talked more about packet loss resistance )

Therefore, the anti-packet ability and coding efficiency are relatively mutually exclusive. The improvement of coding efficiency often requires the reduction of information redundancy between frames as much as possible, and the anti-packet ability depends on a certain amount of inter-frame information redundancy and inter-frame information redundancy. It can be guaranteed that when the current data packet is lost, the current voice frame can be recovered through the pre/post sequence voice frame. In a real-time interactive scenario, because the user's network is an unreliable network, a user may enter an elevator while walking, or sit in a high-speed car. In this kind of network, packet loss and delay jitter are flooded, so the codec's ability to resist packet loss is indispensable. Therefore, balances coding efficiency and anti-packet loss ability also needs to go through detailed algorithm design and polishing verification.

How to balance audio experience and technical indicators?

So how does Shengwang do it? Our engineers comprehensively considered the above points and created a high-definition voice codec Agora Nova (hereinafter referred to as Nova) specifically for real-time communication.

32kHz sampling rate

First of all, in the selection of sampling rate, Nova did not choose 8khz sampling rate or 16khz sampling rate used by other voice codecs, but chose a higher sampling rate of 32kHz. This means that Nova has achieved a larger lead on the starting line of call sound quality. Although the 16kHz sampling rate commonly used in the industry (remarks: 16kHz for WeChat) has met the basic requirements of voice intelligibility, some voice details still require a higher sampling rate to be captured. We hope to provide more high-definition voice calls Ability, which not only guarantees the intelligibility, but also improves the clarity, which is why we chose 32kHz.

optimize coding complexity

The higher the sampling rate, the higher the speech intelligibility, and the more sampling points that need to be analyzed/encoded/transmitted per unit time, and the encoding rate and complexity need to be increased accordingly. The increase in coding rate and complexity will inevitably put pressure on users' bandwidth and device performance and power consumption. But this is not what we want to see. To this end, after theoretical derivation and a large number of experimental verifications, we have designed a simplified speech high-frequency component coding system. Under the premise of a small increase in analysis complexity, the coding of high-frequency signals can be realized at a minimum of 0.8kbps (based on different Technology, in the past, to express high-frequency signals, the code rate generally needs to be higher than 1~2kbps), which greatly increases the clarity of the voice signal.

balances anti-packet loss performance and coding efficiency

In the protection of anti-packet loss ability, we also chose the most balanced scheme under the premise of ensuring the coding efficiency. After experimental verification, this scheme not only guarantees the coding compression efficiency, but also guarantees the recovery rate when the packet is lost. In addition, in addition to Nova, in response to unstable network environments, we have also developed and launched the voice codec Solo and voice and music hybrid codec SoloX , which are more resistant to packet loss.

Agora Nova vs. Opus

Nova has a wealth of mode choices for different scenarios, such as adaptable mode, high-quality mode, low-power high-quality mode, ultra-high frequency mode, and ultra-low bit rate mode.

If you compare Nova with the advanced open source codec Opus, thanks to Nova's efficient signal processing algorithm, its effective spectrum information is 30% more than Opus under the same code rate at the general speech coding rate. Under the subjective and objective evaluation system, Nova's speech coding quality is higher than Opus:

At the objective evaluation level, using the objective quality evaluation algorithm defined by the ITU-T P.863 standard to score the encoding-decoding corpus of the two codecs, the Nova score is always slightly higher than Opus;
At the level of subjective evaluation, the reduction degree of the voice signal after Nova codec is higher than that of the voice signal after Opus codec, which is reflected in the sense of hearing and has less quantization noise.

Thanks to this high-definition voice codec, the Soundnet SDK provides a consistent high-quality audio interactive experience for users around the world. In fact, the quality of a voice call experience is not only directly related to the encoding quality of the codec, but also greatly affected by other modules, such as echo cancellation, noise reduction, and network transmission. We will introduce the voice network in the next issue. Best practices in echo cancellation and noise reduction algorithms.

Detailed explanation of low latency and high sound quality: codec articles

Speech coding and music coding

What is affecting the interactive experience in speech coding?

How to balance audio experience and technical indicators?

RTE开发者社区

引用和评论

最新开源 TEN VAD 与 Turn Detection 让 Voice Agent 对话更拟人｜社区来稿

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

大模型时代，后端程序员如何避免被AI卷死？

Detailed explanation of low latency and high sound quality: codec articles

Speech coding and music coding

What is affecting the interactive experience in speech coding?

How to balance audio experience and technical indicators?

RTE开发者社区

引用和评论

最新开源 TEN VAD 与 Turn Detection 让 Voice Agent 对话更拟人 ｜ 社区来稿

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

大模型时代，后端程序员如何避免被AI卷死？

最新开源 TEN VAD 与 Turn Detection 让 Voice Agent 对话更拟人｜社区来稿