Detailed explanation of low latency and high sound quality: packet loss, jitter and last mile optimization

This article is the third technical sharing of "Detailed Low Latency and High Sound Quality Series". We will zoom in on this time, from the perspective of the entire audio engine link, in terms of time-varying networks, how to weigh the sound quality and the real-time interaction for different application scenarios.

When we are discussing low latency and high sound quality in real-time interactive scenarios, what we actually have to face is the sound quality problem on the entire audio engine link from end to end. In the first article, we briefly described the audio transmission process. If we further refine it on this basis, the entire link of the audio engine includes the following steps:

1. collecting device the acoustic signal sampling , forming a discrete audio signal from the computer is operable;

2. Due to the short-term correlation of the audio signal, the audio signal is subjected to framing processing , after 3A solution deal with acoustics, environmental noise, echo, automatic gain and other problems;

3. The encoder compresses the audio signal in real time to form an audio code stream;

4. According to the format of IP+UDP+RTP+Audio Payload, the group packet , the sender sends the audio data packet to the network, reorganizes arrives at the receiving end through the network;

5. After the anti-shake buffer and decoders reconstruct the continuous audio stream , the playback device will play the continuous audio stream.

In the above different audio processing links, it will inevitably cause some damage to the audio signal. We call the above-mentioned damage introduced to the audio signal due to acquisition and playback as "equipment damage", the damage introduced in step 2 is called "signal processing damage", and the damage introduced in the encoding and decoding process is called "coding damage". The one in link 4 is called "network damage".

If you want to provide users with a full-band high-quality audio interactive experience, you need to support full-band processing in the above audio engine link, and under some constraints (such as from equipment, network bandwidth, acoustic environment, etc.), The damage introduced by the link is minimized as much as possible.

When audio data enters the network, it will encounter...

If the network is regarded as a highway of information, then audio data packets are like cars running on the highway. This car drove from Beijing to Shanghai. The highway it passed was the backbone network, and the rugged mountain road was the weak network environment. Assuming that a car starts from Beijing every minute, they will encounter three common problems in real-time transmission: packet loss, delay, and jitter.

packet loss

"Package loss" refers to the fact that some cars cannot reach the end within the valid time, and may even never reach the end. Some cars may be stuck on Beijing's third ring road forever, and some cars may have accidents halfway through. If five of our 100 vehicles fail to arrive in Shanghai on time for various reasons, the "packet loss rate" of our fleet transmission this time is 5%. Yes, the same is true for Internet transmission, it is not 100% reliable, and there is always data that cannot be transmitted to the destination on time.

Delay

"Delay" refers to the average time it takes for each car to drive from Beijing Bird's Nest to Shanghai. Obviously, the motorcade must take the expressway much faster than the various small roads, and the route along the expressway from the bird’s nest also has a great influence. If it gets stuck in the third ring road, it will take several more hours. NS. So this value is related to the driving route chosen by the fleet. The same is true for Internet transmission. There are often many alternative paths between two points that need to transmit data, and the delays of these paths are often very different.

jitter

"Jitter" refers to the difference in the order, interval, and departure of cars. Although our 100 cars depart every minute at regular intervals in Beijing, they do not arrive every minute in sequence when they arrive in Shanghai, and there may even be cars that depart later than those that depart earlier. To the situation. The same is true for Internet transmission. If the audio and video data are simply played directly in the order of the received audio and video data, distortion will occur.

summary,

1. Real-time audio interaction is carried out on the network, and the encoded audio stream is assembled into data packets according to the real-time transmission protocol. The data packets from the sender to the receiver follow their respective routes through the network.

2. On a global scale, in different regions or different time periods, the service quality of user network connections is sometimes very poor and unreliable.

Based on the above reasons, data packets often arrive at the receiving end not in the exact order, but arrive at the receiving end in the wrong order at the wrong time, or the data packet is lost, etc. This will cause the problems usually mentioned in the real-time transmission field: network Jitter, packet loss, and latency.

Packet loss, delay, and jitter are the three unavoidable problems of real-time transmission based on the Internet, whether it is transmission in a local area network, a single country or region, or a transnational or trans-regional transmission.

These network problems are distributed differently in different regions. According to the actual network conditions monitored by Agora, 99% of audio interactions in China, where the network is relatively good, need to deal with packet loss, jitter, and network delay. Among these audio sessions, 20% have more than 3% packet loss due to network problems, and 10% of the sessions have more than 8% packet loss. In India, the performance is quite different. In 80% of audio interactions, about 40% of the sessions are lost. Optimizing the quality of services under 2G/3G networks in India is still the focus of providing audio services.

There are also many restrictions on jitter, delay, and bandwidth. These network problems lead to a sharp decline in audio quality. What's more, it affects the intelligibility of audio signals, that is, it cannot meet the essential communication requirements of the amount of information exchanged. Therefore, a self-research team using WebRTC or an SDK service that provides real-time services, trying to repair the damage introduced to the audio signal in the process is a compulsory subject.

Packet loss control

To ensure reliable real-time interaction, it is necessary to deal with packet loss. If continuous audio data is not provided, users will hear glitches (glitches, gaps), which reduces the call quality and user experience.

The problem of packet loss can be abstracted as how to complete reliable transmission on an unreliable transmission network. Two error correction algorithms, forward error correction (FEC) and ARQ (Automatic Repeat-reQuest) are usually used, and corresponding strategies are formulated to solve the packet loss problem based on accurate channel state estimation.

FEC means that the sending end uses channel coding and sending redundant information, and the receiving end detects packet loss, and recovers most of the lost data packets based on the redundant information without retransmission. That is, the higher channel bandwidth is used as the cost of recovering lost packets. Compared with ARQ's packet loss recovery, the FEC experience has a smaller delay, but because redundant data packets are sent, the channel bandwidth is consumed more.

ARQ uses ack (acknowledgements signal, that is, the acknowledgement sent back by the receiver to indicate that the data packet has been correctly received) and timeouts, that is, if the sender does not receive the confirmation message ack before the timeout, the sliding window protocol is used to help the sender make decisions Whether to retransmit the data packet until the sender receives the confirmation message ack or until the pre-defined number of retransmissions is exceeded. ARQ delay (because it has to wait for ack or continuous retransmission), and the bandwidth utilization is not high.

Simply put, the FEC and ARQ methods used in packet loss control are to recover the lost data packets through additional channel bandwidth and delay . This is the status quo of traditional anti-packet loss methods, so what feasible methods can solve it?

We have to open source before SOLO Agora example. Usually, what the codec does is compression and de-redundancy, while the anti-packet loss is to a certain extent the expansion of the channel processing. Anti-packet loss is an extension of an error correction algorithm. Anti-packet loss is realized by adding redundancy. Agora SOLO’s strategy is to combine de-redundancy and redundancy, adding redundancy to key information, and more de-redundancy to non-key information, in order to achieve the effect of joint coding in the channel and the source.

Delay and jitter control

The data packet itself will cause delay and jitter when it is transmitted and queued on the network. At the same time, the data packets that we recover through packet loss control will also introduce delay and jitter. Usually, the adaptive de-jitter buffer mechanism is used to fight against to ensure the continuous playback of audio and other media streams.

As we have mentioned above, the change in packet delay, which we call jitter, is the difference in end-to-end one-way delay between packets of an audio stream or other media streams. The adaptive logic is based on the delay estimation of the packet arrival interval (IAT, inter-arrival times). Stuttering occurs when there are data packets that are not restored by packet loss control, excessive jitter, delay, and sudden packet loss, that is, beyond the delay that can be countered by the adaptive buffer. At this time, the receiving end generally uses the PLC (Packet Loss Concealment) module to predict the new audio data to fill in the discontinuity caused by the loss of audio data (due to packet loss, excessive jitter and delay, packet loss, burst packet loss), etc. Audio signal.

To sum up, to deal with network damage is to ensure that data packets are output in order as much as possible through packet loss, delay, and jitter control methods on unreliable communication channels, and to fill in the lack of audio data in combination with PLC prediction.

To minimize network damage, the following five points need to be combined to strengthen the weak network boundary:

1. Accurate estimation of the network channel state, dynamic adjustment and application of packet loss control strategies;

2. And the matching de-jitter buffer, to adapt to the changes of the network's instability (good network getting worse, bad network getting better, sudden comb network) with a faster and more accurate learning speed, and adjust anti-jitter Buffering to a size greater than and closer to the equivalent delay in the steady state can ensure that the listener's sound quality in the instantaneous network environment is good, the delay is low, and it gradually tends to the theoretical optimum;

3. When the weak network recoverable boundary is exceeded, reduce the code rate (also commonly used to solve channel congestion) to increase the overhead of redundant data or the number of retransmissions in the channel bandwidth;

4. Combined with the PLC's ability to adapt to the input signal, to ensure that different speakers, under the time-varying background noise, reduce the perceptible noise as much as possible;

5. Under a smaller bandwidth, the encoder is used to encode low-bit-rate and high-quality speech, combined with 3 to increase the robustness of weak network countermeasures in the case of poor network service quality.

Based on the above countermeasures against packet loss, delay, and jitter, we can provide a better audio real-time interactive experience based on Internet transmission. As we said before, the delay, jitter, and packet loss of the network are different in different regions, different time periods, and different networks. The Agora SDK provides high-quality audio interactive services to the world, allowing users in all regions of the world to interact in real time online, and bring offline acoustic experience to users as much as possible through the audio engine. Therefore, we have also done many field tests and observed the MoS score (ITU-T P.863) and delay data performance of the SDK.

The following is the MOS score and delay data of the Agora RTC SDK and the products of other vendors that we tested at the same time on the Shanghai Zhonghuan loop line, using the same equipment, and under the same operator's network. From a statistical point of view, the real-time audio interactive service provided by the Agora SDK of Soundnet provides higher sound quality with lower latency.

Figure: MoS points comparison

Figure: Delay data comparison

It can be seen from the MoS score comparison chart that the MOS of the Agora SDK is mainly distributed in the high score [4.5, 4.7] interval, and the friends are mainly distributed in [3.4, 3.8]. Let's talk about a data, you may have a more intuitive concept. Although the WeChat we use is not the same type of product as the RTC SDK, it also provides voice call services. The highest MoS score measured by WeChat in a non-weak network environment is 4.19.

The actual audio quality experienced by the user can be displayed by the color of the dots in the audio quality map below. Green indicates that the MOS score is greater than 4.0; yellow indicates that the MOS score is at [3.0, 4.0], and red indicates [1, 3.0].

Picture: Shanghai Zhonghuan, the audio quality of the Agora SDK

Picture: Shanghai Zhonghuan, the audio quality of

summary

Packet loss, delay, and jitter are inevitable problems in real-time interactive scenarios. Moreover, these problems will not only continuously change due to factors such as the network environment, time period, and user equipment, but also new changes due to the development of the underlying technology (such as the large-scale application of 5G). Therefore, our optimization strategies for them must also be iteratively optimized.

After following the audio signal from the sending end to the receiving end through the network, the optimization of the audio experience does not end. In order to "high sound quality experience", we will further optimize the sound quality on the end. In the next article, we will share the tip of the iceberg in detail, so stay tuned.

Detailed explanation of low latency and high sound quality: echo cancellation and noise reduction

Detailed explanation of low latency and high sound quality: packet loss, jitter and last mile optimization

When audio data enters the network, it will encounter...

Packet loss control

Delay and jitter control

summary

RTE开发者社区

引用和评论

NotebookLM 音频概览支持中文；扎克伯格 LlamaCon 发言：语音当下被低估了，未来语音交互将占据更大比重丨日报

从 DeepSeek 看25年前端的一个小趋势

大模型中的Token究竟是什么？从原理到作用深度解析

Open WebUI：开源AI交互平台的全面解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

Mac 安装 DeepSeek-R1 本地化部署