Per capita social fear, love language chat. Voice social products represented by Yuchaofang unlock new ways of socializing with strangers, and continue to tell new stories about coming out of the circle. Follow [Rongyun Global Internet Communication Cloud] to learn more

However, the phenomenon of sound stuttering, intermittent, fast forward, and slow playback will seriously affect the user experience and directly cause the user to leave. These are common problems caused by weak networks.
This paper mainly analyzes the commonly used weak network confrontation technologies from the perspective of audio applications, mainly as follows:

  • Forward Error Correction (FEC, RED, etc.)
  • Backward error correction techniques (ARQ, PLC, etc.)
  • Encoder anti-weak network characteristics (this article focuses on the characteristics of OPUS encoder)
  • Anti-jitter technology (JitterBuffer)

We will use the above and the next two articles, combined with the audio anti-weak network technology used or supported in WebRTC, to analyze the above types of technologies to achieve high availability of audio communication services in a weak network environment.

The first part mainly shares the forward error correction technology, backward error correction technology and the anti-weak network characteristics of OPUS codec; the second part shares the anti-jitter module NetEQ used by WebRTC.

Forward Error Correction

FEC

The most typical forward error correction technology is FEC technology.
Sender: generate redundant packets to combat the problem of packet loss during transmission;
Receiver: Recover the lost packets during transmission for the received redundant packets and normal packets.

FEC is divided into in-band and out-of-band. In WebRTC, video generates redundant packets through out-of-band FEC (ULPFEC[1], FLEXFEC[2]), and audio generates redundant packets through OPUS in-band FEC.

Since the in-band FEC will occupy a part of the encoding bit rate, the sound quality of the audio will be reduced. Out-of-band FEC will not affect the sound quality, but it will take up additional network bandwidth, each with its own advantages and disadvantages.

Typical encoding methods of FEC are XOR and Reed Solomon [3]. The out-of-band FEC of WebRTC uses the XOR encoding method (ULPFEC and FLEXFEC), which is characterized by a relatively small amount of computation, but its ability to resist packet loss is limited.

In WebRTC, out-of-band FEC, whether it is ULPFEC or FLEXFEC, determines the mapping relationship between FEC packets and protected source RTP packets according to the MASK mask, which defines two types of masks, RandMask and BurstMask, the former The protection effect is better in random packet loss; the latter is better for continuous packet loss caused by bursts, but no matter which type, it has its shortcomings; Generate 4 redundant packets) Example:

#define kMaskBursty7_4 \
0x38, 0x00, \
0x8a, 0x00, \
0xc4, 0x00, \
0x62, 0x00

Expand the above hexadecimal into binary as follows:

Package serial number: S1 S2 S3 S4 S5 S6 S7
R1: 0 0 1 1 1 0 0 Original packets S3, S4, S5 are protected by redundant packet R1
R2: 1 0 0 0 1 0 1 ==> original packets S1, S5, S7 are protected by redundant packets R2
R3: 1 1 0 0 0 1 0 Original packets S1, S2, S6 are protected by redundant packet R3
R4: 0 1 1 0 0 0 1 Original packets S2, S3, S7 are protected by redundant packet R4

The above mask indicates that according to the total of 7 original packets S1-S7, the sender will generate 4 redundant packets R1-R4, among which:

R1 package protects S3,S4,S5 three original packages
R2 package protects S1,S5,S7 three original packages
R3 package protects S1,S2,S6 three original packages
R4 package protects S2,S3,S7 three original packages

It can also be seen from the above that each original packet is protected by a redundant packet; when a packet is lost, it can generally be recovered through the redundant packet and the received original packet, for example, the sender sends S1-S7, R1 -R4 has a total of 11 packets, the receiver has received a total of 8 packets of S1, S3, S5, S7, R1, R2, R3, R4, and lost three packets of S2, S4, S6; then the S2, S4, S6 repair process as follows:

 S2 可以被 R4、S3、S7 修复,即 S2 = R4 XOR S3 XOR S7 

S4 可以被 R1、S3、S5 修复,即 S4 = R1 XOR S3 XOR S5 

S6 可以被 R3、S1、S2 修复,即 S6 = R3 XOR S1 XOR S2

However, some packages cannot be repaired. For example, if S1, S2, and S7 are lost, they cannot be recovered. The reasons are as follows:

 根据掩码保护关系可知,S1 的恢复可以通过 R2、S5、S7 或者 R3、S2、S6;但因为 S7 和 S2 丢失,要恢复 S1,需要先恢复 S2 或 S7

同样,S2 可以通过 R3、S1、S6 恢复,但因为 S1 丢失,则需要先恢复 S1 

同理,S6 可以通过 R3、S1、S2 恢复,但是需要先恢复 S1、S2 

所以,经过上面的分析可知 S1、S2、S7 均⽆法恢复

Similarly, if S3, S5, and S7 are lost, they cannot be recovered. This is the technical disadvantage of using masks to determine the protection relationship between redundant packets and original packets in WebRTC.

That is, for a group of (M original packets + N redundant packets), when less than or equal to N packets are lost, it may not be possible to recover the lost packets.

Reed Solomon coding can be done for a group of (M original packets + N redundant packets), if there are less than or equal to N packets lost, the lost packets can be recovered.

RS FEC mainly uses Vandermonde matrix or Cauchy matrix to encode and decode [4]. The effect of Cauchy matrix is less than that of Vandermonde matrix, and its performance is better; but no matter what kind of matrix above, they all have one feature: invertible , And any sub-matrix is invertible, which ensures that RS can recover it when less than or equal to N packets are lost.

The following is a brief description of the Vandermonde matrix. Take (7, 4) as an example, that is, 7 original packets generate 4 redundant packets, the original packets are S1, S2, S3, S4, S5, S6, S7, and the redundant packets are (R1, R2, R3, R4 ). The relationship between the original package and the redundant package is as follows:
微信图片_20220706142632.png
illustration

where the Vandermonde matrix above is A, as follows:
微信图片_20220706142637.png
illustration

The identity matrix is represented as follows:
微信图片_20220706142641.png
illustration

Assuming that the S2 and S4 data packets are lost, delete the row corresponding to the identity matrix in formula 1, and the result is as follows:
微信图片_20220706142645.png
illustration

The matrix on the left side of Equation 2 is denoted as B, as follows:
微信图片_20220706142649.png
illustration

According to the reversible characteristics of Vandermonde matrix, B is also an invertible matrix, denoted as B, then the process of recovering the package is actually the process of solving the B' matrix. The original package can be solved by deriving formula 2 as follows, as shown below:
微信图片_20220706142653.png
illustration

That is, any packet in (S1, S2, S3, S4, S5, S6, S7) can be recovered through matrix B' and the received packet. Therefore, the protection capability of RS is stronger.

RED[5]

RED is also a method of forward error correction. The sender actively sends redundant codes to resist the problem of packet loss in the transmission network to a certain extent. The decoding end can recover lost packets through redundant packets. The standard specification of RED is defined in RFC2198, which can be used to generate video and audio redundant packets. WebRTC audio enables RED mode on m96.

The payload of RED contains not only the current packet, but also the historical packet, so the payload has redundant information to a certain extent and plays a role in preventing packet loss.

The following is a brief introduction to the encapsulation format of RED: RED block head

 0                   1                   2                   3  
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1  
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |F| block PT | timestamp offset | block length | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
  
   F: 1表⽰当前block后还有其它block, 0表⽰当前block为最后⼀个block 
   block PT: 表⽰当前block 的payload type 
   timestamp offset: 表⽰当前包时间戳相对于rtp head的时间戳的偏移 
   block length: 表⽰当前block的⻓度,不包括当前block header⻓度 
  
   0 1 2 3 4 5 6 7 
   +-+-+-+-+-+-+-+-+ 
   |0| Block PT | 
   +-+-+-+-+-+-+-+-+ 
   表⽰最后⼀个block

The following is an example of a RED package:

 0 1 2 3 
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
 |V=2|P|X| CC=0 |M| PT | sequence number of primary | 
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
 | timestamp of primary encoding | 
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
 | synchronization source (SSRC) identifier | 
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
 |1| block PT=7 | timestamp offset | block length | 
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
 |0| block PT=5 | | 
 +-+-+-+-+-+-+-+-+ + 
 | | 
 + LPC encoded redundant data (PT=7) + 
 | (14 bytes) | 
 + +---------------+ 
 | | | 
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 
 | | 
 + + 
 | | 
 + + 
 | | 
 + + 
 | DVI4 encoded primary data (PT=5) | 
 + (84 bytes, not to scale) + 
 / / 
 + + 
 | | 
 + + 
 | | 
 + +---------------+ 
 | | 
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 

 该 rtp 包是⼀个 RED 封装,包含两个 block, ⼀个 block type 为 7, ⼀个 block type 为 5;即该 rtp 包包含了两中类型的数据包。

The RED package is used in WebRTC to generate the audio redundancy package. The principle is roughly as follows:
微信图片_20220706142658.png
illustration

In the above figure, in addition to sending the current packet, the sender also carries the previous packet as a redundant packet. When the RED4 packet in the above figure is lost, that is, when the 4,3 packets are lost, the subsequent RED5 packets arrive, including 5,4 Packets, combined with previous RED3 packs (3, 2 packs included), can recover lost packs.


backward error correction

ARQ

ARQ is a packet loss retransmission technology. The receiver recovers the lost packets by requesting the sender to resend the lost packets.

Compared with the forward error correction technology, this delay is high, and it is a more appropriate choice in the case of small delay.

The principle is as follows:
微信图片_20220706142705.png
picture

When packet 3 is sent for the first time, the receiver does not receive it, so it initiates a retransmission request of 3 to the sender (the NACK RTCP packet is used in WebRTC). After the sender receives the received retransmission request, it retransmits it again. Message 3.

The following is a brief introduction to the NACK[6] RTCP format used in WebRTC. NACK RTCP is introduced in RFC4585. NACK belongs to the feedback message, that is, the Feedback Message. The format is as follows:

 0                  1                   2                   3 
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |V=2|P| FMT | PT | length | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | SSRC of packet sender | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | SSRC of media source | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   : Feedback Control Information (FCI) : 
   : : 
  
   Figure 3: Common Packet Format for Feedback Messages

There are two main types of PT: ‍

 Name | Value | Brief Description 
 ----------+-------+------------------------------------ 
 RTPFB | 205 | Transport layer FB message 
 PSFB | 206 | Payload-specific FB message

The format of the FCI message corresponding to NACK is as follows:

 0                 1                   2                   3 
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
| PID | BLP | 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 

            Figure 4: Syntax for the Generic NACK message 

NACK的PT=RTPFB 且 FMT=1 
PID表⽰当前重传请求的第⼀个seqnum 
BLP为16位,代表PID所指的seqnum后连续的16个seqnum的重传请求情况, 1表⽰当前位对应的se qnum丢失,接收端对其进⾏了重传请求, 0表⽰未对该位对应的seqnum做重传请求 
NACK中可以携带多个FCI端

PLC

PLC is called packet loss concealment technology, which is located at the receiving end, that is, the decoding end; the decoding end performs signal analysis on historical speech frames, and performs LPC modeling through linear prediction coefficients to predict lost speech frames. This technology The feasibility is based on the short-term speech similarity of speech. The advantage is that no additional bandwidth is used; PLC technology can handle small packet loss rates (<15%).

The packet loss concealment in NetEQ is modeled according to the linear prediction coefficient PLC of the previous speech frame, reconstructs the speech signal according to the historical speech signal, and then loads a certain random noise;

When continuous packet loss concealment is used, the same linear prediction coefficient LPC is used to reconstruct the speech signal. Note that the correlation between consecutive reconstructed signals needs to be reduced, so the packet energy generated by packet loss concealment decreases;

Finally, for continuous speech, smoothing is required. When packet loss compensation is required, take out the latest frame of data from the voice buffer stored in the last 70ms and calculate the LPC coefficient of the frame.

Both the NetEQ module of WebRTC and the OPUS decoder have PLC functions. If the Decoder supports PLC, the PLC function of the decoder should be used first. Otherwise, the PLC function of NetEQ will be used. The next article will introduce the NetEQ module in more detail.


Encoder OPUS anti-weak network characteristics [7]

OPUS is not only an open source and patent-free codec, but also has superior performance compared to other codecs. This is why WebRTC audio usually uses it.

Some features of OPUS are described below, which are very helpful in fighting weak nets.

Supports full frequency bandwidth

The bit rate supported by OPUS can range from narrow-band 6kbps to high-quality stereo 510kbps. The following picture shows that OPUS can cover from narrow-band to high-quality broadband, and the quality is higher at the same bit rate.
微信图片_20220706142712.png
illustration

OPUS supports dynamic bit rate adjustment

The code level can be adjusted seamlessly. Under the same code rate, the sound quality of OPUS is higher; at the same time, in the case of packet loss, when the packet loss rate is greater than a certain range, the encoding mode will be converted into SILK mode, that is, low code rate mode. to adapt to network conditions.

 > //设置码率接⼝,可以通过该接⼝动态调整码率 
> WebRTCOPUS_SetBitRate 
>  
> /* When FEC is enabled and there's enough packet loss, use SILK */ 
> if (st->silk_mode.useInBandFEC && st->silk_mode.packetLossPercentage > (128-vo ice_est)>>4) 
>       st->mode = MODE_SILK_ONLY;

OPUS lower latency

OPUS combines two codec technologies, SILK (for speech) and CELT (for music), with the advantage of low latency.

This is essential for use as part of a low-latency audio communication link, and OPUS can reduce algorithmic latency to 5 ms at the expense of voice quality.

Existing music codecs such as MP3, Vorbis, and HE-AAC have latency of 100ms or more, while OPUS has much lower latency but is comparable in quality to bitrate, as shown in the following figure:
微信图片_20220706142718.png
illustration

OPUS supports in-band FEC

OPUS supports the in-band FEC function. After using FEC, redundant packets can be generated according to the packet loss rate to improve the anti-packet loss capability of audio.

The in-band FEC function of OPUS is used in a similar way to the RED method, that is, when the current packet is sent, it will carry the content of the previous packet, except that the previous packet is encoded with a low bit rate to generate redundant packets, similar to the following method:

|1| | -> |2|1| -> |3|2| -> |4|3| -> |5|4| -> |6|5|

The following are several interfaces related to OPUS and FEC:

 //使能内置FEC
WebRTCOPUS_EnableFec
//向OPUS传递丢包率 
WebRTCOPUS_SetPacketLossRate 

//根据丢包率及useInBandFEC来判断是否开启低码率编码,即利⽤低码率编码来上⼀帧语⾳帧,⽣成 冗余包 
st->silk_mode.LBRR_coded = decide_fec(st->silk_mode.useInBandFEC, 
             st- >silk_mode.packetLossPercentage, st->silk_mode.LBRR_coded, st->mode, &st->bandwidth, equiv_rate); 
             
//根据是否⽀持FEC,来分配SILK rate 
static int compute_silk_rate_for_hybrid(int rate, int bandwidth, int frame20ms, int vbr, int fec) 


/* Low-Bitrate Redundancy (LBRR) encoding. 
Reuse all parameters but encode excitation at lower bitrate */ 
static OPUS_INLINE void silk_LBRR_encode_FLP( 
    silk_encoder_state_FLP *psEnc, /* I/O Encoder state FLP */ 
    silk_encoder_control_FLP *psEncCtrl, /* I/O Encoder control FLP */ 
    const silk_float xfw[], /* I Input signal */ 
    OPUS_int condCoding /* I The type of conditional coding used so far for this frame */ 
)

It should be pointed out here that the built-in FEC packets of OPUS are only generated in SILK mode, and redundant packets are not generated in CELT encoding mode.

 if (st->mode == MODE_CELT_ONLY) 
   redundancy = 0; 

 if (redundancy) 
 { 
     redundancy_bytes = compute_redundancy_bytes(max_data_bytes, st- >bitrate_bps, frame_rate, st->stream_channels); 
     if (redundancy_bytes == 0) 
        redundancy = 0; 
 }

The FEC function in WebRTC is enabled through SDP negotiation, as follows:

 a=rtpmap:111 OPUS/48000/2 
a=fmtp:111 minptime=10;useinbandfec=1

The following figure is a comparison of the effect of OPUS with FEC turned on and without FEC turned on [8]
微信图片_20220706142722.png
illustration

As can be seen from the figure, after FEC is turned on, in the case of 20% packet loss, the improvement of the audio MOS value is still very obvious.

OPUS decoder supports PLC

The OPUS decoding end supports packet loss concealment. The principle is to use the normal or recovered speech signal of the previous frame to perform signal analysis, reconstruct and predict the current lost speech frame according to the characteristics of short-term similarity of speech signals.

 int WebRTCOPUS_Decode(OPUSDecInst* inst, const uint8_t* encoded, 
                      size_t encoded_bytes, int16_t* decoded, 
                      int16_t* audio_type) { 
  int decoded_samples; 

  if (encoded_bytes == 0) { 
    *audio_type = DetermineAudioType(inst, encoded_bytes); 
    decoded_samples = WebRTCOPUS_DecodePlc(inst, decoded, 1); 
  } else { 
   ... 
  }

OPUS voice function supports DTX

When not in music mode, ie in VoIP mode, DTX can be turned on in order to save bandwidth when no speech is detected for a certain period of time.

At this time, when no call sound is detected, OPUS will periodically send mute packets for 400ms to achieve the purpose of reducing bandwidth. WebRTC does not enable this feature by default. To enable DTX, only need to add a=ftmp in the line of a=ftmp during SDP negotiation. usedtx=1 to enable.

 WebRTCOPUS_EnableDtx  
WebRTCOPUS_

OPUS itself has many anti-weak network features. These features, combined with packet loss and retransmission, can make audio with strong anti-weak network capabilities.


This paper mainly combines the actual working experience of weak network processing, from the aspects of forward error correction, backward error correction and the characteristics of the OPUS encoder itself, to briefly explain and summarize some common technologies of audio weak network.

Weak net processing also has a key anti-jitter technique that will be detailed in the next article in the series.

References:

 [1]: https://datatracker.ietf.org/doc/html/rfc5109

[2]: https://datatracker.ietf.org/doc/html/draft-ietf-payload-flexible-fec-scheme-03

[3]:‍https://tex2e.github.io/rfctranslater/html/rfc5510.html‍

[4]:https://www.scirp.org/pdf/6-2.16.pdf

[5]:https://datatracker.ietf.org/doc/html/rfc2198

[6]https://tex2e.github.io/rfc-translater/html/rfc4585.html

[7]:https://ja.wikipedia.org/wiki/OPUS_(%E9%9F%B3%E5%A3%B0%E5%9C%A7%E7%B8%AE)

[8]:https://www.OPUScodec.org/static/presentations/OPUS_voice_aes135.pdf

融云RongCloud
82 声望1.2k 粉丝

因为专注,所以专业