This article is the content of the " Dev for Dev Column " series. The author is Wang Rui, the network experience team of Shengwang.
01 Background
In real-time audio and video calls, the audio and video quality is greatly affected by network packet loss, especially for video.
Why is video more sensitive to packet loss? Generally speaking, the original bit rate of audio is smaller than that of video, so the compression rate of audio encoder is much smaller than that of video encoder . Audio frames are usually encoded and decoded independently, so the loss of any frame of data will not affect the decoding of other frames.
For video, in order to achieve a higher compression rate, the residual coding method is usually used to remove a large amount of spatial and temporal redundant information, which leads to the need to rely on reference frames for correct decoding during decoding. The loss of one frame of data will result in some subsequent frames that are related to each other cannot be decoded. Because of this correlation between video frames, video is more sensitive to transmission packet loss.
Common packet loss includes congestion loss in IP transport network and wireless packet loss near the user side. Packet loss due to congestion is generally caused by bottleneck nodes with relatively small processing capabilities in the network, which may appear as random or continuous packet loss. Wireless packet loss is generally caused by channel interference and often appears as continuous packet loss. Of course, there are many reasons for packet loss in the actual network, and the performance is also different.
The congestion control algorithm can avoid congestion and packet loss to a certain extent, but it cannot solve the packet loss problem in all scenarios.
After packet loss occurs, the following two remedies are usually used:
(1) Retransmission (Automatic Repeat-reQuest, ARQ)
Retransmission is an efficient remedy, but it relies on the feedback channel, and the retransmission efficiency is very sensitive to the network RTT. Multiple retransmissions may be required under high packet loss.
(2) Forward Error Correction (FEC)
Error correction coding does not require a feedback channel, and the size of the network RTT does not affect its efficiency.
In one-to-one interconnection scenarios, retransmission is an appropriate choice if the network RTT is low. However, when the RTT is large, the efficiency of retransmission will be significantly reduced. In addition, in large-scale multicast and broadcast applications, excessive retransmission requests may exacerbate network congestion and packet loss.
In these cases, FEC technology is a more appropriate choice. FEC performs redundant encoding at the encoding end, and automatically recovers lost data packets at the receiving end through redundant data. The advantage is that it does not depend on the feedback channel, and the error correction efficiency is not affected by RTT. Therefore, in a high RTT environment, high delay caused by retransmission can be effectively avoided. This article mainly briefly introduces the FEC technology in real-time video transmission, and introduces the best practices of the sound network.
02 Introduction to basic concepts
1. Delete the channel
If a set of data sent by the sender is transmitted through a specific channel, and the location of the lost data is known to the receiver, then this channel model is called a deletion channel. An Internet-based packet-switched network is a typical deletion channel, and under this channel model, all received data is considered correct.
2. Error detection code, error correction code and erasure correction code
In channel coding, error control coding can be divided into three types: error detection code, error correction code and erasure correction code according to different coding purposes.
The error detection code is to check whether there is an error after the data is transmitted through the channel. Commonly used error detection codes include simple parity check codes, cyclic redundancy check codes, and the like. The error correction code can not only identify whether there is an error in the channel transmission, but also correct the transmission error. From the perspective of error correction codes, after transmission through the error channel, the data received by the receiver is unreliable, and it needs to be decoded to find and correct errors. Common error correction codes include Hamming codes, BCH codes, etc. (see Reference 5).
Erasure correction code can be considered as a special error correction code. For erasure correction code, its error channel is a kind of erasure channel. For the receiver, the location of the error is known, and the received data is considered as is correct, so its encoding and decoding will be easier than error-correcting codes.
The development of traditional error correction code technology has been very mature, and it is usually used in the physical layer and data link layer of network protocols to ensure reliable link-level channels. Erasure correction codes are mostly developed based on the theory of error correction codes, and are often used in FEC coding at the application layer.
3. RS code
In the application layer FEC coding algorithm of video transmission, RS code is a common algorithm. RS codes are an important class of linear block codes and short codes with excellent performance, which work in finite fields. Linear block codes are usually represented by (n, k) codes. For bit coding, k and n represent the number of bits, where k is the number of information bits and n is the code length; and in the field of application layer FEC, FEC coding adopts Packet encoding method. In packet encoding, both k and n represent the number of packets. For an RS erasure code with a code length of n, as long as any k information bits are received, all n pieces of data can be decoded.
For a (n,k) linear erasure code, it can be expressed as the following matrix operation: Y = x * G
where x is the information bit vector, y is the codeword vector, and G is the generator matrix.
In RS codes, the common generator matrices are Vandermonde matrix and Cauchy matrix, as shown in the figure below. The characteristics of these two matrices are that any sub-matrix is invertible, so any k received codewords can always be used to decode the remainder. nk codewords.
■Vandermonde matrix
■ Cauchy matrix
4. Finite field operations
A field is a set on which addition, subtraction, multiplication, and division can be performed without the result exceeding the field. If the number of elements in the field is finite, it is called a finite field, or Galois field. Finite fields are widely used in cryptography and coding theory.
The RS code is a multi-ary era erasure code that works on the finite field GF(2^q), and all the symbols of the RS code are taken from the finite field GF(2^q). In general, we can construct finite fields in the following two ways:
Theorem 1 : For the set Zp={0,1,…,p-1}, if p is a prime number, then adding and multiplying modulo p on Zp constitutes a finite field.
That is, Z2, Z3, and Z5 are all finite fields, but Z8 is not, because 8 is not a prime number. In order to be able to construct finite fields over non-prime numbers, primitive polynomials are needed.
Theorem 2 : If p is a prime number and m is a positive integer, then a primitive polynomial of order m modulo GF(p^m) also constitutes a finite field.
The original polynomial is an irreducible polynomial whose first coefficient is 1. In the actual algorithm implementation, considering the performance and algorithm complexity, the RS algorithm is often implemented in the GF(2^8) field. There can be multiple original polynomials, as follows This is an example of an 8th-order origin polynomial:
03 FEC encoding scheme for real-time video streaming
1. Convolutional codes and block codes
The so-called block code means that the encoder needs to group the input data in advance to reach the group length before encoding. The input of the convolutional code is continuous, and the output is also continuous, and no pre-grouping is required.
The output of the block code is only related to the currently encoded block data. For example, the (n, k) block code takes k inputs as a group, encodes n outputs, and the output is only related to k inputs. And the encoded output of (n, k, L) convolutional code is not only related to k inputs, but also related to the historical L inputs. L is the constraint length, or memory depth.
In order to enhance the ability to resist continuous packet loss, the block length (that is, k) of the block code is usually increased, but since the decoding delay of the block code is linear, the increase of the value of k will lead to an increase in the decoding delay. This brings about the problem of selecting the packet length. A longer packet length can provide better error correction capability, but at the same time increase the system delay. Convolutional codes do not need to consider the packet length problem, and can implement on-the-fly coding, which is more suitable for real-time streaming scenarios.
2. Several common coding schemes
For real-time video streams, the following coding schemes are commonly used, frame-level coding, GOP-level coding, window expansion coding, and sliding window coding. The first two are block codes, and the latter two are applications of convolutional codes.
Frame-level FEC encoding uses a single frame as a grouping unit (as shown in the figure below). In this encoding scheme, FEC can achieve the minimum decoding delay, but at low bit rates, because the grouping is too small, it is easy to fail when continuous packet loss occurs. decoding situation. GOP-level coding uses GOP as the grouping unit. The advantage of this coding scheme is that the decoding stability under continuous packet loss is greatly improved, but it also brings a large decoding delay, which is difficult to apply in real-time scenarios.
Window expansion and sliding window coding are specific applications of convolutional codes. Since there is no grouping problem, they can theoretically be coded at any position in the video stream. The difference between the two is that when encoding frame X, the window expansion encoding will encode the 1st to Xth frames within the GOP range, while the sliding window encoding will only encode the XTth to Xth frames, where T is the maximum window length. These two encoding methods can improve the decoding probability in the continuous packet loss scenario, and will not increase the decoding delay. However, due to the performance problems of window expansion coding under large GOP, sliding window coding is a more practical solution. The figure below is an example of sliding window FEC encoding with T=3.
3. Source and channel joint convolutional coding
In the practice of SoundNet, we not only applied the convolutional code scheme on a large scale to verify the advantages of convolutional codes in real-time video streaming, but also combined source coding with channel coding to create a new Coding scheme, and named DMEC (Dense Matrix Erasure Coding), DMEC really gives full play to the maximum performance advantages of convolutional codes, and can achieve optimal video QoE in different scenarios.
In information theory, the coding of information is divided into source coding and channel coding. Source coding is to remove redundant information and improve communication efficiency. For example, the commonly used video encoder H.264 is a source encoder. The channel coding is to combat the noise and attenuation in the channel, and improve the anti-interference and error correction capabilities by adding redundancy. For example, the RS code mentioned above is a channel coding algorithm.
For a video source encoder, taking H264 as an example, video frames are usually referenced frame by frame, that is, the next frame always refers to the previous frame in the same GOP. This is not the case in the case of layered coding, where a video frame is divided into a base layer and one or more enhancement layers. In multi-downlink scenarios, layered coding can provide different levels of adaptive video streams for terminals with different devices and different network conditions. But at the same time, layered coding complicates the dependencies of video frames. If simple sliding-window FEC coding is still applied, the video quality will not be optimal.
Compared with the traditional sliding window scheme, DMEC uses the video frame reference relationship output by the source encoder as the coding constraint of the convolutional encoder, which eliminates the influence of non-reference frames on the decoding probability and makes the video playable frame rate (Playable Frame Rate). Rate, PFR) to achieve the optimum (see reference 1 for the theoretical basis of this scheme).
During DMEC encoding, the encoding window of the current frame only includes the current frame and its reference frame. The more important the frame is, the greater the probability of being referenced, the more times it is encoded during FEC encoding, and the probability of being recovered at the decoding side. The higher it is, through this mechanism, DMEC automatically implements Unequal Protection (UEP) without the need to allocate more FEC bit rate for high-priority frames.
Only the two-layer time-domain and two-layer space-domain SVC shown in the following figure is taken as an example to illustrate, when T=6, the coding window of frame P31 includes I00 , I01 , P20 , P21 , P30 , and P31 .
4. Performance comparison
The figure below shows the PFR test results of different FEC schemes using standard sequences in the laboratory. PFR can play the frame rate, which is an intuitive indicator for evaluating the final video experience of the FEC algorithm. The red is the intra-frame FEC, the blue is the sliding window FEC, and the purple is the DMEC of the sound network. It can be seen that DMEC has obvious advantages over the other two schemes.
In addition to laboratory data, we also conducted large-scale A/B tests online and obtained a large amount of comparative data through data mining (the table below shows the A/B Test comparison data of some customers), which also verified the sound network. Compared with the traditional FEC solution, the self-developed DMEC can reduce the video freezing rate.
Scenes | Grayscale minutes | Caton rate drop | |
---|---|---|---|
customer A | 1v1 | 1.3 million | 22.22% |
customer B | Meeting | 5.6 million | 10.94% |
customer C | live streaming | 0.4 million | 12.73% |
references
1. R. Wang, L. Si and B. He, "Sliding-Window Forward Error Correction Based on Reference Order for Real-Time Video Streaming," in IEEE Access, vol. 10, pp. 34288-34295, 2022, doi : 10.1109/ACCESS.2022.3162217.
< https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9741773 >
2, [RFC8680] Roca, V. and A. Begen, "Forward Error Correction (FEC) Framework Extension to Sliding Window Codes", RFC 8680, DOI 10.17487/RFC8680, January 2020,
< https://www.rfc-editor.org/info/rfc8680 >
3, Vincent Roca, Belkacem Teibi, Christophe Burdinat, Tuan Tran-Thai, Cédric Thienot. Block or Convolutional AL-FEC Codes? A Performance Comparison for Robust Low-Latency Communications. 2017. ffhal-01395937v2
< https://hal.inria.fr/hal-01395937v2/document >
4, Sliding Window Selective Linear Code (SLC) Forward Error Correction(FEC) Scheme for FECFRAME draft-wang-tsvwg-sw-slc-fec-scheme-03
< https://datatracker.ietf.org/doc/html/draft-wang-tsvwg-sw-slc-fec-scheme-03 >
5. Department of Electrical and Computer Engineering - University of New Brunswick, Fredericton, NB, Canada
About Dev for Dev
The full name of the Dev for Dev column is Developer for Developer. This column is a developer interactive innovation practice activity jointly initiated by Shengwang and the RTC developer community.
Through various forms of technology sharing, communication and collision, and project co-construction from the perspective of engineers, the power of developers is gathered, the most valuable technical content and projects are mined and delivered, and the creativity of technology is fully released.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。