WebRTC audio anti-weak network technology (below)

Last week, we shared the forward error correction technology, backward error correction technology and OPUS codec anti-weak network features in the audio weak network countermeasure technology. The anti-jitter module NetEQ used by WebRTC for text sharing. Follow [Rongyun Global Internet Communication Cloud] to learn more

Definition and Elimination of Jitter

Jitter refers to the imbalance of the data arriving at the receiving end in different time periods due to network reasons; or the time interval between the receiving end receiving data packets is large or small.

WebRTC evaluates jitter by the change in the packet arrival time interval with the following formula:

Ji is defined as the jitter measured at time i, E(T) represents the interval mean of packet arrival times, and Ti represents the time interval between the packets received at time i and the last packet received.

Ji > 0 means that the data packet arrives in advance, so the data in the jitter buffer will accumulate, which will easily cause buffer data overflow and increase the end-to-end time delay; Ji < 0 means that the data packet arrives late or is lost, which will increase the delay. time; whether it arrives early or late, it may cause packet loss and increase delay.

NetEQ predicts the transmission time delay of packets in the network by measuring the time interval between packet arrivals; it estimates the size of jitter based on the size of the buffered voice data that is received but not yet played.

In principle, by measuring the network delay, set the size of the jitter buffer with its maximum delay, and then keep the sum of the delay of each packet in the network plus the delay of the packet in the jitter buffer equal. This eliminates jitter, so that the voice packets can be controlled to play audio data from the jitter buffer at a relatively smooth speed.

The following figure [1] illustrates the core idea of jitter removal:

(The core idea of jitter elimination)

A, B, C, and D packets are sent at the sender at 30ms intervals, that is, at 30ms, 60ms, 90ms, and 120ms, respectively; The delays in the network are 10ms, 30ms, 10ms, and 10ms, respectively; the intervals between packet arrivals are 50ms, 10ms, and 30ms, that is, jitter.

Therefore, packets A, C, and D can be played after a delay of 20ms in the jitter buffer, that is, the playback time of A, B, C, and D is 60ms, 90ms, 120ms, and 150ms, which can maintain a stable interval for playback.

NetEQ adjusts the jitter buffer according to the 95% quantile of the network transmission delay by estimating the network transmission delay; this allows NetEQ to take into account the minimum delay when eliminating the impact of the jitter buffer.

The following figure [2] is the official comparison of the delay caused by NetEQ and other technologies in eliminating jitter. It can be seen that NetEQ can maintain a very low time delay when eliminating jitter.

(Comparison of NetEQ and other technologies to eliminate jitter delay)

NetEQ and its related modules

Figure [1] below outlines the WebRTC speech engine architecture. The red area is the NetEQ part. It can be seen that NetEQ is located at the receiving end, including jitter buffer, decoder, PLC and other related modules. After the receiving end receives the voice packet from the network, the voice packet will first enter the NetEQ module for jitter elimination, packet loss concealment, decoding and other operations, and finally output the audio data to the sound card for playback.

(WebRTC's speech engine architecture diagram)

The modules included in NetEQ are shown in the following figure [1]:

(NetEQ module)

The core modules of NetEQ include MCU module and DSP module . The MCU module is responsible for inserting and retrieving data into the jitter buffer cache, as well as operations for the DSP; the DSP module is responsible for the processing of voice information numbers, including decoding, acceleration, deceleration, fusion, PLC, etc. .

At the same time, the MCU module takes packets from the jitter buffer and is affected by the feedback of the DSP module.
The MCU module includes the audio packets entering the buffer, the audio packets being taken out from the buffer, the network transmission time delay is evaluated through the arrival interval of the voice packets, and what operations (acceleration, deceleration, PLC, merge) are performed on the DSP module, etc.

The DSP module includes decoding, performing related operations on the decoded voice PCM data, evaluating the network jitter level, transmitting playable data to the play buffer, and so on. The following detailed analysis of MCU and DSP related modules.

MCU

Insert received packets into packet_buffer_
After receiving the audio packets from the network, insert them into the jitter buffer packet_buffer_, and buffer up to 200 packets; when the buffer is full, all the packets in the buffer will be refreshed, and the buffer will buffer up to 5 seconds of audio data, too The earlier ones will be cleared periodically.

If the received packet is a red packet, each individual packet in the red packet will be parsed and stored in the cache queue, and the packets will be stored in the cache queue in ascending order of timestamp, that is, the latest audio packets will be stored in the back of the queue , old packets are stored at the front of the queue.

After receiving the audio packet, the NackTracker module will also be updated. This module mainly judges whether there is packet loss through the serial number of the packet. If there is packet loss and the retransmission request is enabled, it needs to initiate a nack request to the sender to request the sender to retransmit. Send missing audio packets.

Estimating network transmission delays

When inserting packets into the jitter buffer, the network delay is estimated based on the packet arrival time interval. Here, WebRTC calculates the network delay according to the packet arrival time interval, and WebRTC calculates the audio network delay, mainly considering the following points:

The time interval between the arrival of audio packets is counted, and 95% of it is divided as the network delay estimation.

The above formula calculates the time stamp range occupied by each seq packet by calculating the time difference between the current packet and the previous packet, and the difference between seqnum, that is, packet_per_time, and then calculates the packet receiving interval iat_ms between the current packet and the previous packet. Packet interval delay, measured by the number of packets.

When a normal packet arrives, iat = 1, iat = 0 means it arrives early, and iat = 2 means delaying a packet, and finally the number of packets is used to measure the delay time.

After each iat is calculated, it will be inserted into the corresponding histogram, and the histogram records 0-64 packets, that is, records the delay probability distribution of 0-64 packets. All delay probabilities sum to 1.

Calculate the arrival interval iat of the current packet, insert it into the histogram, and update the probability of each delay in the histogram. Each delay distribution is multiplied by a forgetting factor f, and f is less than 1. The update process is as follows:

The meaning of the above formula is to ensure that each time the arrival interval delay of a packet is calculated, its probability is increased correspondingly, and the probability of other delay distributions is forgotten, so that the whole sum remains 1. Finally, the delay divided by 95% is taken as the target delay, that is, 95% of the delays are less than this target delay.

Maximum peak value of statistical packet arrival time interval

Count the peaks of multiple packet arrival intervals, up to 8. In certain cases, the peaks are used as network delay estimates. The peak value is determined every time the packet arrival time interval iat and the target delay calculated by the 95% quantile (recorded as target_level) are calculated.

The conditions for considering the iat to be a delay peak are:

iat > target_level + threshold, or iat > 2 * target_level, where the threshold is 3
Quoted and the time interval since the last peak is less than the threshold 5s

When the iat is judged to be a delay peak, it is added to the peak container, and each element in the peak container records two values, one is the peak iat value, and the other is the time interval between the current peak and the previous peak (that is, period_ms);

When there are more than two peak containers, and the elapsed time from the last time the peak was found (recorded as eplase_time) is less than twice the maximum period_ms recorded in the peak container element, the peak is considered to be valid, and the peak value needs to be calculated from the peak value. The container takes the largest iat (denoted as max_iat), and the target delay takes the value:

When the value of target_level is max_iat, the theoretical effective time of the peak value can reach more than 40s, and there is room for optimization here.

Minimum delay limit

According to the set minimum delay, adjust the target delay estimate to ensure that the target delay estimate is not lower than the minimum delay; the minimum delay is the result of the audio and video synchronization calculation. If the target delay is less than the minimum delay, the audio and video will be out of sync.

Target latency cannot exceed 0.75 * maximum jitter buffer size

The default maximum jitter buffer in WebRTC is 200, so the target latency is:

That is, the target_level is not more than 150 packets, that is, the target delay is not more than 3s delay.

Fetch packets from packet_buffer_

The playback thread tries to get 10ms of data to play each time, and if there is no data in the jitter buffer, it will play the mute data.

The data obtained from packet_buffer_ first needs to be decoded by the decoder, and then the corresponding DSP operations (acceleration, deceleration, packet loss concealment, PLC, fusion, etc.) are performed according to the feedback operations. Finally, fill the data processed by DSP into sync_buffer_ to be played by the speaker, and take 10ms audio data from sync_buffer_ to play.

Calculate Jitter Delay

Update and calculate the jitter buffer delay filtered_level according to the total delay total_dealy of the unplayed audio data in the jitter buffer packet_buffer_ and sync_buffer_

If the acceleration/deceleration operation is performed, the delay change introduced after the acceleration/deceleration operation needs to be eliminated from the filtered_level, as shown below:

Get the corresponding action

The size of unplayed data in packet_buffer_ and sync_buffer_ in the figure below can be understood as the size of the jitter buffer. It can be seen from the figure that after the data is taken out from packet_buffer_, it is decoded by the decoder, processed by DSP, and finally entered into sync_buffer_ to be played.

low_limit = 3/4 *target_level
high_limit = max(target_level, low_limit + window_20ms

(jitter buffer)

But what kind of DSP operation processing?

NetEQ will determine the total delay of unplayed audio data in packet_buffer_ and sync_buffer_ (recorded as total_delay), the time of the last audio data to be played in sync_buffer_ (recorded as endtimestamp), and the time of the first packet of packet_buffer_ (recorded as avaibleTimestamp), and the relationship between target_level and filtered_level, to comprehensively judge which DSP operation to perform next.

The DSP processing operating conditions of several cores are briefly described below:

norml operations

One of the following conditions is met, the normal operation will be performed

If the expand operation is to be done originally, but the amount of data to be played in sync_buffer_ is greater than 10ms.
quote
The number of consecutive expand operations exceeds the threshold.
quote
3. Both the current frame and the previous frame arrive normally, and the fitlered_level is between low_limit and high_limit.
quote
The last operation was PLC, the current frame was normal, but the current frame came too early, and the normal operation was performed.
quote
Both the current frame and the previous frame arrive normally. Originally judged by filtered_level > high_limit, the operation should be accelerated, but the audio data to be played in sync_buffer_ is greater than or less than 30ms.

expand Operating Conditions

When the current data packet is lost or has not yet arrived, and when the audio data to be played in sync_buffer_ is less than 10ms, the expand operation will be performed if any of the following four conditions are met.

packet_buffer_ No audio data available.
quote
2. The last time was an expand operation, and the current total_delay < 0.5* target_level.
quote
The last time was an expand operation, and the current packet (that is, availbeTimestamp - endtimestamp is greater than a certain threshold) came too early, and the current filtered_level was less than target_level.
quote
The last packet was not expanded (accelerated, decelerated, or normal), and availbeTimestamp - endtimestamp > 0, that is, there was packet loss in the middle.

Speed up operation

Both the previous packet and the current packet arrive normally, the filtered_level is greater than hight_limit, and the amount of data to be played in sync_buffer_ is greater than 30ms.

deceleration operation

Both the previous packet and the current packet arrive normally, and the filtered_level is less than low_limit.

merge operation

The last time was an expand operation, and the current package arrived normally.

DSP

pitch [3]

The fundamental tone refers to the fundamental harmonic of the signal corresponding to the periodicity caused by the vibration of the vocal cords during voiced sound, and the fundamental tone period is equal to the reciprocal of the frequency of the vocal cord vibration. The general sound is composed of a series of vibrations with different frequencies and amplitudes emitted by the sounding body. One of these vibrations has the lowest frequency, and the sound produced by it is the fundamental sound, and the rest are overtones. The fundamental tone carries most of the energy and determines the pitch.

For the period extraction of the pitch, the short-term autocorrelation function is generally used, because the autocorrelation function generally exhibits the characteristics of a maximum value based on the period. In NetEQ's DSP signal processing, the extraction of pitch is a crucial step.

Stretching of speech[4]

Stretching and variable speed of speech, time domain method and frequency domain method. The time-domain method is less computationally intensive than the frequency-domain method, and is suitable for scenarios such as VoIP ; the frequency-domain method is suitable for scenes with drastic changes in frequency , such as music.

The stretching (acceleration or deceleration) of the voice in NetEQ uses the WSOLA algorithm, that is, the speed is changed by the method of superposition of similar waveforms; this algorithm can ensure that the speed change does not change when the voice is stretched. At the same time, the algorithm is a time domain algorithm, which has a better effect on speech. The following figure shows its general principle and process:

(Principle of WSOLA algorithm)

The general process of the WSOLA algorithm:

decoding

Get data from packet_buffer_, decode normally, and store the decoded data in decoded_buffer_.

accelerate

In NetEQ, when the total delay of the data to be played in sync_buffer_ and packet_buffer_ accumulates too much, and the previous frame and the current frame are normal, it is necessary to speed up the playback operation to reduce the amount of data to be played in the jitter buffer. The purpose of delay, otherwise it is easy to cause jitter buffer overflow and packet loss. Acceleration requires that the speed change does not change. The WSOLA algorithm achieves the purpose of speed change by finding similar waveforms and superimposing similar waveforms.

Here is a brief description of the acceleration process in NetEQ:

First, the acceleration requires at least 30ms of data. The data generally comes from decoder_buffer_. When the data obtained from decoded_buffer_ is less than 30ms, you need to borrow some data from sync_buffer_ to make up for 30ms.
The 30ms data corresponds to 240 samples, and the 240 samples are downsampled to 110 samples; the downsampled 110 samples are divided into two parts, seq1 and seq2, and seq2 is the fixed end of 50 samples, namely [60,110]; seq1 is also 50 Samples, sliding range [0,100], calculate the autocorrelation of seq1 and seq2.
According to the calculated autocorrelation results, parabolic fitting is used to find the autocorrelation peak and position. The position where the maximum value appears is the offset in the above figure, that is, the distance the seq1 window slides to the left, so the pitch period T = offset + 10.
In the 30ms sample to be accelerated, take the two pitch period signals of 15ms and 15ms - T, and denote them as XY to calculate the matching degree of these two pitch period signals.

When the best_correlation is greater than 0.9, the two pitch period signals are combined into one pitch period signal, which plays an acceleration role.
The accelerated data, as shown in the output below, will be stored in algorithm_buffer_. If you borrowed some data from sync_buffer_ because the amount of data in decoded_buffer_ is not enough, you need to copy the borrowed data from algorithm_buffer_ back to sync_buffer_ The corresponding position (the purpose is to ensure a smooth transition of audio), and copy the remaining data in algorithm_buffer_ to the end of sync_buffer_, as shown in the following figure:

slow down

When the delay level of the data to be played in the NetEQ jitter buffer is less than the lower limit of the target delay, it means that the amount of data to be played is small. In order to achieve the best sound quality experience on the playback side, it is necessary to stretch the existing data and increase the amount of playback data appropriately; This is the opposite of acceleration, the underlying technology is the same. Here is a brief analysis of it:

The first 4 steps and acceleration are basically the same, so I won't repeat them here
The final step is to insert the combined pitch of the 15ms - T and 15ms two pitch periods cross-fading merged before the one period after 15ms. To achieve the purpose of adding a pitch period data, in order to increase the amount of playback data, and need to return the borrowed data from the algorithm buffer to sync_buffer_.
Specifically as shown in the figure below:

Packet Loss Compensation

When the current packet is lost, the packet loss compensation will be triggered in NetEQ to predict the lost packet; there are two ways of packet loss compensation, one is to predict and reconstruct the lost packet through the codec, and the other is to use the NetEQ module to predict and reconstruct the lost packet. Predictive reconstruction of lost packets; corresponding to kModeCodecPlc and kModeExpand respectively.

No matter what mode it is, the premise is that the amount of audio data to be played in sync_buffer_ is less than the amount of data currently requested to be played before this operation can be performed; packet loss compensation reconstructs PLC-related parameters through recent historical data, and then The data and PLC-related parameters are linearly predicted, the lost packets are recovered, and finally a certain noise is added.

When the PLC is operated for many times in a row, the distortion will increase, so after many times of operation, the PLC voice energy value will be reduced.

The figure below shows the core steps of the NetEQ expand operation.

(Steps to build PLC parameters, click to enlarge)

(Build the PLC package, click to view the larger image)

In PLC operation, the primary task is to calculate the pitch period. Here, two methods of autocorrelation calculation and signal distortion calculation are used to calculate the pitch period, which can be briefly explained by the following figure.

A is the signal at the end 60 samples

B is a sliding window, the window contains 60 sample signals, and the starting position of the window slides in the range [0, 54]

By calculating the autocorrelation coefficients of A and B, 54 autocorrelation results are obtained, and parabolic fitting is used to find the positions of the three maximum values for these 54 autocorrelation results. Location

It is considered that the fundamental tone is a periodic signal, and the position where the extreme value appears is on the period

Therefore, the pitch period T
T1 = peak_idx3 + 10
T2 = peak_idx2 + 10
T3 = peak_idx1 + 10

According to the three extreme values found by autocorrelation and parabolic fitting above, three pitch periods are obtained

Take the end 20 signals as A, the sliding window B is also 20 samples (2.5ms), and B slides within a range of 4 samples (0.5ms) before and after a pitch period from A

Therefore, there are three window sliding ranges, and the distortion of window A and window B within the three ranges is calculated.

Take the three extreme values of the minimum distortion degree, and these three extreme values are used as the pitch period T' calculated by the minimum distortion

1, T'2, T'3
The measurement of the minimum distortion is based on the minimum sum of the absolute value of the difference between the corresponding elements of A and B

After obtaining the three pitch periods T1, T2, T3 calculated based on the autocorrelation and the three pitch periods T'1, T'2, T'3 based on the minimum distortion degree

By comparing the ratio = self-phase value/distortion degree of these three pairs, when the ratio is the largest, the pair is considered to be the best pitch period

In PLC operation, the calculation details of expand_vector0 and expand_vector1 are as follows:

(Construct expand_vector0, expand_vector1 from historical data)

The AR filter parameters are constructed, and the AR filter parameters are predicted by obtaining a group of (7) autocorrelation values, and calculating the group of autocorrelation values through the LevinsonDurbin algorithm.

The AR filter is a linear predictor, which uses historical data to predict current data; when constructing PLC packets, AR filtering is performed on historical data to predict lost packet information.

The following figure is a brief illustration of this process:

The AR filter formula is as follows:

Here k = 7; e(n) is the error between the predicted value and the actual value; AR filtering is to predict the current moment data by using the latest historical data; the LevinsonDurbin algorithm predicts ck by calculating the self-consistency value, so that e(n) minimum.

The larger the autocorrelation, the smaller the e(n);
The LevinsonbDurbin algorithm predicts the AR filter parameters by using the autocorrelation value, which can be understood as using e(n) to predict.

fusion

The fusion operation is generally that the previous frame is lost, but the current frame is normal;

The previous frame is a PLC packet generated by prediction, and the current frame needs to be smoothed to prevent obvious changes in the two connections;

Fusion is to complete this function. The main process of fusion is as follows:

202 extended sample signals are required, mainly borrowing sync_buffer_ unplayed signals from sync_buffer_ unplayed signals. If the unplayed signal is less than 202, it can be filled by generating enough PLC data.
Obtain the pitch period by extending the sample signal and the input signal, calculating their autocorrelation values, and by parabolic fitting
Perform Ramp signal transformation on input data Mix the extended signal and input signal segments Take and return the borrowed signal data from the buffer algorithm to sync_buffer_
Append the remaining smoothed signal of the algorithm buffer to sync_buffer_

The following figure provides a brief description of the fusion process in NetEQ:

(Pitch period and mute_factor, click to enlarge)

(Mix expand and input signals, click to enlarge)

normal

The current frame can be played normally, that is, the signal can be directly sent to sync_buffer_, but because the previous frame may be expand, etc., related smoothing operations are required.

The main steps are as follows:

Copy the input signal to the algorithm buffer to generate a PLC data packet
Calculate the mute_factor according to the energy ratio of the background noise and the input speech signal
According to the mute_factor, the algorithm buffer data is corrected according to the trend of energy from weak to strong
Smoothly mix the expand and the first 1ms audio data of the algorithm buffer, and the result is stored in the algorithm buffer;
The relevant mixed smoothing formula is as follows:

Add the final algorithm buffer data to sync_buffer_

The following figure supplements the normal operation process:

(Normal operation process, click to enlarge)

NetEQ related buffers

In order to eliminate jitter, decode audio data, perform DSP processing on decoded data (acceleration and deceleration, PLC, fusion, smooth data), and smooth playback, NetEQ uses multiple buffer buffers. The following is a brief description of these buffers:

packet_buffer_: used to receive audio packet data received in the network, also known as jitter buffer, it will regularly delete the packets 5 seconds before the current time, and buffer up to 200 packets at the same time, that is, 4 seconds (here, each packet is 20ms calculate).
quote
decoded_buffer_: When the playback thread obtains audio data for playback each time, it will judge whether it needs to obtain audio data from packet_buffer_ for decoding according to the target delay and jitter buffer delay, and store the decoded data in decoded_buffer_. Up to 120ms of data can be cached.
algorithm_buffer_: After the data in decoded_buffer_ is processed by DSP, it will be stored in this cache, and it will be emptied every time it is processed.
sync_buffer_: Generally, the data copied from the algorithm cache is the data to be played; there are two variables in sync_buffer_, one is next_index, which indicates the current playing position, the data before next_index indicates that it has been played, and the following data indicates that it is to be played. ; The other is endtimestamp, which represents the last data to be played in sync_buffer_, that is, the latest audio data amount.
The buffer can buffer up to 180ms of data, which is a circular buffer; the playback thread will take 10ms of data from sync_buffer_ for playback each time.

Instructions for decoding when to take data from packet_buffer_:

Under normal circumstances, 10ms data is taken from packet_buffer_ for decoding each time.
When the expand operation is to be performed, but there is more than 10ms data in sync_buffer_, the data is not decoded from packet_buffer_.
When performing acceleration operation, if there is more than 30ms data in sync_buffer_, do not fetch data from packet_buffer_; or if there is more than 10ms data to be played in sync_buffer_ and 30ms data was decoded last time, it will not fetch data from packet_buffer_.
When accelerating operation, if sync_buffer_ data to be played is less than 20ms, and the last decoded data is less than 30ms, get 20ms data from packet_buffer_ to be decoded.
When the deceleration operation is performed, if the data to be played in sync_buffer_ is greater than 30ms; or the data to be played is less than 10ms, but the last decoded data exceeds 30ms, the data will not be obtained from packet_buffer_.
When the deceleration operation is to be performed, the amount of data to be played in sycn_buffer_ is less than 20ms, and the amount of data decoded last time is less than 30ms, obtain 20ms data for decoding.

NetEQ can track network jitter very well, and at the same time ensure that the delay is as small as possible when eliminating jitter, which significantly improves the audio experience. Combined with some technologies about weak network confrontation in the previous article, it can significantly improve the audio experience in a weak network environment.

References:

WebRTC audio anti-weak network technology (below)

Definition and Elimination of Jitter

NetEQ and its related modules