头图

Online video/voice calls have gradually become a part of people's daily life, and the complex and changeable network environment will cause some audio packets to fail to be transmitted to the receiving end, resulting in short-term interruption or freezing of the voice signal, which will seriously affect the call experience. To solve such problems, the Alibaba Cloud Video Cloud Audio technical team has developed a real-time causal intelligent packet loss compensation algorithm AliPLC (Ali Packet Loss Concealment) after comprehensive consideration of effects, performance overhead, real-time performance and many other factors. An end-to-end generative adversarial network solves the problem of packet loss during speech transmission.

In real-time communication, what should I do if the signal is not good?

With the rapid development of Internet technology, emerging interactive methods such as live broadcasting, online education, audio and video conferences, social pan-entertainment, and interactive games are changing people's lives. It is worth mentioning that their rise is inseparable from the development of Real Time Communication (RTC). Figure 1 shows the brief flow of the audio link in RTC communication, mainly including: acquisition, pre-processing (3A), encoding, transmission, decoding, packet loss compensation, mixing, playback and other links.

Figure 1. Schematic diagram of audio link in RTC

The voice signal is transmitted in frames on the network through coding and compression technology. However, due to the influence of the network environment, some audio packets cannot be transmitted to the receiving end, resulting in short-term interruption or freezing of the voice signal, which in turn affects the sound quality and intelligibility during long-term calls. To solve the above problems, the Packet Loss Concealment (PLC) algorithm came into being. The PLC algorithm can properly compensate the lost audio packets by using all the obtained information, so that it is not easy to be detected, thus ensuring the clarity and fluency of the audio on the receiving side, and bringing users a better call experience.

Research Status of Audio Compensation Algorithms in the Industry

Packet loss is a phenomenon that is often encountered when data is transmitted in the network, and it is also one of the main reasons for the degradation of voice quality in VOIP (Voice Over Internet Phone, VOIP) calls. Traditional PLC solutions are mainly based on the principle of signal analysis [1-2], which can be roughly divided into schemes based on compensation at the sending end and those based on compensation at the receiving end. The basic principle of the former is to use the encoded redundant information to recover the content of the lost packets.

However, this method requires extra bandwidth and has the problem of codec incompatibility. The basic principle of the latter is to use the decoding parameter information before the packet loss to reconstruct the lost speech signal. The biggest advantage of the traditional PLC method is that the calculation is simple and it can be compensated online; When dealing with long-term continuous burst packet loss, the traditional algorithm will have mechanical noise, rapid waveform attenuation, etc. that cannot be effectively compensated. Therefore, the processing capability of the above-mentioned traditional PLC method cannot meet the needs of the existing network services.

In recent years, there have been significant advances in hardware and algorithms, and more and more deep learning methods have been applied to the field of speech signal processing. Of course, PLC algorithms are no exception. Existing deep PLC methods use deep learning models at the receiving end to generate lost audio packets, which can be roughly divided into two general working frameworks:

The first is a real-time causal processing framework that uses only historical non-missing frames for post-processing. In real-time processing, according to the different iterative methods, it can be roughly divided into two types: autoregressive methods based on recurrent neural networks [3-4] and parallel methods based on generative adversarial networks [5-6]. parameters and calculations.

The second is an offline non-causal processing framework that, in addition to using historical unmissed frames, may also use broader contextual information including future frames [7-8]. Offline processing methods usually focus on how to fill the gaps in the speech signal, and usually do not consider the computational complexity, which is difficult to deploy in practical application scenarios.

Intelligent packet loss compensation algorithm: AliPLC

1. Algorithm principle

After comprehensively considering business usage scenarios, compensation effects, performance overhead, real-time performance and many other factors, the Alibaba Cloud Video Cloud Audio technical team has developed a real-time causal intelligent packet loss compensation algorithm: AliPLC (Ali Packet Loss Concealment), which adopts a low-complexity algorithm. The end-to-end generative adversarial network is used to solve the problem of packet loss during speech transmission. This algorithm has the following advantages:
• The algorithm does not have any delay;
• Can be streamed in real time;
• Can generate high quality speech;
• No need for a separate smoothing operation to ensure smooth and coherent audio before and after packet loss.

2. Algorithm performance

The parameters of the AliPLC algorithm are 590k, and the time required to compensate a frame of 20ms audio data on an Intel Core i5 quad-core machine with a main frequency of 2GHz is 1.5ms, and no delay is generated during the deduction process.

3. Application scenarios

4. Effect display

The following shows the effects of packet loss compensation before and after the test corpus for Chinese boys and girls. In the subjective sense of hearing, the speech lag after compensation is reduced, and the fluency and clarity are significantly improved.

Chinese male voice fixed continuous packet loss for 60ms:

packet loss audio

Webrtc neteq plc supplementary audio

opus plc fills in audio audio

AliPLC fills out audio audio

Waveform comparison of different methods:

It can be clearly seen from the figure that when the fixed packet loss is 60ms, the audio coherence after processing by the AliPLC algorithm is better, and there is no such thing as attenuation that cannot be compensated.

Fixed continuous packet loss of 120ms for Chinese female voice:

packet loss audio

WebRTC neteq plc supplementary audio

opus plc fills in audio audio

AliPLC fills out audio audio

It can be clearly seen from the figure that when the fixed packet loss is 120ms, the compensation effect of the AliPLC algorithm is better than other algorithms; the neteq_plc algorithm completes the packet loss compensation through simple repetition and attenuation of the gene cycle, and when long-term packet loss occurs , it sounds a heavy mechanical sound, and it will affect the waveform of the part without packet loss; the compensation capability of the opus_plc algorithm is limited, it can only effectively compensate for about 40ms, and the packet loss more than 40ms will be attenuated to silence.

AliPLC objective benchmark evaluation

We use two objective indicators, POLQA and STOI, to evaluate the compensation effects of different PLC algorithms. Their scores under different packet loss rates are shown in the figure below. The abscissa represents the packet loss rate, and the ordinate represents the score. The value range of the POLQA score is 0-4.5, and the value range of the STOI score is 0-1. The higher the scores of the two objective indicators, the better the quality of the compensated speech signal and the higher the intelligibility.

It can be clearly seen from the figure that the AliPLC algorithm outperforms other PLC algorithms in both POLQA and STOI objective indicators. Compared with the neteq_plc algorithm, the AliPLC algorithm has an average increase of 0.54 points in POLQA and an average increase of 21.7% in STOI; compared with the opus_plc algorithm, the AliPLC algorithm has an average increase of 0.45 points in POLQA and an average increase in STOI by 3.4%; the indicators of the AliPLC algorithm when the packet loss is 30% It is better than the neteq_plc algorithm when the packet loss is 20%, that is, the AliPLC algorithm can make the receiving side more resistant to 10%-20% packet loss.

Subsequent innovations to AliPLC compensation algorithms

As part of the audio solution of Alibaba Cloud Video Cloud Audio Technology Team, AliPLC makes full use of the ability of GAN network in deep learning to effectively generate high-quality audio, innovates in method, and provides continuous The ability to compensate for packet loss improves the user's call experience in a weak network environment. In the future, the Alibaba Cloud Video Cloud Audio technical team will continue to explore audio technologies based on deep learning + signal processing to create the ultimate audio experience for a wider range of users.

references
[1] SM Kay and SL Marple, “Spectrum analysis A modern perspective,” Proceedings of the IEEE, vol. 69, no. 11, pp. 1380–1419, 1981.
[2] CA Rodbro, MN Murthi, SV Andersen, and SH Jensen, “Hidden Markov model-based packet loss concealment for voice over IP,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1609–1623, 2006.
[3] MM Mohamed and BW Schuller, “ConcealNet: An End-to-end Neural Network for Packet Loss Concealment in Deep Speech Emotion Recognition,” arXiv:2005.07777 [cs, eess], May 2020, arXiv: 2005.07777.
[4] F. Stimberg et al., "WaveNetEQ — Packet Loss Concealment with WaveRNN," 2020 54th Asilomar Conference on Signals, Systems, and Computers, 2020, pp. 672-676.
[5] S. Pascual, J. Serra, and J. Pons, “Adversarial Auto-Encoding for Packet Loss Concealment,” arXiv:2107.03100 [cs, eess], Jul. 2021, arXiv: 2107.03100.
[6] J. Wang, Y. Guan, C. Zheng, R. Peng, and X. Li, “A temporal-spectral generative adversarial network based end-to-end packet loss concealment for wideband speech transmission,” The Journal of the Acoustical Society of America, vol. 150, no. 4, pp. 2577–2588, Oct. 2021.
[7] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” arXiv:1505.04597 [cs], May 2015, arXiv: 1505.04597 version: 1.
[8] A. Marafioti, N. Perraudin, N. Holighaus, and P. Majdak, “A context encoder for audio inpainting,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 12, pp. 2362–2372, 2019.

"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.

CloudImagine
222 声望1.5k 粉丝