头图

Super points make the voice of online meetings brighter. Online meetings have become a more common way of communication in daily work, and the methods of accessing meetings are also diversified, such as joining a meeting by computer, joining a meeting by mobile phone, or joining a meeting by telephone.

Xueya, Yaochen| Author

As we all know, audio signals with high sampling rate and high bandwidth are rich in frequency components, which can bring people a more immersive listening experience. However, in an online conference, in a scenario where the original acquisition bandwidth is too low due to equipment and other reasons, the intuitive feeling brought to people is that listening to the other party is boring, which seriously affects the conference experience. In signal processing, speech superdivision technology can be used to deal with such situations. It can maximize the reconstruction of its high-frequency components from the low-bandwidth audio signal, so that the speech signal sounds more "bright and lifelike", thereby providing Customers can provide a better and higher quality calling experience.

Here is a simple demonstration of the effect:
https://www.youku.com/video/XNTg1MzgwMjczNg==

Author's Note: The first half of the video is a narrowband signal, and the second half is a wideband signal after super-divided.

Early researches on speech super-division technology mostly revolve around traditional signal processing theories, such as source filter models, which predict high-band spectral envelopes through codebook mapping or linear mapping [1, 2]. In recent years, with the application of deep learning technology in the field of signal processing, the effect of speech super-score technology has been significantly improved with the blessing of deep learning.

In the beginning, the traditional signal processing framework was continued, and the neural network replaced part of the original framework to predict the spectral envelope or amplitude spectrum of the high frequency band [3, 4], and the phase expansion followed the traditional method to keep the computational complexity low. However, the phase information has a non-negligible impact on the subjective hearing of the human ear.

Subsequently, inspired by the image super-resolution algorithm, an end-to-end neural network model was applied to the speech super-resolution task [5, 6], which directly predicted the signal in the time domain and avoided the phase problem. It is trained by minimizing a loss function (such as L2 loss). Later, the GAN training method was introduced, combining the original loss function and adversarial loss to achieve better results [7, 8].

Currently, we mainly focus on the case where the sampling rate is increased from 8KHz to 16KHz (spectral bandwidth is extended from 4KHz to 8KHz).

Speech Superscore Algorithm: AliSSR

The neural network-based speech super-resolution algorithm has achieved good super-resolution results in recent years, but many of these algorithms are dual-non-algorithms (non-real-time non-causal), which often involve a large amount of parameters and calculations, which are difficult to implement in practice. deployment in application scenarios. Based on the above practical problems, the Alibaba Cloud Video Cloud Audio technical team has developed two real-time causal speech super-score algorithms: AliSSR (e2e version) and AliSSR (lightweight version). Maintain its high-quality voice overscore.

1. Introduction to Algorithm Principles

A. AliSSR (e2e version): It is based on the end-to-end encoder-decoder model. Combined with practical application scenarios, the model fully considers the loss caused by encoding, decoding and downsampling, and combines GAN-related training techniques to improve the effect of bandwidth expansion;

B. AliSSR (lightweight version): an algorithm model that combines traditional signal processing and deep learning. The model is simple and easy to expand and consumes less resources.

The neural network-based speech super-segmentation algorithm developed by the audio technology team does not require additional data transmission, and can perform high-quality bandwidth expansion for narrow-band speech signals in real-time streaming.

2. Algorithm performance

3. Application scenarios

In some low-bandwidth scenarios, such as PSTN scenarios, the other party's voice is often perceived as "stuffy". This is mainly due to the low sampling rate of the voice signal transmitted by the sender and no voice information with high-frequency components. The voice super-resolution technology provides customers with higher sound quality and a better listening experience by reconstructing the high-frequency components of the voice. The following table shows the common usage scenarios of voice super-score.

4. Display of super score effect

AliSSR real-time super-score algorithm supports multi-language and multi-gender. The following shows the effects before and after the super-score of the test corpus for boys' English and girls' Chinese respectively. In terms of subjective hearing, the voice after the over-score is obviously more "bright" than the narrow-band audio. Among them, the AliSSR (e2e version) is brighter after over-score. Better than AliSSR (lightweight version).

Sample 1: English

https://www.youku.com/video/XNTg1MzgwNDUzNg==
The three audios in the video are: narrowband speech, e2e version processed in real time by AliSSR, and lightweight version

Sample 2: Chinese

https://www.youku.com/video/XNTg1MzgwNDcwOA==
The three audios in the video are: narrowband speech, e2e version processed in real time by AliSSR, and lightweight version

Voice super-division technology has a wide range of implementation scenarios in the fields of PSTN, online conference, old audio restoration, and media production. With the help of neural network, AliSSR speech super-score algorithm can bring users a more "bright and realistic" sound quality experience in real time with very little resource consumption. In the future, audio technology will continue to forge stronger super-resolution capabilities, and explore super-resolution technologies covering all scenarios from narrow to full frequency bands, from voice to music to all types of audio.

The Alibaba Cloud Video Cloud Audio technical team will continue to explore audio technologies based on deep learning + signal processing to provide a clearer and more extreme audio experience for scenarios such as online meetings.

references

[1] J.Makhoul, M.Berouti, “High-frequency regen-eration in speech coding systems”, in Proceedings of ICASSP, 1979, vol. 4, pp. 428–431.
[2] B. Iser, G. Schmidt, “Neural networks versus codebooks inan application for bandwidth extension of speech signals,” in Proc. of Interspeech, 2003.
[3] Kehuang Li, Chin-Hui Lee, “A deep neural networkapproach to speech bandwidth expansion”, in Proceedings of ICASSP, 2015, pp. 4395–4399.
[4] J. Abel, T. Fingscheidt, “Artificial speech band-width extension using deep neural networks for wide-band spectral envelope estimation”, IEEE Transactionson Acoustics, Speech, and Signal Processing, vol. 26,no. 1, pp. 71–83, 2017.
[5] V. Kuleshov, S.Z. Enam, and S. Ermon, “Audio super resolution using neural nets”, in Workshop of ICLR, 2017.
[6] Heming Wang, Deliang Wang, "Time-frequency loss for CNN based speech super-resolution", in Proceedings of ICASSP, 2020.
[7] Eskimez, Sefik Emre et al. “Adversarial Training for Speech Super-Resolution.” IEEE Journal of Selected Topics in Signal Processing 13 (2019): 347-358.
[8] Li, Y., Tagliasacchi, M., Rybakov, "Real-Time Speech Frequency Bandwidth Extension", ICASSP, 2021.

"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.

CloudImagine
222 声望1.5k 粉丝