头图

From the invention of the telephone in 1860 to voice interaction through the Internet today, voice has always been the most natural and basic way of real-time interaction. In the past few years, voice real-time interaction has become a part of daily life for more and more people. But everyone will encounter a weak network environment, which will directly affect the voice call experience. Therefore, the voice network is constantly using the most cutting-edge technology to improve the voice call experience. We are now the first in China to officially launch a machine learning-based voice codec (voice AI Codec)-Agora Silver. It can provide ultra-wideband coding sound quality with a sampling rate of 32KHz at an ultra-low bit rate, and further optimize the sound quality and the natural sense of voice through the AI noise reduction algorithm.

Why do traditional encoders introduce AI?

In the process of voice interaction, all users will encounter weak networks. Some are caused by network facilities in the area; some may be in areas with better network facilities, but during peak periods of network usage, network congestion will still occur, which reduces the effective bandwidth allocated to users. No one can guarantee the full-time stability of the network, and the weak network environment will exist for a long time.

In the face of a weak network, usually choose to reduce the bit rate, thereby reducing the bandwidth occupation, in order to avoid voice jams. However, although this method solves the problem of stalling and unavailability, it brings new problems.

Traditional codecs can only maintain a certain level of speech intelligibility (that is, they can clearly hear what the other party is saying) at extremely low bit rates, but it is difficult to maintain other information such as timbre. For example, Opus can only encode narrowband speech at a bit rate of 6kbps, and the effective spectral bandwidth is only 4KHz. what is this concept?

Opus is currently the most widely used audio codec in the industry, and it is also the default codec for WebRTC. In order to adapt to different network conditions, its bit rate can be adjusted between 6kbps-510kbps. Then when you encounter a weak network, or when the network bandwidth is limited, you can reduce the bit rate to 6kbps as low as possible. At this code rate, only narrowband speech coding can be performed. According to the industry definition, the sampling rate of narrowband speech coding is 8KHz. According to the sampling theorem, also known as the Nyquist sampling theorem, only when the sampling frequency is higher than twice the highest frequency of the sound signal, can the sound represented by the digital signal be restored to the original sound. In other words, when the sampling rate is 8KHz, the effective spectral bandwidth is only 4KHz. The human voice will sound dull, because many high-frequency parts of the sound are lost.

After so many years of development, it is difficult to use algorithm tuning to help traditional codecs break through this bottleneck. With the continuous development of AI speech synthesis technology, especially the development of WaveRNN-based speech generation technology, people have found that the combination of AI and audio codec can restore speech more completely under lower bit rate encoding conditions.

What is Voice AI Codec?

At present, the industry has a lot of exploration on the combination of AI and audio codec. For example, there is a method of optimizing low bitrate sound quality through WaveRNN on the decoding side, and there is also a method of using AI to optimize compression efficiency on the encoding side. So broadly speaking, as long as machine learning and deep learning are used to compress or decode speech, it is regarded as speech AI Codec.

Difficulties facing voice AI Codec now

Although in the design and development of many codec standards, it has begun to explore the application of AI. Voice AI Codec ranges from academics and standards to actual business scenarios. For example, Google’s recently released Lyra can restore 16KHz sampled broadband voice at a rate of 3kpbs. Its method is to use the machine learning model to reconstruct and restore the high-quality signal at the decoding end according to the received low-bit-rate voice data, so that the sound reproduction sounds higher. A similar voice AI Codec also includes Satin released by Microsoft, which can restore ultra-wideband voice with a sampling rate of 32KHz at a bit rate of 6kpbs.

However, compared with traditional vocoders, the application of voice AI Codec still needs to solve some difficulties:

Noise robustness

According to Shannon's theorem, a low bit rate requires a higher signal-to-noise ratio. Since speech AI Codec decoding mostly uses speech generation models to generate audio signals, in the case of noise, a more intuitive feeling is that the noise has become some unnatural noise similar to speech, which greatly affects the sense of hearing. Coupled with low bit rate compression, the noise situation is likely to cause a rapid decline in speech intelligibility. It sounds like you will feel that the person on the other end has a "big tongue" and speaks vaguely. Therefore, in actual use, an excellent noise reduction module is often needed as a pre-processing, and then encoding.

optimized for the mobile terminal algorithm model

AI models often require huge computing power when decoding. The calculation of the speech generation model used in decoding is relatively time-consuming, and the real-time interactive scene requires that the model can be calculated in real time on most mobile devices. Because most real-time interactions occur on mobile terminals. For example, Google’s open-source Lyra measured an audio package containing 40ms information on the Kirin 960 chip, and decoding takes 40ms-80ms. If your mobile phone is equipped with this chip, such as Huawei Honor 9, Lyra cannot be used in real-time interactive scenarios. This is only a single-channel decoding. If multiple-channel decoding is required (multi-person real-time conversation), the required computing power needs to be doubled, and ordinary equipment may not be able to support it. Therefore, if you want the voice AI Codec to be used in real-time interactive scenarios, it must also be optimized for mobile terminals to meet real-time performance and latency requirements.

language naturalness and computing power 160f80c4952c5e

To get a natural sense of voice hearing, a model with higher computing power is often needed. This and the second "challenge" we just mentioned just form a mutually restrictive relationship.

A model with a smaller computing power may cause a lot of distortion and unnatural hearing in the generated speech. For example, the most natural point-by-sample generation model (Sample by sample) of current speech often requires 3-20 GFLOPS of calculations. We can generally use MUSHRA (a subjective evaluation test method for streaming media and communication related coding, with a full score of 100) to evaluate the speech intelligibility and naturalness of the speech generation model. A 20GFLOPS model, such as WaveRNN, can achieve MOS scores. It reaches about 85 points, while the model with relatively small computing power, such as the LPCNET of 3GFLOPS, can only reach 75 points.

Silver characteristics and horizontal measured effects

In the Silver codec, we solved the above three problems through self-developed algorithms. As shown in the figure below, Silver first uses a real-time full-band AI noise reduction algorithm to provide noise robustness. On the decoding side, Silver is based on the deeply optimized WaveRNN model to achieve speech decoding with minimal computing power.

Silver 编解码器流程图
<center size=1>Silver codec flowchart</center>

Features of Silver include:

1. Solve the problem of noise robustness: Combine the self-developed real-time full-band AI noise reduction algorithm.

2. The machine learning model can run on mobile terminals: based on the deeply optimized WaveRNN model to achieve voice decoding with very small computing power. The actual measurement is on the Qualcomm 855 single core. It only needs 5ms calculation time to decode a 40ms voice signal, which smoothly supports various A real-time interactive scene.

3. Ultra-low bit rate: the lowest bit rate can reach 2.7kpbs, which saves bandwidth.

4. High sound quality: support 32KHz sampling rate, ultra-wideband coded sound quality, full and natural tone.

We compared the intelligibility and naturalness of Silver, Opus (6kbps), and Lyra based on the MUSHRA standard. As shown below. Among them, REF is the full score anchor point, and Anchor35 is the low score anchor point, which is to mix the original speech (full score anchor point) and poor synthetic data (low score anchor point) into the test corpus to receive test scores. We tested three languages, and Silver scored higher than other codecs.

图片

At the same time, we also compared and tested the above three codecs under different noise environments, and the test score results are as follows. With the support of AI noise reduction algorithms, Silver can provide users with more natural voice interaction effects.

图片

In a noisy and noise-free environment, the original sound and the effect after transmission through different codecs can be compared through the audio we prepared. Since the platform cannot upload audio, interested developers can click here to listen.

Due to space limitations, the number of audios that can be shared is limited. If you want to learn more about Silver, please visit developer community of , and leave a message on the forum to communicate with us.


RTE开发者社区
647 声望966 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。