With the development of information technology, people's demand for real-time communication continues to increase, and it has gradually become an indispensable part of work life. The huge number of minutes of audio and video calls each year pose a huge challenge to the Internet infrastructure. Although the vast majority of Internet users in the world are currently in good network conditions, there are still many areas with extremely poor network conditions. In addition, even in areas with good network, there will still be weak network conditions. So how to provide high-quality audio experience under limited bandwidth has become a very important research direction.
Over the past few decades, coding techniques for speech or audio have involved a large amount of domain-specific knowledge, such as speech generation models. In recent years, with the rapid development of deep learning algorithms, a variety of audio processing algorithms based on neural networks have gradually emerged. Based on a comprehensive analysis of the common problems in actual business scenarios, the Alibaba Cloud video cloud technical team began to explore the use of data-driven methods to improve the coding efficiency of audio, and proposed the intelligent audio codec AliIAC (Ali Intelligent Audio Codec). It can provide a higher quality audio calling experience under limited bandwidth conditions.
What is an audio codec?
Even if you have not heard of the concept of audio codec, you must have used this technology in daily life, from watching TV programs to using mobile phones to make calls, from short videos to watching live broadcasts, audio coding will be involved. decoding technology.
The goal of audio coding is to compress the input audio signal into a bit stream, which occupies much less storage space than the input original audio signal, and then restore the original audio signal through the received bit stream at the decoding end, and hope that the reconstructed signal is subjective. The sound is as close to the original signal as possible, and the encoding process is shown in the following formula:
$$ \boldsymbol{h} \leftarrow \mathcal{F}_{\mathrm{enc}}(\boldsymbol{x}) $$
Among them, \({x} \in \mathbb{R}^{T}\) represents the time domain speech signal, the length is \(T\), \(h\) will be further converted into a bit stream\( \tilde {\boldsymbol{h}} \in \mathbb{R}^{N} \), \( N \) is much smaller than \(T\), the decoding process is shown in the following formula:
$$ \boldsymbol{x} \approx \hat{\boldsymbol{x}} \leftarrow \mathcal{F}_{\mathrm{dec}}(\tilde{\boldsymbol{h}}) $$
Traditional audio codecs can be divided into two categories: waveform codecs and parametric codecs
Waveform Codec
The point of a waveform codec is to produce a reconstruction of the input audio samples at the decoder side.
In most cases, waveform codecs rely on transform coding techniques to map incoming time-domain waveforms to the time-frequency domain. Then, the transform coefficients are quantized by a quantization technique, and finally converted into a bit stream that can be used for transmission through an entropy coding module. At the decoder side, the time-domain waveform is reconstructed by the corresponding inverse transform.
In general, waveform codecs make little or no assumptions about the type of audio being encoded (eg, speech, music, etc.), so they can handle a wide range of audio. This method can produce very high-quality audio at medium and high bitrates, but tends to introduce some encoding-induced artifacts at low bitrates, resulting in poor hearing.
parametric codec
The core idea of parametric codecs is to make specific assumptions about the audio (eg, speech) to be encoded, and incorporate prior knowledge into the encoding process in the form of parametric models.
The encoder first estimates the parameters of the model, and then quantizes the model for further compression, and the decoder uses the quantized parameters to drive the synthesis mode to reconstruct the time-domain waveform. Unlike waveform codecs, the goal of parametric codecs is not to obtain high similarity waveform reconstructions on a sample-by-sample basis, but to generate audio that is perceptually close to the original audio.
Challenges Facing Traditional Audio Codecs
Thanks to the vigorous development of the WebRTC ecosystem, the opus audio codec is widely used in the industry, and it focuses on handling a wide range of interactive audio application scenarios, including VOIP (Voice over Internet Protocol), video conferencing, in-game chat, and even remote Live music.
It consists of two different codecs, one is SILK for speech and the other is CELT for music. Although opus and other traditional audio codecs (such as EVS, AMR-WB, speex, MELP, etc.) have excellent results, they are limited in bandwidth, low signal-to-noise ratio and severe reverberation mixing conditions. Both of these devices have different degrees of limitations and cannot cope with the current complex and changeable application scenarios, bringing a smooth and clear audio call experience.
AliIAC Smart Audio Codec
Considering the excellent performance of traditional audio codecs at high bit rates and its mainstream status in the industry, the Alibaba Cloud Video Cloud Audio technical team has proposed two smart audio codecs, the E2E version and the Ex version.
Among them, the E2E version can directly replace the traditional codec modules such as opus, and supports working under 6kbps ~ 18kbps, and encodes and decodes 16khz audio; the Ex version can use the traditional codec on the basis of post-processing Repair and enhance the decoded 16khz audio from 6kbps~8kbps to improve intelligibility and sound quality.
Algorithm principle
1. E2E version: It is based on the end-to-end encoder-decoder model, and considers the problems of speech spectrum damage, reverberation and residual noise that will be encountered in practical application scenarios. Combined with the training strategy of the GAN network, the decoding process is further improved. For the convenience of deployment and use, the residual quantization module is used to support a single model variable bit rate, ranging from 6kbps to 18kbps.
2. Ex version: It is a deep model for frequency domain repair/enhancement for audio decoded by traditional codecs such as Opus under the condition of 6kbps ~ 8kbps. The loss compensation of the amplitude spectrum is performed in the 0.4kHz frequency domain, and the spectrum prediction compensation is performed in the 48kHz frequency domain. The repaired/enhanced audio has a significant improvement in the subjective sense of hearing (intelligibility and sound quality).
Algorithm performance
Algorithm effect
Scene 1: Real scene lossy spectrum + reverberation situation
Spectrogram comparison of different methods:
From the subjective sense of hearing and the spectrogram, it can be seen that the opus 6k Ex, E2E 6k, and E2E 18k versions have significantly improved effects compared with the opus 6k version. Among them, the opus 6k Ex and E2E 6k have obvious damage in the first half of the spectrum. There is a little residual noise after decoding, but the E2E 18k version is basically close to the original audio.
Scene 2: Real scene with noise
Spectrogram comparison of different methods:
From the subjective sense of hearing and the spectrogram, it can be seen that the opus 6k Ex, E2E 6k, and E2E 18k versions have significantly improved effects compared with the opus 6k version, and the timbre and pitch are all close to the original audio.
AliIAC smart audio codec will continue to evolve
AliIAC, as part of Alibaba Cloud's video cloud audio solution, aims to make full use of data-driven ideas to improve audio coding efficiency, so that a better audio call experience can be obtained at a lower bandwidth cost.
At present, AliIAC is still in the stage of balancing computing power, bit rate and effect, and it is necessary to further solve problems such as real-time performance and stability of audio generation effect. excellent effect. Among them, in most practical scenarios, the effect of E2E 18kbps is the same as that of opus 24kbps, and the effect of E2E 6kbps is the same as that of opus 12kbps, which can save an average of 25% ~ 50% of bandwidth consumption; Under the premise of additional consumption of bandwidth resources, the average subjective MOS score increases by 0.2 to 0.4. In the future, the Alibaba Cloud Video Cloud Audio technical team will continue to explore audio technologies based on deep learning + signal processing to create the ultimate audio experience.
"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。