WebRTC (Web Real-Time Communication), as an open source technology that supports web browsers for real-time voice or video conversations, solves the technical threshold of Internet audio and video communications, and is gradually becoming a global standard.

In the past ten years, thanks to the contributions of many developers, the application scenarios of this technology have become more and more extensive and rich. In the era of artificial intelligence, where will WebRTC go? This article mainly shares the relevant direction of the combination of WebRTC and artificial intelligence technology and the innovative practice of Rongyun (add Rongyun Global Internet Communication Cloud Service Account to learn more).

WebRTC + artificial intelligence, making the sound more real and the video more high-definition

The application of artificial intelligence technology in audio and video is more and more extensive. In audio, artificial intelligence technology is mainly used for noise suppression, echo removal, etc.; in video, artificial intelligence technology is more It is mostly used for virtual backgrounds, video super-resolution, etc.

AI voice noise reduction

Voice noise reduction has a history of many years, and analog circuit noise reduction methods are the first to be used. With the development of digital circuits, noise reduction algorithms replace traditional analog circuits, greatly improving the quality of voice noise reduction. These classic algorithms are based on statistical theory to estimate noise and can eliminate steady-state noise relatively cleanly. For non-steady-state noises, such as the sounds of keyboards, desks, and the sounds of cars coming and going on the road, the classic algorithm appears powerless.

AI voice noise reduction came into being. It is based on a large amount of corpus, through the design of complex algorithms, combined with continuous training and learning, eliminating the tedious and ambiguous parameter adjustment process. AI speech noise reduction has a natural advantage when dealing with unsteady state noise. It can identify the characteristics of unsteady state data and reduce unsteady state noise in a targeted manner.

Echo cancellation

Echo is caused by the attenuation and delay of the sound from the speaker and then being recorded by the microphone. When we send audio, we need to remove unwanted echoes from the voice stream. WebRTC's linear filter uses frequency domain block adaptive processing, but does not carefully consider the issue of multi-person calls. The non-linear echo cancellation part uses Wiener filtering.

Combined with artificial intelligence technology, we can directly eliminate linear echo and nonlinear echo based on deep learning methods, using voice separation methods, and through carefully designed neural network algorithms.

virtual background

The virtual background relies on segmentation technology, which can be achieved by segmenting the foreground in the picture and replacing the background picture. The main application scenarios include live broadcast, real-time communication, and interactive entertainment. The technologies involved mainly include image segmentation and video segmentation. A typical example is shown in Figure 1.

(Picture 1 The black background in the above picture is replaced with the purple background in the picture below)

video super resolution

Video super-resolution is to make high-blur video clearer. Under the condition of limited bandwidth and low bit rate, lower quality video is transmitted, and then it is restored to high-definition video through image super-resolution technology. This technology is used in WebRTC. It is of great significance. A typical image is shown in Figure 2. In the case of limited bandwidth, high-resolution video can still be obtained by transmitting low-resolution video streams.

(Figure 2 Original low-resolution image vs processed high-resolution image)

Innovative practice of cloud integration

WebRTC is an open source technology stack. To truly achieve the ultimate in actual scenarios, a lot of optimizations are needed. Rongyun combined with its own business characteristics to modify the source code of WebRTC audio processing and video compression to achieve audio noise suppression and video efficient compression based on deep learning.

Audio processing

In addition to WebRTC's original AEC3, ANS, and AGC, Rongyun has added an AI voice noise reduction module for pure voice scenarios such as meetings and teaching, and optimized the AEC3 algorithm to greatly improve the sound quality in music scenarios.

AI voice noise reduction: Most of the industry uses time domain and frequency domain mask methods, combining traditional algorithms and deep neural networks. Estimating the signal-to-noise ratio through a deep neural network can calculate the gains of different frequency bands. After converting to the time domain, calculate a time domain gain again, and finally apply it to the time domain, which can eliminate noise to the maximum extent and preserve speech.

Because the deep learning speech noise reduction model uses RNN (recurrent neural network) too much, the model still thinks that there is a human voice for a period of time after the speech. The delay time is so long that the speech cannot mask the residual noise. noise. Rongyun adds a prediction module on the basis of the existing model, predicts the end of the speech in advance according to the speech amplitude envelope and the degree of SNR decline, and eliminates the residual noise that can be detected at the end of the speech.

(Figure 3 Noise tailing before optimization)

(Figure 4 No noise tailing after optimization)

Video processing

In the WebRTC source code, the video encoding part mainly uses the open source OpenH264, VP8, VP9 and repackaged into a unified interface. Rongyun completes tasks such as background modeling and coding of regions of interest by modifying the OpenH264 source code.

Background modeling: In order to complete real-time video encoding, it is necessary to put the processing of background modeling on the GPU. After investigation, it is found that the background modeling algorithm in OpenCV supports GPU acceleration. In actual operation, we convert the original YUV image acquired by the camera and other acquisition devices into RGB images, and then send the RGB images to the GPU. Then, get the background frame in the GPU and transfer it from the GPU to the CPU. Finally, add the background frame to the long-term reference frame list of OpenH264 to improve compression efficiency. The flowchart is shown in Figure 5.


(Figure 5 Background modeling flow chart)

Region of interest extraction: The yolov4tiny model is used in the region of interest coding part to perform target detection and fusion with the foreground region extracted by background modeling. Part of the code is shown in Figure 6 below. After the network is loaded, select cuda for acceleration, and set the input image to 416*416.


(Figure 6 Part of the program that loads the network to the GPU)

The experimental effect of video encoding on WebRTC: In order to verify the effect, we use the videoloop test program in WebRTC to test the modified OpenH264. Figure 7 shows the video collected by the camera on-site, using 1920*1080 resolution for background modeling. Figure 8 shows the output result. In order to ensure real-time performance, WebRTC discards frames that are not actually encoded within the set time due to various reasons. Figure 8 shows that the algorithm we adopted did not consume a lot of coding time and did not cause the encoder to generate discarded frames.

(Figure 7 Current frame and background frame)

How to use artificial intelligence technology to optimize WebRTC products (with specific solutions included)_Deep Learning_10


(Figure 8 The actual effect of the encoder)

In summary, the use of artificial intelligence-based noise reduction processing in audio can significantly improve the experience of existing voice calls, but the model predictions are not accurate enough and the amount of calculation is relatively large. With the continuous improvement and optimization of the model and the continuous expansion of the data set, AI voice noise reduction technology will definitely bring us a better call experience. In terms of video, the background modeling technology is used to add background frames to the long-term reference frame list, which effectively improves the coding efficiency of surveillance scenes. The use of target detection and background modeling and efficient bit rate allocation schemes increase video interest The coding quality of the region effectively improves the viewing experience of people in a weak network environment.

Technology is constantly changing, and we have entered the era of comprehensive intelligence. Artificial intelligence technology is deeply applied in various scenarios. In the audio and video industry, the combination of advanced technology and WebRTC is also promising. Service optimization is endless. Rongyun will continue to keep up with technological trends, continue to actively explore innovative technologies, and deposit them into low-level capabilities that can be conveniently used by developers to empower developers in the long term.


融云RongCloud
82 声望1.2k 粉丝

因为专注,所以专业