头图

need

With the development of RTC technology, the threshold of audio and video communication has dropped to a very low standard. Mobile terminal, PC terminal, Web terminal, applet, just pick up a device at hand to complete high-quality audio and video calls. And with the development of the mobile Internet (4G, 5G) and the evolution of AI technology, people’s demand for audio and video communication no longer remains audible and visible, but has begun to pursue more interactive and novel communication methods, such as beauty. , Props, interactive graffiti, etc. The expansion of the audio and video communication direction is endless, especially in the ToC scene.

From a technical point of view, native video processing technology is not uncommon. Many libraries like OpenCV have already open sourced their face capture, image processing and other capabilities. You can create a new project with a few interfaces that can be used to implement some simple video processing. However, the web side has always been behind in this area. No matter how powerful the front-end technology is when advocating performance, it can only say that it is close to the original. The bottleneck can be seen here (the original intention of JavaScript is not to run speed).

Technical selection

ActiveX solution

Around 2000, in order to defeat the emerging browser Netscape at that time, Microsoft hoped to develop a solution that would allow its leading product office to run on IE. This is ActiveX technology. A technology that sounds fantastic natively interacts seamlessly with the browser. The combination of ActiveX and Office did eventually curb the development of Netscape, allowing Internet Explorer to dominate the mainstream for a long time.

ActiveX is actually a COM component developed based on the COM standard. It writes its own GUID and the installation path into the registry during installation. JavaScript can easily be loaded into this native object through GUID and completed with a simple point syntax. transfer. Because it is a COM component, the interface call is actually made directly in the memory, which is no different from calling a dynamic library (DLL) in a native project. What's more outrageous is that ActiveX supports native UserControl rendering directly in the browser. MFC, QT, winform, WPF, mainstream Windows interface development frameworks can all complete ActiveX development. (I have to admit that with the vigorous development of the mobile Internet, the development technology on the PC has begun to decline. These terms are far less familiar than the terms such as flutter and vue). After using WPF to complete the development call of an ActiveX plug-in, we were shocked. It's such a program that sounds omnipotent. Why has it become so unpopular? The answer is: security.

Due to the high authority and flexibility of ActiveX, it can "do whatever you want" on the user's PC. Wanton operations to add or modify local file content, access login information, run external executable files directly in the browser, etc., just listening to these makes people feel creepy. At the beginning of the 21st century, when the Internet was just emerging, people generally didn't understand what computers and the Internet were. I don't know how many game accounts were stolen because users clicked to allow ActiveX plug-ins to be loaded.

So Chrome, Firefox and other browsers have gradually abandoned their support for ActiveX. Even Microsoft itself no longer supports ActiveX in Edge. Only the old and dilapidated IE still supports it. However, it is a pity that the IE browser has also stopped maintenance and will soon be withdrawn from the list of pre-installed Windows systems. With the general trend, the ActiveX solution is destined to be submerged in the trend of technological development.

ActiveX is very good, especially banks, governments and other organizations that use private networks. The security problem of ActiveX does not seem to be so fatal to them. However, it is impossible for us to design our new solution for a dying technology, or ActiveX will be an alternative solution for our specific scenarios, but it will never become our first choice.

WebAssembly solution

With the decline of ActiveX, there is an urgent need for a new solution to supplement the needs of native and front-end interaction, at this time WebAssembly came into being.

Through Emscripten, C, C++, and Rust codes can be compiled into WebAssembly, and the compiled .wasm file is a bytecode that can be called by JavaScript.

Seeing this, this solution is very exciting, so we started to build our own WebAssembly. Currently, the more mature support WebAssembly frameworks include Unity, QT, etc. The process of compiling WebAssembly between Unity and QT is very simple, and test Demos can be easily built, and the native interface is also rendered to the front end very well, which reminds me of ActiveX. Glory!

Next, let's turn up the camera and do some simple video processing. I was looking forward to writing the code and trying to run it on the front end, but I couldn't finish it. Take a look at the official website of QT WebAssembly:

The QtMultimedia framework has been determined that it cannot be used in WebAssembly, and even they themselves have not figured out which Modules are available and which are not. In our opinion, there are countless "pits" in the way forward.

In order to ensure security, WebAssembly runs in a sandbox environment, and its permissions are bound to be limited. We talked jokingly that for developers, WebAssembly is like a step backward from ActiveX (it is undoubtedly an improvement for users).

With a scientific and rigorous attitude, we decided to find another way to verify this plan. The front-end collects video and WebAssembly processes it to verify its final feasibility and the near-native operating speed advertised on the Internet.

Fortunately, OpenCV provides a version of WebAssembly, which happens to allow us to do some simple verification. Build a native project and integrate the C++ version of OpenCV, and the WebAssembly version of OpenCV has officially provided a test address, which saves us a lot of work.

Take bilateral filtering as an example, select a set of suitable parameters for comparison and verification, select 15 for diameter and 30 for sigma.

The performance of WebAssembly is as follows:

The frame rate of the video has dropped to 4FPS (floating up and down), and the look and feel has been noticeably stuck.

original performance of

The video frame rate still maintains 16FPS (floating up and down). Although the experience is affected, this value still meets the RTC transmission requirements (RTC transmission generally regards 13~30FPS as normal).

Continue to add Gaussian filtering processing on the original, select the length and width of the Gaussian kernel to be 3, and the performance is as follows:

The video frame rate remains at 14FPS (up and down), which has negligible impact on performance, and still meets the RTC transmission requirements (RTC transmission generally regards 13~30FPS as normal).

The performance of other parameters is roughly the same as this set of tests, at least in the video processing of special scenes, the performance of WebAssembly is much lower than native. Of course, it may be that OpenCV's support for WebAssembly is not good enough, but this set of comparisons and WebAssembly's permission support have made us a little disappointed.

WebSocket local connection scheme

There is no systematic definition of this scheme. The realization idea is to use the native project as the Server, and the front-end interacts with it through the localhost port. The small amount of data can use HTTP (supports a wider range of browsers), and the large amount of data can use WebSocket ( IE10 and above). For RTC, if it is sent at the front end, WebSocket may need to undertake data transmission of several M per second to send video frames from the native process to the front end, and the front end also needs to be rendered through WebGL.

Although it is local communication, the overflow of the acquisition frame rate and the audio and video synchronization problems that may be caused by the audio and video acquisition in the two processes have caused us to worry about its performance, so we have not tried too much.

Virtual camera solution

Multiple solutions do not work, let us not forget ActiveX. COM has huge advantages in performance that other solutions do not have. Other solutions are not as good as native or advocate that performance is close to native, while COM is real native performance.

After doing some research around COM, we found that there are other paths that can meet our needs, that is, COM components combined with DirectShow to send video to the analog camera, so as to complete the day-to-day change at the acquisition level! If this solution is feasible, then the final product will not only be used in our current scenes, but all applications that use DirectShow for camera calls will be able to use our encapsulated video processing technology.

Build a COM project, encapsulate the realization of the AI digital human image, call the DirectShow interface to complete the virtual camera registration and video stream transmission, and write a batch script to register our COM to the system path. Complete a series of work, use many camera test tools to test, the effect is surprisingly good.

The following is the effect of using the virtual camera processed by the AR mask to access the NetEase meeting:

final plan

After a lot of program verification, we decided to use the virtual camera program as our final program. This program is impeccable in terms of performance and coupling.

Scheme structure

Key realization

1. First, we create a new dynamic library project named WebCamCOM, and use CoCreateInstance and RegisterFilter interfaces to register our object as DirectShow Filter.

2. Use the interface of memoryapi.h to pass the data we defined. Here we pass the video length, width and timestamp information in addition to the basic video data.

3. Use CreateMutex to ensure access security during memory sharing.

4. Create another dynamic library project named SharedImageWrapper, and define only one external interface.

5. According to the input parameters of shouldRotate, decide whether you need to flip in the vertical direction (used to adapt to Unity).

6. After simply processing the data, the video data is also passed to the DirectShow Filter defined by us through the memoryapi.h interface.

7. The upper layer integrated SendImage interface can send the collected RGB data to DirectShow.

8. Write a batch script and use the regsvr32 command to register WebCamCom to the system registry with administrator privileges.

problem

  1. Unity's collection of Texture is bottom-up, and if you use its data directly, it will be upside down, so you need to do a vertical flip.
  2. Unity can choose OpenGL rendering and Direct3D rendering. The texture handle parsing of the two rendering methods requires two sets of interfaces.

OpenGL:

D3D:

Outlook

Although DirectShow is the current mainstream camera operation framework, the use of the Media Foundation framework has become a trend. Consider adapting the interface to the Media Foundation framework in the future (USB camera driver development is also a feasible solution).

At present, the capabilities supported by video processing mainly revolve around digital human image, beauty, and virtual background. Based on the existing framework, it can actually be combined with more fun video processing technologies.

The plug-in itself can be combined with the WebSocket (HTTP) solution to open some interfaces, such as the beauty parameters and the appearance of the digital figure, so that the front-end can silently complete the configuration of the plug-in.

The plug-in can integrate a practical setting interface, and you can check the preview effect by dragging and dropping.

Summarize

This article introduces some of NetEase's research on PC Web video processing solutions, compares the pros and cons of some optional solutions from multiple aspects, and finally outlines the realization ideas on the virtual camera solution. Maybe you are not engaged in the development of audio and video fields, maybe you don't care about PC development, I hope this article can give you some different perspectives to understand these technologies. Due to space limitations, it is a pity that I have not introduced the core COM component mechanism in detail. If you are interested, you can reverse the trend and play with the black technology in PC development.


网易数智
619 声望140 粉丝

欢迎关注网易云信 GitHub: