demand

With the development of RTC technology, the threshold of audio and video communication has dropped to a very low standard. Mobile terminal, PC terminal, Web terminal, small programs, you can pick up a device to complete high-quality audio and video calls. In addition, with the development of mobile Internet (4G, 5G) and the evolution of AI technology, people’s demand for audio and video communication is no longer limited to what they can hear and see, but to pursue more interactive and novel communication methods, such as beauty, props, interactive graffiti and so on. The direction of audio and video communication is constantly expanding, especially in the CONTEXT of ToC.

From a technical point of view, native video processing technology is not new. Many libraries like OpenCV have their own face capture, image processing and other capabilities open source, a new project to call a few interface can achieve some simple video processing. However, the Web side has always lagged behind native in this area, and the bottleneck can be seen in the performance trumpeting of the best front-end technologies that are close to native (JavaScript is not designed for speed).

Technology selection

ActiveX scheme

Around 2000, Microsoft was trying to beat the emerging browser Netscape by developing a way to run its flagship Product, Office, on Internet Explorer. Sounds like a fantastic technology that natively interacts seamlessly with the browser. The combination of ActiveX and Office did eventually stifle Netscape, allowing Internet Explorer to dominate for a long time.

ActiveX is a COM component based on the COM standard. It writes its GUID to the registry at installation time along with the installation path. JavaScript can easily load this native object using the GUID and call it with simple point syntax. As a COM component, interface calls are made directly in memory, just like native engineering calls to dynamic libraries (DLLS). Even more outrageous, ActiveX supports native UserControl rendering directly in the browser. MFC, QT, WinForm, WPF, the mainstream Windows interface development framework can complete the development of ActiveX. (Admittedly, with the boom of the mobile Internet and the decline of development technology on the PC, these nouns are far less familiar than flutter and vue.) We were blown away after completing an ActiveX plug-in development call using WPF. Is such a sound omnipotent plan, why has become so unpopular? The answer is: security.

Because of the high permissions and flexibility of ActiveX, it can do whatever it wants on the user’s PC. It’s creepy to hear how you can add or modify local file content, access login information, and run external executables directly in your browser. At the beginning of the 21st century, when the Internet was just emerging, people didn’t understand what computers and the Internet were, and I don’t know how many game accounts were stolen because users clicked on the ActiveX plug-in.

As a result, Chrome, Firefox and other browsers are gradually dropping ActiveX support, and Microsoft itself is dropping ActiveX support in Edge, with only the aging Internet Explorer stubbornly supporting ActiveX. Sadly, Internet Explorer has also been taken out of maintenance and will soon be removed from the list of pre-installed Windows browsers. As the tide turned, ActiveX solutions were destined to be swamped by technological developments.

ActiveX is good, especially for banks, governments and others that use private networks, for whom the security issues of ActiveX seem less lethal. However, we can’t design a new solution for a dying technology, or ActiveX will be an alternative in our particular scenario, but it will never be our first choice.

WebAssembly scheme

With the demise of ActiveX, a new solution was needed to complement the need for native and front-end interaction, and WebAssembly came into being.

Emscripten compiles C, C++, and Rust code into WebAssembly. The resulting.wasm file is a bytecode that can be called by JavaScript.

With that in mind, the solution looked promising, so we set out to build our own WebAssembly. At present, there are mature frameworks supporting WebAssembly, such as Unity and QT. The process of compiling WebAssembly in Unity and QT is very simple, which can easily build a test Demo, and the native interface is well rendered to the front end. Reminds me of ActiveX’s glory days!

Let’s tune up the camera and do some simple video processing. Writing code in anticipation, trying to run it on the front end, never getting it done. Take a look at the QT WebAssembly website:

The QtMultimedia framework has been determined not to work with WebAssembly, and they haven’t even figured out which modules are available and which aren’t, so we see a lot of holes in the road ahead.

To ensure security, WebAssembly runs in a sandbox environment and its permissions are bound to be limited. We joked that WebAssembly was a step back from ActiveX for developers (and definitely for users).

In the spirit of scientific rigor, we decided to take a different approach to verify this solution, with front-end video capture and WebAssembly processing, to verify its final feasibility, as well as the near-native running speed touted online.

Fortunately, OpenCV provides a version of WebAssembly that we can use to do some simple verification. Build native project, integrate OpenCV C++ version, and WebAssembly version of OpenCV its official has provided the test address, help us save a lot of work.

Taking bilateral filtering as an example, a set of appropriate parameters were selected for comparative verification, including diameter 15 and Sigma 30.

WebAssembly behaves as follows:

The frame rate of the video has dropped to 4FPS (up and down), and the look and feel of the video is clearly stuttering.

Native performance is as follows:

The video frame rate remains at 16FPS (up and down), which meets RTC transmission requirements despite the experience. (RTC transmission is generally considered normal at 13 to 30FPS.)

Gaussian filtering was continued to be added to the original, and the length and width of gaussian cores were selected to be 3 respectively, as follows:

The video frame rate remains at 14FPS (up and down), which has negligible impact on performance and still meets RTC transmission requirements (RTC transmission generally considers 13 to 30FPS as normal).

The other parameters performed roughly as well as this set of tests, at least in the case of special scene video processing, WebAssembly’s performance was much lower than native. Of course it could be that OpenCV’s Support for WebAssembly isn’t good enough, but this comparison and WebAssembly’s permission support have left us a little disappointed.

WebSocket local connection solution

There is no definition of the system in this solution. Its implementation idea is to take the native project as the Server, and interact with it through the front-end port of Localhost. For small data volume, HTTP can be used (more browsers are supported), and for large data volume, WebSocket (above IE10) can be used. For RTC, if the sending is in the front end, the WebSocket may have to take on M data transfers per second to send video frames from the native process to the front end, which also needs to be rendered by WebGL.

Although it was local communication, we were worried about its performance due to the overflow of acquisition frame rate and the synchronization of audio and video in two processes, so we did not make too many attempts.

Virtual Camera scheme

None of these solutions worked, leaving us clinging to the ActiveX. COM has a huge advantage over other solutions in terms of performance, which are either inferior to native performance or advertised as close to native performance, while COM is truly native performance.

Doing some research around COM, we found that there was another way to meet our needs, that is, COM components combined with DirectShow to send video to the analog camera, thus achieving a complete change of scene at the acquisition level! If this works, the final product will be available not just for our current scenario, but for all applications that use DirectShow for camera calls using our encapsulated video processing technology.

Build COM project, encapsulate the realization of AI digital human image, call DirectShow interface to complete virtual camera registration and video streaming transmission, write batch script to register our COM into the system path. Completed a series of work, using a variety of camera testing tools to test, the results are surprisingly good.

The following is the effect of the VIRTUAL camera connected to a netease conference using AR mask processing:

Final plan

After a lot of scheme verification, we decided to take the scheme of virtual camera as our final scheme, which is impeccable both in terms of performance and coupling.

Solution architecture

The key to realize

1. First we create a new dynamic library project named WebCamCOM and register our object as DirectShow Filter using CoCreateInstance and RegisterFilter interface.

2. Use the memoryapi.h interface to pass our defined data, where we pass the video length, width and timestamp in addition to the basic video data.

3. Use CreateMutex to ensure access security during memory sharing.

4. Create another dynamic library project named SharedImageWrapper and define only one interface externally.

5. Use the shouldRotate input to determine if we need to do a vertical flip (for Unity).

6. After simple data processing, memoryapI.h interface is also used to transmit video data to our defined DirectShow Filter.

7. The upper integrated SendImage interface can send collected RGB data to DirectShow.

8. Write a batch script and run the regsvr32 command to register WebCamCom with the system registry as an administrator.

The problem

  1. Unity’s Collection of Texture is bottom-up. If you use its data directly, it will be upside down, so you need to do a vertical flip.

  1. Unity is available for OpenGL rendering and Direct3D rendering. Texture handle parsing requires two sets of interfaces for both rendering methods.

OpenGL:

D3D:

Looking forward to

  • Although DirectShow is the mainstream camera operating framework, the use of Media Foundation framework has become a trend, considering the future interface adaptation to Media Foundation framework (development based on USB camera driver is also a feasible solution).
  • At present, the capabilities supported by video processing mainly revolve around digital human image, beauty and virtual background. Based on the existing framework, more interesting video processing technologies can be combined.
  • The plug-in itself can be combined with WebSocket (HTTP) solutions to open up some interfaces, such as beauty parameters, digital human appearance, so that the front end can silently complete the configuration of the plug-in.
  • The plugin can be integrated into a useful Settings interface, you can drag and drop to see the preview effect.

conclusion

This paper introduces some exploration of netease’s PC Web end video processing scheme, compares the advantages and disadvantages of some alternative schemes from multiple aspects, and finally Outlines the implementation ideas in the virtual camera scheme. Maybe you don’t work in audio and video development, or maybe you don’t care about PC development. I hope this article gives you a different perspective on these technologies. Limited by space, not to the core COM component mechanism to do a detailed introduction of a little regret, if you are interested in reversing the trend to play a PC development of black technology.