Interactive live broadcasting, online conference, online medical treatment and online education are important scenarios for the application of real-time audio and video technology, and these scenarios have strict requirements for high availability, high reliability and low delay. Many teams will encounter various problems in the development of audio and video products. For example: fluency, if the video process is frequently stalled, it is basically difficult to have good interaction; Echo cancellation, which is picked up and transmitted by the microphone through the ambient reflection, also affects the interaction; Interconnection at home and abroad, more and more products choose to go abroad, and interconnection at home and abroad is also a point that needs to be solved in technology; Massive concurrency, which is a great challenge to the pressure resistance of audio and video products.

On May 29th, in the “QCON Beijing Global Software Development Conference”, Feng Yue, VP of Acoustics Agora Technology, as the special producer, initiated the “Real-time Audio and Video Special Session”, and invited technical experts from New Oriental, Peyu English and Acoustics Agora. We shared the next generation of video engine architecture, difficulties and jumps in large-scale implementation of audio and video system, voice evaluation and localization practice, research and practice of front end audio and video player and other topics.

Exploring the Next Generation Video Engine Architecture of Sound Net

With the rapid development of audio-video technology, real-time audio-video interaction has been widely applied in many fields (social entertainment, online broadcast, medical treatment, etc.). At the same time, with the rapid development of AI technology in image processing, the advanced video pre-processing function which integrates AI algorithm has also been used more and more. The rich and changeable scenes put forward high requirements for the flexible and extensible functions of the next generation of video.

Li Yaqi, the architect in charge of the architecture design of the next generation video engine of Acoustic Net Agora, first shared “Exploration and Practice of the Architecture of the Next Generation Video Engine of Acoustic Net”.

In order to better meet the demand for the richness of video experience, the difference of users and the demand for live broadcasting experience, Voice.cn summarizes the design principles and objectives of the next generation video processing engine as follows:

1. To meet the differentiated needs of different users for integration;

2. It should be flexible and extensible to quickly support the landing of various new businesses and new technology scenarios;

3. It should be fast and reliable, provide rich and powerful possibilities for the core system of the video processing engine, and greatly reduce the mental burden of developers.

4. The performance should be superior and monitored, the performance of the video broadcast processing engine should be continuously optimized, and the monitoring means should be improved to realize the transparency of quality data.

In view of the above four design objectives, what specific software design methods have been adopted for the sound network?

Because the users of the engine are naturally stratified, some users want low code to come online quickly, and they need the engine to provide API as close to their business functions as possible. Other users want the engine to provide them with more core video processing power, on top of which they can customize the video processing business according to their own needs. Therefore, according to this user form, the sound network also adopts a layered business design of business combination plus core functions. High Level API provides ease of use for business, while Low Level API provides core functions and flexibility. In order to open the ability of flexible arrangement as a video processing engine to developers, so that developers can flexibly arrange according to different business needs through flexible and free API combination, the core Architecture of the video processing engine of VoiceNet adopts the architectural pattern of MicroKernel Architecture. Separates variables and invariants throughout the engine. Flexible and extensible goals can be achieved through the architecture pattern of microkernel: various module functions can be rapidly expanded, and video processing pipelines can also be flexibly arranged through the combination of building blocks.

If we don’t have a stable and reliable core system, a developer developing a beauty plug-in from scratch in the video processing pipeline will need to consider many issues beyond its own business logic: Modular location, data format conversion, threading model, memory management, attribute configuration and other issues, for this series of engineering related integration problems, sound network will solidify the solution to the underlying core system, providing users with rich and powerful basic functions. This video engine core system includes basic video processing unit, pipeline construction and control, video basic format algorithm support and system infrastructure and other functions. With this core system, integration becomes very simple, and plug-ins simply implement the relevant encapsulated interfaces according to the core system interface protocol. Rich and powerful core system functions greatly reduce the mental burden of module developers, thus helping developers to improve the overall R&D efficiency.

In the monitoring part with superior performance, the audio network optimizes the data processing link at the mobile end, separates the control surface and the data surface, and improves the overall transmission efficiency of data and video. In addition, memory pools related to video processing features are built to reduce system resource consumption. Finally, a full-link video quality monitoring mechanism is implemented to achieve the effect of closed-loop feedback in video optimization performance.

Difficulties and challenges of self-developed large-scale real-time audio and video systems

Dong Haibing, an industry architect from Agora, as a long-term deep user in the field of RTC, popularized the basic concepts related to RTC in the conference, and also analyzed the scene characteristics of RTC in detail as well as the architectural design and difficulties in the process of self-research. Finally, he also shares his own views on the future development direction of RTC.

Compared to traditional Internet solutions that are mature to deal with large scale, high concurrency: caching, asynchrony, and distribution, the challenges in real-time audio and video are actually more complex. To be real time, you have to keep it within one second. For example, when you do caching, the time is at the level of seconds, or at the level of minutes, and rarely at the level of milliseconds. Real-time audio and video (RTC), when dealing with large-scale, high concurrent scenarios, needs to take into account audio and video quality, fluency, low latency, scalability, availability and other issues, which are very different from the traditional Internet in making real-time audio and video, which also means that the solution will be more complex.

During development, common challenges for users include development costs, network setup, quality control, audio handling, and final testing. In sharing, Dong Haibing gave an example of audio from the research. First of all, there are three key problems to be solved in audio transmission: silent/low sound, echo, noise/noise. Secondly, the weak network countermeasures capability is also very important. When the network changes, how to adjust the bit rate and frame rate to alleviate the change, and at the same time to solve the problem of realizing the optimal path selection and transmission in the intelligent routing algorithm. Another challenge is to do multi-dimensional quality assessment, and to do it in real time, and at the same time to form a closed loop with dynamic adjustment, which is the best way to play a better role in the weak net countermeasure. Dong Haibing also discussed and shared several common solutions (JITSI/JITSI VideoRidge, Kurento, Licode/Erizo, Pion, Janus).

In addition to the development of the server side, the operation and maintenance of real-time audio and video and quality monitoring are also somewhat different from the traditional Internet methods. For example, in terms of operation and maintenance, in addition to common disaster recovery planning, containerized deployment, automated operation and maintenance, performance analysis and logging system, the operation and maintenance of real-time audio and video also needs to face challenges such as global network (cross-regional, cross-operator) and LastMile strategy.

If users choose the self-research method, they may also face problems such as large-scale connectivity, RTC recording/playback scheme, control of operating costs and so on. But even though we need to face and solve so many difficulties and challenges, we can not ignore that real-time audio and video technology is being applied in more and more scenarios, also has more and more possibilities.

Metaverse, translated into the meta-universe, is a popular concept recently. In real life, we can understand it as a role switch, and in the virtual world, it is another new experience, realizing the switch of roles in a variety of virtual worlds. Vrchat is also similar, using VR to do social or entertainment and help people to have better online interaction, which is likely to be the direction of the future development and exploration of the Internet. Dong Haibing mentioned that as a self-research team, we should keep up with the pulse of The Times and the development trend of the industry. We should try our best to put our strength into our core business and the aspects we are good at. We should work together to do better in the field of real-time audio and video.

New Oriental Cloud Classroom Web audio and video player practice

Online education should be one of the most familiar real-time audio and video application scenarios in the past two years. In this special session, we invited Li Benru, a front-end interaction architect from New Oriental Cloud Classroom, to share with you the best practices of how New Oriental realized the fast offline to online migration.

New Oriental started to build its own cloud classroom at the end of 2018. During the Chinese New Year in 2020, it has leaped from supporting 10,000 levels of concurrency to supporting 300,000 concurrency in one week.

New Oriental Cloud Classroom is a complete online class solution, providing SaaS service. One of its remarkable features is that the pace of updating and iteration is very fast. If they do native development on the terminal, such as with PC, Windows, mobile terminal, Android and iOS, then the update iteration is bound to catch up with the rhythm, so they decided to embed H5 page in the client, except for real-time audio and video, interaction functions are basically realized by H5. The Web ADAPTS to every end, which is the fastest development mode.

Real-time audio and video (RTC) latency is in the hundreds of milliseconds, at most not more than 500 milliseconds, the human ear is basically inperceptible. There are two different scenarios in online education: small class and large class. Small class requires relatively high real-time interaction with low delay, but for some university courses and lectures, or large class scenes where famous teachers make public speeches, the cost will be relatively higher if RTC is used. For large classes, New Oriental Cloud Classroom adopts the H5 supersized class method, which supports millions of people attending classes at the same time. The teacher uses RTMP to push the stream, while the student still uses HTTP to pull the stream.

Architecture diagram of Web live streaming player

For future scalable parts, if the video encoding in the cloud classroom adopts the H.265 standard, then the compression will be less than half that of H.264, and the network pressure will be reduced a lot. H5 has the advantages of a wide range of applications and cross-platform support. It can adapt the same set of solutions to different clients, and quickly develop a set of products, which can be launched quickly. Custom Universal Player allows you to change the input stream, customize or quickly develop.

Voice evaluation and localization

In order to provide better educational services, online education platforms have implemented many new functions in combination with deep learning in the past two years. Voice evaluation is one of them. In particular, there is a huge demand for the number of assessments of children’s spoken English in English education. How can you reduce test latency and improve the test service experience while reducing server stress and costs? Huang Zhichao, head of AI algorithm in Taiwan from Binyu Technology, shared “Voice Evaluation and Localization”.

Voice evaluation is a function of intelligent scoring for children’s spoken pronunciation by machine instead of manual. The practice of voice evaluation in companion fish mainly includes the selection of algorithm and frame, the training of acoustic model, the optimization of effect and speed. In terms of algorithm, companion fish chose deep neural network and hidden mankov, mainly because deep learning framework is very mature at present. The framework is Kaldi, which has the largest number of users in the voice industry and complete data.

The evaluation process of deep neural network and hidden Markov algorithm (DNN + HMM) is shown in the figure above. The first step is to train a DNN acoustic model and train the HMM topological parameters. After training, we will compose the input text, extract the audio features, and then go through the acoustic model. After a scoring model, a sentence score is obtained.

In this process, data screening, acoustic model training, evaluation accuracy optimization are the key. In the following sharing, Huang Zhichao also shared in detail the optimization of model volume in the localization process of Binyu’s voice evaluation, the robustness of evaluation service, and how to solve the difficulties of abnormal Case analysis and other problems and experiences.

In order to give you a more convenient and in-depth understanding of the “behind the scenes” of real-time audio and video development, we will organize and interpret all the content of this special in more detail. For more details, please click [Read the original], and pay attention to the latest developments of the community!