This time we invited Xue Di from Tencent Cloud real-time audio and video TRTC background research and development, he shared with us Tencent cloud TRTC in the architecture upgrade and product practice experience. It carefully explained the original manufacturing source of the mixing engine, the problems found in the whole optimization process and the solutions, which laid a good foundation for Tencent conference and cloud call center.

The text/Xue Di

Organizing/LiveVideoStack

Hello everyone, I am Xue Di, the head of research and development of Tencent Cloud real-time audio and video TRTC background. It is my great honor to share with you the experience of Tencent Cloud TRTC in architecture upgrade and product practice.

Our team previously provided services for QQ audio and video products, which can be divided into three forms: two-person, multi-person and group video shows. The first two tend to be real-time calls, while group video shows are more like live video. Of these, the most commonly used and largest is two-person video calling, which is an order of magnitude larger than multi-person scenes and group video shows combined, and wechat is almost the same way now.

01 Audio and video Product forms

1.1 Two-person audio and video

From an architectural point of view, two-person audio and video systems are relatively simple and clear. The red dot represents the room signaling service. The main functions of the room signaling service are to manage room information, achieve capacity negotiation, and control the quality of upstream and downstream linkage. For example, when the downstream channel is congested, the upstream bit rate and resolution will also decrease. Transport channel level, our strategy is to give preference to the directly connected, in the case of trans-regional and cross operator, we will choose the single or double transfer channel, on the strategy to keep direct and transfer channels exist at the same time, when one of the channel quality is bad, the system will automatically cut the flow from one channel to another.

1.2 Multi-player audio and video

The product form of multi-person video call is that there are no more than 50 people in the whole room, and the average number of people in the room is about 4.X. The maximum number of people in the room is one large video and three small videos (four pictures). According to this limitation, we adopted a typical SFU small room design in architecture.

The red dots in the figure above represent the room signaling service, which is used for room management and status information synchronization. Room management mainly involves the management of user lists, such as which users have started video/audio, who I have watched and who has watched me, based on the room management information, and then the room signaling service will synchronize this information to the media transport service for data distribution. Another function of room service is room-level capability negotiation and quality control. For example, when everyone in the room supports H.265 at the beginning, at some point a user comes in that only supports H.264, then all the uplink hosts in the room must cut H.265 to H.264. In another case, when there is a certain proportion of people in the room with poor quality of the downlink, the quality of the up-link room will be degraded. At the transmission level, we adopt a single-layer distributed media transmission network. We choose the transfer mode without differentiating between two-person and multi-person, and adopt full-mesh transmission mechanism to push all the data. For example, people on a node do not all watch the videos of the other two people, but they will still push the videos to them.

02 Mixing engine

One other feature of our product at the time was that we didn’t have a high percentage of video on, but the audio-only room was very active. I was impressed by a case that the number of voice rooms skyrocketed after the new skin of a popular game, which was also because many players used QQ multiplayer voice to open the black box. Based on this phenomenon and cost considerations, we developed a mixing engine. When the number of people in the room exceeds the cost line, we divert the stream to the mixing engine, which selects routes based on volume, recodes the mix, and pushes the stream to the downstream media platform. Its architecture is no longer typical OF SFU, but more like MCU. Although the starting point was cost saving, it laid a good foundation for Tencent conference and cloud call center.

03 Deeply optimized cloud PaaS service — TRTC

Tencent cloud real-time audio and video product -TRTC is a cloud PaaS service launched after in-depth optimization and transformation of To B scene on the basis of QQ multi-person audio and video platform. First of all, it provides a full-platform SDK, which inherits the compatibility and experience configuration of models, hardware and systems targeted in the process of MASSIVE QQ services, and has a relatively stable performance on various platforms. Secondly, this SDK has been integrated into wechat. At present, wechat video live, wechat group live and enterprise wechat all use TRTC-SDK and background services. External customers can also use the live-pusher and live-Player tags of applets to achieve high quality communication between applets and native applications. In terms of media processing, we maintain very close cooperation with Tencent Cloud Multimedia Lab, Tencent Conference Teana Lab, QQ and wechat. Mature technologies used in Tencent conference, QQ and wechat will also be introduced into TRTC.

3.1 TRTC video optimization practice

In terms of video, we adopt time-domain hierarchical coding to deal with downlink bandwidth limiting scenarios. When dealing with downlink bandwidth limiting scenarios, general live broadcast products usually adopt transcoding — encoding the original code flow into streams of various specifications (original stream, HD, STANDARD DEFINITION), and switching different streams according to different network quality. But the RTC due to time delay and cost considerations, generally do not do room converted code, so we need to code the GOP frames used in layered, the bandwidth of the time, choose to discard some enhancement layer frame, retain base layer of the frame, the frame of base layer additional redundancy protection, in order to protect users viewing experience. Another thing that is often used with time-domain layered coding is the RPS cross-frame reference. The RPS policy is that only the frames acknowledged by the receiver ACK are used as reference frames. This is a policy that guarantees fluency by sacrificing some image quality. In order to minimize the loss of picture quality, we also designed a variety of different reference models to adapt to different network conditions. For example, when the network quality is good, we predict that the next 2-3 frames will be received with a high probability. In this case, the ACK mode is not fully referred to, and the encoder is still referred to nearby in the small groups of GOP to ensure clarity. When the network quality deteriorates, we will switch models at different gears according to the network situation and reduce the number of nearby references. When the network continues to deteriorate, we strictly follow ACK confirmation. In terms of coding efficiency, filtering and noise reduction will be carried out to a certain extent before coding, so that the compression rate of coding will be well improved, and the bandwidth can be better improved. With limited resources, we also use ROI to tilt our limited bit rates towards areas of more interest to users.

3.2 TRTC audio optimization practice

We introduce pre-processing of Tencent conference and source-based resistance enhancement strategy in audio processing. The usual clatter of typing during a meeting, or the less common sound of rain hitting a window pane, can be easily eliminated. In addition, our self-developed cPLC — context-based packet loss compensation technology can recover continuous lost packets within 120ms. In addition, our self-developed cFEC also has better recovery effect than OPUS ‘native in-band FEC, so it can deal with some sudden packet loss scenarios without adding a lot of out-of-band FEC under normal network. Of course, these technical points are not effective if they are used separately. We must combine codec, source and channel resistance, bandwidth prediction, congestion control and media transmission to ensure low delay and high quality call effect, which is also one of the biggest differences between RTC and standard live broadcast and its charm.

3.3 Intelligent control based on cloud

On the control side, we developed a cloud-based control engine. The advantage of putting the control system in the cloud is that the compatibility problem of terminal version can be reduced, and the effect verification of algorithm AB-test is more convenient. The second point is that in the operation process, we accumulated a large number of rules through BadCase analysis, so that the identification of scenes is more accurate and the regulation is more targeted. At the same time, we extract rich configuration parameters in the rule model and tune them according to different customer needs. For example, some call product customers want fluency first, while some live broadcast customers need clear mode first. These requirements can be realized by adjusting the configuration. In addition, we are constantly improving the algorithm. When BBR comes out, we refer to the way of BBR bandwidth prediction, and also adapt to the regulation of oversized rooms.

04 TRTC Architecture Evolution

4.1 System Bottlenecks and Scenario Requirements Upgrade

Speaking of the scale problem, the multi-person audio and video calling system mentioned at the beginning is based on the SFU architecture of small rooms, and there will be a system bottleneck when the number of people in the room increases. The first is the synchronization of the controlled calculation and state information, which is completed in a process of a machine. When the number of people is very large, the amount of computation is also increasing, and the bottleneck is very obvious. Secondly, media transmission adopts single-layer distribution structure. When the number of rooms is increasing and the number of nodes is increasing, the bandwidth will produce an obvious bottleneck. Finally, to reduce Intranet traversal, access scheduling adopts the aggregation allocation policy. Users in the same room, carrier, or region are preferentially assigned to the same machine, and then to the next machine when they are full. If a large number of people enter the room together within a short period of time, the data report and status synchronization of the service may not be updated in a timely manner, resulting in node overload. However, with the expansion of RTC technology scenes, many customers have demands for rooms of 100,000 people, such as large classes in education scenes, super small classes, large chat rooms and e-commerce live delivery, etc., so the original architecture has been upgraded and transformed according to customer needs.

4.2 Set transform

First of all, we made Set transformation for the large cluster, which decomposed into fixed size sets by country, region and operator. In the Set, diffusion agents were automatically selected, and convergence was carried out according to the granularity of stream technology to relieve the pressure of media distribution on upstream nodes. Most SETS are connected through the Intranet. In this way, some remote overseas countries or domestic edge nodes can communicate with other nodes with only one hop to return to the Intranet.

4.3 Intranet Transmission

The Intranet transmission is partly realized through the cloud networking of Tencent Cloud. Tencent Cloud has opened 27 geographical areas and 61 available areas around the world at present. There are multiple links to switch Intranet private lines and sufficient bandwidth reserves. During the high-level meeting report of Tencent conference, switching drills are often carried out. Several special lines can be cut into each other, and the Intranet quality is stable in practice.

4.4 Room Management

In the room management section, we upgraded from centralized management to distributed room management and signaling channels. RoomSvc saves only the basic information of the user list and video user list, greatly reducing the burden on the control system. Since the original centralized architecture was adopted, the bottleneck was very obvious. Under the new architecture, we disassembled the rooms according to the subscription relationship, and made a layer of convergence in the internal cluster, so that the amount of computation is very small, thus realizing the expansion of the room size. All RoomSvc information can be dynamically expanded and quickly restored. For example, if the core node is down, the information of this core node is stored in other nodes. You only need to change a machine and transfer the original information to this machine to get the complete information. Another example is that there is a small node in the Set down, but the information in the media node is intact. Change to a new machine and rebuild the data, then the room can be rebuilt. Under the new architecture, we conservatively estimate that one million people can be supported in a single room, and it will be highly available, scalable and reliable.

4.5 Practical achievements

This framework played an important role in the early stage of the epidemic last year. In the early stage of the epidemic, the number of simultaneous online users of Tencent Conference and Tencent Classroom doubled every day. Relying on this framework, we expanded the number of core users by 1 million in seven days, and supported tens of millions of users of these two products to use them simultaneously.

Large-scale video conferencing is becoming more and more common, such as Tencent conferencing from 300 square meters to 2000 square meters enterprise version, and will soon offer larger conference product formats. Some institutions are now trying to use RTC instead of standard live streaming to improve the class experience, and there will be more and more large room scenes in the future.

TRTC Technology optimization practice — MEDIA processing subsystem of RTC platform

5.1 Advantages and problems of the media processing subsystem

Finally, the media processing subsystem of RTC platform is discussed. The media processing subsystem of RTC, such as recording, yellow authentication, video broadcasting, and mixed flow pushing, is essentially a bypass system. Currently, the common practice in the industry is to let a Linux SDK robot simulate entering the room, pull the stream to the server, process it, and then bypass out. The Linux SDK robot simulation into the room does not affect the main business process of the two rooms. Changes in media processing subsystem requirements usually do not need to move the core system, and can achieve very flexible and complex business logic, such as the need to do whiteboard and video alignment, karaoke scenes can be accurate alignment of audio and time stamps, through this mode can achieve the isolation of complexity. However, we encountered two difficulties when doing Tencent conference and call center. Tencent conference scenario requires phone calls. The initial idea was to use a robot to enter the room and pull the stream back, and then convert it into G.711 and G.729 and put them on the PSTN network. Since the IVR broadcast is a full-room broadcast or an individual broadcast, the IVR needs exclusive broadcast. In the forwarding architecture of SFU, it is not easy to perform precise control, and it is complicated and difficult to control both the new control and the identification of media. In addition, in the case of voice recording, it is not acceptable for financial industry customers to record more or less words in the recording process, just as it is absolutely not acceptable to read the number of missing digits in the ID card. However, if the system is used in the way of robot to do, due to the integration of two systems, there will be jitter and delay between the systems, which will lead to the recording of the content should not be recorded or the recording of the record is not recorded, these situations for more demanding users are absolutely unacceptable.

5.2 The mixing engine solves the problem

Because of the above problems, we adopt a mixing engine, through which the stream is centralized processing, which not only saves bandwidth, but also facilitates the implementation of IVR and recording functions, through the TRTC and PSTN telephone system, to achieve the convergence communication capability. We also cooperated with Tencent Conference to do PSTN narrowband audio domain overbranch expansion and line noise detection and elimination to further improve call quality. PSTN’s converged communication capability is a feature of TRTC, which is widely used in customer service, call center and conference scenarios.

06 Looking to the Future

Although the epidemic has objectively promoted the development of real-time audio and video industry, RTC is still on the eve of the outbreak. At present, important RTC scenes such as social networking, cloud gaming, remote real-time control, chorus and so on have been applied, but the experience cannot reach the best, or there are various limitations. In the future, with the improvement of network infrastructure and terminal hardware and software capabilities, the average end-to-end delay can reach 50-80ms, the experience of RTC products will also be improved qualitatively, and scenarios and applications will be more dynamic. That’s my share for today. Thank you.