Technology review series class | netease cloud letter online ten thousand people even wheat technology full disclosure

This article is based on Chen Ce, senior audio and video server development engineer of netease Yunxin in the online live broadcast of “McTalk Live#5: netease Yunxin Online 10,000 people LianMai Technology Reveals”.

takeaway

Hello, I’m Chen from netease Yunxin. As a highly interactive scene, the high concurrency in a single room is always a complicated problem. This time I will share with you the exploration and practice of netease Yunxin in the scene of 10,000 people.

Typical demand scenarios include video conferencing seminars, low delay live broadcasting, large online education classes, Club House, etc. The common solution in the market is “RTC+CDN”. That is to say, the number of hosts on Mai is limited to a small number (about 50 people) for RTC interaction, and then it is transferred to CDN, and then the large-scale audience is realized by the way of CDN live streaming. This solution limits the number of users on the online platform, and there is a large audio and picture delay between the audience and the anchor, which cannot meet the business requirements of Yunxin. For example, in the game voice scene, there will be a large number of users open public mic, so it is necessary to make sure that everyone can access the mic, and the number of users is unlimited. So, how netease cloud letter to solve this problem, this article will share with you our exploration on this problem.

Technical difficulties of signaling

Signalling concurrency, weak network, and high availability issues

The discussion of this problem is divided into several aspects: signaling, audio technology, video technology and network transmission between servers. Let’s look at the implementation of signaling first.

RTC is bi-directional notification using a long connection. Let’s say there are 10,000 people on a particular chat host. If one of them goes on/off the stream or joins/leaves the room, it will trigger signaling notifications to the other 9,999 people. This random user behavior can put a lot of instantaneous stress on the server.

The traditional centralized single point server is obviously not able to support such a large number of concurrency in a room of ten thousand people, so a distributed architecture must be used to distribute the pressure. If a room of ten thousand people is distributed on several servers, then the real-time synchronization of the user status and publish and subscribe relationship of NX (n-1) between the servers is needed. Meanwhile, in order to achieve high availability, data synchronization after a server crashes and reboots is also needed.

Media servers, on the other hand, tend to be distributed in a grid architecture to reduce point-to-point latency. But if the signaling server also maintains a net structure where every node is equal, then each node maintains a full number of cascading relationships. The intricacies of message synchronization can be extremely difficult to maintain.

Netease Cloud Message Distributed Tree Architecture

In order to achieve high concurrency and high availability, considering the business feature that signaling is transmitted within rooms and there will be no signaling interaction between rooms, we design the signaling server into a distributed tree structure, and each room is an independent tree, as shown in the figure below:

  • Root node: Room management server, which manages and stores all user status and subscription relationships.
  • Child node: As the edge server, it is responsible for the nearest access of the user.

This tree structure can effectively distribute the broadcast pressure of each node. The root node will try its best to be allocated according to the principle of nearby users to avoid too long link between it and its sub-nodes. At the same time, sub-nodes will try their best to act as a message proxy and not participate in the service, and concentrate the service on the root node, so as to avoid problems such as synchronization out of order of signaling.

The root node uses a cache plus a database to ensure the performance and reliability of the business data store. Since the child node does not involve the business state of the user, after the crash and restart, it only needs to reconnect the long connection of the client signaling, and does not need to carry out operations similar to rejoining the room, so that it is not aware of the user. In the case of child node downtime, it will rely on the timeout mechanism of the client to reschedule the request. Through this series of means, high availability of services is realized.

Signaling weak network problem

In RTC architecture, due to the large number of signalling and complex interaction, the strategy of separating signalling and media channel is generally adopted. But this introduces a problem, which is that the two are inconsistent in their ability to resist the weak net. Media is typically delivered using UDP, which makes it easy to watch with 30% packet loss. But the signal generally uses TCP transmission, 30% packet loss under the signal is basically unusable.

In order to solve this problem, we use QUIC as signalling acceleration channel to ensure the consistency of weak network countermeasures capability of signalling media. At the same time, when the QUIC connection is blocked (there may be user network is not friendly to some UDP ports), it can also retreat to WebSocket to ensure high availability of signaling on weak network.

The technical difficulties of audio

Mixing and routing defects

The most complicated problem in the multi-player scenario is the audio concurrency problem when multiple users are talking at the same time.

Suppose every anchor in a room of ten thousand people is on the Mac. Because of the nature of voice, each anchor should theoretically subscribe to everyone else (video can be subscribed on demand, but audio should theoretically be subscribed to completely). Not only would it be a waste of traffic if every client subscribed to N-1(9999) channel streams, the client wouldn’t be able to support that much downlink, and in a real world scenario, if more than 3 people were talking at the same time, it would be almost impossible for anyone else to hear anything clearly.

There are two ways to solve this problem: audio routing and server mixing. However, both of these solutions have drawbacks in the case of a room of 10,000 people:

  • Audio routing is in the N channel audio to choose the loudest several ways to send (generally 2~3). This solution does solve these problems, but it has one precondition: the full amount of audio streams must be collected on the edge server directly connected to the client in order to choose the path. So, even 48 core servers, it is difficult to support 10,000-path concurrency.
  • The server-side mixing is to decode N channels into 1 channel, or choose a channel and then decode 3 channels into 1 channel. The problem of the former is that the MCU server is difficult to withstand the pressure of so much transcoding, and too long. The latter still suffers from the disadvantages of audio routing described above. In fact, the biggest disadvantage of MCU is the single point problem. After the crash, it will affect the whole number of users, and it is easy to become a system bottleneck.

Distributed routing of netease cloud message

In order to solve the above problem of audio, we adopted the solution of pre-routing before server cascading. Suppose that a room of 10,000 people is evenly distributed on 20 edge servers, and each server has 500 routes of traffic. If no cascading preset routing is used, then after each server is intercascaded, it needs to pull the full amount of 10,000 routes of traffic to be used for downlink routing.

When we use the cascading preselecting scheme (default 3 channels), each server only needs to pull 3X (20-1) channel stream, and then conducts the secondary electing from the local 500 channel stream and the cascading 3X (20-1) channel stream, and selects the 3 channels with the highest final volume to send to the client. In this way, audio full subscription is realized. Assuming N channels of audio streams are transmitted on M servers, the order of magnitude of data transmitted by the server drops from N^2 to M^2.

Technical difficulties with video

Simulcast/ SVC and bit rate suppression in large room limitations

Due to the limitation of client performance, the number of video streams that can be decoded and rendered at the same time is generally not too many, and the above media concurrency problem can be avoided by the way of “streaming by subscription”.

The video technical difficulties of a room of ten thousand people mainly lie in QoS, and there are two main QoS means of the server in RTC:

  • Using Simulcast/ SVC tangential flow
  • The code rate of the sender is suppressed by RTCP

The essence of Simulcast and SVC is to layer the user’s downlink bandwidth into the corresponding stream for distribution, but in a room of ten thousand people, the user bandwidth tends to be very dispersed, and mechanical layering does not give most users the best experience. RTCP code rate suppression is based on the receiver’s bandwidth feedback to the encoder of the sender, the most suitable code rate, the most applicable scene is 1V1, in a room of ten thousand people will be difficult to make a decision.

Netease Yunxin QoS strategy

In order to allow as many users as possible to get video streaming matching their network, we developed the following QoS strategy: First of all, we will divide the user into four levels from high to low according to the downlink bandwidth. The sending end uses Simulcast+ SVC encoding at the same time, such as 720P / 30FPS, 720P / 15FPS, 720P / 8FPS, 180P / 30FPS. The server allocates the corresponding data stream according to the user level.

The advantage of this method is that each user can match the video stream corresponding to its bandwidth, but it has an obvious disadvantage that the distribution is not equal enough. For example, in a room of 10,000 people, most of the users’ bandwidth hits the 720p/15fps layer, and a few users are scattered across the other three layers, then the video experience of most of the users in the room is actually not optimal.

In order to solve this problem, it is necessary to combine the bit rate suppression on the basis of hierarchical coding: firstly, the user’s bandwidth is sorted from high to low, and the lowest bandwidth of TOPN % users is fed back to the sender to guide the coding bit rate of the highest level (720p/30fps), so that TOPN % users can all hit the best data stream experience. N can be set by the user or can be changed dynamically according to the downlink bandwidth of the user in the room.

The following figure takes Top60% as an example:

Another scenario is when the user’s uplink bandwidth is insufficient, assuming only 1.2M, Simulcast+ SVC cannot be implemented at all. In this case, we would have the client encode only one single stream of 720p, then introduce an MCU in the room and send the single stream to it. The MCU would then use Simulcast+ SVC when transcoding and push back to SFU to match our downlink QoS strategy.

Technical difficulties of network transmission between servers

Cross-carrier and cross-country delivery

In the process of actual development, we also encountered some network transmission problems between servers, and here to share with you.

For example, in order to reduce the distance of the last kilometer, we will use some single-line machine rooms as the edge nodes. If the single-line machine rooms of different operators are directly cascading, their network transmission is obviously uncontrollable. To solve this problem from the architectural level, it is necessary to introduce another three-line /BGP room in the cascade network as the transit, which also needs to consider the node location allocation and single point crash of the transit server, which will undoubtedly greatly increase the complexity of scheduling.

Another case is the machine cascading in the transnational scenario. The route of the public network between servers is not necessarily optimal, and the jitter may be very large. The network in the middle one kilometer is completely uncontrollable.

WE-CAN

To solve similar problems, WE abstract the transmission module between servers and introduce our own public Network infrastructure, a large-scale distributed real-time transmission Network: WE-CAN (Communications Acceleration Network). Nodes are deployed in the world’s major region, between quality and continuously detecting network reported, center decision module integrated operators after receipt of the report, the real-time network quality, after the information such as bandwidth costs to calculate the shortest path between any two nodes, and generate the routing table, and then send to each node as the next-hop routing reference. WE-CAN is business agnostic and is entirely a transport layer solution. In this way, media cascading only needs to be delivered to WE-CAN at the target address on the packet head, without any consideration for delivery outside the service.

conclusion

The above technical solution is all the content to share this time. The service is upgraded to stateless through Yunxin’s 10,000 people connected to the Mail.com technology. There is no need to limit the maximum number of people in the room and the number of people on the Mail.com at the same time, and it supports horizontal elastic capacity expansion, easily cope with sudden traffic, and second-level matching user network.

Of course, the construction of any system is not achieved overnight, each point, we are trampling countless pits. Netease Yunxin will also continue to polish audio and video technology, to bring better services to the industry.

The authors introduce

Chen Ce, senior audio and video server development engineer of netease Yunxin, responsible for the construction and core development of Yunxin’s global RTC network, has rich experience in media data transmission and RTC full-stack architecture design.