Extending GRTN: The evolution of the RTC architecture in the context of cloud native trends

Brief introduction:In 2021 LiveVideoStackCon audio and video technology conference Shanghai station, focusing on “light heavy cloud and edge architecture new mode” special, ali cloud video cloud RTC transmission expert Yang set up (forget li) to bring “based on edge cloud native RTC service architecture evolution” keynote speech, Share with industry partners the challenges and experiences of video cloud in the evolution of RTC service architecture. The full presentation is below.

Backend transmission network is the core capability of RTC system, such as GRTN of Aliyun and SD-RTN of Acoustic Network. This paper introduces how Ali Cloud Video Cloud constantly improves the RTC architecture, expands the GRTN network, and obtains the powerful capability of cloud based on cloud native technology.

Personal introduction

Hello, everyone, I’m Yang Chengli, now I’m in charge of RTC transmission network in Aliyun, and before that I was in charge of live broadcast transmission network in Lanxun CDN. I’ve been doing video back-end services for about ten years, and I’m also the author of open source video server SRS, which is currently one of the top open source video servers in the world.

Backend services are all built on the cloud, and the trend of CDN is also the edge cloud, because cloud computing becomes the infrastructure of various services, including the backend services of video, of course. Developers can easily directly use the SDK and video cloud services from cloud manufacturers, or they can use open source solutions to build their own systems on the cloud.

I happen to be active in the two fields of open source video and cloud service, and some friends have been asking about the differences between the two. I would like to take this opportunity to share this topic with you. I hope that through this sharing, you can understand the way from an open source server to a commercial system that can provide services.

Introduction to RTC services

Because some of my friends are not servers, I’d better introduce what RTC service is first.

After the development of live broadcasting over the years, everyone understands that backend services are needed, such as OBS push stream, which cannot be directly pushed to players, but forwarded through CDN, which is the backend service of live broadcasting.

RTC is quite different. For example, WebRTC itself is designed as a call, and most of the conversation scenes are one-to-one conversations, so WebRTC has designed a variety of transmission methods, such as direct P2P, forward through STUN, forward through SFU or MCU.

If just run DEMO, then do not need RTC server, direct P2P can also run. Real online, must be through the server, now the most widely used is SFU forwarding. The main reasons are as follows:

P2P failure: Some networks are symmetric NATs, where two clients cannot get through P2P on their respective internal networks, so they must use the server to relay traffic.
Cross-network long-distance transmission: such as transnational transmission or cross-operator transmission, even if the client can be connected directly by P2P, the effect is not good, if the transmission line can be optimized through the server.
Conference to live broadcast: if the media needs to be processed, such as RTC to live broadcast to more audiences, it needs to transcode and transfer protocol, which must also be handled by the server.
Recording highlights: the current recording and editing and other content processing, on the Internet or RTMP docking more, RTMP stream to the recording or editing system.
Different client network conditions are different: some client network is good, some are bad, through the server can accurately calculate the different client network conditions, to transmit the different quality of the flow to the client.
Compatibility with old clients and protocols: There are many versions of the online client, and the protocol may be supported differently with the iteration, requiring the server to do compatibility processing.

Each cloud vendor has its own back-end services, such as GRTN of Ali Cloud, SD-RTN of Voice Network, and so on. In fact, the transmission network is not equal to the transmission server, but a transmission system, including scheduling, routing, protocol processing, publishing and maintenance, problem troubleshooting, data analysis and so on.

AliRTC (AliCloud RTC) transmission network, the transmission protocol uses GRTN, in addition to GRTN (CDN) network, we also extended the implementation of GRTN (Tenfold); GRTN (Tenfold) complements BGP dedicated lines, ENS, proprietary cloud networks, third-party cloud supporting K8S cloud networks, etc., to adapt to the transmission requirements of a variety of different scenarios.

GRTN (Tenfold) is based on the SRS and adds a lot of capabilities, some of which have been fed back to the SRS community.

Why SRS

Here’s a look at the SRS and why we chose it.

The main problems of video server are: high threshold, wide field, fast update, open source and cloud services are not synchronized.

High threshold: because of the deep stack of video technology, signal processing, codec, transmission, client platform, each direction has a deep stack of technology, must have a dedicated video server. Nginx, which is well known in the industry, doesn’t do video per se. Web and video are very different.

Broad-field: Live broadcasting and RTC are large-scale applications of the Internet. In fact, monitoring and IoT are also developing very fast. Public cloud, proprietary cloud and edge cloud are also different. Dedicated conference servers, like Janus, have structural problems on a very large scale (or this is a problem for live streaming, so Janus doesn’t need to solve it).

Fast update and mismatch between open source and cloud services: video develops earlier than cloud services, while many requirements of cloud services cannot be met by open source video servers. Many open source projects do not consider cloud architecture, so it is very difficult to migrate from self-built systems based on open source to cloud services.

Why is this important?

It affects the landing of videos in various fields and hinders the development of new scenes. The new scene must be cross-field, and there will be no such situation as only live broadcast or only RTC. The new field is not simply the penetration of live broadcast, but the penetration of Internet video. Only cross-field open source projects can promote the development and implementation of new scenes.

The cloud service capability cannot be used. The best thing about cloud architecture is its elasticity, and that’s the standard elasticity across the cloud. If open source projects do not consider the cloud architecture themselves, they cannot migrate to the cloud and are not resilient. Open source cloud architecture is not open source running in the cloud host, is the cloud architecture.

Cloudy migration is difficult. The direction of the cloud is the standardization of the cloud on the application (K8S), which can be migrated seamlessly between multiple clouds, which gives the application a very high degree of reliability. If open source projects themselves do not make the changes required by K8S, they cannot migrate across multiple clouds.

How does the SRS solve this problem? SRS is positioned as a cloud-native video server, which has done a lot of transformation in response to the cloud native, so it is very convenient to upload and migrate to the cloud.

In addition to cloud native capabilities, SRS is also a very high performance open source server. Of course, the performance is not the highest, only higher, each major release needs to do performance optimization, and then trade performance for functionality and user experience.

For the record, this is not to say that Nginx and Janus can’t do SRS concurrency, just that the current version weighs the measured data. Performance is very relevant to the industry context, for example 2012 was mostly gigabit network era, so Nginx performance is enough to get full bandwidth. Janus and similar servers are almost Janus magnitude. SRS has always focused on performance because the Internet is so focused on cost, and cost must use performance swaps.

This year is the eighth year of SRS. Last year, SRS became the Top1 open source video server, mainly because of the rapid development of the domestic video industry and the relatively few active open source video servers.

This outbreak has had a huge impact on the global economy. It has also led to an explosion of Internet video, such as live streaming, education, conferences, cloud gaming, IoT, etc. People had to stay at home, so the Internet became a very important way to communicate, and there was a lot of growth in open source projects in the early ’20s, like Janus.

This is probably the only black swan we’ll ever experience. I’ve always had a question about whether Internet video will return to pre-liberation after the epidemic is over. Judging from the growth rate of Janus, half a year later the growth rate is back to pre-epidemic. This may also show that even open source can’t rely on this kind of event.

The rapid growth of SRS came at the end of 19, the same time that SRS supported WEBRTC, SRT and GB28181. So it is not clear how much is the pull of the epidemic and how much is the SRS’s own efforts. The good news is that the growth of SRS is not slowing down, and it is the fastest growing open source video server project. Sustained growth and global Top1, this is not the end, but a new beginning.

We believe that only when the developers subscribing to the official account exceed 100K can we have the opportunity to enhance the creativity of the developers in the entire video industry. Only reach 100K STAR, can call Internet video standard open source server. Only by constantly promoting the demos and exploration of new scenes can we constantly expand the boundaries of video.

SRS is an ambitious open source project, and ten years of OKR is a big goal. If we look 30 years from now, there are three generations of new developers in the video industry, and as video becomes part of the infrastructure of the Internet, the goal is not that big. The big question may be whether SRS will survive for 30 years.

What is cloud native

Back to today’s topic, what still needs to be addressed, from open source SRS to commercial services.

Long session: The longest meeting in RTC is 48 hours, even longer. Sometimes, the live broadcast is also a very long time to push the stream. For example, yesterday’s live broadcast of Jun Lei’s video number, folding the folding screen of Xiaomi’s mobile phone, and continuous live broadcast folding for three days. How to upgrade the three-day live streaming service?

Hub, edge, and proprietary cloud SLAs vary widely: The network status of the central cloud, the infrastructure is very good, and the migration of sessions is relatively easy. SLAs for edge and proprietary clouds are much different and cannot be migrated using the same mechanism.

Port and IP reuse: the traditional RTC is generally an internal network application, which can be used freely, and can be allocated tens of thousands of random ports. These have security risks in the cloud, and the public network IPv4 address can not be used at will, so it is difficult to expand the capacity.

There are many and related streams, and there is also the problem of network cutting: there is no correlation between live streams, so new sessions can be scheduled to other servers when the server load is high, while there is correlation between RTC streams, which can’t be scheduled at will sometimes, making load balancing difficult to do.

Difficult performance optimization: RTC must be encrypted, UDP has poor performance on the kernel protocol stack, and constant iteration of QoS algorithms consumes performance. This makes the RTC service no longer a pure IO intensive server, performance is the foundation of the entire system, affecting all other aspects.

There are so many versions and algorithms on the client side that it is difficult to do regression testing. It’s hard to know if a change is going to cause a problem on the client, and it’s hard to know if all the big and small versions on the network are going to cause a problem.

The first four of these issues are very closely related to cloud native. Each of the following questions is a big topic. Due to time limitation, we will share them with you in the future.

The development direction of cloud, whether central cloud, edge cloud or proprietary cloud, is the original direction of cloud. The cloud itself is in the cloud, and the cloud itself is even more foggy. We can look at the thinking of the cloud itself.

It can be said that if the open source project does the transformation and redesign of the cloud native, and has the capability of cloud architecture, it will solve a big problem of commercial service. Let’s see what we need to do.

Long sessions, difficult to upgrade

Long session: The longest meeting in RTC is 48 hours, even longer. Sometimes, the live broadcast is also a very long time to push the stream. For example, yesterday’s live broadcast of Jun Lei’s video number, folding the folding screen of Xiaomi’s mobile phone, and continuous live broadcast folding for three days. How to upgrade the three-day live streaming service?

Problem: Long sessions, up to 48 hours of meetings, difficult to upgrade.

Why it’s important: Online systems that provide real services are either being upgraded, or on the way to being upgraded, all day long. It is not possible to completely stop, discontinue service, and provide service after a full scale upgrade. Long sessions mean that no-interrupt upgrades must be supported, or they can cause unavailability and service outages that can seriously impact the customer experience.

Spreading capacity is also affected by long conversations. When the business volume grows, it is necessary to increase the capacity of machines, and the existing long sessions cannot be migrated to new machines, so the capacity expansion can only cope with new traffic. After the volume of business decreases, the volume can be reduced to reduce the cost. If the cycle of a long session exceeds the business cycle, the volume can not be reduced.

The service quality of live broadcast is calculated by percentage, such as N percent of the lag is acceptable. In RTC, if one person is unavailable in the meeting, the whole meeting cannot continue. Every meeting is important. One meeting is not necessarily less important than a hundred other meetings.

Current and future: The open source SRS improves the exit logic, allowing you to wait a certain amount of time to exit. The SRS does not yet have the ability to do stateless upgrades because stateless storage is required to do so, whereas open source SLAs do not need to be that high.

GRTN (Tenfold) has been upgraded stateless and can be upgraded at any time (of course, during the peak period of business). With the ability to restart stateless, we have also solved the problem of recovering from a Crash. C++ applications, like the mobile Crash rate, are bound to have crashes.

In the future, GRTN (Tenfold) will also do state transitions and rolling upgrades to K8S.

SLAs are different, and migration is difficult

Hub, edge, and proprietary cloud SLAs vary widely: The network status of the central cloud, the infrastructure is very good, and the migration of sessions is relatively easy. SLAs for edge and proprietary clouds are much different and cannot be migrated using the same mechanism.

Problem: Without 100% SLA, the underlying infrastructure is bound to go wrong. Sooner or later something will go wrong: downtime, IO Hang, network unavailability. Center, edge, proprietary cloud, SLA difference is big, migration is difficult.

Why it matters: It’s unlikely that the underlying infrastructure will fail, but when it does, the service will become unavailable. When one server goes down, it affects not only the sessions on that server, but all the meetings on that server, and a meeting typically spans multiple servers.

The migration of the central cloud, the available infrastructure is relatively perfect. Edge and proprietary clouds, network health and infrastructure reliability, are less difficult to migrate than central clouds.

Current situation and future: SRS does not support migration, open source SLA tolerance is higher, similar open source servers do not have the ability to migrate; Future plans are in place to support migration with a reconnect solution with poor experience.

GRTN (Tenfold) has the underlying migration capability and is expected to support central cloud migration this year. In the future, we need to continuously optimize the migration capacity to support the migration of edge clouds and proprietary clouds. You also need to consider planned migrations, such as traffic rebalancing.

Port and IP multiplexing, expansion is difficult

Port and IP reuse: the traditional RTC is generally an internal network application, which can be used freely, and can be allocated tens of thousands of random ports. These have security risks in the cloud, and the public network IPv4 address cannot be used tens of thousands at will, and it is difficult to expand the capacity.

Problem: Security requirements can only open fixed ports; Enterprise firewall can only open specific ports; Cannot open a range of ports, such as 10000 to 20000 ports.

Why it’s important: Failure to meet safety specifications and pass safety audits. Multiple ports are more vulnerable to attack, and if there is a security breach, it is more serious than a single service being disabled, which is why WebRTC is doing E2E (end-to-end) encryption.

Some users behind the corporate firewall, access to the public network can not access any port, must converge to some IP and port. If port reuse is not supported, it cannot be used in these enterprise scenarios.

Port is essentially a state, it is a user identification, such as IP+ port can be considered a client. This also creates problems for service migration, where more state needs to be migrated.

Current and future: The standard practice of cloud native is to hide services via SLB/Service and forward traffic to real POD servers via SLB. The SRS already supports this approach.

There are also mobile network cutting issues on the line that affect SLB location clients. At present, SRS uses the PingPong of ICE to mark the client. In the future, it will be better to use QUIC. The QUIC protocol itself considers the marking of the session, and the problem can be solved in SLB layer.

GRTN (Tenfold) also supports SLB forwarding for the TURN protocol. In the future, it is also necessary to solve the port reuse problem in the edge cloud. Different from the central cloud, the edge cloud may be divided into operators, and the IP entry needs to be replaced after the client has cut off the network.

It is difficult to load balance due to the large number of streams and correlation

There are many and related streams, and there is also the problem of network cutting: there is no correlation between live streams, so new sessions can be scheduled to other servers when the server load is high, while there is correlation between RTC streams, which can’t be scheduled at will sometimes, making load balancing difficult to do.

Problem: Streams are correlated, the number of sessions of the service is constant, and the load can spike. The correlation of streams, fluctuations in bit rates, and dynamic changes in QoS algorithms can lead to inaccurate estimation of water levels and varying CPU and bandwidth consumption without increasing the number of sessions.

Why it’s important: If the water level cannot be accurately assessed, more resources can only be reserved to keep the water level running at a lower level to avoid a surge in the water level and the server being knocked out. Keeping water levels low leads to high overall costs.

Current situation and future: SRS has not solved this problem and is doing QUIC cascading. In the future, it will consider giving the water level of the server, but it will not do traffic scheduling and load balancing, which is what the system should do.

GRTN (Tenfold) already supports multi-level cascading, cross-regional cascading, and rough water level assessments. Accurate water level assessment is being done and flow equalization will be considered in the future.

The SRS cloud native

To sum up, the cloud native solution is dirty work, but also do not finish the dirty work. Cloud native is a big step forward, allowing the infrastructure to continue to standardize development, as long as the application adhere to the cloud native specifications, can be at ease on multiple clouds.

The threshold of video is really very, very, very high. I still remember eleven years ago when I first started to contact Flash and FFmpeg, just a variety of concepts and protocols made people confused. SRS wants to keep the bar low for video, keeping it easy to use allows developers to feel less anxiety and stress, and keep their hair thick.

However, this is not the whole challenge of RTC services. The day will come when the back-end services will not be finished.

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.