Real-time audio and video architecture design

What is constantly changing is the innovation of gameplay, and what remains constant is the need for ultra-low latency. Real-time architecture is the cornerstone of ultra-low latency. How to construct real-time architecture in the whole chain of source coding, channel coding and real-time transmission? Based on the real-time architecture, how to reduce latency by optimizing the key links in acquisition, coding, transmission, decoding and rendering? This paper will introduce the thinking and practice in this aspect.

From live streaming to online doll grabbing

Figure 1

Figure 1 shows two different application scenarios of real-time audio and video — interactive live broadcast and online doll machine. Although both are interactive, the requirements for real-time audio and video are different. The first real-time mianmai is the interaction of voice and video streaming. For example, one person says a word, and another person hears it and replies a word. This real time only has high requirements on the real time of voice and video streaming. The second online doll grab has higher requirements for signaling delay. The operator does not need to speak, but sees the video stream results returned by the doll machine.

If interactive livestreaming is about the delay of real-time audio and video, then online doll catching is about the delay of signaling and video stream. Over time, our definition of real-time voice and video will evolve a little bit, and there may be more to consider in the future.

Figure 2

Figure 2 is the live interactive real-time architecture diagram, we put the live interaction is divided into two parts, one is the host side, need lower latency, on the other side is a general audience, less sensitive to delay, but is sensitive to fluency, through the bypass some of the services between the two clusters (a cluster called ultra low delay cluster, the other a cluster called onlookers cluster).

In the ultra low latency part, we provide services such as streaming status updates, room management, etc., as well as some streaming media services, mainly for distribution. We provide real-time distribution through a cluster of ultra-low-latency servers (not quite the same as on the audience side).

In addition, it also provides dynamic scheduling services to help us find better links in the existing resource network. The audience cluster in the back is another cluster, separated for some business and our own cost considerations, plus storage, PCB acceleration, distribution functions.

The intermediate bypass services include mixed-stream, transformat (mainly transcoding), and transprotocol. Why mix it up? Let’s take a simple example. When there are 9 people connected to the mic on the host side, if there is no mixed-stream service, the audience side will pull 9 channels of audio and video at the same time, which puts a great pressure on the bandwidth.

Ordinary audience by onlookers server cluster (delayed relatively large cluster) to pull these flows, the cluster’s delay controllable is relatively weak, may be this 9 images are not synchronized between the phenomenon, with mixed flow service, the audience is synthetic good audio and video stream, you won’t get out of sync problem between the various flow.

There are transformat transcoding services. The previous cluster provides very low latency services. Some of them, such as the code stream, cannot be distributed on the traditional CDN network. There is also the transfer protocol, because the first to provide a lower latency service, the latter to be distributed on the CDN network, so the protocol also needs to transfer.

Figure 3

Figure 3 is the architecture of the APP version of the online doll machine. The feature here is that the online doll machine can push two video streams in real time, and players on the machine can cut one screen to watch at any time. The two video streams first pass through our ultra-low delay server cluster, while the player on the computer can also push the stream all the way up, which can show the audience some expressions, reactions and language of the person when grabbing the doll, increasing an interaction. In addition, players need to control the doll remotely from their mobile phones, so real-time signaling distribution is also required.

Figure 4.

Next is the H5 architecture diagram of the doll machine. There is no big difference between the APP version and the push stream, and the proprietary protocol is still on the side of the doll machine. The difference is that the private protocol cannot directly make H5 pull to the stream, so a media gateway is added in the middle to translate our private protocol into a stream format that H5 can recognize, and then H5 pulls down the stream through websocket. Here, the media gateway is required to achieve ultra-low delay conversion.

In a nutshell, the gateway server here just does a distribution service that doesn’t seem to introduce latency, but it does.

Because Websocket pulls TCP stream, but we push UDP stream, when the video frame is very large, a frame data has to be cut into many UDP packets, the server needs to save these UDP packets, make up a complete frame before sending H5, so as to ensure no screen loss, so as to run smoothly. So there will be a delay in the process of assembling frames. The signaling part is basically similar to the APP part.

Some thoughts on real-time architecture

Just now, two scenarios of real-time audio and video are introduced. Here is a point of consideration: What are the characteristics of real-time audio and video? How to construct a real-time audio and video system?

It’s a matter of opinion. You can structure the system in a number of ways that will work relatively well. But I think, no matter how, real-time audio and video have around but the following points, only to do them well, to be able to have a higher reputation in the industry, better technical reserves.

The first is that real-time audio and video can’t wait, because if you wait, it won’t be real-time audio and video. It can’t wait. There’s a paradox here. Since it can’t wait, for example if you look at real-time audio and video as a consumption model, is that production in advance or on demand? The literal understanding is very simple, must be on demand production, production when needed, if early production is delayed. But it doesn’t make sense to make on-demand production at every point.

For example, if you want to play a piece of audio, the best thing to do is for the system or the driver to tell you that it needs data, and then to take a frame and stuff it, that’s on-demand production. But why is there such a thing as early production? When the system tells you it wants data, it actually has a response time requirement.

You may have to wait to finish a frame to go into production now, but is it too late? If you have to go all the way down, you might make it. But now requires a lot of downlink, in a very short period of time to solve a lot of frames, has a high demand on hardware performance. Generally speaking, it is not desirable. This is just a simple example of real-time audio and video. Early production introduces delays, so how much early production, and how to dynamically estimate when we should produce? This is an open question, and one that should be considered when designing a system.

Second, real-time audio and video can’t wait. Some of the waiting is unavoidable in real-time audio and video. For example, if you want to do audio encoding, it must be done in 20ms frame or 40ms frame. You can’t program a sampling dot. Since some delays and waits are inevitable, we want the system to be as granular as possible, which may lead to lower latency. But the lower the granularity of processing, when the whole system runs frequently, you can think of it as a set of cycles. When there are few things circulating, the cycle will run many times, which is a big overhead and burden for the system.

So when we can’t wait, we want it to be small. Another advantage of small processing granularity is that there is no guarantee that the processing granularity of each link is consistent throughout the system. For example, this node may require 10 milliseconds, and the next node may require 15 milliseconds. This is due to algorithm limitations, and there may be no way to avoid this. If you choose a relatively small particle size in the whole system, when the particle size splicing, for example, 10-15 ms, it takes two 10 ms to make 15 ms, and there is only 5 ms left, so there is less left.

If the granularity is very coarse, you might have a lot left over. In granular stitching, this residual quantity represents the delay in the entire link. So we want to be as small as possible, but not so small that the whole system can’t handle it.

Third, real-time audio and video can not wait. For example, if you need to receive a network packet and the packet arrives late, you can’t wait completely. However, there is a timeout mechanism when waiting. For example, if the audio package is not available for a long time, I will skip it to make a frame correction compensation. When the package finally arrives, I can only throw it away instead of using it.

Figure 5

In addition, real-time audio and video on the server side also need to consider several issues: the first is load balancing. The second is nearby access, the third is quality assessment, the fourth is dynamic routing, and the fifth is algorithmic flow control.

First, load balancing means to make each node of the whole server bear relatively uniform service, so that the high load of a node will not cause some packet loss, resulting in the increase of network round-trip, so that any network damage, real-time audio and video will cause a relatively large delay increase.

The second is proximity, where “proximity” does not mean geographical proximity, but “network proximity”. A very simple example is that we are doing push stream in Shenzhen. Hong Kong is very close to us, so we can push to the server in Hong Kong. However, it is a cross-domain network after all, and there are unstable factors in it, so we would rather push it further. This close refers to the close in the sense of network quality evaluation, for example, the round-trip time of the network is very small, the round-trip time is very smooth, the distribution does not have a great probability at the moment of relatively large delay, and the packet loss rate is very low.

To achieve the nearest access, this near to have a good quality assessment system. There are two methods of quality assessment:

Post facto quality assessment. For example, when the network runs smoothly for a month, the quality of the whole month is reviewed. Such quality evaluation can be considered as a relatively offline evaluation, which can provide us an indicator of whether the network has improved in the recent month compared with the last month. We can learn some experience from it, for example, there are some policy differences between this month and last month’s scheduling. This is a systematic experience summary and optimization method.
Real-time quality assessment. What is more important is a real-time evaluation. For example, if I push the flow now, I can monitor the current quality in real time, so that I can achieve real-time dynamic routing. Real-time dynamic routing is when someone pushes a stream from Beijing to Dubai, there are many links to choose from, and he might have some previous experience, if his previous experience tells you that he pushes a stream directly to Dubai, this link is good, but after all, there are some cases. There is a dynamic real-time quality assessment to know whether it is good to push Dubai at this time. If not, it can be replaced without the user’s perception, and the nodes of some routes in the whole link can be added or deleted at any time. That’s the idea of dynamic routing.

Real-time dynamic routing is when someone pushes a stream from Beijing to Dubai, there are many links to choose from, and he might have some previous experience, if his previous experience tells you that he pushes a stream directly to Dubai, this link is good, but after all, there are some cases. There is a dynamic real-time quality assessment to know whether it is good to push Dubai at this time. If not, it can be replaced without the user’s perception, and the nodes of some routes in the whole link can be added or deleted at any time. That’s the idea of dynamic routing.

In the actual situation, the above four points are combined to select the best quality or approximate best link in our network and server resource concentration to ensure real-time audio and video services. But the resource set is finite, no one can guarantee that your resource set will be able to select the optimal with good link characteristics. The fifth point should be taken into consideration if I cannot guarantee it. Even if I choose a link that I think is the optimal one in the whole resource set, its quality is not up to a very good standard, and some algorithms are needed to make up for it. These algorithms include techniques for reliable audio and video transmission over an unreliable network, which we will share with you a little later, as well as some congestion control over the entire link.

Thinking about source coding

Figure 6.

Source coding is a way to reduce the burden on the network by condensing a large amount of data into smaller network data. There are many kinds of compression modes. Let’s first look at the audio. Some pictures are drawn above (Figure 6). We focus on the Opus encoder. Mixed mode is to mix two coding modes together and choose according to different situations.

Figure 6 is an encoder with bit rate on the horizontal axis, its quality on the vertical axis, and the performance of the various audio codecs in the middle. You will find that the linear prediction approach provides good quality at low bit rates, but there is no curve at around 20K because it does not support that high bit rate. Then look at MDCT coding, which can achieve nearly transparent sound quality at relatively high bit rates. Audio encoders have different coding principles inside, such as this LP Mode is to simulate the human voice model, since there is mathematical modeling, its characteristics are able to provide a relatively reliable quality at a relatively low bit rate.

But it has the characteristic that it is easy to achieve a kind of qualitative saturation, which means that when you give it a high bit rate, it will actually be coded the same way, because it is after all a parameterized code. So depending on the business scenario, when you need a very high sound quality, but also need a music scene, it is obviously not the right choice. MDCT MODE does not have any model in it. In fact, it converts the signal into the frequency domain and directly quantizes it. Since it is not modeled, it is relatively bit-rate consuming, but it can provide good quality at a higher bit-rate, but the performance at a lower bit-rate is far inferior to that of the modeled method.

Figure 7.

To summarize, audio includes both speech and music, so there are codec for speech and CODEC for music. The first coDEC is suitable for speech. Speech can be modeled. A CODEC for speech can provide good quality at low bit rates, providing a relatively high compression ratio, but it tends to saturate and does not provide an approximation to transparent sound quality. The other coDEC has a different coding principle and can encode both music and speech well, but it does not provide high compression ratio, and it is not expected to provide high coding quality at low bit rate.

Figure 8.

For video coding, the simplest points are I frame, P frame and B frame. I-frame is self-reference, p-frame is forward reference, it will reference the characteristics of the history frame encoding. B frame is a two-way reference, it can refer to the previous frame, can refer to the following frame. B frames provide a higher compression ratio, providing better quality. But because it refers to future frames, it introduces latency, so we rarely use B-frames in real-time audio and video systems.

If you want to do a good real-time audio and video system, flow control is a must. What are the requirements for video codec? At the very least, the codec’s control must be stable. Why is that? For example, I now have a very good congestion control strategy, do a very good bandwidth estimation, and make no mistakes, and estimate that the bandwidth of the allocated video at any given moment is 500kbps, and then I can set the video encoder to 500kbps. However, if the code control is not stable, when you set 500kbps, the video encoder may run up to 600kbps, causing some congestion and delay. Therefore, we want to choose a CODEC with a good code control strategy.

In fact, some open source code is to do code control, but directly used is not suitable for your scene, because these open source code to do, may be more or less consider other scenes, not just real-time audio and video scenes. For example, if a COdec is used for compression, you want to reach a predetermined bit rate in half an hour or an hour, regardless of what the second or the next second looks like, but real-time audio and video requires a very small time window.

In addition, we want coDEC to have the ability to layer code. What is layered coding? Why is there hierarchical coding? Hierarchical coding is also divided into two kinds, one is the time domain layer, one is the spatial layer. The former is encoding when the current frame does not refer to the previous frame, but has the strategy of interframe reference; The latter can be considered as using the lower code to encode a small picture first, and then using the remaining bit rate to encode the incremental part to get a higher resolution picture.

Why do you do that? Not many scenes in the real-time audio and video are one-to-one, when not one to one, to do flow control, cannot be because of a downward is bad, all the way the audience will push the anchor upward flow rate down, because there may be one thousand audience’s network is very good, the good audience will also because individual network is bad, the picture can only see not so clear. So be layered, can choose to the user on the server on which layer of the issued, because of the hierarchical strategy, if the line is bad, want to choose the level of one of the smaller it is ok that sent him, the core layer, for example, so that we can hold the whole video reduction using core, may damage some details or frame rate is low, but at least is available as a whole.

Finally, I want to say that a lot of people think that video has a lot of data, that video should have a higher latency than audio, but it doesn’t. Because a lot of the delay is actually the codec’s own delay, if there are no B-frames in the codec, you can interpret the video encoding as having no delay at all. However, audio encoding more or less refers to some future data, which means that there must be a delay in the audio encoder. So, generally speaking, the audio delay is higher than the video delay.

Thinking about channel coding technology

Figure 9.

Channel coding is divided into several parts. One is based on prior knowledge of network redundant coding technology – forward error correction technology. For RS (4,6) encoding, for example, I would send a packet with six packets, four of which are actual media data and two of which are redundant. When any four of the six packets are received on the decoder side, all packets carrying media content can be fully restored.

For example, if 2 and 3 are lost, 1, 4, R1, R2 are received, and 2 and 3 can be fully recovered. So that looks good, any two lost can be fully recovered. However, such an algorithm also has its weaknesses and is not suitable for sudden packet loss. Because this group should not be too large, if the group is large, the group will have a large delay. If the group is small, the whole group is likely to be lost.

It doesn’t really make any sense. Therefore, it is not suitable for sudden packet loss, and after all, it is a kind of redundancy based on prior knowledge, that is to say, it is always a judgment based on the state of the network at the last moment, what the network will be like at the next moment, which is a prediction thing. The web is changing in real time, and this predictive stuff is not entirely reliable.

Therefore, its recovery efficiency is relatively low in the actual network, and the algorithm complexity is relatively high. Of course, it also has advantages. For example, we calculated the packet in advance and sent it at one time. There is no need to wait until you find that the packet is lost, so it is not affected by network round-trip. In addition, the size redundancy can be adjusted randomly, which is suitable for the scenario of uniform packet loss.

Figure 10.

Another technique is packet loss and retransmission. Comparatively speaking, packet loss and retransmission is more targeted than RS, so the recovery efficiency is relatively high. The first Go Back N technology is a transmission technology similar to TCP. The sender is constantly sending packets, and the receiver is responsible for telling the sender how I have received the packets and the serial number of the consecutive frames I have received. The sender finds that 10 frames are sent, and the receiver only receives 8 correctly. Regardless of whether the 9 or 10 frames are received, the sender will discard and retransmit the frames. Therefore, the Go Back N technology has a certain purpose. It maintains the packet loss state. It knows which packets are not received, but it is not accurate.

The next step is Selective ARQ. In selective retransmission, the receiver discovers which packet is missing and then asks the sender to resend the packet. Sounds like a very good technology, very efficient, lost which packet will be retransmitted which packet. But the weakness is that you have to assume that the packet is being sent frequently. For example, the sender sends packets 1, 2, 3, and 4, but only sends one packet per second. When does the sender notice that 2 is missing? When you get the 3. If the 2 is the last package, you’ll never find it and throw it away. That is, if the packet is not frequent, it takes at least 1 second to find it lost. It’s too late to retransmit it.

So in a real system, selective retransmission would be preferred because most of the audio and video scenes are dense, but some go-back-n might also be needed. Develop some confirmation mechanism so that retransmission can be done more fully. In addition, all retransmission must wait for at least one network return time, because a network return time is required for confirming packet loss or feedback packet receiving. Therefore, its weakness is that it is greatly affected by network return time. If it is not properly controlled, it may cause retransmission storm. The advantage is that the algorithm is relatively complex and easy to implement. In addition, because it has great pertinence, invalid retransmission packet will be less, and it will have a better effect for sudden packet loss.

The two transport techniques for unreliable networks, forward error correction and packet loss retransmission, have their own advantages and disadvantages. In fact, a good network distribution technology is a combination of these two technologies, depending on the channel situation.

Figure 11.

Figure 11 is from the network. From the blue part in the lower left corner, retransmission is used when the network round-trip is small and the packet loss rate is not high. But when the network RTT is high, as you can see in this diagram, the retransmission strategy is not used. From my personal point of view, I don’t think this is a very reasonable approach. As mentioned just now, FEC is a redundant technology without purpose and based on prior knowledge. Although retransmission is time-consuming when RTT is very high, if there is no retransmission, a lot of redundant packets need to be added to fully recover the lost packets, which will actually lead to a great waste of resources. And when you have a high packet loss rate, you may not be able to recover all your packets completely. As long as the video frame loss will be very slow, the video packet loss rate should be controlled in a few thousand below, can achieve smooth can watch the level.

Figure 12

Thinking about channel coding. Channel coding is inversely proportional to network throughput. Both retransmission coding and redundancy coding occupy bandwidth and reduce the throughput of actual media information. In real life, channels are limited. When you transmit, you have to do some strategies based on the characteristics of the channel. If the channel is congested, we need to have a congestion control algorithm to estimate how to allocate the channel properly.

In addition, when making a system, it is very important to think clearly about how to evaluate the effect of a system. In channel coding, an important indicator is what the effectiveness of channel coding looks like. There are two types of validity. One is whether retransmission or redundancy can actually replace lost packets, which is an validity. Even if the packet is filled back, there is still some packet loss after a channel encoding strategy.

For example, the original packet loss is 20%, and the replacement is 1%, then the retransmission actually has no effect in our evaluation, because 1% packet loss is indifferent to audio, but very slow to video. In such an evaluation system, there is still 1% missing packets, so all the coding is not very meaningful. For example, if the channel congestion also occurs at this time, then such channel coding will not achieve good results. Should all channel coding be stopped at this point?

And judgment, the effectiveness of the channel coding to measure whether it is good, is how much redundancy, how many of redundancy is not make good use of, if these redundant like that last example, six redundancy package with 2 package, just lost 2 packages, had back out all use of the entire package, that is one hundred percent of redundancy effectively. If you lose one of the four packets of information and bring two packets of honor, one of them has no effect. So if you want to make a good system, you should first think about how to evaluate the system.

Introduce the link of delay and the idea of reducing delay

The introduction of delay is mainly divided into three parts, one is acquisition/rendering. This may seem like an easy part, but it introduces probably the biggest delay, probably the biggest part of the distribution process.

Many people don’t understand this, but in fact, in the existing network architecture, the latency of network round-trip is controlled within 50 milliseconds, but rendering and acquisition, especially rendering, almost no mobile system can guarantee it 100 percent of 50 milliseconds, which is some hardware limitations. How can these delays be reduced? I have just given you the idea of a production and consumption model. You can carefully consider whether to produce on demand or ahead of time.

There’s also some latency associated with codec, especially with audio. Some of these delays are unavoidable, and we need to reduce them based on actual usage scenarios. These are all modality trade-offs. There are also processing granularity considerations that can affect overall system latency.

There is another delay, which everyone can see, which is the network distribution delay. How to reduce it? In addition to finding an optimal subset in the resource set, there is also channel coding, to make a good channel coding system, how do we evaluate the channel coding system. With these ideas, we can guide us to do better development work in the next step.

The authors introduce

Guan Xu, the core expert of Audio and video engine of Jigou Technology, graduated from the Department of Mathematics of Nankai University with a master’s degree. He has worked for ZTE, Tencent and other companies in charge of audio and video related RESEARCH and development. He has accumulated many years of experience in real-time audio and video technology, and is currently responsible for the core development of audio and video engine of Jigou Technology.

Thanks to Xu Chuan for correcting this article.

Stamp here
ArchSummit Global Architect Summit

Real-time audio and video architecture design

From live streaming to online doll grabbing

Some thoughts on real-time architecture

Thinking about source coding

Thinking about channel coding technology

Introduce the link of delay and the idea of reducing delay

The authors introduce

Related Posts

Scrapy introduction to give up 03: understand the configuration Settings, monitor Scrapy engine | August more challenges

Unreliable sessions

You can easily play with red and black trees by mastering the four core HashMap knowledge points!