Why is THE delay of live streaming very high?

Data encoding
The data transfer
Multiterminal cache
conclusion
read

Why’s THE Design is a series of articles about programming decisions in THE field of computing. In each article, we present a specific question and discuss THE pros and cons of this Design and its impact on implementation from a different perspective. If you have a question you’d like to know more about, leave a comment below.

The development of communication technology promotes the rise of voD and live streaming services, and the progress of 4G and 5G network technology also makes streaming media technology more and more important. However, network technology cannot solve the problem of high latency of live streaming. This paper will not introduce the impact of network on live streaming services. Instead, it will analyze a common phenomenon in live broadcasting — the obvious network delay that can be felt between the host and the audience. Apart from the business requirement of delayed live streaming, what factors could cause such a high latency for live video?

Figure 1 – Live streaming

When the audience through the barrage and the host to interact, see barrage from us to get the response of the host may be after 5 s or even longer time, although the host see the barrage time with the audience to see the barrage of time won’t have too big difference, but live system will be the host of audio and video data transmission to the client or browser needs a long time, The time it takes to transfer data from host to viewer is often referred to as end-to-end audio and video delay.

Streaming live broadcasting involves a very long link from the collection and encoding of audio and video to the decoding and playing of audio and video, which requires access to the host end, streaming media server and audience end, which provide different functions respectively:

Anchor side: audio and video collection, audio and video coding, push stream;
Streaming media server: live streaming collection, audio and video transcoding, live streaming distribution;
Audience side: pull stream, audio and video decoding, audio and video playback;

In this lengthy collection and distribution process, some technologies are used to ensure the quality of live broadcast in different processes. These means to ensure reliability and reduce system bandwidth jointly cause the problem of high delay of live broadcast. This paper will analyze why the end-to-end delay of live streaming media is very high from the following three aspects:

The encoding format used for audio and video determines that the client can only decode from a specific frame;
The size of the network protocol slice used for audio and video transmission determines the interval of receiving data by the client.
The server and client reserve cache to ensure user experience and live broadcast quality.

Data encoding

The Coding technology of Audio and Video must be used in live Video broadcasting. At present, the mainstream encoding methods of Audio and Video are Advanced Audio Coding (AAC) 1 and Advanced Video Coding (AVC) 2, and AVC is often called H.264. Instead of discussing audio data codec algorithms, we’ll take a closer look at why H.264 encoding is needed and how it affects live latency. Let’s say we need to watch a 2-hour 1080p, 60FPS movie. If each pixel requires 2 bytes of storage, the entire movie takes up the following resources:

However, in the actual situation, each movie occupies only a few hundred MB or a few GB of disk space, which is several orders of magnitude different from our calculated results. Audio and video coding is a key technology to compress audio and video data and reduce the occupation of disk and network bandwidth.

H.264 is the industry standard for video compression, because video is composed of pictures frame by frame, and there is a strong continuity between different pictures. H.264 uses intra-coded pictures (I frames) as the full data of video. Full data can be constantly modified incrementally by using Predicted Picture (P picture) and Bidirectional Picture (B Picture) for compression.

Figure 2-H. 264 Compressed video data

H.264 uses I, P, and B frames to compress the video data into the image sequence shown in the figure above. These three different video frames perform different functions 3:

Video frame	role
The I frame	Complete image in JPG or BMP format
P frame	Data can be compressed using data from the previous video frame
B frame	You can compress data using the preceding and following video frames

The compressed video data is a series of consecutive video frames. When decoding the video data, the client will find the first key frame of the video data and then modify the key frame incrementally. If the first video frame received by the client is a keyframe, the client can play the video directly, but if the client misses a keyframe, it has to wait for the next keyframe to play the video.

Figure 3 – Video encoding GOP

The Group of Pictures (GOP) specifies the organization of video frames. The encoded video stream is composed of successive GOP. Since each GOP starts with a key frame, the size of GOP will affect the delay at the player end. The network bandwidth occupied by the video is also closely related to GOP. In general, the GOP broadcast on the mobile terminal is set to 1 ~ 4 seconds. Of course, we can also use a longer GOP to reduce the bandwidth occupied by 4.

Video coding in the GOP determines the interval of the keyframe, also determines the client find the first key frames can play time, will affect the delay of the live streaming media, the second level of delay for live video business impact is obvious, the setting of the GOP is, bandwidth and delay the result of weighing the quality of video.

The data transfer

Different application layer protocols can be used for audio and video data transmission. The two most common network protocols are Real Time Messaging Protocol (RTMP) and HTTP Live Streaming (HLS). The two network protocols use different ways to transmit audio and video streams. We can consider RTMP to distribute data based on audio and video streams, while HLS to distribute audio and video data based on files.

Figure 4 – Streaming media data transfer protocol

RTMP is an application layer protocol based on TCP. It divides audio and video streams into segments for transmission. By default, the size of audio data segments is 64 bytes and that of video data segments is 128 bytes. With the RTMP protocol, all data is transferred as chunks:

Figure 5-RTMP data blocks

Each RTMP data block contains a protocol Header of 1 to 18 bytes. The protocol Header consists of Basic Header, Message Header and Extended Timestamp. In addition to the basic protocol header containing the block ID and type, the other two parts can be omitted. The RTMP protocol in transit requires only a 1-byte header, which also means very low overhead 6.

HLS protocol is a bit rate adaptive streaming media network transmission protocol based on HTTP released by Apple in 2009. When the player gets a pull stream address that uses the HLS protocol, the player gets the M3U8 file from the pull stream address as shown below:

# # EXTM3U EXT - X - TARGETDURATION: 10 # EXTINF: 9.009, http://media.example.com/first.ts # EXTINF: 9.009, http://media.example.com/second.ts # EXTINF: 3.003, http://media.example.com/third.tsCopy the code

M3u8 is a multimedia list file format 8. The file contains a series of video stream slices, and the player can play each video stream in turn according to the description in the file. HLS protocol divides live streams into small files and uses M3U8 to organize these live streams. When players play live streams, they will play the split TS files in turn according to the description of M3U8.

Figure 6-M3U8 and TS files

The size of TS files sharded by HLS protocol will affect the end-to-end delay of live broadcast. Apple’s official documentation recommends the use of 6-second TS slices, which means that the delay from the host to the audience will increase by at least 6 seconds. It is not impossible to use a shorter sharding method, but it will bring huge extra overhead and storage pressure.

Although all application layer protocols are limited by MTU9 of physical devices and can only transmit audio and video data in segments, the granularity of the segmentation of audio and video data by different application layer protocols determines the end-to-end network delay. RTMP and HTTP-FLV and other protocols based on stream distribution have small granularity and a delay of less than 3s, which can be regarded as real-time transmission protocols. The HLS protocol is based on file distribution and its slice granularity is very large, which may cause a delay of 20 to 30 seconds in actual use.

It should be noted that file distribution is not equivalent to high latency, and the size of fragments is the key factor to determine the delay. Real-time streaming media transmission protocol needs to consider how to reduce the extra overhead while ensuring small fragments.

Multiterminal cache

The link of live video broadcasting architecture is often very long, and we cannot guarantee the stability of the whole link. In order to provide smooth data transmission and user experience, both the server and the client will increase cache to cope with the audio and video lag of live broadcast.

The server usually caches part of the live broadcast data and then transmits the data to the client. In case of sudden network jitter, the server can use the data in the cache to ensure smooth live streaming. When the network recovers, the data is cached again. Clients also use preread buffers to improve the quality of live streams. We can adjust the buffer size to increase real-time performance, but the user experience of the client will be seriously affected when the network condition is jitter 10.

conclusion

Live streaming high latency is a systematic engineering, and WeChat compared to 1 to 1 real-time communications, such as video streaming link between production and consumption of party very long, many factors can affect the host and the audience’s feelings, because the cost of bandwidth, historical inertia and uncertain network, we can only through the different technology to solve the problems, What has to be sacrificed is the user experience:

Too much full audio and video data – using audio and video coding will use key frames and incremental modification to compress data. The interval GOP of key frames determines the longest time that the client needs to wait when playing the first picture;
Browsers do not support real-time streaming protocols enough – HLS protocol is used to distribute slices of live broadcast based on HTTP, which will cause 20 to 30 seconds of live broadcast delay for the host and the audience;
Uncertainty caused by long links – The server and client use cache to reduce the significant impact of network jitter on live broadcast quality;

All these factors above will affect the end-to-end delay of the live broadcast system. In a normal live broadcast system, RTMP and HTTP-FLV can achieve a delay of less than 3s, but GOP and multi-terminal cache will affect this index, and the delay is normal within 10s. In the end, let’s take a look at some of the more open-ended questions that interested readers can ponder:

How much extra overhead will file-based streaming protocols bring?
What are the compression rates of different video encoding formats?

If you have questions about the content of this article or want to learn more about the reasons behind some design decisions in software engineering, you can leave a comment below on this blog. The author will respond to the questions in this article and select the appropriate topics for subsequent content.

read

Why does TCP require three handshakes to establish a connection?
Why does TCP have performance problems?

Wikipedia: Advanced Audio Coding en.wikipedia.org/wiki/Advanc… ↩
Wikipedia: the Advanced Video Coding en.wikipedia.org/wiki/Advanc… ↩
Wikipedia: Video compression will picture types en.wikipedia.org/wiki/Video_… ↩
Reduce Bandwidth Consumption by GOP Settings www2.acti.com/support_old… ↩
Wikipedia: Real – Time Messaging Protocol en.wikipedia.org/wiki/Real-T… ↩
H. Parmar, Ed, M. Thornburgh, Ed. December 2012. “Adobe’s Real Time Messaging Protocol, 5.3. Chunking wwwimages2.adobe.com/content/dam…” ↩
Wikipedia: HTTP Live Streaming en.wikipedia.org/wiki/HTTP_L… ↩
Wikipedia: M3U en.wikipedia.org/wiki/M3U ↩
Why is TCP/IP protocol split data? Draveness. Me/whys – the – DE… ↩
Live delay time, Youtube support.google.com/youtube/ans… ↩

About pictures and reprints

Creative Commons Attribution 4.0 International License agreement
Guide to illustration of technical articles

Wechat official account

About comments and comments

Why is THE delay of live streaming very high?

Why’s THE Design? · Programming for faith

Follow: Draveness dead simple

Why is THE delay of live streaming very high?

Data encoding

The data transfer

Multiterminal cache

conclusion

read

About pictures and reprints

Wechat official account

About comments and comments

Related Posts

MySQL > alter table migration database

Big data platform building | Hive

Java-static java-static java-static