How to optimize mobile live broadcasting in seconds

Nowadays, the challenges of mobile live broadcasting technology are far more difficult than traditional equipment or computer live broadcasting. Its complete processing links include but are not limited to: audio and video collection, beauty/filter/special effects processing, coding, packet sealing, stream pushing, transcoding, distribution, decoding/rendering/playing, etc.

With the development of The Times and the progress of technology, there are still many problems in mobile live broadcasting. I will not sum up all the problems here, but I will clarify my thoughts first to find a solution.

What is video?

First we need to understand a basic concept: video. From the perspective of sensibility, a video is a film full of interest. It can be a film or a short film. It is a coherent visual impact with rich images and audio. However, from a rational point of view, video is a structured data, which can be analyzed into the following structure by using engineering language:

Content element

Image (Audio) Metadata

Encoding format (Codec)

Video: H.264, H.265,… Audio: AAC, He-AAC,…

Container

MP4, MOV, FLV, RM, RMVB…

Any Video file, structurally speaking, is composed in such a way: image and audio constitute the most basic content elements; The image is processed in a video encoded compression format (usually H.264); The audio is processed in an audio encoded compression format (e.g. AAC); Indicate the corresponding Metadata;

Finally, it is packaged in a Container (such as MP4) to form a complete video file.

If that’s hard to understand, think of it as a bottle of ketchup. The outermost bottle is like the Container. The information on the bottle, such as the raw material and processing place, is like Metadata. After the bottle cap is opened (unsealed), the ketchup itself is like the coding content after compression. Tomatoes and spices are the original Content.

1. Real-time video transmission

In short, rational cognition of the structure of video helps us to understand live video. If video is a kind of “structured data”, then live video is undoubtedly the way to transmit this “structured data” (video) in real time.

So the obvious question is: how do you transmit this “structured data” (video) in real-time?

Here’s a paradox: A video that is encapsulated in a Container must be an Immutable video file, which is already a production result that, according to the theory of relativity, cannot be accurate enough to be real time. It is already a memory of time and space.

Therefore, live video broadcasting must be a process of “production, transmission and consumption”. This means that we need to take a closer look at the intermediate process (coding) of video from the original content elements (images and audio) to the finished product (video files).

2 Video coding compression

Let’s understand video coding compression technology in a nutshell.

In order to facilitate the storage and transmission of video content, it is usually necessary to reduce the volume of video content, that is, to compress the original content elements (image and audio), compression algorithm is also referred to as encoding format. For example, the original image data in the video will be compressed in H.264 encoding format, and the audio sample data will be compressed in AAC encoding format.

After the video content is encoded and compressed, it is really beneficial to storage and transmission. But when it comes to watching the playback, the decoding process is required accordingly. Therefore, between encoding and decoding, it is obvious that there needs to be a convention that both encoder and decoder can understand. For video image encoding and decoding, the convention is simple:

The encoder encodes multiple images to produce a segment of GOP (Group of Pictures), while the decoder reads a segment of GOP for decoding and then reads the picture for rendering and display.

GOP (Group of Pictures) is a Group of continuous Pictures, consisting of one I frame and several B/P frames. It is the basic unit of access for video image encoder and decoder, and its sequence will be repeated until the end of image.

Frame I is an internal encoding frame (also known as a key frame), frame P is a forward prediction frame (forward reference frame), and frame B is a bidirectional interpolation frame (bidirectional reference frame). To put it simply, I frame is a complete picture, while P frame and B frame record changes relative to I frame.

P and B frames cannot be decoded without I frames.

To sum up, the data in the image part of a Video is a set of GOP, while a single GOP is a set of I/P/B frame images.

In such a geometric relationship, Video is like an “object”, GOP is like a “molecule”, and I/P/B frame image is like an “atom”.

Imagine if, instead of transmitting an “object”, we could transmit atoms one by one, sending the smallest particles at the speed of light. What would it be like to sense with our biological eyes?

What is live video?

It is not hard to imagine that live broadcasting is such an experience. Live video broadcasting technology is the smallest particle of video content (I/P/B frame…). A technology that transmits at the speed of light, based on a time series.

In short, live broadcasting is the process of streaming transmission of every Frame of Data (Video/Audio/Data Frame) with time sequence label (Timestamp). The sender continuously collects audio and video data, encodes, encapsulates and pushes the stream, and then spreads it through the relay distribution network. The player continuously downloads data and decodes and plays it on time. In this way, the process of “production, transmission and consumption at the same time” is realized.

After understanding the above two basic concepts of video and live streaming, we can take a look at the business logic of live streaming.

1 Service logic of live broadcasting

The following is a simplified one-to-many live business model and the protocols between the various levels.

The differences among protocols are compared as follows:

These are some basic concepts about live streaming technology. Let’s further understand the performance indicators of live broadcast that affect people’s visual experience.

3. Live broadcast performance indicators that affect visual experience

1 The first performance indicator of live broadcast is delay

Latency is the time it takes for data to be sent from an information source to a destination.

According to Einstein’s special theory of relativity, the speed of light is the highest speed at which all energy, matter and information can move, which puts an upper limit on the speed of travel. So even what we perceive as real time to the naked eye is actually a delay.

As RTMP/HLS is an application layer protocol based on TCP, TCP three-way handshake, four-way wave, and round-trip time (RTT) are added to each round trip during slow startup. These interactions increase latency.

Secondly, according to the TCP packet loss and retransmission feature, network jitter may lead to packet loss and retransmission and also indirectly increase the delay.

A complete live broadcast process includes but is not limited to the following steps: collection, processing, coding, packet sealing, stream pushing, transmission, transcoding, distribution, stream pulling, decoding and playing. The lower the delay, the better the user experience, from push stream to play and then forward.

2. The second performance index of live broadcasting is stuck

It refers to the video playback process appears the picture lag frame, so that people obviously feel “stuck”. The statistics of the number of times of playback lag per unit time is called the lag rate.

The delay may be caused by interruption of data transmission on the push end, congestion of transmission on the public network, or network jitter, or poor decoding performance of the terminal. The less or no stalling, the better the user experience.

3 Third live performance indicator First screen time

Refers to the wait time for the naked eye to see the picture after the first click on play. Technically, it is the time it takes the player to decode the first frame of the rendered display. Usually said “second open”, refers to click the play, you can see the play picture within a second. The faster the first screen opens, the better the user experience.

The above three live broadcast performance indicators correspond to user experience demands of low delay, smooth HD, and extreme speed in seconds respectively. Understanding these three performance indicators is crucial to optimizing the user experience of mobile live streaming apps.

What are the common pits in the mobile live broadcast scene?

According to the experience summarized from practice, the pit of video live broadcast on mobile platform can be summarized into two aspects: equipment differences and technical tests brought by network environment.

Pit and circumvention measures of mobile live scene

1. Coding differences on different chip platforms

No matter hard or soft coding on iOS platform, as it is manufactured by Apple, there is almost no coding difference caused by different chip platforms.

However, on the Android platform, MediaCodec encoders provided by the Android Framework SDK vary greatly in different chip platforms. Different manufacturers use different chips. The performance of Android MediaCodec varies slightly on different chip platforms, and it is usually not cheap to achieve full platform compatibility.

In addition, Android MediaCodec’s hardcoded H.264 image quality is baseline, so the image quality is usually mediocre. Therefore, on the Android platform, it is recommended to use soft editing, the advantage is that the picture quality is adjustable, and the compatibility is better.

2 how to collect and encode high performance on low-end devices

For example, the output of Camera acquisition may be pictures, and the volume of a picture is not small. If the acquisition frequency is very high and the frame rate of encoding is very high, each picture goes through the encoder, then the encoder may be overloaded.

At this time, we can consider selective frame loss before coding without affecting the picture quality (we talked about the microscopic significance of frame rate earlier), so as to reduce the power consumption of the coding process.

3 How to ensure smooth HD push flow under weak network

In mobile network, it is easy to encounter network instability, connection reset, disconnection and reconnection. On the one hand, frequent reconnection requires overhead to establish a connection. On the other hand, especially when GPRS / 2G / 3G / 4G switch occurs, bandwidth may appear bottleneck. When bandwidth is insufficient, content with higher frame rate/bit rate is difficult to send, and variable bit rate support is required.

In other words, it can detect the network state and simply measure the speed at the push end, and dynamically switch the bit rate to ensure the smooth push flow during the network switch.

Secondly, the logic of encoding, packet sealing and stream pushing can also be fine-adjusted. We can try to selectively lose frames, such as video reference frames (not I frames and audio frames), which can also reduce the data content to be transmitted, but at the same time achieve the purpose of not affecting the quality of the picture and the version of audio and visual fluency.

4 You need to distinguish between the live stream status and service status

Livestreaming is a media stream, while APP interaction is an API signaling stream. The two states should not be confused. In particular, the state of live stream cannot be judged based on the API state of APP interaction.

The above are some common pits and evasive measures in the mobile live broadcast scene.

Other optimization measures for mobile live broadcast scenes

1. How to optimize the opening speed to reach the legendary “second opening”?

You may see that some mobile live streaming apps in the market can be opened very quickly. In some mobile live streaming apps, it takes several seconds to play after clicking the play button. What accounts for such a huge difference?

Most players can decode and play only after getting a finished GOP, and players based on FFmpeg transplant even need to wait for the time stamp synchronization of audio picture to play (if there is no audio in a live broadcast, only the video can play after the audio timeout).

Seconds can be considered from the following aspects:

1) Rewrite the player logic so that the player will display the first key frame after it gets it.

The first frame of GOP is usually a key frame, which can reach the “first frame second opening” due to the small amount of data loaded.

If the live broadcast server supports GOP cache, it means that the player can get the data immediately after establishing a connection with the server, thus saving the time of cross-region and cross-carrier back source transmission.

GOP reflects the period of key frames, namely, the distance between two key frames, namely, the maximum number of frames in a frame group. Assuming that a video has a constant frame rate of 24 FPS (i.e. 24 frames per second) and a keyframe period of 2s, then a GOP is 48 images. Generally speaking, you need to use at least one keyframe per second of video.

Increasing the number of key frames improves the image quality (GOP is usually a multiple of FPS), but also increases bandwidth and network load. This means that the client player downloads a GOP, which has a certain data volume after all. If the player network is poor, it may not be able to quickly download the GOP within seconds, thus affecting the visual experience.

If the player behavior logic cannot be changed to open the first frame second, the live broadcast server can also do some tricks, such as changing the cache GOP to cache double key frames (reducing the number of images), which can greatly reduce the volume of content to be transmitted when the player loads GOP.

2) Optimize APP business logic.

For example, do DNS resolution ahead of time (save tens of milliseconds), and speed line selection ahead of time (pick the best line). After such preprocessing, the download performance will be greatly improved when the play button is clicked.

On the one hand, performance optimization can be done around the transport level; On the other hand, business logic can be optimized around customer playback behavior. The two can effectively complement each other as the optimization space of second opening.

In addition to experience optimization on the mobile terminal, the server architecture of live streaming media can also reduce latency. For example, the receiving server actively pushes GOP to the edge node, and the edge node caches GOP. The playback end can load GOP quickly to reduce the source delay.

Secondly, it can be processed and distributed close to the terminal

2. How to deal with beauty filters?

In the context of mobile live streaming, this is a necessity. Mobile live streaming apps without beauty functions are not popular with anchors. After the image is collected, the data source can be called back to the filter processing program before the data is sent to the encoder. After the original data is processed by the filter, it can be sent back to the encoder for encoding.

3. How to ensure the continuous smooth broadcast without delay?

“Second on” solves the first loading experience of live broadcast. How to ensure the smooth audio-visual performance of pictures and sounds during the continuous playing of live broadcast? After all, a live broadcast is not a one-time request like HTTP, but a long connection at the Socket level, until until the host actively terminates the push stream.

We’ve talked about Caton’s definition above: a frame lag during playback that triggers a visual sensation. Without considering the performance differences of terminal devices, we take a look at how to ensure a continuous live broadcast without delay for reasons of network transmission.

This is actually a fault tolerance problem when the transmission network is unreliable during live broadcast. For example, the player is temporarily disconnected from the network but quickly recovers. In this scenario, if the player does not perform fault tolerance, it is difficult to avoid the phenomenon of black screen or reloading playback.

In order to tolerate such network errors and achieve end user inattention, the client player could consider building a FIFO (first-in, first-out) buffer queue, where the decoder reads data from the play cache queue, and the cache queue downloads data continuously from the live server. Generally, the capacity of the cache queue is measured by time (such as 3s). When the player network is unreliable, the client cache can play a transitional role of “no connection”.

Obviously, this is just a “delaying strategy”. If the edge node of the live broadcast server fails and the client player is connected for a long time at this time, the client buffer capacity is no longer of use no matter how large it is when it cannot receive the connection disconnection signal of the peer end, it is necessary to combine the client business logic to make scheduling.

The important thing is that the client and the server can do precise scheduling. Before initializing live streaming, for example, edge A-Nodes with the best line quality are allocated based on IP geographic location and precise scheduling by carriers. In the process of live streaming, quality data such as frame rate feedback can be monitored in real time, and the line can be dynamically adjusted based on the quality of live streaming.

6 Q & A

1. What is the frequency of keyframe setting? Are there dynamic Settings based on access? Too long a first screen second would be hard to do.

Xu Li: The longer the key frame interval, that is, the longer the GOP, the higher the hd picture in theory. However, when generating HLS live broadcast, the minimum cutting granularity is also a GOP, so for interactive live broadcast, it is generally not recommended to set the GOP too long. Generally, 2 key frames are required for live broadcast. For example, if the frame rate is 24fps, then the interval between two key frames is 48fps, and the GOP is 2s.

2. Is the live broadcast of Seven Cows accelerated by Internet? Did you run into any holes?

Xu Li: In terms of live broadcasting, Quniu mainly builds its own nodes. It also supports the integration of many third-party CDN service providers and provides customers with better services through diversified line combinations. Problems encountered in the process of cooperation with third-party CDN can be communicated and shared in finer granularity.

3. Is there any means to speed up RTMP live stream other than optimizing the route?

Xu Li: Physically optimize the line, logically optimize the strategy, such as selective frame loss, reduce the transmission volume without affecting the quality of the code.

4. Which link is the problem when OBS pushes the stream and HLS on the player side has video/audio synchronization? How do you optimize?

Xu Li: It may be the problem of the collection end. If the encoding link of the collection end is not synchronized, you can synchronize the time stamp of the audio and picture on the stream receiving server, which is the global proofreading. If the decoding performance of the player is a problem, the playback logic needs to be adjusted, for example, to selectively lose a frame on the premise of ensuring the strong consistency of the audio and picture timestamp.

5. I frame is not a key frame, but an IDR frame is. An IDR frame is an I frame, but an I frame is not necessarily an IDR frame. Only IDR frames are reentrant.

Xu Li: I frame is translated into key frame in Chinese, but since IDR frame is mentioned, it can be expanded to explain. All IDR frames are I frames, but not all I frames are IDR frames. IDR frames are a subset of I frames. I frame is strictly defined as an in-frame encoding frame. Since it is a full-frame compression encoding frame, I frame is usually used to represent “key frame”. IDR is an “extension” based on I-frame, with control logic, IDR images are I-frame images, when decoder decodes to IDR images, it will immediately empty the reference frame queue, and all the decoded data will be output or discarded. Find the parameter set again and start a new sequence. This gives you an opportunity to resynchronize if there is a major error in the previous sequence. Images after IDR images are never decoded using data from images before IDR.

6. Have you investigated the nginx RTMP module? Why is it useless?

Xu Li: I have researched that Nginx_rtMP_module is a single process multi-threading, non-go this kind of lightweight thread/coroutine with concurrent natural semantic way to write flow business. Nginx originally had a large amount of code (about 160,000 lines, but not a lot of features related to live streaming). In addition, nginx.conf is written to configure tenants. Usually, a single tenant can configure tenants. However, the service scalability is not flexible, and it can meet basic requirements but not advanced functions.

7. Which open source software is used? Is it coded in X264? Did you develop the livestreaming server yourself or open source?

Xu Li: Live broadcast server with go development, mobile coding priority hard coding, soft coding with X264

8. When using OBS to push stream to nginx_rtmp_module, has the video been compressed or needs to be developed based on OBS?

Xu Li: OBS coding compression has been done, there is no need to develop.

9. Video broadcast wants to seamlessly insert a TS file of an advertisement into the HLS stream, so I would like to ask you some questions: 1. Does the resolution of this TS stream have to be consistent with the previous video stream? 2. Does the PTS timestamp increment with the previous TS?

Xu Li: 1. It can be inconsistent. In this case, the two videos are completely independent, and there is no relationship between them. Just insert the Discontinue tag, and the player can reset the decoder parameters after recognizing this tag, and the video will play seamlessly, and the screen will switch smoothly. 2. There is no need to increment. For example, video A is live, playing to PTS. At 5s, insert A video B, insert A Discontinue, then insert B, after B is finished, insert A Discontinue, then insert A, At this time, PTS of A can be incrementing with the previous one, or offset according to the duration of B inserted in the middle. Generally, PTS will be incrementing continuously for vod and time shift, while THE duration of B will be counted for live broadcast.

Any player we use is based on FFmpeg. However, AiPlayer provided by Apple does not support live files. Bilibili has packaged FFmpeg, which can be used for object oriented development.

1. Search for ijkPlayer on GitHub, you can see that Android and iOS are based on this

2. Locate Build iOS and open the terminal

3. Enter CD Desktop/

Git clone github.com/Bilibili/ij… ijkplayer-ios

Download ijkPlayer -ios to your desktop

cd Desktop/ijkplayer-ios

./init-ios.sh (download FFmpeg)

CD ios (jump to ios folder and compile FFmpeg)

./compile-ffmpeg.sh clean

./complie-ffmpeg.sh all

4. After compiling, run Demo. The following interface is displayed, indicating that Demo is completed:

We needed to integrate Bilibili’s Dome into our own project in two ways:

1. The official method is integrated engineering in engineering, which is complicated;

2. Package all the source code of IJK into a framework, and only need to inherit the framework.

The steps for generating the framework are as follows:

1. Run the project under IJKMediaPlayer and click Edit Scheme…

2. The following window pops up, change Debug to Release, and click Close

3. We need to Build a simulator version and a real version and merge them

Build emulator version: Choose any emulator:

4.command+B, red to black, file already exists, show in Finder:

5. Then for the real phone version, we select Generic iOS Device when not connected to the phone

Command +B again to generate the real machine version

What I really want to merge is this file:

6. Merge method

7. Complete, before to replace IJKMediaFramework files to true inside the machine, and then copy IJKMediaFramework at the next higher level. The framework file to the desktop, for later use.

To broadcast is to broadcast a URL. We’ve got the URL. We’ve written it to our data source. To import a playable framework, go to Target ->General->Linked Frameworks and Libraries, click the plus sign, add avfoundation. framework, import #import,

MPMoviePlayerViewController *movieVC = [[MPMoviePlayerViewController alloc] initWithContentURL:[NSURL URLWithString:live.streamAddr]];

[self presentViewController:movieVC animated:YES completion:nil];

SO far, when clicking cell, the play interface can pop up, but it can’t play, because it can’t decode, SO, we have to give up this MP method.)

At this time we prepared IJKMediaFramework them is required. The framework.

8. Will we merge IJKMediaFramework. Framework to our project, be sure to click on the Copy

9. It also relies on some frameworks. Go to GitHub, search ijkPlayer, and find the frameworks you need to import as follows:

To start importing: Go to Target ->General->Linked Frameworks and Libraries and click on the plus sign

10. Then we need to create our own interface to play, create a new:

11. Copy the contents of IJKMediaDemo to YKPlayerViewController

Import the header file # import < > IJKMediaFramework/IJKMediaFramework. H

Add protocol @Property (atomic, retain) idPlayer

Then register the notification, remove notification (copy the corresponding code), but there is still yellow warning, indicating that there is a method not implemented, need to copy the method, the rest is to create the real live (configure live).

So viewDidLoad, [self initPlayer];

And then I’m going to do this initialization -(void)initPlayer,

IJKFFOptions *options = [IJKFFOptions optionsByDefault];

IJKFFMoviePlayerController *player = [[IJKFFMoviePlayerController alloc] initWithContentURLString:self.live.streamAddr withOptions:options];

self.player = player;

self.player.view.frame = self.view.bounds;

self.player.shouldAutoplay = YES;

[self.view addSubview:self.player.view];

And then it passes the value

YKPlayerViewController *playerVC = [[YKPlayerViewController alloc] init];

playerVC.live = live;

[self.navigationController pushViewController:playerVC animated:YES];

Now you’re ready to play.