Introduction: Live video is a concern of many technical teams and architects. In terms of real-time performance, most live video is quasi-real-time, with a delay of 1-3 seconds. In this paper, Yuan Rongxi contributed to “High Availability Architecture”, introducing the implementation behind its live broadcast delay control at 500ms.

Yuan Rongxi with honors student engineers, 2015 students of outstanding gentleman, responsible for the network real-time transmission of the students with excellent grades king and architecture design and implementation of a distributed system, focusing on the basis of technology, the network transmission, database kernel, distributed systems and concurrent programming has a certain understanding.

Recently, due to the business relationship of the company, we need a architecture and technical solution for real-time interaction of ultra hd video on the public network. As is known to all, CDN + RTMP can meet most of the video broadcast services. We have also contacted and tested the schemes provided by several CDN, and there is no problem with single broadcast. Once multi-person interaction is involved, the delay is very large, and normal interactive conversation cannot be carried out. It is meaningless for our online education company to live broadcast without interaction, so we decided to build a super clear (1080P) real-time video transmission scheme by ourselves.

First, let’s explain what is real-time video. Real-time video is the whole process of video image from generation to consumption without delay. As long as the video business meets this requirement, it can be called real-time video. The real-time performance of the video can be summarized into three levels:

  • Pseudo real-time: video consumption delay is more than 3 seconds, one-way real-time viewing, the general architecture is CDN + RTMP + HLS, now basically all live broadcast is this kind of technology.

  • Quasi real-time: video consumption delay of 1 ~ 3 seconds, two-way interaction can be carried out but there are barriers to interaction. Some live broadcasting websites have implemented such technology through TCP/UDP + FLV, and YY live broadcasting belongs to this kind of technology.

  • Real time: video consumption delay < 1 second, average 500 ms. This type of technology is truly real-time, where there is no significant delay in talking to people. QQ, wechat, Skype and WebRTC have all implemented such technologies.

Most of the real time videos in the market are real-time transmission schemes of 480P or below. It is difficult for online education and teaching, and sometimes the fluency is a big problem. We’ve done a lot of tentative research and exploration in implementing super clear real-time video, and we’ll share most of the details here.

In order to shorten the delay in real time, it is necessary to know how the delay is generated. From the generation, encoding, transmission to the final consumption of video, all links will produce delay, which is generally summarized as the following figure:

(Click on the image to zoom in full screen)

Imaging delay, general technology is helpless, involving CCD related hardware, the best CCD on the market now, 50 frames per second, imaging delay is also about 20 milliseconds, general CCD only 20 ~ 25 frames, imaging delay 40 ~ 50 milliseconds.

The encoding delay, which is related to the encoder, is described in the following summary, and the space for general optimization is relatively small.

We focus on network delay and playback buffer delay to design, before introducing the whole technical details to understand the video coding and network transmission related knowledge and characteristics.

First, video coding those things

We know that the image format collected from CCD is generally RGB format (BMP), which has a very large storage space. It is used to describe the color value of a pixel with three bytes. If the image space is 1080P resolution: 1920 x 1080 x 3 = 6MB, even converted to JPG is nearly 200KB, if it is 12 frames per second to use JPG also requires nearly 2.4MB/S bandwidth, which is unacceptable for public network transmission.

Video encoder is to solve this problem, it will do motion detection according to the change of the image, through a variety of compression to send the change to each other, 1080P after H.264 coding bandwidth is about 200KB/S ~ 300KB/S. In our technical solution we use H.264 as the default encoder (also working on H.265).

1.1 h. 264 coding

As mentioned above, the video encoder will perform selective compression according to the forward and backward changes of the image. Because the receiver does not receive any image at the beginning, the encoder needs to perform full compression at the beginning of the compression of the video, which is in the I frame in H.264, and the subsequent video images will perform incremental compression according to this I frame. These incremental compression frames are called P-frames, and H.264 also introduces a bidirectional predictive coding called B-frames to prevent packet loss and reduce bandwidth. In order to prevent the loss of intermediate P frames, H.264 introduces grouping sequence (GOP) coding, that is, sending a full I frame at intervals, with a grouping GOP between the last I frame and the next I frame. The relationship between them is shown below:

PS:B frames are best avoided in live videoBecause B frame is bidirectional prediction, it needs to be encoded according to the following video frame, which will increase the codec delay.

1.2 Mosaics, caton and secs

As mentioned above, if the P frame in GOP group is lost, the image at the decoding end will be wrong. In fact, this mistake is Mosaic. Because the continuous motion information in the middle is lost, H.264 will complement the data according to the previous reference frame when decoding, but the data is not the real motion changes, so the color difference will occur, which is the so-called Mosaic phenomenon, as shown in the figure:

This is not what we want to see. In order to avoid such problems, if P frame or I frame is found missing, all frames in this GOP will not be displayed until the next I frame comes and the image is refreshed again. However, I frames are generated according to the frame period, which requires a relatively long period of time. If the subsequent image is not displayed before the next I frame, the video will not move, which is the so-called lag phenomenon. If the continuous loss of too many video frames causes the decoder to have no frame to solve, it will also cause serious lag phenomenon. Video decoding end of the staid phenomenon and Mosaic are caused by the loss of frames, the best way is to make frames as far as possible.

After knowing the principle of H.264 and block coding technology, the so-called second open technology is relatively simple. As long as the sender sends it to the receiver from the latest I frame development of GOP, the receiver can decode the completed image normally and display it immediately. However, there will be more frames at the beginning of the video connection, resulting in playback delay. As long as the receiver tries to decode expired frames and do not display them until the current video frame is within the playing time range.

1.3 Encoding delay and bit rate

In the first four delays, we mentioned the encoding delay. The encoding delay is the time of the FRAME data from THE RGB data of CCD encoded by the H.264 encoder. We tested the latencies of the latest version OF X.264 at various resolutions on a common client with an 8-core CPU as follows:

As can be seen from the above, the encoding delay of ultra hd video will reach 50ms. To solve the encoding delay problem, we can only optimize the encoder kernel to make the encoding operation faster, and we are also working on it.

Under the resolution of 1080P, the video coding bit rate will reach 300KB/S, the data size of a single I frame can reach 80KB, and the data size of a single P frame can reach 30KB, which poses a severe challenge to real-time network transmission.

Second, network transmission quality factors

A key link of real-time interactive video is network transmission technology, whether early VoIP, or the current popular video broadcast, its main means is to communicate through TCP/IP protocol. However, IP networks are inherently unreliable transmission networks, and video transmission over such networks is prone to lag and delay. Let’s take a look at several key factors affecting the quality of IP network transmission.

2.1 the TCP and UDP

People who have had an understanding of live broadcast will think that the preferred video transmission is TCP + RTMP. In fact, this is relatively one-sided. Neither TCP nor RTMP is dominant in large-scale real-time multimedia transmission networks. TCP is a congestion fair transmission protocol, and its congestion control is to ensure the fairness of the network rather than fast arrival. As we know, the TCP layer will prompt the application layer to read data only when the corresponding packets are in order. If the packets are out of order or lost, the application layer will wait in TCP. Therefore, TCP’s sending window buffering and retransmission mechanism will cause uncontrollable delay in the case of network instability, and the more transmission link layers, the greater the delay.

About the principles of TCP:

TCP 的那些事儿(上)

About TCP retransmission delay:

http://weibo.com/p/1001603821691477346388

It is more reasonable to use UDP in real-time transmission. UDP avoids TCP’s heavy three-way handshake, four-way wave and various complicated transmission features. It only needs to do a simple link QoS monitoring and packet retransmission mechanism on UDP, and its real-time performance is better than TCP. This can be proved by the RTP and DDCP protocols, which we have formally referenced to design our own communication protocols.

2.2 delay

An important factor in evaluating the quality and latency of network communication is round-trip Time, also known as RTT. The method for evaluating RTT between the two ends is simple, roughly as follows:

  1. The sender sends a ping packet with the local timestamp T1 to the receiver.

  2. After receiving the ping packet, the receiver constructs a PONG packet carrying T1 with the timestamp T1 in ping and sends it to the sender.

  3. When the sending end receives pong sent by the receiving end, the local timestamp T2 is obtained, and T2-T1 is the RTT of this evaluation.

The schematic diagram is as follows:

(Click on the image to zoom in full screen)

The detection period of the above steps can be set to once every second. In order to prevent the increase of network burst delay, we adopt the RTT forgetting attenuation algorithm of TCP to calculate, assuming that the original RTT value is RTT, and the RTT value of this probe is keep_RTT. Then the new RTT is:

new_rtt = (7 * rtt + keep_rtt) / 8

Maybe the keep_RTT detected each time will be different, we need to calculate a modified VALUE of RTT rTT_var, the algorithm is as follows:

New_rtt_var = (rtt_var * 3 + ABS (RTT – keep_RTT)) / 4

Rtt_var is the time difference of network jitter.

If the RTT is too large, the network latency is too high. We maintain multiple end-to-end network paths at the same time and detect their network status in real time. If RTT exceeds the delay range, the transmission path will be switched (except for local network congestion).

2.3 Jitter and Disorder

In addition to latency, UDP also has network jitter. What is jitter? For example, if we send 10 video frames per second, the delay between sender and receiver is 50MS, and each frame data is carried by a UDP packet, then the frequency of sending data by the sender is 100ms: a data packet, indicating the time at which the first packet is sent is 0ms. T2 indicates the time at which the second packet is sent: 100ms.. In an ideal state, the time at which the packet is received by the receiver is successively (50ms, 150ms, 250ms, 350ms… .). However, the relative time of the packet received by the receiver may be (50ms, 120ms, 240ms, 360ms… .). , the difference between the actual time when the receiver receives the packet and the ideal time is jitter. The schematic diagram is as follows:

(Click on the image to zoom in full screen)

We know that the video must be played strictly according to the time stamp, otherwise the video will speed up or slow down the phenomenon, if we receive the video data immediately play, then the phenomenon of speed up and slow down will be very frequent and obvious. In other words, network jitter will seriously affect the quality of video playback. In order to solve this problem, a video playback buffer will be designed to buffer the received video frame and then play it according to the internal timestamp of the video frame.

In addition to small jitter, UDP packets are out of order in a large range. That is, the last packet reaches the receiver before the first packet. Out-of-order may cause the sequence of video frames to be out of order. Generally, this problem is solved by setting a sequence function in the video playing buffer so that the packets sent first are played first.

The design of the playback buffer is very particular. If the buffer frame data is too much, it will cause unnecessary delay. If the buffer frame data is too little, no data can be played because of jitter and disorder, which will cause a certain degree of lag. The details of the design inside the playback buffer are covered in a later section.

2.4 packet loss

UDP packets may be lost during transmission due to various reasons, such as insufficient network egress, congestion of intermediate network routes, too small socket sending/receiving buffer, hardware problems, and transmission loss. Packet loss is a very frequent occurrence in the process of UDP-based video transmission. Packet loss will cause frame loss in the video decoder, resulting in video playback lag. This is why most live videos use TCP and RTMP, because TCP has its own retransmission mechanism at the bottom layer, which can ensure that videos are not lost during transmission under normal network conditions. UDP packet loss compensation methods are as follows:

Message redundancy

Packet redundancy is well understood. A packet is sent two or more times. The advantage of this is simplicity and low latency, but the disadvantage is that it requires an additional N times (N depending on the number of times it is sent) of bandwidth.

FEC

Forward Error Correction algorithm, commonly used in erasure Correction code technology (EC), is common in distributed storage systems. In the simplest case, XOR (and or operation) is performed on two packets A and B to obtain C, and the three packets are sent to the receiving end. If the receiving end receives only AC, the XOR operation of A and C can obtain OPERATION B. This method is usually used for real-time voice transmission because it requires less bandwidth, avoids packet loss, and has less latency. For ultra-clear video with 1080P and 300KB/S bit rate, even 20% additional bandwidth is unacceptable, so FEC mechanism is not recommended for video transmission.

Packet retransmission

There are two modes for packet loss retransmission: push and pull. In Push mode, the sender periodically retransmits packets without receiving acknowledgement from the receiver. TCP uses Push mode. In Pull mode, the receiver sends a retransmission request to the sender to retransmit the lost packet. Packet loss retransmission is on-demand retransmission, which is more suitable for the application scenario of video transmission. It does not increase much extra bandwidth, but once packet loss causes at least one RTT delay.

2.5 MTU and Maximum UDP

The IP network defines the maximum size of an IP packet. The MTU is as follows:

Super channel 65535

16Mb/s token ring 179144

Mb/s token ring 4464

FDDI 4352

Ethernet 1500

The IEEE 1492 802.3/802.2

X.25 576

Point to point (low delay) 296

The red is the Internet access mode used on the Internet, among which x. 25 is an old Internet access mode, which mainly uses ISDN or telephone line to access the Internet. It is not excluded that some household routers follow the DESIGN of X.25 standard. Therefore, we must clearly know the MTU size of each client. A simple method is to use UDP packets of various sizes to detect the MTU size during initialization. The MTU size affects the size of a video frame fragment, which is the maximum size of data carried by a single UDP packet.

Fragment size = MTU – IP header size – UDP header size – Protocol header size.

IP header size = 20 bytes; UDP header size = 8 bytes.

In order to adapt to the packet first feature of the network router, if the fragment size we get exceeds 800, we will directly default to the fragment size of 800.

3. Transmission model

According to the characteristics of video encoding and network transmission, we design a transmission model for real-time transmission of 1080P ultra hd video, which includes a codec object with automatic bit rate according to the network state, a network sending module, a network receiving module and a PROTOCOL model of UDP reliable arrival. The relationship diagram of each module is as follows:

(Click on the image to zoom in full screen)

3.1 Communication Protocol

First, let’s look at the communication protocol. The communication protocol we define is divided into three stages: access negotiation stage, transmission stage and disconnect stage.

Access negotiation stage:

Mainly the sender initiated a video transmission access request, to carry the current state of the local video, the starting frame number, timestamp and MTU size, etc., the receiving party after receiving the request, according to the request of video information to initialize local receiving channel, and carries on the comparison to the local MTU and sender MTU in smaller of the two back to the sender, Let the sender fragment according to the negotiated MTU. The schematic diagram is as follows:

(Click on the image to zoom in full screen)

Transmission stage:

There are several protocols in the transmission stage, including a PING/PONG protocol of test volume RTT, data protocol for carrying video frame fragments, data feedback protocol and synchronization correction protocol at the sender. The data feedback protocol is fed back to the sender by the receiver, which carries the sequence of packet ID, frame ID and packet ID requesting retransmission that the receiver has received consecutive frames. In the synchronization correction protocol, the sender proactively discards the packets in the buffer of the sending window and then requests the receiver to synchronize the packets to the current location of the sending window, preventing the receiver from requiring the sender to resend the discarded data after sending the frame data. The schematic diagram is as follows:

(Click on the image to zoom in full screen)

Disconnect stage:

With a disconnect request and a disconnect acknowledgement, both sender and receiver can initiate a disconnect request.

3.2 send

Sending mainly includes video frame fragment algorithm, sending window buffer, congestion judgment algorithm, expired frame discarding algorithm and retransmission algorithm. Let me introduce you one by one.

The frame fragmentation

In 1080P, the size of most video frames is larger than the UDP MTU, so you need to fragment the frames. The method is very simple. The fragment size is determined based on the MTU size negotiated during the connection process (the algorithm for determining the fragment size is described in section MTU). Then, the frame data is divided into several fragments based on the fragment size, and each fragment is sent to the receiver in the form of segment packets.

The retransmission

Retransmission is relatively simple. We use the pull method to realize retransmission. When packet loss occurs at the receiver, if the moment of packet loss is T1 + rTT_var < the current moment T2 of the receiver, the packet is considered lost. In this case, it will construct a segment ACK for all the lost packets that meet this condition and send it back to the sender. After receiving the feedback, the sender can resend the packet in the buffer of the resend window based on the ID.

Why is it necessary to interval rtt_var to consider packet loss? Packets may arrive in disorder. Therefore, the packet is lost after a jitter period. If packet loss is detected, the sender immediately sends feedback to request retransmission, which may cause more data at the sender, resulting in bandwidth congestion and congestion.

Send window buffer

The sending window buffer holds all the messages that are being sent without a continuous ID confirmation from the sender. When the receiver sends back the latest continuous packet ID, the sending window buffer deletes all the packets whose ids are less than the latest continuous packet ID. All the packets cached by the sending window buffer exist for resending. Here is an explanation of the continuous packet IDS fed back by the receiver. For example, if the sender sends 1.2. 3. The minimum contiguous ID is 2. If 3 follows, the minimum contiguous ID of the receiver is 5.

Congestion judgment

We mark the current time stamp as curr_T, the time stamp of the oldest packet sent in the window buffer as oldest_T, and the interval between them as delay, so

delay = curr_T – oldest_T

Request is sent in the encoder module send new video frame, if delay > congestion threshold Tn, we think that the network congestion, this time will be received at the receiving end according to a recent 20 seconds to confirm the size of the data to calculate a bandwidth value, and to bring the bandwidth value feedback to the encoder, the encoder after receiving the feedback, will adjust the coding rate according to the bandwidth. If the feedback of reducing bit rate occurs repeatedly, we will reduce the resolution of the image to ensure the smoothness and real-time performance of the video. The value of Tn can be determined by RTT and rTT_var.

However, the network may be congested for a period of time and then recover to normal. We designed a timer to periodically check the number of retransmitted packets and delay of the sender. If the delay is found to be normal, the encoding rate of the encoder will be gradually increased to restore the video to the specified resolution and clarity.

Expired frame discarding

In case of network congestion, there may be a lot of packets being sent in the buffer of the sending window. In order to alleviate congestion and reduce delay, we will check the whole buffer. If there is a H.264 GOP group that exceeds a certain threshold time, we will remove all the packets of this GOP frame from the window buffer. In addition, the frame ID and message ID of the next GOP group I are synchronized to the receiving end through WND Sync protocol. After receiving this protocol, the receiving end will set the latest continuous ID as the synchronized ID. It must be noted here that if frequent action of discarding expired frames will cause lag, it indicates that the current network is not suitable for transmission of high resolution video, so you can directly set the video to a smaller resolution

3.3 the receiving

Reception mainly includes packet loss management, play buffer, buffer time evaluation and play control, which are implemented around the play buffer, introduced one by one.

Packet loss management

Packet loss management consists of packet loss detection and packet ID management. The packet loss detection process is roughly as follows. Assume that the maximum packet ID of the playback buffer is max_id and the ID of the newly received packet is new_id. If max_id + 1 < new_id, packet loss may occur. All ids and current moments in the range [max_id + 1, new_id -1] are added to the packet loss manager as K/V pairs. If new_id is less than max_id, the K/V pair corresponding to new_id in packet loss management is deleted, indicating that the lost packet is received. When the feedback conditions are met, the system scans the entire packet loss management, adds the ID of the lost packet to the segment ACK feedback message and sends a request for retransmission to the sender. If the ID is retransmitted, the system sets the current moment as K/V pairing and increases the retransmission counter count of the corresponding packet. During the scan, resend_count is counted for the maximum number of resends of a single packet in the packet manager.

Buffer time evaluation

In the previous section of jitter and disorder, we mentioned that there is a buffer at the player end. If the buffer is too large, the delay will be large; if the buffer is too small, the lag will occur. We designed a buffer time evaluation algorithm to solve this problem. Buffer evaluation firstly calculates a cache timer. The cache timer is obtained by scanning resend count and RTT obtained from packet management. We know that the interval from requesting retransmission of the packet to receiving the retransmission of the packet is an RTT cycle. Therefore, the calculation method of cache timer is as follows.

cache timer = (2 * resend_count+ 1) * (rtt + rtt_var) / 2

If the cache timer is calculated to be very small (less than the frame timer between video frames), then the cache timer = frame timer, that is, no matter how good the network is, the buffer will buffer at least 1 frame of video data, otherwise the buffer is meaningless.

If no packet loss or retransmission occurs within a unit of time, the cache timer is appropriately reduced. The advantage of this is that when the network intermittently fluctuates and the cache timer is large, the cache timer can be restored to a relatively small position after recovery, reducing unnecessary buffer delay.

Play buffer

The play buffer we designed is an ordered circular array indexed by frame ID, and the units inside the array are the specific information of the video frame: frame ID, fragment number, frame type, etc. The buffer has two states: waiting and playing. Waiting indicates that the buffer is buffered and cannot play video until the frame in the buffer reaches a certain threshold. The Playing state indicates that the buffer enters the Playing state, from which the Playing module can retrieve frames for decoding playback. Let’s introduce the switching relationship between the two states:

  1. The buffer is initialized to waiting when it is created.

  2. When the timestamp interval between the latest and oldest frames buffered in the buffer > cache timer, the playing state is entered and the absolute timestamp play ts is set to play at the current time.

  3. Waiting when the buffer is playing and has no frames, wait until step 2 is triggered.

The purpose of the playback buffer is to prevent jitter and deal with packet loss and retransmission, so that the video stream can be played according to the frequency at the time of collection. The design of the playback buffer is extremely complex, and many factors need to be considered, so the implementation needs to be careful.

Playback controls

The last link of the receiver is playback control, which is to take out valid video frames from the buffer for decoding playback. But how? When do you pick it up? We know that the video is played according to the relative time stamp carried by the video frame from the sender. We have a relative time stamp TS for each video frame. According to the difference of TS between frames, we can know the time interval between the last frame and the next frame. The relative timestamp is prev_ts, the current system timestamp is curr_play_ts, and the relative timestamp of the minimum ordinal frame in the current buffer is frame_ts, as long as:

Prev_play_ts + (frame_TS – prev_TS) < curr_play_ts and all packets have been received

Prev_play_ts = cur_play_ts, but update prev_ts is a little bit tricky, so we do something special to prevent buffer delay.

If frame_TS + cache timer < TS of the maximum frame in the buffer, the buffer delay is too long. Prev_ts = TS – cache timer of the maximum frame in the buffer. Otherwise prev_ts = frame_ts.

Four, measurement

Even the best models need to be validated by reasonable measurements, especially in the time-sensitive transmission field of multimedia. In the laboratory environment, neTEM is generally used to simulate various conditions of the public network for testing. If the simulation environment has reached an ideal state, relevant personnel will be organized to test on the public network. Here’s how to test our entire transport model.

4.1 NETEM simulation test

Netem is a network simulation tool provided by the Linux kernel. It can set latency, packet loss, jitter, out-of-order, and packet damage, and basically simulate most of the public network conditions.

You can visit netem’s official website:

https://wiki.linuxfoundation.org/networking/netem

We set up a test environment based on server and client mode in the experimental environment. The following is the topology diagram of the test environment:

We use Linux to make a router. Both the server and the receiver and sender are connected to this router. The server is responsible for client registration, data forwarding, data buffering and so on, which is equivalent to a simple streaming media server. The Sender encodes and sends media, and the Receiver receives and plays media. In order to test the delay, we run sender and receiver on the same PC machine, stamp a time when sender obtains RGB image from CCD, and send the message recorded in this frame to server and receiver. When the receiver receives and decodes this frame of data, the recorded timestamp is used to obtain the delay of the whole process. Our test case simulated the following network states on router with NETEM using a 1080P video stream with a bit rate of 300KB/S:

  1. The loop delay is 10m, and there is no packet loss, jitter, or disorder

  2. Loop delay 30ms, packet loss 0.5%, jitter 5ms, 2% out of order

  3. Loop delay 60ms, packet loss 1%, jitter 20ms, 3% out of order, 0.1% packet damage

  4. Loop delay 100ms, packet loss 4%, jitter 50ms, 4% out of order, 0.1% packet damage

  5. Loop delay 200ms, packet loss 10%, jitter 70ms, 5% out of order, 0.1% packet damage

  6. Loop delay 300ms, packet loss 15%, jitter 100ms, 5% out of order, 0.1% packet damage

Since the transmission mechanism adopts reliable arrival, the effective parameter to test the transmission mechanism is video delay. We counted the maximum delay within a 2-minute period, and the delay curves of various cases are as follows:

As can be seen from the figure above, if the network control loop delay is 200ms and packet loss is less than 10%, the video delay can be less than 500ms milliseconds, which is not a very demanding condition for network quality. Therefore, in the background media service deployment, we try to make the network between the client and the media server meet this condition. If the network loop delay is within 300ms and 15% packet loss, the delay can still be less than 1 second, basically meeting the requirements of two-way interactive communication.

4.2 Public Network Test

The public network test is relatively simple. We deploy the Server on UCloud cloud. The sending end uses the 100M broadband of Shanghai Telecom, and the receiving end uses the 20M community broadband of Hebei Unicom, and the loop delay is about 60ms. After the overall test, 1080P video viewing at the receiving end is smooth and natural, with no jitter and no lag, and the average delay is about 180ms.

Five, the pit

In the whole process of realizing the transmission technology of 1080P ultra hd video, we have encountered many pits. Roughly as follows:

Socket buffer problem

In the early stage of development, we used the default buffer size of socket. Since the data of 1080P image frame was very large (the key frame was over 80KB), we found serious packet loss at the receiver in the network environment without packet loss set in the Intranet test. It was verified that the packet loss was caused by the socket receiving and receiving buffer being too small. Then we set the socket buffer to 128KB and the problem was solved.

H.264 B frame delay problem

In the early stage, we started b-frame coding in order to save transmission bandwidth and prevent packet loss. Since B-frame is bi-directional predictive coding before and after, it will lag several frame intervals in the coding period, causing encoding delay of more than 100ms. Later, we simply removed the b-frame coding option in order to achieve real-time performance.

Packet loss and retransmission in Push mode

In the design stage, we used the sender’s active push mode to solve the problem of packet loss and retransmission. In the testing process, it was found that in the case of frequent packet loss, bandwidth consumption was increased by at least 20%, and it was easy to bring delay and network congestion. After several demonstrations, the current pull mode is used for packet loss and retransmission.

Segment memory problem

In the design stage, we dynamically allocate memory objects for frame information in each video buffer. Since 1080P will send 400-500 UDP packets per second during transmission, memory fragmentation is easy to occur when running on PC for a long time. Inexplicable cliB false memory leak and concurrency issues on the server side. We implement a problem with memory slab management that frequently allocates and frees memory.

Audio and video data transmission problems

In the early design, we used the FLV method to transmit audio and video data with the same transmission algorithm. The advantage is that it is easy to achieve, but in the case of network fluctuation, it is easy to cause sound lag, and the transmission cannot be optimized according to the characteristics of audio. Later, we isolated the audio and designed a low delay and high quality audio transmission system according to the characteristics of audio, and optimized the transmission of audio at fixed points.

The subsequent work is to focus on the media multi-point distribution, multi-point concurrent transmission, P2P distribution algorithm exploration, minimize the delay and service bandwidth cost, so that the transmission becomes more efficient and cheaper.

Q&A

Question: what is the most critical piece in optimizing to 500ms?

Yuan Rongxi: The coordination between packet loss and retransmission congestion and playback buffering is the most important. Delay control and video fluency should be taken into account.

Question: what are the differences between multi-party video and single-party video? Have you used CDN to push stream?

Yuan Rongxi: Our company is engaged in online education, and many scenes require teachers and students to talk, so the delay of CDN streaming is very large. This video is mainly to solve the problem of the delay of multi-party communication. We now watch put also useful CDN push stream, but just pure watch. We are also developing a UDP-based viewing side distribution protocol, which has not been completed yet.

Refer to the reading

  • Optimization experience of mobile live broadcasting technology in seconds (including PPT)

  • Reveal a Facebook live video watched by millions

Technical original articles, welcome to submit through the public menu “contact us”. The direction of submission includes technical architecture articles, new technologies and new practices. The approved articles will be published on the public account of HIGH availability architecture, Weibo, Toutiao and other media. Submissions must agree to publish the article in the high availability architecture. Please indicate that the reprint is from the wechat official account of “ArchNotes” and contains the following QR code.

Highly available architecture

Change the way the Internet is built

Long press the QR code to follow the “HA Architecture” public account

Salon and live event preview

On August 20th, the high Availability Architecture will be held in Shenzhen, and the technology salon “Internet Architecture from 1 to 100” will be held. At that time, technical experts from Tencent, Sina Weibo, Meizu and Ping An Technology will share their experience in architecture evolution. Identify the QR code to enter the live video page, click to read the original text to register before August 20.

Live broadcast Address:

http://www.ainemo.com/live/liveVideo/share?liveVideoId=ff808081561a289a0156889293fb5523&uid=124996&shareType=1&debugMode =false