No longer fragmented learning, quickly master H5 live technology

Right now, most working front-end workers learn by going straight to Stackoverflow and searching for code or reading blog posts about it. This is fast, but bits and pieces are just an isolated knowledge point. You might forget it all afternoon, save it for the only thing you can possibly remember, and then you’ll be dead. Is there a better way to do this?

Of course I do. One, I have time, two, I have guidance, and three, I have the right materials.

In fact, this is the same as reading a book, the most valuable place is not its content or author, but its contents, whether it really moves you. Don’t waste your time on vague books with no technical content.

Therefore, this article mainly introduces the range of technologies covered by current HTML5 live streaming. If we want to deeply learn each technology, we can continue to discuss it later.

Broadcast protocol

Live streaming started in ’16 with short videos. Its business scenes have many, there are game anchor, talent anchor, online teaching, group experiment (some time ago, someone live let the audience to speculate in stocks) and so on. However, according to the division of technical requirements, it can also be divided into low delay and high delay live broadcast, which is mainly a matter of protocol selection.

At present, there are many common live broadcast protocols, such as RTMP, HLS, and HTTP-FLV. However, the most commonly used protocol is HLS because of its high support, simple technology, but severe latency. This is very painful for some scenes with high real-time performance, such as live sports events. Here, let’s take a closer look at each protocol.

HLS

HLS stands for HTTP Live Streaming. This is the live streaming protocol proposed by Apple. (Apple was also behind the demise of Adobe’s FLV player.)

HLS consists of two parts: an.m3U8 file and a.ts video file (TS is a video file format). The browser will first request the.m3u8 index file, then parse m3u8 to find the corresponding.ts file link, and start downloading. For a more detailed illustration, please refer to this image:

The way he uses it is:

<video controls autoplay>  
    <source src="http://devimages.apple.com/iphone/samples/bipbop/masterplaylist.m3u8" type="application/vnd.apple.mpegurl" /> 
    <p class="warning">Your browser does not support HTML5 video.</p>  
</video>
Copy the code

You can simply write m3U8 into SRC and leave it to the browser to parse. Of course, we can also take fetch to manually parse and fetch the relevant files. The HLS detail version has an extra Playlist, or master, than the simple version above. In Master, different M3U8 files are set up according to the network segment implementation, for example, 3G/4G/wifi network speed, etc. For example, in a master file:

#EXTM3U #EXT-X-VERSION:6 # EXT - X - STREAM - INF: PROGRAM - ID = 1, BANDWIDTH = 2855600, CODECS = "avc1.4 d001f, mp4a. 40.2", RESOLUTION = 960 x540 live/medium m3u8 # EXT - X - STREAM - INF: PROGRAM - ID = 1, BANDWIDTH = 5605600, CODECS = "avc1.640028 mp4a), 40.2," RESOLUTION = 1280 x720 live/high m3u8 # EXT - X - STREAM - INF: PROGRAM - ID = 1, BANDWIDTH = 1755600, CODECS = "avc1.42001 f, mp4a. 40.2," RESOLUTION = 640 x360 live/low m3u8 # EXT - X - STREAM - INF: PROGRAM - ID = 1, BANDWIDTH = 545600, CODECS = "avc1.42001 e, mp4a. 40.2," RESOLUTION = 416 x234 live/cellular m3u8Copy the code

Just focus on the BANDWIDTH field, the rest of the field is generally clear. If the high.m3u8 file is selected here, then the contents will be:

#EXTM3U # ext-x-version :6 # ext-x-targetDuration :10 # ext-x-media-sequence :26 #EXTINF:9.901, http://media.example.com/wifi/segment26.ts # EXTINF: 9.901, http://media.example.com/wifi/segment27.ts # EXTINF: 9.501, http://media.example.com/wifi/segment28.tsCopy the code

Note that the link ending in TS is the actual video file we need to play during the live broadcast. This second-level M3U8 file can also be called a media file. This file actually has three types:

Live PlayList: dynamic playlist. As the name suggests, the list changes dynamically, with TS files updated in real time and out-of-date TS indexes removed. By default, dynamic lists are used.

#EXTM3U # ext-x-version :6 # ext-x-targetDuration :10 # ext-x-media-sequence :26 #EXTINF:9.901, http://media.example.com/wifi/segment26.ts # EXTINF: 9.901, http://media.example.com/wifi/segment27.ts # EXTINF: 9.501, http://media.example.com/wifi/segment28.tsCopy the code

Event PlayList: static list. The main difference between this and dynamic lists is that the original TS file index is not deleted, the list is constantly updated and the file size is gradually increased. It simply adds # ext-x-playlist-type :EVENT to the file.

#EXTM3U # ext-x-version :6 # ext-x-targetDuration :10 # ext-x-media-sequence :0 # ext-x-playlist-type :EVENT #EXTINF:9.9001, http://media.example.com/wifi/segment0.ts # EXTINF: 9.9001, http://media.example.com/wifi/segment1.ts # EXTINF: 9.9001, http://media.example.com/wifi/segment2.tsCopy the code

VOD PlayList: Full list. It simply lists all ts files in a list. If you use this list, it’s no different than playing an entire video. It uses # ext-x-endlist to indicate the end of the file.

#EXTM3U # ext-x-version :6 # ext-x-targetDuration :10 # ext-x-media-sequence :0 # ext-x-playlist-type :VOD #EXTINF:9.9001, http://media.example.com/wifi/segment0.ts # EXTINF: 9.9001, http://media.example.com/wifi/segment1.ts # EXTINF: 9.9001, http://media.example.com/wifi/segment2.ts #EXT-X-ENDLISTCopy the code

For details about related fields, see Apple HLS

HLS defects

HLS is good except that it has a lot of latency, and apple probably didn’t care about it when it was first designed. Delays in HLS include:

TCP handshake
M3u8 file download
Download all TS files under m3U8 file

Here, we assume that the playback time of each TS file is 5s, and the maximum number of TS files carried by each M3U8 is 3-8. Then the maximum delay is 40s. Note that playback cannot begin until all ts files under one M3U8 file have been downloaded. This does not include TCP handshake, DNS resolution, m3U8 file download. So, the total delay in HLS is pretty desperate. Is there a solution? Yes, very simple, either reduce the playback time of each TS file, or reduce the number of TS contained in M3U8. If the balancing point is exceeded, a delay is added each time a new M3U8 file is requested, so the appropriate policy needs to be specified for the business. Now, of course, thanks to mediaSource, it’s not too hard to customize a player to keep the live stream running smoothly while maintaining latency.

RTMP

RTMP: Real-time Messaging Protocol. It was developed in FLV format, so the first reaction was, oh my God, it doesn’t work again!!

Yes, in the current device, because FLV does not support, basically RTMP is not used in the Web at all. However, with the advent of MSE (MediaSource Extensions), it is not impossible to access RTMP directly on the Web. The basic idea is to establish a long connection directly according to WebSocket for data exchange and monitoring. Here, we won’t go into details. Our main goal is to talk about concepts, to talk about frameworks. The RTMP protocol can be divided into the following types:

Pure RTMP: Connects directly through TCP with port 1935
RTMPS: RTMP + TLS/SSL for secure communication.
RTMPE: RTMP + encryption. Using Adobe’s own encryption method over the original RTMP protocol
RTMPT: RTMP + HTTP. Use HTTP to wrap the RTMP stream so it can pass through the firewall. However, the delay is relatively large.
RTMFP: RMPT + UDP. This protocol is often used in P2P scenarios and has abnormal requirements for delay.

RTMP internally transmits data through the TCP long connection protocol, so it has very low latency. Also, the protocol is flexible (and therefore complex) in that it can transfer data based on message Stream IDS as well as on chunk Stream ids. Both can serve as a partition of flows. The content of streams is also divided into video, audio, and related protocol packages.

The detailed transmission process is shown as follows:

If you want to use the RTMP protocol later, you can refer to it directly

HTTP-FLV

This protocol is not much different from RTMP, except for the landing part:

RTMP is to directly transfer the stream on the RTMP protocol, while HTTP-FLV is a transcoding process between RTMP and the client, namely:

Since each FLV file is retrieved over HTTP, its captured protocol header needs to be chunked encoded.

Content-Type:video/x-flv
Expires:Fri, 10 Feb 2017 05:24:03 GMT
Pragma:no-cache
Transfer-Encoding:chunked
Copy the code

It’s relatively easy to use, but the back-end implementation is more difficult than using RTMP directly.

The above briefly introduced three kinds of agreements, specific choice of which agreement, or need to be strong and specific business related, otherwise, the loss or their own (dug pit)…

Here’s a simple comparison

Agreement contrast

agreement	advantage	defects	delayed
HLS	Supporting wide	Giant high latency	More than 10 s
RTMP	Good delay and flexibility	Large quantity, high load	More than 1 s
HTTP-FLV	Good delay, the game broadcast commonly used	It can only be played in the mobile APP	More than 2 s

Front-end audio and video streaming

Due to the browser to FLV under siege, lead to FLV in the browser’s survival situation is grim, but the FLV format with its simple, the processing efficiency high characteristic, make the video background developers could not enabled, if change, will need to be done to the existing video transcoding, such as to MP4, this not only in the play, And it’s all a bit heavy for stream processing to be acceptable. With the advent of MSE, this awkward point is completely solved, and the front-end can be customized to implement a Web player, which is really perfect. (Apple doesn’t think that’s necessary, though, so it won’t work on IOS.)

MSE

MSE stands for Media Source Extensions. It is short for a set of video streaming technologies, including a series of apis: Media Source, Source Buffer, etc. Before the emergence of MSE, the front-end operation of video was only limited to the operation of video files, and could not do any related operation of video stream. MSE now provides a series of interfaces that allow developers to provide media Stream directly.

Let’s take a look at how MSE does the basic flow processing.

var vidElement = document.querySelector('video');

if (window.MediaSource) {
  var mediaSource = new MediaSource();
  vidElement.src = URL.createObjectURL(mediaSource);
  mediaSource.addEventListener('sourceopen', sourceOpen);
} else {
  console.log("The Media Source Extensions API is not supported.")
}

function sourceOpen(e) {
  URL.revokeObjectURL(vidElement.src);
  var mime = 'video/webm; codecs="opus, vp9"';
  var mediaSource = e.target;
  var sourceBuffer = mediaSource.addSourceBuffer(mime);
  var videoUrl = 'droid.webm';
  fetch(videoUrl)
    .then(function(response) {
      return response.arrayBuffer();
    })
    .then(function(arrayBuffer) {
      sourceBuffer.addEventListener('updateend', function(e) {
        if (!sourceBuffer.updating && mediaSource.readyState === 'open') {
          mediaSource.endOfStream();
        }
      });
      sourceBuffer.appendBuffer(arrayBuffer);
    });
}
Copy the code

The code above completes the two parts of the related fetch flow and processing flow. Among them, MS and Source Buffer are mainly used to complete. Next, let’s go into details:

MediaSource

MS(MediaSource) is just a series of video stream management tools that can fully expose audio and video streams to Web developers for manipulation and processing. So, by itself, it doesn’t create undue complexity.

MS only mounts 4 properties, 3 methods, and 1 static test method in its entirety. There are:

Four attributes:

SourceBuffers: Get the SourceBuffer that is currently created
ActiveSourceBuffers: Get the SourceBuffer that is currently active
ReadyState: Returns the current MS state, for example:closed.open.ended.
Duration: Sets the current MS playing duration.

Three methods:

AddSourceBuffer (): Creates a SourceBuffer of the specified type based on the given MIME
RemoveSourceBuffer (): Removes the SourceBuffer specified on the MS.
EndOfStream (): terminates this stream directly

1 static test method:

IsTypeSupported (): Mainly used to determine whether the MIME of the given audio is supported.

The most basic is to use the addSourceBuffer method to get the specified SourceBuffer.

var sourceBuffer = mediaSource.addSourceBuffer('video/mp4; Codecs = "avc1.42 E01E mp4a. 40.2" ');Copy the code

Source Buffer

Once the SourceBuffer has been created using MS, the next step is to play the extra Buffer. Therefore, SourceBuffer provides two basic operations appendBuffer and remove. After that, we can put the ArrayBuffer directly into appendBuffer.

The SourceBuffer also provides an abort() emergency method that can be used to abort a given stream if it becomes problematic.

Therefore, the whole flow chart is:

The ArrayBuffer of audio and video accesses

Of course, the concepts above are just a few, and you’ll have to learn more if you want to actually code them. Interested students, can continue to in-depth understanding of my other blog: full advanced H5 live.

Of course, if you have the opportunity later, you can continue to implement the following how to do the actual coding. This paper mainly introduces the necessary technologies and knowledge points needed for live broadcasting. Only when it is complete, can we complete the introduction of actual coding without obstacles.

The processing of flow

Above, we have explained how to access the actual stream through MSE in the live broadcast. Now, we will start to understand the operation and processing of the specific stream in a down-to-earth way. Because, in the video stream format solution protocol, the most commonly involved is packet grouping, field modification, packet cutting and other operations.

Before we get started, we need to understand some specific concepts about flows:

binary

Binary is nothing more than a bit stream. On the Web, however, there are several abbreviations: binary, octal, and hexadecimal.

Binary: Use 0b to literally represent binary. Each of them represents 1bit(2^1).
Octet: Use 0o to literally represent octet. Each represents 3 bits (2^3).
Hexadecimal: Use 0x to literally represent hexadecimal. Each of them represents 4 bits (2^4).

Each of these represents the number of digits in the actual shorthand, like 0xFF31; This is the length of 2B.

An operation

Bitwise operations are very important in processing flow operations, but because the front-end Buffer does not provide a large set of operations, we will have to construct some wheels ourselves.

I won’t go into detail here, but some common bit operators on the Web are:

See Web Bit Operations for details.

The overall priority is:

~ > > < < > > > & ^ |Copy the code

Byte order

Byte order is simply the order in which bits are placed. Due to historical omissions, bytes are placed in two orders:

BigEndian: To place data from large to small, considering the first byte as the highest bit (normal thinking).
LittleEndian: Prevents data from arriving small, considering the first byte as the least significant.

And this is a concept that we’re going to use a lot as we write. Of course, how do we know whether a computer is using large or small bytes? (Most of them are small bytes). You can use IIFE to make simple judgments:

const LE = (function () {
    let buf = new ArrayBuffer(2);
    (new DataView(buf)).setInt16(0, 256, true);  // little-endian write
    return (new Int16Array(buf))[0] === 256;  // platform-spec read, if equal then LE
})();
Copy the code

On the front end, all we really need to do is manipulate the ArrayBuffer. It is also the channel through which we communicate directly with Buffer.

ArrayBuffer

AB(ArrayBuffer) is not a pure collection stream processing tool like NodeJS ‘Buffer objects. It’s just a container for a stream, which is what the underlying V8 implementation is all about. The basic usage is to give the instantiation a fixed memory area:

new ArrayBuffer(length)
Copy the code

Creates the memory size of the specified length Byte. At this point, it’s just empty memory, and you’ll need to borrow the other two objects TypedArray and DataView to help you write and modify. However, AB provides a very important method: slice()

Slice (), like the slice method on an Array object, returns a new copy of a portion of the Array. Why is this useful?

The underlying AB cannot be changed for TypedArray or DataView objects, so if you want to perform different operations on a Buffer, for example, set AB’s 4-8b to 0 and then set it to 1. If you want to keep both, you’ll need to manually create a copy, which uses the slice method.

AB specific properties and methods are not described here, interested students can refer to the MDN ArrayBuffer

Next, we’ll take a look at TypedArray and DataView, the two objects that actually work most closely with AB.

TypedArray

TA(TypedArray) is a set of ArrayBuffer processing. How can I put it? It can also be subdivided into

Int8Array();
Uint8Array();
Uint8ClampedArray();
Int16Array();
Uint16Array();
Int32Array();
Uint32Array();
Float32Array();
Float64Array();
Copy the code

Why are there so many?

Because TA is an array of buffers separated into a specified size based on the specified length. Such as:

var buf = new Uint8Array(arrayBuffer);

buf[0]
buf[1]
...
Copy the code

Get the specified Buffer bit value by index like this. For example, in the Uint8Array above, each bit is 1B.

The length of the separation is different, and the rest of the content can basically be summarized as a whole by TA. In fact, you could just view it as Array. Why? You can take a look at how it works to get the picture:

reverse()
set()
slice()
some()
sort()
subarray()
...
Copy the code

However, for compatibility reasons, for some methods we need to add the associated polyfill. However, this didn’t interfere with our research learning, and since MSE was developed for modern mobile browsers, we didn’t have to focus too much on browser compatibility when making Web players.

The most common operation on TypedArray is to write directly to index:

buf[0] = fmt << 6 | 1;
buf[1] = chunkID % 256 - 64;
buf[2] = Math.floor(chunkID / 256);
Copy the code

Note that in TypedArray, buffers are read according to the platform’s default byte order, such as UintArray32(). However, most platforms use little-Endian by default for reading.

DataView

DV(DataView) is similar to TypedArray in that it is used to modify the underlying Buffer. To put it bluntly, they are the two parts of the NodeJS Buffer, and for some reason they may have been created separately. The API provided by DataView is simple, with methods like GET /set. Basic usage:

new DataView(Arraybuffer [, byteOffset [, byteLength]])
Copy the code

Note that the previous parameter is only an ArrayBuffer, and you can’t count TypedArray as well, otherwise… You can try it.

It is also important to note that dataViews are modified in the same way as objects, by reference type, that is, if multiple DataViews are created for a Buffer, multiple changes will only show up in a single Buffer.

The great use of DV is that it is very easy to write values in different byte order, which is much more convenient than using TypedArray to do swap(). Of course, byte order correlation is valid only for operations larger than 8 bits.

Here is an example of setUInt32, whose basic format is:

setInt32(byteOffset, value [, littleEndian])
Copy the code

LittleEndian is a Boolean value that indicates how to write a byte order. The default is to use a large byte order. Reference:

It is big-endian by default and can be set to little-endian in the getter/setter methods.

So, if you want to use small order, you need to pass true manually!

Such as:

let view = new DataView(buffer); view.setUint32(0, arr[0] || 1, BE); // Use TypedArray to manually construct swap buf = new Uint8Array(11); buf[3] = byteLength >>> 16 & 0xFF; buf[4] = byteLength >>> 8 & 0xFF; buf[5] = byteLength & 0xFF;Copy the code

Of course, if you’re worried, you can just use an IIFE:

const LE = (function () {
            let buf = new ArrayBuffer(2);
            (new DataView(buf)).setInt16(0, 256, true); // little-endian write
            return (new Int16Array(buf))[0] === 256; // platform-spec read, if equal then LE
        })();
Copy the code

In order to give you a better understanding of the difference between JS developers working with buffers from the front end to the back end, here is a brief description of how buffers are handled in NodeJS.

Node Buffer

The Node Buffer is actually the most useful Buffer operation on the front end because it is a collection of integrated ArrayBuffer, TypedArray, and Dataview objects on which all processing methods are mounted. For details, see Node Buffer.

It can create buffers of the specified size directly by using the alloc and FROM methods. The old method of using new Buffer is no longer recommended by stackOverflow. I won’t go into details here. It is important to note that NodeJS can already be converted directly to the front-end ArrayBuffer. Using the FROM method, you can convert the ArrayBuffer directly to NodeJS’s Buffer.

Format for:

 Buffer.from(arrayBuffer[, byteOffset[, length]])
Copy the code

See the demo provided by NodeJS:

const arr = new Uint16Array(2); arr[0] = 5000; arr[1] = 4000; Const buf = buffer. from(arr.buffer); <Buffer 88 13 A0 0f> console.log(buf); // Change the original Buffer arr[1] = 6000; // Print: <Buffer 88 13 70 17> console.log(buf);Copy the code

The Node Buffer object also mounts the following:

buf.readInt16BE(offset[, noAssert])
buf.readInt16LE(offset[, noAssert])
buf.readInt32BE(offset[, noAssert])
buf.readInt32LE(offset[, noAssert])
Copy the code

Slightly different, it is a straightforward byte-order decision based on the name.

BE representative BigEndian
LE on behalf of LittleEndian

After that, we can write and read using the specified methods.

const buf = Buffer.from([0, 5]);

// Prints: 5
console.log(buf.readInt16BE());

// Prints: 1280
console.log(buf.readInt16LE());
Copy the code

In actual use, we usually use the official Node documentation, which is very detailed.

Basic concepts of audio and video

In order to reduce certain discomfort in learning, here to introduce you to the basic concept of audio and video, in order to prevent others from blowing force, you can smile at the side. First, the basics are video formats and video compression formats.

Video formats should be needless to say, we usually called.mp4,.flv,.ogv,.webm and so on. Basically, it’s a box that holds the actual video stream in a certain order to ensure that the playback is orderly and complete.

The specific difference between video compression format and video format is that it converts the original video stream into usable digital encoding. Because the original video stream is very large, for example, if you directly use the phone to record audio, you will find that your few minutes of audio is much larger than the existing MP3 audio in the market, which is the main role of the compressed format. The specific flow chart is as follows:

First, the relevant digital signal stream is provided by the original digital equipment, and then the size of the stream is greatly reduced through the video compression algorithm, and then the video box is delivered to the corresponding DTS and PTS fields, and finally the usable video file is generated. Common video formats and compression formats are as follows:

Video format mainly refers to the format file provided by ISO for learning, and then refer to the encoding and decoding.

Here, I mainly want to introduce the compression algorithm, because this is important to understand the related concepts in your actual decoding.

Let’s first look at what video coding is.

Video coding

A video is essentially a frame by frame of pictures stitched together for playback. The image itself can also be compressed, such as removing duplicate pixels, merging pixel blocks and so on. However, there is another compression method, motion estimation and motion compensation compression, because adjacent images must have a large chunk of similar, so, to solve this problem, can be de-weighted between different images.

So, in general, there are three commonly used encoding methods:

Transform coding: eliminate intra – frame redundancy of the image
Motion estimation and motion compensation: Eliminate interframe redundancy
Entropy coding: improve compression efficiency

Transform coding

There are two concepts involved in imaging: spatial domain and frequency domain. The spatial domain is our physical picture, the frequency domain is the physical picture according to its color value and so on mapping to the digital size. The purpose of transform coding is to use frequency domain to achieve de-correlation and energy concentration. Common orthogonal transformations include discrete Fourier transform, discrete cosine transform and so on.

Entropy coding

Entropy coding is mainly optimized for the length of the code section. The principle is to assign short codes to symbols with high probability in the source and long codes to symbols with low probability, and then realize the minimum average code length. Encoding methods (variable word length encoding) include hoffman encoding, arithmetic encoding, run-length encoding and so on.

Motion estimation and motion compensation

The above two methods are mainly to solve the correlation within the image. In addition, video compression has a temporal correlation. For example, for some video changes, the background image does not change but only some objects in the image move. In this way, it is possible to encode only the changing parts of adjacent video frames.

Next, to illustrate the I,B and P frames associated with motion estimation and motion compensation compression.

I, B, P frame

I,B, and P are actually derived from motion compensation, which is introduced here for later purposes.

I-frame: The scientific name is:Intra-coded picture. You can also call it independent frames. The frame is a reference image randomly selected by the encoder. In other words, an I-frame is itself a static image. It serves as a reference point for B and P frames. For its compression, only useentropy 和 Changes in codingThese two methods are used for intra-frame compression. So, there’s almost no kinematic compensation.
P frame (P frame): also known asPredicted picture– Forward prediction frame. That is, he will be based on the previous image, to carry out dynamic compression between the pictures, its compression rate and I frame is higher than a little bit.
B frame: also known as B frameBi-predictive picture– Two-way prediction. It also has more prediction of the next image than the P-frame, so it has a higher compression rate.

Take a look at Dr. Ray’s diagram:

That’s OK, though, but not when it comes to actual coding. If you think about IBP three frames in a video, it’s very likely that you’ll encounter I, B, B, P. This is fine, but when decoding the data, there is a problem. B frames are relative to the first and second frames, but only the first frame is fixed, and the second frame is relative to the first frame, that is, B frames can only be relative to I/P. However, the P frame has not been resolved yet. So, in order to solve this problem, when decoding, you need to switch them to another position, namely, I, P, B, B. In this way, the correctness of decoding can be guaranteed.

How do you guarantee that? This is done by DTS and PTS. These are the two properties (and of course a CTS) that we end up using in video frame codec.

Explain:

PTS (Presentation time stamps): display time stamps, display time from receiving to decoding.
DTS (decoder timestamps): decoder timestamps. Also represents the order of the sample in the entire flow

So the sequence of video frames can be simply expressed as:

   PTS: 1 4 2 3
   DTS: 1 2 3 4
Stream: I P B B
Copy the code

As you can see, we use DTS to decode and PTS to play. OK, that’s about the basics of webcasting. If there is still a chance behind, we can carry out the actual practice of audio and video decoding.