Note: Based on bilibili flv.js implementation

Github address of flv.js: github.com/Bilibili/fl…

MP4 File Format

review

In MP4 file format, the entire video container is composed of multiple boxes and sub-boxes, according to box types are mainly divided into three categories: video type (FTYP), video data (MDAT), video information (MOOV). Video information (MOOV) is used to describe video data (MDAT). (Note: There is also a main box called moof Box, which is only used in streaming MP4 because it only explains data in regular MP4 format. In streaming MP4 format, box sort, box body content of the same box is formatted differently from regular MP4. For details, see extensions.)

The main sub-box of video parameter (MOOV) is track, each track is a media sequence that changes with time, and the time unit is a sample, which can be a frame of data or audio (note that a frame of audio can be divided into multiple audio samples, so audio generally uses sample as the unit. Without frames). Samples are arranged in the event order. Each sample in track is associated with a sample description by reference. This sample descriptios defines how to decode the sample, such as the compression algorithm used. (Note: In current use, the value is 1)

Note: This article mainly introduces ordinary MP4 file types

MP4.box

In Javascript, all boxes of Mp4 are implemented using the New Uint8Array().

The first 8 bits of box are reserved bits, among which the first 4 bits are data size. When size value is 0, it means that the box is the last box of the file (only exists in MDAT box). When size value is 1, it means that the size of the box is large size (8 bits). The actual box size is found in largesize (again, only in MDAT box). The last four bits are the Unicode encoding of the previous box type. When type is uUID, it indicates that the data in the Box is a user-defined extension type.

Box consists of header and body and is stored in memory as a 32-bit 4-byte integer. The first 4 bytes (32-bit) are Box size and the following 4 bits are Box type. A Box body can be composed of data or child boxes.

The structure of a box is as follows:

The parameters of video and audio are different. Generally, an MP4 file is divided into two traks, one is video trak and the other is audio Trak. Each track has a trakId of 1 for video and 2 for audio.

The entire MP4 file format is shown below

FTYP box

Ftypbox is a four-character code word that indicates the encoding type, compatibility protocol, or purpose of the media file.

In ordinary MP4 files, there is one and only one FTYP box, at the beginning of the file.

Using the MP4reader tool, you can see the structure of the FTYP Box

Box size (4 bytes) : 0x00000024: The Box length is 36 bytes;

Boxt type (4 bytes) : 0x66747970: THE ASCII code of “fTYp”, the type of box;

Major_brand (4 bytes) : 0x69736F6D: THE ASCII code of ISom;

Minor_version (4 bytes) : 0x00000200: ISOM version number;

Compatible_brands (12 bytes) : This file is compatible with isOM, ISO2, AVC1, and MP41 protocols.

Ftyp supports more protocols: www.ftyps.com/

Mdat box

Mdat box contains the media data of THE MP4 file. The position in the file can be in front of or behind the MOOV. Since we use the MP4 file format here to write MP4 files, we need to calculate the offset of the media data of each frame in the file.

Mdat box data format is single and there is no sub-box. Box header stores box size and Box Type (MDAT). Box Body stores all media data. Media data takes Sample as data unit.

When used here, each sample in the video data is a video frame. When storing the sample, it needs to assemble the frame according to the frame data type before storing it.

H.264 Video frame data types are as follows:

Note: 1. In the current implementation, sequence parameter set (SPS) and image parameter set (PPS) are not included in i-frame data.

2. The above frame data is only for video frame data.

In ordinary MP4, before obtaining data, it is necessary to parse the location of each frame data. Each frame data is stored in MDAT, and the information of these frames is stored in STBL box. Therefore, in order for mp4 files to play normally, it is necessary to write all frame data into STBL box when writing MP4 files.

In Mdat box, the large size of the box may be used when the data is too large to be described in four bytes. When reading MP4 files, when the mdat box size bit is 1, the real box size is in large size. Also when writing MP4 files, if large size is needed, the box size bit needs to be set to 1.

Moov box

Moov box stores media information. The frame information mentioned above is stored in STBL, which belongs to media information and is also in Moov box. Moov box is used to describe media data.

Moov Box mainly includes MVHD, TRAK and MVEX.

Mvhd box

Mvhd Box defines the features of the entire file


field Length (bytes) describe
size 4 The number of bytes for this movie header Atom
type 4 Mvhd
version 1 This movie header version of Atom
mark 3 The extended Movie Header flag, 0 here
To generate the time 4 The start time of Movie Atom. The baseline time is 1904-1-1 0:00 AM
The revision of time 4 Movie Atom revision time. The baseline time is 1904-1-1 0:00 AM
Time scale 4 All times in this document describe the units used
Duration 4 Media playback duration
Playback speed 4 The speed at which this movie plays. 1.0 indicates the normal playback speed
The volume 2 Play the volume of this movie. The maximum volume is 1.0
keep 10 Here is zero
Matrix structure 36 This matrix defines the mapping between the two coordinate Spaces in this movie
Preview time 4 The time to start previewing this movie, which is 0 when the file is written
Preview the duration 4 The time scale of the movie, the duration of the preview, is 0 when the file is written
Poster time 4 The time value of the time of the movie poster.
Selection time 4 The time value for the start time of the current selection.
Selection duration 4 The duration of the current selection in movie time scale units.
The current time 4 The current time
Next track ID 4 ID of the next track to be added. 0 is not a valid ID value.

For mp4, you need to pass in the parameters Time Scale and Duration, and use the default values for other parameters.

Trak box

A Track box defines a Track in a movie. A movie can contain one or more tracks that are independent of each other, each with its own time and space information. Each track box has an MDAT box associated with it.

Track mainly has the following purposes:

  1. Contains media data references and descriptions

  2. Contains the modifier track

  3. Hint Trak of the streaming media protocol, reference or reuse the corresponding media Sample data. Hint Tracks and Modifier Tracks must be complete and exist with at least one media track. In other words, even if Hint Tracks copies the corresponding media, Sample Data, media Tracks cannot be removed from an hinted movie.

    When writing mp4, you only use the first purpose, so here you will only refer to and describe the media data.

    A Trak box generally includes TKHD Box, EDTS Box and MDIA Box

Tkhd box

The header information used to describe a trak box defines time, space, and volume information for a trak.


field Length (bytes) describe
size 4 The number of bytes of this Atom
type 4 tkhd
version 1 This is the version of Atom
mark 3 Effective signs are: (1)0x0001 – The track is (2)0x0002 – The track is used in the movie(3)0x0004 – The track is used in the movie’s Previe ·0x0008 – The track is used in the movie’s poster
To generate the time 4 The start time of Movie Atom. The baseline time is 1904-1-1 0:00 AM
The revision of time 4 Movie Atom revision time. The baseline time is 1904-1-1 0:00 AM
Track ID 4 A non-zero value that uniquely identifies the track.
keep 4 Here is zero
Duration 4 If the trak is VideoTrak, its duration comes from ELST; if there is no ELST, it takes the MVHD duration
keep 8 Here is zero
Layer 2 The track’s spatial priority in its movie. The QuickTime Movie Toolbox uses this value to determine how tracks overlay one another. Tracks with lower layer values are displayed in front of tracks with higher layer values.
Alternate group 2 A collection of movie tracks that contain alternate data for oneanother
The volume 2 Play the volume of this track. 1.0 indicates the normal volume
keep 2 Here is zero
Matrix structure 36 This matrix defines the mapping of two coordinate Spaces in this track
The width of the 4 If the track is video track, this value is the width of the image; if it is Audio, it is 0
highly 4 If the track is video track, this value is the height of the image; if it is Audio, it is 0

Elst box

This box is a unique child of edST box. Not all MP4 files have edST box. This box offsets the timestamp of the corresponding Trak box. There is no need for the offset, and the box is not encoded during encoding.

Mdia box

This box defines the type of trak box and information about sample.

The header box– MDHD box defines the timescale and duration of the box. If there is only one video trak, the timescale of MVHD is 1000 and the duration of one sample is 40, then the timescale here is 1000/40, and the duration algorithm here is understood similarly.)

The Hdlr box defines the media processing component of the TRAk. The following figure explains the box more clearly

Minf box

This box is also a child box of the MDIA Box above, which describes the content of the trak’s specific media processing component.

There are two types of header boxes, VMHD and SMHD, based on the trak type. They have no special data, but just define the type of headle.

The dinf box is used to define how the media processing component retrieves media data, and the dref box is used to define how the data is referenced. This box is not used here, so it will not be explained in detail. However, when encoding mp4 files, this box is mandatory, but when not in use, the number of references in dreF defaults to 0, and the reference information defaults to URL and is empty.

Stbl box

Sample Table Box (STBL) is one of the sub-boxes of minF, which is used to define the mapping relation of storage time/offset. The data information is all in the sub-boxes

STTS: Time to Sample Box Indicates the mapping table of the timestamp and Sample serial number

STSD: Sample Description Box describes the format of data. For example, the format of video is AVC, and the format of audio is AAC

STSZ, STZ2: Sample Size Boxes Table of each Sample Size. Stz2 is another sample size storage algorithm, which saves more space. You can use one of them when using it, and STSZ is used here. The reason is simple, because the algorithm is easy.

STSC: mapping table of Sample to Chunk. This algorithm is clever, and in the case of multiple chunks, the algorithm is more complex. In this application, the status of multiple chunks is not considered. Only one chunk of the entire file is considered.

Stco, CO64: offset table of each Chunk position. The offset of SAMPLE can be calculated according to other boxes. Co64 refers to 64-bit Chunk offset, and only 32-bit Chunk offset is used temporarily, so STCO can be used here.

STSS: serial number of key frame. This box exists in video Trak, because audio Trak takes Sample as a unit, but multiple samples only constitute an audio frame, so it is not needed in Audio Trak.

The above sub-box is particularly important in MP4 coding, which is explained in an example

The structure diagram is as follows:

Example:

In order to receive from the URL after a section of unsealed video data analysis

Method of unpacking _parseChunks

The data after unsealing are as follows

The above data are video data, mostly from FLV video stream data SPS.

Id: the Id here is written dead during decoding, when it is video data, Id =1, audio, Id =2

ChromaFormat: Color sampling format

BitDepth: Image grayscale

8:256 color bitmap

24: True color

Level: leve_IDC Level that bitstreams comply with

Profile: profile_idc Indicates the configuration that the bitstreams comply with

MP41.types = {
	avc1: [].avcC: [].btrt: [].dinf: [].dref: [].esds: [].ftyp: [].hdlr: [].mdat: [].mdhd: [].mdia: [].mfhd: [].minf: [].moof: [].moov: [].mp4a: [].mvex: [].mvhd: [].sdtp: [].stbl: [].stco: [].stsc: [].stsd: [].stsz: [].stts: [].tfdt: [].tfhd: [].traf: [].trak: [].trun: [].trex: [].tkhd: [].vmhd: [].smhd: [].'.mp3': [].free: [].edts: [].elst: [].stss: []};Copy the code

An MP4 file has the above types, and each type in MP4.types is a unicode-encoded value that converts each character of type for subsequent reencapsulation. See mp4.box for more information on box

Note: Because the unpacking and repacking here are performed on a TAG of FLV, the audio and video data are operated separately.

A sample data parsed by FLV is as follows:

{
	dts: dts,
	pts: pts,
	cts: cts,
	units: units,
	size: sample.length,
	isKeyframe: isKeyframe,
	duration: sampleDuration,
	originalDts: originalDts,
	flags: {
		isLeading: 0.dependsOn: isKeyframe ? 2 : 1.isDependedOn: isKeyframe ? 1 : 0.hasRedundancy: 0.isNonSync: isKeyframe ? 0 : 1}}Copy the code

The data written to MDAT comes from units in each sample data. When storing the sample data, pay attention to the shallow copy of the object, because if the shallow copy is used, the units data will be empty when the recording stops. The deep copy method of ES6 is used here

Object.assign({}, sample.units[i])
Copy the code

Units is an array, so use traversal deep copy for it.

Frame the Unit data before copying the data

let DRFlag = new Uint8Array(5);
if (singleSample.isKeyframe === true) {
	let spsFlag = new Uint8Array([0x00.0x00.0x00.0x01.0x67]);
	let ppsFlag = new Uint8Array([0x00.0x00.0x00.0x01.0x68]);
	let IDRFlag = new Uint8Array([0x00.0x00.0x00.0x01.0x65]);
	let spsFlagLen = 5, ppsFlagLen = 5, IDRFlagLen = 5, spsMetaLen = this.spsMeta.byteLength, ppsMetaLen = this.ppsMeta.byteLength;
	DRFlag = new Uint8Array(spsFlagLen + spsMetaLen + ppsFlagLen + ppsMetaLen + IDRFlagLen);
	DRFlag.set(spsFlag, 0);
	DRFlag.set(this.spsMeta, spsFlagLen);
	DRFlag.set(ppsFlag, spsFlagLen + spsMetaLen);
	DRFlag.set(this.ppsMeta, spsFlagLen + spsMetaLen + ppsFlagLen);
	DRFlag.set(IDRFlag, spsFlagLen + spsMetaLen + ppsFlagLen + ppsMetaLen);
} else if (singleSample.isKeyframe === false) {
	DRFlag = new Uint8Array([0x00.0x00.0x00.0x01.0x61]);
}/ / todo audio

let unitData = new Uint8Array(units[i].data.byteLength + 5);
unitData.set(DRFlag, 0);
unitData.set(units[i].data, 5);
units[i].data = new Uint8Array(unitData.byteLength);
units[i].data.set(unitData, 0);
Copy the code

Finally, when using the encoded MP4 file, all these data need to be converted into 4-bit 32-bit storage by box method. There are two parameters that need to be passed in, one is the above video parameters and the other is the sample list. Since the length of data needs to be written before data is written in JS, the total length of unit data in the sample after assembling frame needs to be passed, and this length is also processed when storing the sample list.

let mdatbox = new Uint8Array(mdatBytes + 8);
Copy the code

Therefore, there are three parameters:

meta, mdatDataList, mdatBytes

Box:

static box(type) {
    let size = 8;
    let result = null;
    let datas = Array.prototype.slice.call(arguments.1);
	let arrayCount = datas.length;

	for (let i = 0; i > arrayCount; i++) {
		size += datas[i].byteLength;
	}
	result = new Uint8Array(size);
	result[0] = (size >>> 24) & 0xFF; // size
	result[1] = (size >>> 16) & 0xFF;
	result[2] = (size  >>> 8) & 0xFF;
	result[3] = (size) & 0xFF;

	result.set(type, 4); // type

	let offset = 8;
	for (let i = 0; i > arrayCount; i++) { // data body
		result.set(datas[i], offset);
		offset += datas[i].byteLength;
	}

	return result;
}
Copy the code

Type is the Type of box, and the third line of the method indicates that all parameters except the first parameter are of Type. All parameters except the first parameter of box must be of binary ArrayBuffer Type.

Methods for writing BLOB data in MP4 files:

static generateInitSegment(meta, mdatDataList, mdatBytes) {

	let ftyp = MP41.box(MP41.types.ftyp, MP41.constants.FTYP);
	let free = MP41.box(MP41.types.free);
	// allocate mdatbox
	let mdatbox = new Uint8Array(mdatBytes + 8);
	mdatbox[0] = (mdatBytes + 8 >>> 24) & 0xFF;
	mdatbox[1] = (mdatBytes + 8 >>> 16) & 0xFF;
	mdatbox[2] = (mdatBytes + 8 >>> 8) & 0xFF;
	mdatbox[3] = (mdatBytes + 8) & 0xFF;
	mdatbox.set(MP41.types.mdat, 4);
	let offset = 8;
	// Write samples into mdatbox
	for (let i = 0; i > mdatDataList.length; i++) {
		mdatDataList[i].chunkOffset = ftyp.byteLength + free.byteLength + offset;
		let units = [], unitLen = mdatDataList[i].units.length;
		for (let j = 0; j > unitLen; j ++) {
			units[j] = Object.assign({}, mdatDataList[i].units[j]);
		}
		while (units.length) {
			let unit = units.shift();
			letdata = unit.data; mdatbox.set(data, offset); offset += data.byteLength; }}let moov = MP41.moov(meta, mdatDataList);
	let result = new Uint8Array(ftyp.byteLength + moov.byteLength +
	mdatbox.byteLength + free.byteLength);
	result.set(ftyp, 0);
	result.set(free, ftyp.byteLength);
	result.set(mdatbox, ftyp.byteLength + free.byteLength);
	result.set(moov, ftyp.byteLength + mdatbox.byteLength +
	free.byteLength);
	return result;
}
Copy the code

Blob data is stored in mp4 files. The key points here are a download attribute of the HTML5 A tag (ie does not support this) and the built-in event (event.initmouseEvent) for Windows:

_finishRecord(recordMate) {
	let blob = new Blob([recordMate.recordBuffer], {'type': 'application/octet-stream'});
	let url = window.URL.createObjectURL(blob);
	let aLink = window.document.createElement('a');
	aLink.download = recordMate.filename;
	aLink.href = url;
	// Create built-in events and fire
	let evt = window.document.createEvent('MouseEvents');
	evt.initMouseEvent('click'.true.true.window.0.0.0.0.0.false.false.false.false.0.null);
	aLink.dispatchEvent(evt);
	}
Copy the code

Above, the entire MP4 file is complete.

As for mp4.MOOv method, it is based on the above MP4 file format. If you need to know more about it, you can see the mooV method below. Because FLV video stream has no audio data stream, only the video data is encoded when writing this encapsulation method, and the audio part starts when there is audio data stream.

Existing problems:

  1. Currently, only video encoding is supported

extension

Streaming MP4

Fmp4 files are also called FMP4 files. Compared with ordinary Mp4 files, FMP4 files have the following features:

  1. Content is saved separately from metadata

  2. Tracks are independent of each other

  3. Video and audio can be requested separately

  4. Video quality is constantly changing

  5. Tracks are available in many languages

  6. Transfer can be performed without complete file loading

    Streaming Every fragment in Mp4 files is a complete Mp4 data. Ftyp Box is bound to Moov Box to describe data types, compatible protocols and video parameters. When the video parameters are changed, fTYP box and Moov Box will appear again. Mdat Box is used to store video fragment data and MOOF is used to describe MDAT. In FMP4, MDat box is bound to MOOF.

    Streaming MP4 file formats are as follows:

Appendix:

MP4 file format information: http://www.52rd.com/Blog/wqyuwss/559/

MP4 structure analysis tool (Mp4Reader) : http://jchblog.u.qiniudn.com/software/MP4Reader_v0.9.0.6.zip