FLV protocol Overview

Flash Video (FLV) is a kind of streaming media format. Due to its small size and relatively simple protocol, it quickly becomes popular and is widely supported.

The common HTTP-FLV live broadcast protocol uses HTTP to stream transmission of FLV encapsulated audio and video data. For those of you who want to understand HTTP-FLV, it is essential to understand the FLV protocol.

In a nutshell, FLV consists of FLV header and FLV file body, and FLV file body consists of multiple FLV tags.

FLV = FLV header + FLV file body FLV file body = PreviousTagSize0 + Tag1 + PreviousTagSize1 + Tag2 + … + PreviousTagSizeN-1 + TagN

FLV tags are divided into three types:

  • Video Tag: Store video-related data;
  • Audio Tag: Stores Audio related data;
  • Script Tag: stores audio and video metadata.

Before the actual explanation of FLV protocol, the unit is agreed:

type define
0x… Hexadecimal data
SI8 Signed 8-bit integer
SI16 Signed 16-bit integer
SI24 Signed 24-bit integer
SI32 Signed 32-bit integer
STRING Sequence of Unicode 8-bit characters (UTF-8), terminated with 0x00 (unless otherwise specified)
UI8 Unsigned 8-bit integer
UI16 Unsigned 16-bit integer
UI24 An unsigned 24-bit integer
UI32 An unsigned 32-bit integer
xxx [ ] An array of type XXX
xxx [n] An array of type XXX and length n

FLV header

FLV Header consists of the following fields:

  1. The first three bytes are always FLV
  2. The last 4 bytes of content are fixed at 9 (for FLV version 1)
field The field type Field meaning
Signature UI8 Signature, fixed to ‘F’ (0x46)
Signature UI8 Signature, fixed to ‘L’ (0x4c)
Signature UI8 Signature, fixed to ‘V’ (0x56)
Version UI8 Version, such as 0x01, indicates FLV version 1
TypeFlagsReserved UB[5] Full of 0
TypeFlagsAudio UB[1] 1 means audio tag, 0 means no tag
TypeFlagsReserved UB[1] Full of 0
TypeFlagsVideo UB[1] 1 indicates that there is a video tag, and 0 indicates that there is no video tag
DataOffset UI32 Size of the FLV header, in bytes

FLV file body

FLV file body is regular and consists of a series of tagsizes and tags:

  1. PreviousTagSize0 is always 0;
  2. A tag consists of a tag header and a tag body.
  3. For FLV version 1, the tag header is fixed to 11 bytes, so PreviousTagSize (except for the first) is 11 + the size of the previous tag body;
field The field type Field meaning
PreviousTagSize0 UI32 Is always zero
Tag1 FLVTAG The first tag
PreviousTagSize1 UI32 Size of the previous tag, including the Tag header
Tag2 FLVTAG The second tag
. . .
PreviousTagSizeN-1 UI32 The size of the n-1st tag
TagN FLVTAG N a tag
PreviousTagSizeN UI32 The size of the NTH tag, including the tag header

FLV tags

The FLV tag consists of tag header and tag body.

The tag header is 11 bytes:

field The field type Field meaning
TagType UI8 The tag type

8: audio

9: video

18: script data

Others: Reserved
DataSize UI24 Size of the tag body
Timestamp UI24 Timestamp relative to the first tag (in milliseconds)

The first tag has a Timestamp of 0
TimestampExtended UI8 Timestamp extended field, enabled when Timestamp is 3 bytes short, representing 8 bits higher
StreamID UI24 Is always zero
Data Depends on the TagType If TagType=8, it is AUDIODATA

If TagType=9, it is VIDEODATA

TagType=18, it is SCRIPTDATAOBJECT

In playback, the time sequencing of FLV tags depends on the FLV timestamps only. Any timing mechanisms built into the payload data format are ignored.

Audio tags

The definition is as follows:

field The field type Field meaning
SoundFormat UB[4] Audio format, focusing on **10 = AAC **

0 = Linear PCM, platform endian

1 = ADPCM

2 = MP3

3 = Linear PCM, little endian

4 = Nellymoser 16-kHz mono

5 = Nellymoser 8-kHz mono

6 = Nellymoser

7 = G.711 A-law logarithmic PCM 8 = G.711 mu-law logarithmic PCM 9 = reserved

10 = AAC

11 = Speex

14 = MP3 8-Khz

15 = Device-specific sound

SoundRate UB[2] The sampling rate, for AAC, is always equal to 3

0 = 5.5-kHz

1 = 11-kHz

2 = 22-kHz

3 = 44-kHz
SoundSize UB[1] The sampling accuracy, for compressed audio, is always 16 bits

0 = snd8Bit

1 = snd16Bit
SoundType UB[1] Channel types, for Nellymoser, are always mono; For AAC, it’s always two-channel;

0 = sndMono

1 = sndStereo
SoundData UI8[size of sound data] If it is AAC, then AACAUDIODATA;

For others, please refer to the specification;

Remark:

If the SoundFormat indicates AAC, the SoundType should be set to 1 (stereo) and the SoundRate should be set to 3 (44 kHz). However, this does not mean that AAC audio in FLV is always stereo, 44 kHz data. Instead, the Flash Player ignores these values and extracts the channel and sample rate data is encoded in the AAC bitstream.

AACAUDIODATA

When SoundFormat is 10, it indicates that audio is encoded by AAC. At this time, the definition of SoundData is as follows:

field The field type Field meaning
AACPacketType UI8 0: AAC sequence header

1: AAC raw
Data UI8[n] If AACPacketType is 0, AudioSpecificConfig is specified

If AACPacketType is 1, it is AAC frame data

The AudioSpecificConfig is explained in ISO 14496-3. Note that it is not the same as the contents of the esds box from an MP4/F4V file. This structure is more deeply embedded.

About AudioSpecificConfig

The pseudocode is as follows: see here

5 bits: object type
if (object type == 31)
    6 bits + 32: object type
4 bits: frequency index
if (frequency index == 15)
    24 bits: frequency
4 bits: channel configuration
var bits: AOT Specific Config
Copy the code

The definition is as follows:

field The field type Field meaning
AudioObjectType UB[5] The encoder type, for example, 2 indicates AAC-LC
SamplingFrequencyIndex UB[4] Sampling rate index value, such as 4 for 44100
SamplingFrequencyIndex UB[4] Sampling rate index value, such as 4 for 44100
ChannelConfiguration UB[4] For example, 2 indicates dual-channel, front-left, and front-right

Video tags

The definition is as follows:

field The field type Field meaning
FrameType UB[4] Focus on 1 and 2:

1: KEYframe (for AVC, a seekable frame) — h. 264 IDR frame

2: Inter frame (for AVC, a non-seekable frame) — H.264 normal I frame;

3: disposable inter frame (H.263 only)

4: generated keyframe (reserved for server use only)

5: video info/command frame
CodecID UB[4] Codecs with a focus on 7 (AVC)

1: JPEG (currently unused)

2: Sorenson H.263

3: Screen video

4: On2 VP6

5: On2 VP6 with alpha channel 6: Screen video version 2

7: AVC
VideoData Depends on CodecID The actual media type, mainly concerned with 7:AVCVIDEOPACKE

2: H263VIDEOPACKET

3: SCREENVIDEOPACKET

4: VP6FLVVIDEOPACKET

5: VP6FLVALPHAVIDEOPACKET

6: SCREENV2VIDEOPACKET

7: AVCVIDEOPACKE

AVCVIDEOPACKE

When CodecID is 7, VideoData is AVCVIDEOPACKE, also known as H.264 media data.

AVCVIDEOPACKE is defined as follows:

field The field type Field meaning
AVCPacketType UI8 0: AVC sequence header

1: AVC NALU

2: AVC end of sequence
CompositionTime SI24 If AVCPacketType=1, it is the time CTS offset; Otherwise, it is 0
Data UI8[n] 1, if if AVCPacketType = 1, is AVCDecoderConfigurationRecord

2, if AVCPacketType=1=2, then NALU (one or more)

3. If AVCPacketType=2, null

Here are a few things to explain:

  1. NALU: In H.264, an abstract logical unit (NALU) is obtained after data is formatted according to specific rules. The data here includes not only the encoded video data, but also the parameter set (PPS, SPS) needed for video decoding.
  2. AVCDecoderConfigurationRecord: h. 264 video decoding the required parameter set (SPS, PPS)
  3. CTS: When B frames exist, DTS and PTS may be different in the process of video decoding and presentation. The calculation formula of CTS is PTS-DTS /90, in milliseconds. If B frame does not exist, CTS is fixed to 0.

PPS and SPS are not expanded here.

Script Data Tags

Script Data Tags are usually used to store onMetaData related to audio and video in FLV, such as length, length, width, etc. Its definition is relatively complex, using AMF (Action Message Format) encapsulates a series of data types, such as strings, values, arrays and so on.

field The field type Field meaning
Objects SCRIPTDATAOBJECT[] Any number of ScriptDataObjects
SCRIPTDATAOBJECTEND UI24 Always 9, marking the end of Script Data

A SCRIPTDATAOBJECT is defined as follows:

field The field type Field meaning
ObjectName SCRIPTDATASTRING Object name
ObjectData SCRIPTDATAVALUE The value of the object

The definition of SCRIPTDATAVALUE is as follows:

field The field type Field meaning
Type SCRIPTDATASTRING Variable type:

0 = Number type

1 = Boolean type

2 = String type

3 = Object type

4 = MovieClip type

5 = Null type

6 = Undefined type

7 = Reference type 8 = ECMA array type 10 = Strict array type 11 = Date type

12 = Long string type
ECMAArrayLength If Type is 8 (array), then UI32 The length of the array
ScriptDataValue If Type == 0 DOUBLE

If Type == 1 UI8

If Type == 2 SCRIPTDATASTRING

. (A bit long, please refer to the specification)
The value of the variable
ScriptDataValueTerminator If Type==3, it is SCRIPTDATAOBJECTEND

If Type==8, it is SCRIPTDATAVARIABLEEND
End character of Object or Array

As you can see, the definition of Script Data Tag is relatively complex.

onMetaData

OnMetaData contains audio and video related metadata, encapsulated in Script Data tags, which contain two AMFs.

The first AMF:

  • The first byte, 0x02, is a string
  • Bytes 2-3: UI16 type 0x000A, indicating string length of 10 (length of onMetaData);
  • Byte 4-13: Hexadecimal number corresponding to the string onMetaData (0x6F 0x6E 0x4D 0x65 0x74 0x61 0x44 0x61 0x74 0x61);

Second AMF:

  • The first byte, 0x08, represents the array type;
  • Bytes 2-5: UI32, indicating the length of the array. The properties of onMetaData are not fixed.
  • Byte 6 + : for example, duration, then:
    • Byte 6-9:0x0008, indicating the length of 8 bytes;
    • Bytes 10-17 0x6475 7261 7469, duration;
    • The 18th byte, 0x00, represents a numeric type;
    • Bytes 19-26:0x… Is the specific duration;

More onMetaData definitions:

field The field type Field meaning
duration DOUBLE Length of file
width DOUBLE Video width (PX)
height DOUBLE Video Height (PX)
videodatarate DOUBLE Video bit rate (KB /s)
framerate DOUBLE Video frame rate (frame /s)
videocodecid DOUBLE Video codec ID (see Video Tag)
audiosamplerate DOUBLE Audio sampling rate
audiosamplesize DOUBLE Audio sampling accuracy (see Audio Tag)
stereo BOOL Stereo or not
audiocodecid DOUBLE Audio codec ID (see Audio Tag)
filesize DOUBLE Total file size (bytes)

Write in the back

FLV protocol itself is not complicated, the difficulty in understanding, more often from audio and video codec related knowledge, such as H.264, AAC related knowledge, it is recommended to look up when you do not understand. In addition, the byte order of FLV is big-endian, which must be paid attention to when parsing protocols.

This article for the convenience of explanation, part of the content may not be rigorous, if there are mistakes, please point out.

A link to the

Video_file_format_spec_v10. PDF www.adobe.com/content/dam…

Mpeg-4 Part 3 en.wikipedia.org/wiki/MPEG-4…

FLV file analysis www.jianshu.com/p/e290dca02…

H.264 NALU syntax blog.csdn.net/qq_29350001…