This article shares from tao department multimedia front end team – Ye Xu



In 2020, live streaming with goods popular throughout the network. Want to see the front-end technology behind Taobao Live? This article will take you into the world of Taobao Live front end technology. For most front-end engineer, audio and video technology is a less involved in the field, this article covers the streaming media technology in the text, graphics, images, audio and video, a variety of theoretical knowledge, involves the player, web technology, the mainstream media framework is introduced, such as only need to spend a little time, you will enter the front field of multimedia.

First, audio and video foundation

1.1 video

1.1.1 Basic Concepts
Bit rate The higher the sampling rate per unit time, the higher the accuracy, and the closer the processed files are to the original files.
Frame rate For video, the frame rate corresponds to this viewing lag. The higher the frame rate, the better the smoothness, and the lower the frame rate, the slower the vision.
The compression ratio Compressed file size/original file size * 100% = Compression ratio. The smaller the code compression, the better, but the smaller the pressure, the decompression time
The resolution of the A parameter used to measure the amount of data in an image, which is closely related to video clarity.

1.1.2 Video container Format

MP4, AVI, FLV, TS/M3U8, WebM, OGV, MOV…

1.1.2 Video encoding format
H.264 The most popular encoding format.
H.265 New coding format, efficient video coding. To replace the H.264/AVC encoding standard.
VP9 VP9 is the next generation video coding format developed by WebM Project. VP9 supports all Web and mobile use cases from low bitrate compression to high quality uHD, with additional support for 10/12 bit encoding and HDR
AV1 AOM (Alliance for Open Media) developed an Open source, royalty free video coding format. AV1 is the successor to Google’s VP9 standard and a strong competitor to H265.

1.2 audio

1.2.1 Basic Concepts
Sampling rate Audio sampling rate refers to the number of audio signals sampled by a recording device in one second. The higher the sampling frequency, the more authentic and natural the sound will be.
The sampling size The number of samples taken in one second is the Bit rate, and the number of bits of information in each sample is the Bit depth, that is, the sampling accuracy, in Bit.
Bit rate The number of bits transmitted per second, also known as data signal rate. The unit is bit/second, kilobit/second, or megabit/second. The higher the bit rate, the more data transmitted per unit of time.
The compression ratio The ratio of raw audio data to the size of the data that has been compressed by compression encoding techniques such as PCM

1.2.2 Audio container format

Audio formats are also common: WAV, AIFF, AMR, MP3, Ogg…

1.2.3 Audio coding format
PCM Pulse Code Modulation (PCM) is one of the encoding methods of digital communication.
AAC-LC(MPEG AAC Low Complexity) Low Complexity Codec (AAC-LC – Low Complexity Advanced Audio Codec) is a high performance audio codec with low bit rate and high quality audio.
AAC-LD (also known as AAC low Latency or MPEG-4 low latency audio encoder), a low latency audio codec tailored for teleconference and OTT services
LAC (Free Lossless Audio Codec) Free lossless audio codec. Is a famous set of free audio compression coding, its characteristic is lossless compression. Since 2012 it has been supported by many software and hardware audio products such as CDS.

2. Live broadcast technology


First, let’s take a look at an intuitive diagram, which is a live streaming process from the anchor push stream to the user pull stream.



Streaming protocol

Every video or audio media you watch on the network relies on specific network protocols for data transmission, which are basically distributed at the Session Layer, Presentation Layer, and Application Layer. Common protocols: RTMP, RTP/RTCP/RTSP, HTTP-flv, HLS, DASH. Each agreement has its own advantages and disadvantages.

Push-pull flow process

The host enables live broadcasting on the device, and the collection device collects the voice and picture of the anchor and pushes it to the “streaming media server” through the corresponding protocol. In this case, the viewing end (the streaming end) can pull the streaming data from the streaming media server through the streaming protocol and play the streaming data.

Iii. Player


This section focuses on the technology involved in the player. In this section, we will briefly describe how the player works once the streams are available.



4.1 pull flow

The first step is to pull the stream. You need to get the video stream before you can play it. For example, video streaming data in FLV format can be pulled through the Fetch API and Stream API provided by the browser.

4.2 decapsulation


After the stream data is obtained, the decapsulation operation needs to be performed immediately. Before playing, you need to separate images, sounds, subtitles (which may not exist), and so on from the pulled stream data. The act and process of this separation is demux.



After decapsulation, basic streams such as images, sounds and subtitles are obtained, and then the basic streams can be decoded by decoder.

4.3 Demux (Decoded)

From the upper decapsulation, we know that after decapsulation, we need to decode the separated original code stream to generate data that can be played by audio and video players. In the process of decoding, we will get a variety of data, let’s pick a few important ones:

4.3.1 the SPS and PPS

The two guys decide on the maximum video resolution, frame rate, and a number of other parameters to play the video. PPS are usually stored at the beginning of the bit stream along with SPS.

  • SPS and PPS save a set of global parameters of encoded video sequence, if they are lost, the decoding process is likely to GG.

4.3.2 IBP frame




  • I frame, key frame. I frame is the first frame of each GOP (a video compression technique used by MPEG). It is moderately compressed and can be regarded as a static image as a reference point for random access.
  • B frame, forward precoded frame. It uses a front i-frame or p-frame and a back I-frame or p-frame to make a prediction. Not only to get the previous cache screen, but also after the decoding of the screen, through the picture before and after the superposition of the frame data to obtain the final screen.
  • A P-frame, predictive frame, is an encoded image that fully removes the amount of compressed data transmitted by taking the time redundancy information of the previously encoded frames in the image sequence and becomes a predictive frame.

4.3.3 SEI(Supplementary enhancement Information)
  • Video encoders can output video streams without an SEI. To take a simple example, the popular live answer test transmits a lot of information related to the answer business through the SEI, and optimizes the synchronization between the question display and the audience’s audio and video watching through the information carried by the SEI.

4.3.4 PTS and DTS
  • The Decoding Time Stamp (DTS) is the Decoding Time Stamp. The meaning of this Time Stamp is to tell the player when to decode the data of this frame.
  • Presentation Time Stamp (PTS) : a display Time Stamp that tells the player when to display the data for this frame. In short, these two guys probably decide whether or not your audio and video streams are synchronized.

Decoding will also generate a variety of products, which will not be introduced here, interested students can check at the end of this article.

4.4 Remux (Reuse)

With Demux, there is remux. Remux is the combination of basic audio ES, video ES, subtitles ES, etc. into a complete multimedia. For a video, to change the encapsulation format, to change the video encoding, remux and Demux need to work together. I’m not going to expand on this.

4.5 apply colours to a drawing

Render refers to the decoding of the data, on the PC hardware (monitor, speakers) for playback. Responsible for rendering module we call the renderer (Render), Video renderer mainstream EVR (Enhanced Video Render) and madVR (Madshi Video Render), Web players generally use the Video tag to embed. Custom rendering: Take our H.265 player as an example, using the interface provided by the browser to achieve a simulated video label, through canvas and Audio to achieve rendering.

Fourth, Web media technology

4.1 webRTC


Web real-time Communication allows a Web application or site to set up a peer-to-peer connection between browsers for streaming video and/or audio, or any other data, without the use of intermediariesRapid transport.



Composition form:

VideoEngine, VoiceEngine, Session Management, iSAC, VP8, APIs (Native C++ APIs), The Web API)



4.2 MSE

Those of you who have used the player will be familiar with MSE. The Media Source Extension API (MSE) provides functionality for plugin-free, Web-based streaming media. With MSE, media streams can be created in JavaScript and played using audio and video elements. MSE greatly extends the browser’s media playback capabilities by providing JavaScript enabled media streams. This can be used in adaptive streaming and time-varying live streaming applications. The use of MSE is not described here. If you are interested in MSE, you can search for it. I believe it can help you.

4.3 WebXR

XR is short for Extended Reality, which includes VR, AR, Mixed Reality, MR. WebXR supports a variety of XX Reality devices. WebXR allows developers to create immersive content that runs across all VR/AR devices for a Web-based VR/AR experience.

4.4 WebGL

WebGL (Fully written Web Graphics Library) is a 3D drawing standard that allows users to interact with it. By adding a JavaScript binding to OpenGL ES 2.0, WebGL can provide hardware 3D accelerated rendering for HTML5 Canvas. This allows Web developers to more smoothly display 3D scenes and models in the browser with the help of system graphics cards, as well as create complex navigation and data visualizations. WebGL is based on the Canvas for rendering. In the “Player” chapter, we learned that the player can render the player image through Canvas, and that the player’s smooth performance is enhanced through WebGL.

4.4 WebAssembly






WebAssembly or WASM is a new format that is portable, small, fast loading, and Web compatible. It is a new specification created by the W3C community of major browser vendors.

Those who are interested can gowebassembly.org/Understand learning.

Based on WASM, the player can be combined with FFmpeg to decode H.265 video which is currently not recognized by the browser.

Use of open source products and frameworks

There are many popular open source products and frameworks in the market. We have collected some excellent mainstream frameworks for you.

5.1 FLV. Js


Flv. js is Bilibili website open source HTML5 FLV player, based on HTTP-FLV streaming media protocol, through pure JS FLV transencapsulation, so that FLV format files can be played on the Web.





Official lot:Github.com/bilibili/fl…

5.2 HLS. Js


HLS. Js is based on Http Live Stream protocol development, using Media Source Extension, used to realize HLS on the Web playing a JS library.

It is worth mentioning that because HLS protocol is proposed by Apple and widely supported on mobile devices, it can be widely used in live streaming scenarios.





Official lot:Github.com/video-dev/h…

5.3 video. Js






Video. Js is an HTML5-based player, which supports both H5 and Flash playback, and has more than 100 plug-ins to use, which can meet the demands of HLS, DASH format playback, support customization of themes, subtitles extension and other different levels, and has a large number of application scenarios worldwide.

Official lot:Github.com/videojs/vid…

Official documents:videojs.com/

5.4 FFmpeg






FFmpeg is a leading multimedia framework, an open source and cross-platform multimedia solution that provides audio and video encoding, decoding, transcoding, packaging, decamping, streaming, filtering, playback and more.

Official website address:ffmpeg.org/



For the front end, FFmpeg can be used:

  • JS Player: You can implement browser-side JS player based on FFmpeg and WebAssembly, or extend other browser-side audio and video capabilities.
  • Node module Fluent – FFMPEG: a very practical module in Node.js. This module simplifies the complex command operation of FFMpeg, and is very practical with file upload and video stream processing. For more details, please refer to Fluent – FFmpeg

5.5 OBS

OBS (Open Broadcaster Software) is a free and Open source Software package for recording and webcasting. Written using C and C++ voice, OBS provides real-time source and device capture, scene composition, encoding, recording, and broadcasting. Data transfer is done primarily through the Real-time Messaging Protocol (RTMP) and can be sent to any rTMP-enabled destination, including streaming sites such as YouTube, Twitch.tv, Instagram and Facebook. For Video coding, OBS can encode Video streams into H.264/MPEG-4 AVC and H.265/HEVC formats using the X264 Free Software library, Intel Quick Sync Video, Nvidia NVENC and AMD Video coding engines. Audio can be encoded using MP3 or AAC codecs. Advanced users can choose to use any codec and container in Libavcodec/ libavFormat, or they can output streams to custom FFmpeg urls.

5.6 MLT






MLT is a non-linear video editor engine that can be used for many types of apps, not only on the desktop, but also on Android, iOS and other platforms.

Official website address:www.mltframework.org/

Making:Github.com/mltframewor…

About us

Due to space problem, this article only introduce the technology behind the taobao now live, more technical details can be reference Taobao live knowledge document: www.yuque.com/webmedia/ha…

Who we are

We are multimedia front-end system team, is mainly responsible for taobao live, short video and other multimedia business, live in low latency of push-pull flow line, video player, write open, intelligent, interactive, multimedia streaming media continued to research and practice, such as the direction, focus to the study of audio and Web technology, is committed to building the top domestic multimedia front-end technical team.

Contact us


If you are interested in our team, please write to us[email protected]