At the end of June, Qiuniuyun released LiveNet, a real-time streaming network for live video streaming, and a complete live streaming cloud solution. Many developers are very interested in the details and usage scenarios of this network and solution. Combined with the practice of LiveNet, the real-time streaming network of Qiniu, and the cloud solution of live streaming, we use seven articles to systematically introduce the key technologies in all aspects of the current hot live streaming, so as to help video streaming entrepreneurs have a more comprehensive and in-depth understanding of live streaming technology and better technology selection.

The outline of this series of articles is as follows: (1) acquisition (2) processing (3) coding and packaging (4) Streaming and transmission (5) Delay optimization (6) Modern player principles (7) SDK performance test model

In the last article on latency optimization, we shared some simple and useful tuning tips. This is part six of the Live Video Technology series: How modern players work.

In recent years, the increasing demand for multi-platform adaptation has led to the rise of adaptive bitrate playback in streaming media, forcing Web and mobile developers to rethink the logic of video technology. First, the giants released protocols like HLS, HDS, and Smooth Streaming separately, hiding all the details in their specialized SDKS. Developers are not free to change logic like the media engine in the player: you can’t change the rules of the adaptive bit rate or the cache size, or even the length of your slice. These players may be simple to use, but you don’t have much choice to customize them, and you’ll have to put up with even bad features.

However, with the increase of different application scenarios, the demand for customizable functions becomes stronger and stronger. Just between live and on-demand, there are differences in buffer management, ABR policies and cache policies. These requirements led to a more low-level set of apis for manipulating multimedia: Netstream on Flash, Media Source Extensions on HTML5, Media Codec on Android, and a standard HTTP-based streaming format called MPEG-Dash. These more advanced capabilities provide developers with greater flexibility to build players and multimedia engines tailored to their business needs.

Today we are going to share how to build a modern player and what are the key components needed to build such a player. In general, a typical player can be broken down into three parts: UI, multimedia engine, and decoder, as shown in Figure 1:




Figure 1. Modern player architecture

User Interface (UI) : This is the top layer of the player. It defines the end-user viewing experience through three distinct features: the skin (the look and feel of the player), the UI (all customizable features such as playlists and social sharing), and the business logic section (specific business logic features such as advertising, device compatibility logic, and authentication management).

Multimedia engine: This handles all the logic related to playback control, such as parsing description files, pulling video clips, and setting and switching of adaptive bitrate rules, which will be explained in more detail below. Because these engines are typically tightly bound to the platform, it may be necessary to use several different engines to cover all platforms.

Decoder and DRM Manager: The lowest part of the player is the decoder and DRM manager, and the functionality of this layer directly calls the APIS exposed by the operating system. The main function of the decoder is to decode and render the video content, while the DRM manager controls playback through the decryption process.

We’ll use examples to illustrate the different roles each layer plays.

I. User Interface (UI)

The UI layer is the top layer of the player and controls what your users can see and interact with, while customizing it with your own brand to provide a unique user experience for your users. This layer is closest to what we call front-end development. Inside the UI, we also include the business logic components that make up the uniqueness of your playback experience, even though the end user can’t directly interact with this functionality.

The UI consists of three main components:

1. The skin

Skins are a general term for the visually relevant parts of a player: progress bars, buttons, animated ICONS, and so on, as shown in Figure 2. Like most design-class components, this component is implemented using CSS and can be easily integrated by designers or developers (even if you are using JW Player and Bitdash as a whole solution).




Figure 2. Player skin

2. The UI logic

The logical part of the UI defines all the visible interactions between playback and user interaction: playlists, thumbnails, channel selection, social media sharing, and so on. There are many other features that can be added to this section, depending on the play experience you want to achieve, many of which are already in the form of plug-ins, for inspiration: github.com/videojs/vid… There are many functions in the logical part. We will not go through them in detail, but take the UI of the Eurosport player as an example to directly feel these functions.




Figure 3. Eurosport player user interface

As can be seen from Figure 3, in addition to the traditional UI elements, there is another very interesting feature. When the user watches DVR streaming media, the live broadcast is displayed in the form of a small window through which the audience can return to the live broadcast at any time. Since the layout or UI and multimedia engine are completely separate, these features can be implemented in HTML5 using Dash.js in just a few lines of code. The best way to implement the UI part is to have various features added to the UI core module as plug-ins/modules.

3. Service logic

In addition to the two “visible” features above, there is an invisible part that makes your business unique: authentication and payment, channel and playlist acquisition, advertising, etc. There are also some technology-related things, such as A/B test module, and device-related configurations for selecting multiple different media engines between different types of devices.

To uncover the underlying complexity, we take a closer look at these modules:

Device detection and configuration logic: This is one of the most important features because it separates playback from rendering. For example, depending on the version of your browser, the player may automatically choose an HTML5 MSE based multimedia engine hls.js for you, or a Flash-based playback engine FlasHls for you to play HLS video streams. The best part of this section is that no matter what underlying engine you use, you can use the same JavaScript or CSS on top to customize your UI or business logic.

The ability to detect the user’s device allows you to configure the end-user experience on demand: if you’re playing on a mobile device rather than a 4K screen device, you might need to start at a lower bit rate.

A/B testing logic: A/B testing is to be able to grayscale part of the production process users. For example, you might give some Chrome users a new button or a new multimedia engine and still be able to make sure everything works as planned.

Advertising (optional) : Handling advertising on the client side is one of the most complex business logics. As shown in the flowchart of the videojs-contrib-ads plug-in module, there are several steps in the process of inserting an AD. For HTTP video streaming, you’ll more or less use existing formats such as VAST, VPAID, or Google IMA, which can pull video ads (often outdated, non-adaptive formats) from AD servers and play them before, during, and after the video without skipping.

Conclusion:

For your customization needs, you may choose to use JW Player with all the classic features (which also allows you to customize some features), or use an open source Player like Videojs to customize your own features. Even to unify the user experience between the browser and Native player, consider using React Native for UI or skin development and Haxe for business logic development. These libraries are excellent for sharing the same code base across many different types of devices. Figure 4. Flow chart of business logic

Second, multimedia engine

In recent years, the multimedia engine has appeared as a new independent component in the player architecture. In the MP4 era, the platform handles all the playing-related logic and only opens up some of the multimedia processing-related features (just play, pause, drag and drop, full-screen mode and so on) to developers.

However, the new HTTP-based streaming formats require a whole new set of components to handle and control the new complexity: parsing declaration files, downloading video clips, adaptive bit rate monitoring, decision specification and more. Initially, the complexity of the ABR was handled by the platform or device provider. However, with the increasing need for host control and custom players, more low-level apis are slowly opening up in some new players (such as the Media Source Extensons on the Web, Netstream for Flash and Media Codec for Android, and quickly attracted powerful and robust multimedia engines based on these underlying apis.




Figure 5. Data flow chart of Shakaplayer, a multimedia processing engine provided by Google

We’ll go through the details of the components of a modern multimedia processing engine:

1. Declare file interpreters and parsers

In an HTTP-based video stream, everything starts with a description file. This declaration file contains meta information that the media server needs to understand: how many different types of video qualities, languages, letters, and so on, and what they are. The parser gets the description information from XML files (a special M3U8 file for HLS), and from that information gets the correct video information. Of course, there are many types of media servers, and not all of them are properly implemented, so the parser may have to deal with some additional implementation errors.

Once the video information is extracted, the parser parses the data from it to build a streaming visual image and knows how to retrieve the different video clips. In some multimedia engines, these visual images appear as an abstract multimedia graph, and then the different characteristics of the different HTTP video stream formats are plotted on the screen.

In a live stream scenario, the parser must periodically retrieve the declaration file to get the latest video clip information.

2. Downloaders (download declaration files, multimedia fragments, and keys)

The downloader is a module that wraps a native API for handling HTTP requests. It is not only used for downloading multimedia files, but also for downloading declaration files and DRM keys when necessary. The downloader plays an important role in handling network errors and retries, as well as collecting data on the currently available bandwidth.

Note: Multimedia files can be downloaded using HTTP or other protocols, such as WebRTC in peer-to-peer real-time communication scenarios.

3. Streaming player engine

The streaming playback engine is the central module that interacts with the decoder API. It imports different multimedia clips into the encoder, and handles multi-bit rate switching and playback differences (such as the difference between declaration files and video slices, and the automatic frame hopping of the card).

4. Resource quality parameter estimator (bandwidth, CPU, frame rate, etc.)

The estimator takes data from a variety of dimensions (block size, download time per segment, and frame hops) and aggregates it to estimate the bandwidth and CPU power available to the user. This output is used for the ABR (Adaptive Bitrate) switching controller to make a judgment.

5. The ABR switches controllers

The ABR switcher is probably the most critical part of the multimedia engine — and often the most overlooked. The controller reads the data (bandwidth and frame hops) output by the predictor and uses a custom algorithm to make judgments based on this data, telling the streaming engine whether to switch video or audio quality. There is a lot of research work in this area, but the biggest challenge is finding a balance between rebuffering risk and switching frequency (too often switching can lead to a poor user experience).

6. DRM Manager (Optional)

All of today’s paid video services are managed with DRM, which is largely platform – or device-dependent, as we’ll see later when we talk about players. The DRM manager in the multimedia engine is a wrapper around the content decryption API in the lower-level decoder. Whenever possible, it tries to mask differences in browser or operating system implementation details in an abstract way. This component is usually tightly tied to the stream processing engine because it often interacts with the decoder layer.

7. Format Conversion multiplexer (optional component)

As we’ll see later, each platform has its own limitations in terms of encapsulation and encoding (Flash reads H.264/AAC files wrapped in FLV containers, MSE reads H.264/AAC files wrapped in ISOBMFF containers). As a result, some video clips require format conversion before being decoded. For example, with the mPEG2-TS to ISOBMFF format conversion multiplexer, hls.js can play HLS video streams using MSE format content. Format conversion multiplexers at the multimedia engine level have been questioned. However, with the improved performance of modern JavaScript or Flash interpretation power, the performance loss is negligible and the user experience is not impacted much.

There are also many different components and features in the multimedia engine, from subtitles to screenshots to AD inserts and more. We will also write a separate article to compare the differences between different engines and give some substantive guidance on engine selection through some testing and market data. It is important to note that to build a platform-compatible player, it is important to provide multiple, freely replaceable multimedia engines because the underlying decoder is platform-specific, which we will focus on next.

Decoder and DRM manager

Decoders and DRM managers are closely tied to the operating system platform for decoding performance (decoders) and security (DRM) concerns.




Figure 6. Decoder, renderer, and DRM workflow flowchart

1. The decoder

The decoder handles the logic associated with the lowest level of playback. It unpacks video in different formats, decodes its content, and then renders the decoded video frame to the operating system for the end user to see.

As the video compression algorithm becomes more and more complex, the decoding process is a process requiring intensive calculation, and in order to ensure the decoding performance and smooth playback experience, the decoding process needs to be strongly dependent on the operating system and hardware. Much of today’s decoding relies on the help of GPU-accelerated decoding (which is one reason why the free and more powerful VP9 decoder didn’t win h.264’s market share). Without GPU acceleration, decoding a 1080P video would take up around 70% of the CPU computation, and the frame loss rate could be severe.

On top of decoding and rendering video frames, the manager also provides a native buffer that the multimedia engine can interact with directly, knowing its size in real time and refreshing it if necessary.

As mentioned earlier, each platform has its own rendering engine and API: Flash has Netstream, Android has the Media Codec API, and the Web has the standard Media Sources Extensions. MSE is gaining traction and may become the de facto standard on platforms beyond browsers in the future.

2. DRM manager




Figure 7. DRM manager

Today, DRM is necessary to deliver paid content produced by studios. This content must be protected from theft, so DRM code and working procedures are shielded from end users and developers. Decrypted content does not leave the decoding layer and therefore cannot be intercepted.

To standardize DRM and provide interoperability across platform implementations, several Web giants have joined forces to create Common Encryption standard (CENC) and Encrypted Media Extensions, To build a common SET of apis for multiple DRM providers (for example, EME for Playready on Edge and Widewine on Chrome) that can read video content encryption keys from the DRM authorization module for decryption.

CENC has declared a standard encryption and key mapping method that can be used to decrypt the same content on multiple DRM systems by providing the same key.

Inside the browser, based on the meta information of the video Content, EME can identify which DRM system it is encrypted with and invoke the corresponding Content Decryption Module (CDM) to decrypt the Content encrypted by CENC. The decryption module CDM takes care of content authorization, obtaining keys and decrypting video content.

CENC does not specify the details of granting authorization, format of authorization, storage of authorization, and mapping of usage rules and permissions, all of which are handled by the DRM provider.

Four,

Today, we have an in-depth understanding of the different content at three levels of the video player. The most outstanding feature of this modern player structure is that its interactive part is completely separated from the logic part of the multimedia engine, so that anchors can customize the end user experience seamlessly and freely. Using different multimedia engines on different terminals at the same time can also ensure the smooth playback of different formats of video content.

On the Web, MSE and EME are emerging as the new standards for playback, with the help of mature libraries such as dash.js, Shaka Player and HLs.js, and are being used by more and more influential players. In recent years, attention has also begun to move to set-top boxes and Internet TV, and we are seeing more and more of these new devices using MSE as their underlying multimedia processing engine. And we will continue to put more effort into supporting these standards.

Translated by He Lishi, seven Niuyun evangelistBlog. Streamroot. IO/how modern -…