Background: Today we are going to share the topic of “Media Intelligence – Interactive Practice of Taobao Live Streaming Media”. The content is divided into five parts. First, let’s see how anchors can pay New Year greetings to users in the live broadcast room of Taobao Live. Then how to make a special effect of a gesture to pay New Year’s greetings; Then we introduce the overall scheme design of media intelligence and one of the core works, the implementation of an editor like Mediaai Studio. Finally, let’s talk about the direction of our future construction.

How to call New Year in the broadcast room?

New Year’s Day is coming soon. We will pay New Year’s greetings to our relatives and friends every New Year’s Day. That in the broadcast room, how the anchor to the user New Year? At the beginning of the Spring Festival this year, we did a project that allowed anchors to pay New Year greetings to their fans in the broadcast room, and then created some special effects of the Spring Festival atmosphere in the broadcast room.

The specific design scheme is that the anchors in the process of live broadcast, real-time recognition of anchors’ New Year greetings gesture, to trigger the rendering of some special effects of the Spring Festival atmosphere, and real-time recognition of anchors’ faces, to follow the rendering of some face props.

You can see the above several effects of the signal, such as anchors can do a love or New Year greetings gesture, to trigger the broadcast room flower words, couplets or fireworks, can also give anchors face with the god of wealth hat and other facial props, enhance the broadcast room festive atmosphere.

Make special gestures for New Year greetings

So in order to achieve these effects in the process of live broadcasting, how to make it? The following is a detailed description of how to make such a special effect in the broadcast studio.

It is generally divided into four steps. The first step is for designers to make static or dynamic materials through some design software, such as the hat of the god of wealth and related fretting effects. The delivery can be a sequence frame animation. The second step is for designers to make a complete material package in our self-developed MAI editor. In this editor, we can do frame adaptation, face recognition follow, set gesture trigger conditions, real-time local effect preview, etc. After editing the material, package it and upload it to the content material platform; The last is the use of the foreground, that is, our anchor chooses to open some material play packages in the push stream end, and then in the process of push stream real-time identification and rendering merge, distributed to the user end to watch.

For example, in the editor set to recognize a gesture to trigger a specific special effect, the general operation process is like this: first add a material sticker, upload a replacement material picture, which can be a sequence frame, and then adjust the position and size of the sticker, select the trigger condition in the playback effect, and trigger through the New Year greetings gesture. How do you see the effect? Select our pre-made video in the preview on the right, and you can see a flower effect triggered by the New Year greeting gesture:

Another example is that we want to wear a hat for the anchor. The general operation process is as follows: first, add material stickers, select the sequence frame animation of the hat, adjust the size and position, and select the forehead part of the face in the following position. Preview the effect and select our prefabricated video. You can see that when the anchor nods, the hat of the God of Wealth will also move along, which is like putting a hat on the anchor:

After the material is made, it is uploaded, distributed and used, and we can see the final effect in the studio:

Media intelligent scheme design

We have introduced a single point case before, then what is the media intelligent streaming media gameplay we defined, and then we will introduce the overall scheme design of media intelligent in detail. Let’s look at another case:

The traditional interaction of red envelope rain in the broadcast room is to cover a layer of ordinary H5 page on the video stream, which is separated from the stream content; The media intelligent streaming media interaction we want to do is to render materials in the video stream, and the anchors can control the distribution of red envelope rain through gestures, combine the content of live broadcast with the depth of interaction, and improve the interaction rate and length of live broadcast.

Media intelligence, as defined by us, is a new type of streaming media interaction formed by combining AI/AR gameplay on live broadcast and video streaming. As a front end, our goal is to build an intelligent media solution from production to consumption, and at the same time form an engineering link to improve the production cycle of streaming media interaction to the “7+3+1” mode, that is, 7 days of algorithm development, 3 days of gameplay writing, and 1 day of material production can bring a new gameplay online. For example, on November 11 this year, we spend “7+3+1” to develop a new game play. In next year’s daily life, we only need one day to make new materials, or spend three days to change the game play logic, and then we can develop a new game play.

Next, we carried out link disassembly of the scheme, including four steps: gameplay production, gameplay management, gameplay use and gameplay presentation.

Producers by editor to play production, through the material platform to manage the material package, get through the ALive do component management platform, through multimedia production to use play host, do play configuration and open in instrument panel, side do play in pushing flow execution and confluence output, at the same time the location of the key frame transmission material by SEI, hotspots, such as information, Monitor the SEI and interactive responses in the studio via live containers.

The scheme design of media intelligence is divided into two aspects, intelligent material and interactive gameplay. Intelligent material is for products, operations and designers, providing a one-stop intelligent material production platform, interactive gameplay is for developers, to provide a support for code writing, debugging, preview, deployment of streaming media interactive IDE.

What is smart material? For example, I have produced a lot of smart material through Mediaai Studio, providing users with a lot of face props for shooting scenes. In this Spring Festival, we also produced some Spring Festival atmosphere applied in live broadcast scenes, such as these Spring Festival atmosphere dynamic effects at the top and bottom of the broadcast room, and special effects triggered by the recognition of anchors’ New Year greeting gestures.

The technical scheme of intelligent material is relatively simple, the core is agreed a JSON protocol module configuration, the underlying relying on face detection, gesture detection, object recognition algorithms, such as ability, abstract the module configuration layer, including filter, stickers, beauty beauty, text templates, etc., each module will do some configuration, including style, playback, animation, etc., Finally, materials are downloaded on the terminal to complete configuration analysis, and the underlying rendering calculation engine does rendering calculation according to the configuration. Finally, the output is combined.

Below is an example of the module JSON configuration we defined. You can see the editor version, the frame, the module type configuration, as well as the material resources, location and size information, playback Settings, trigger Settings, animation Settings, and so on.

What is interactive play, such as double this year 11 cases in the live on taobao, millet venti challenge on the left, you can see the host through the body to control the movement of venti, to receive the above screen drops props, on the right is the bubble matt’s career challenge, you can see the host through the face to control the movement of the figures, After some collision detection, item points can be earned to complete some game logic.

In order to realize such a set of interactive gameplay of streaming media in the process of live broadcasting, the general technical solution is as follows. Combined with the case of red envelope rain mentioned above, let’s do link series. First, an editor such as Mediaai Studio is used to generate gameplay materials and scripts. Then, a new red envelope rain component is created in Alive and bound to the gameplay. The anchor starts the gameplay through the central console, and then the stream-end downloads and executes the gameplay scripts, combining the red envelope materials into the stream. The user player acquires the location of the red envelope through the SEI keyframe information in the stream, consumes this interaction in the Alive component, and responds to the user’s operation by drawing the hot area.

Editor Mediaai Studio

The gameplay editor has been mentioned several times before, and one of our core tasks in the Media Intelligence link is to build a gameplay production editor.

Based on Electron, we built Mediaai Studio, an editor for gameplay production. At the bottom, it relies on the cross-platform rendering computing engine Race on the client side. Race integrates the algorithm reasoning framework MNN and the algorithm platform Pixelai, providing the basic ability of algorithm recognition and rendering calculation.

The main process of Electron is responsible for window management, upgrade services, etc., while the rendering process is responsible for some tools such as module tree and editing panel, as well as real-time rendering. At the same time, a worker thread is opened in the rendering process, which is responsible for communication with the Node module of Race and some image processing. Functional level provides engineering management, material production, gameplay development, account management and other functions.

We encapsulate the C module of Race as a. Node C extension, perform some functions such as frame parsing, background setting and rendering output through N-API, and realize JS scripting ability through JSBinding calling C ++ module. The rendering part involves a large number of canvas pixel exchange and output, and abstracts the worker layer in the rendering layer, including background update, frame update, module update, Buffer update, etc. The Worker and Render communicate with each other through some JSON protocols and binary data protocols to achieve the ability of real-time rendering.

This is the editor effect that we implemented,

Smart material can be produced from a designer’s point of view, using fixed stickers on the bottom, face stickers on the head, and gesture-triggered stickers:

And the developer perspective can write gameplay scripts in the editor. In this case, an intelligent interaction can be realized by controlling the movement trajectory of the bird through facial recognition:

Subsequent construction

We are just beginning to explore media intelligence. At present, we mainly focus on the tool perspective. The core of media intelligence is to provide the production of intelligent materials and interactive gameplay through Mediaai Studio, a PC desktop tool. For example, the JS script in our interactive gameplay also needs to conform to the specifications of front-end safety production, so we need to get through with the publishing platform in the editor to complete the ability of project creation, debugging, CR, release and deployment. Finally, based on tools and platforms, designers and ISV ecology, or even commercial operation, are provided to rapidly expand the volume of live streaming media interaction and enrich the types of gameplay.

Postscript: D2 live QA

Q1: What work does the front end undertake in the field of gameplay effects (in addition to the material editing platform)

A1: The gameplay link is mainly divided into four links: gameplay production, gameplay management, gameplay use and gameplay presentation. The core of gameplay production is Mediaai Studio, an editor whose front end is based on the PC client built by Electron. Gameplay management is primarily an Alive platform; We provide tools for PC and APP for two scenes. For PC, we also launched Electron project, which combines the ability of streaming and the special effects and gameplay in a deeper way. The gameplay presentation is primarily an interactive component in the studio and short videos, and the overall open technology solution is front-end driven. Therefore, in each link, the front end undertakes some work, among which the front end of the production, use and display links undertakes the core work.

Q2: How to select the frequency of special effect detection?

A2: push the performance of the flow itself cost is large, including audio and video collection, coding, filter and so on, so the algorithm testing part of the game effects, we made two layers of the frequency control: one is the opening and closing of the whole play package, only when anchor or assistant opened a clear style, will make a corresponding algorithm to detect; Second, different algorithms also have different detection Settings. The algorithm is also divided into detection frame and follow frame to minimize the performance overhead of gameplay detection.

Q3: Recognition and confluence are realized in which end respectively? What technologies and protocols do streams use?

A3: At present, recognition and confluence are realized at the end of anchor push stream, including PC and APP. Stream is the traditional broadcast technology and protocol, push stream RTMP, pull stream HLS and HTTP-FLV.

Q4: There is a delay in live streaming, will it increase the delay by confluence? How to ensure the data delay between the push screen and the user interaction?

A4: Confluence will not increase the delay of live broadcasting. If the algorithm is executed too slowly and the processing of one frame cannot be completed, the frame rate of live broadcasting will decrease, which means the picture will become stagnant for the user’s motion sense. In my understanding, the user interaction asked here refers to the interaction of the C-end users. Generally, the interaction and response of the C-end users are completed at the C-end. So far, there is no case where the real-time response of the C-end is needed after the interaction of the C-end users. If it is such a scene that requires a high degree of synchronization between the picture and content, we will guarantee the synchronization of the picture and data through SEI+CDN.

Q5: Can you recommend an open source library for gesture listening?

A5: Google Research Open Source Mediapipe

Q6: Will recognition significantly increase the size of the front-end packet?

A6: No, the size of the front-end package is mainly material resources and JS scripts. The algorithm model and recognition ability are end-to-side and will not be included in the front-end package.

Q7: What is the algorithm part of the editor implemented with? TFjs?

A7: Not TFJS, the power of the algorithm is the power of the MNN inference framework and Pixai algorithm platform, and the cross-platform rendering computing framework Race integrates this part of the power.

Q8: In the rain of red envelopes, are the locations of red envelopes random? How do I predefine the heat zone?

A8: It is random. After the execution of the script at the push end, the position, size, deformation and other information drawn in the red envelope is transmitted to the player end through the SEI key frame. After the player end is parsed by the SEI, the front end will restore the corresponding hot area to respond to the user operation events.

Q9: How is the efficiency of code execution in the game part guaranteed?

A9: At present, Race C ++ code in the game is called by JSBinding for JS, and the game screen is rendered by the client. In addition, Race makes JS call optimization mechanism at the bottom, so the execution efficiency is close to native. With the increasing complexity of the business, the development efficiency of the game has become a bottleneck. Next, we are considering using the WebGL protocol interface provided by Race Canvas to adapt to the mainstream H5 game engine in Taobao. With the help of the perfect interactive ability and engine ecology of H5 games, combined with the unique API of multimedia interaction, we are going to use the WebGL protocol interface provided by Race Canvas to adapt to the mainstream H5 game engine in Taobao. Interactive development of streaming media, to achieve a development, multi-rendering engine operation.

This article is the original content of Aliyun, shall not be reproduced without permission