Analysis and retrieval of massive short video content based on deep learning

In RTC real-time Internet conference in 2018, the beautiful cloud visual technical director li-li zhao to share the beautiful pictures in a short video in the field of application of AI technology, content mainly includes three parts: the beautiful short video business scenario, based on the business scenario do short video content analysis and retrieval technology, as well as the problems and the corresponding solutions. Finally, some thoughts on the platform construction process. The following is a summary of the speech.

Welcome to the RTC developer community to exchange experience with more real-time audio and video developers.

Meitu’s representative product in the field of short video is the short video app “Meipai”, which was launched in 2014. In recent years, some competitors have emerged, such as Douyin and Kuaishou. Recently, Meipai has also adjusted and positioned its content, mainly beauty and tutorials, in the hope that users can absorb some nutritious information and knowledge in the process of entertainment.

A video of the technology involved

A video may involve many processing techniques during its lifetime. It starts with 2D and 3D capture, then codec, which also involves transmission, storage, and then editing and processing, such as editing, filter beautification, style transformation, and background segmentation. Then comes information extraction, including object recognition, scene detection, character analysis, behavior recognition, subject extraction, event detection. After the above steps are completed, we have a large number of videos, but also do video retrieval. It has two parts. One is to retrieve the content we want from a given video. The other is to retrieve similar videos from a large database based on a given video.

The application of AI technology in Meitu short video business is mainly in two aspects, one is tool level, the other is content level.

At the tool level, AI technology is used to process the video, such as beautifying the video characters, replacing the background, and slimming the characters in the video. The content level is tagging, such as identifying objects in the video, detecting scenes in the video, and detecting user behavior. In addition, the most important thing is that after we get a video, we can use AI to detect the quality of the picture and whether the content of the video violates the rules. After extracting video features, we carried out some video retrieval work to support the business around short videos, including user portrait, operation, recommendation and search.

Based on the above business requirements, we built a multimedia content analysis and retrieval platform, which is divided into two parts based on content analysis algorithm. The first part is the multimedia content analysis platform, which is responsible for analyzing the characteristics of video content and tagging. Another is multimedia data retrieval platform.

Figure: Application architecture of Meitu short video content analysis and retrieval platform

Technical challenges in short video content analysis and retrieval

After getting a video, how to understand its content, this is actually a multi-dimensional problem. First and simplest, when we see a video, our first reaction is its tone, texture, style and picture quality. Further, we need to know what objects this video contains, where the scene takes place, what characteristics of people it contains, including gender, age, characteristics and clothing, and whether the content violates the rules. In addition, there is a deeper level of video content recognition and detection, for example, the forefront of the academic research is behavior recognition. These are also the dimensions meitu will involve when analyzing a video content.

Based on the above business requirements, we processed the video, audio, image and text, transmitted them to the multimedia content analysis platform, and then analyzed the following four types of information:

Basic features: tone, texture, style, picture quality;
Character analysis: gender, age, appearance level, hairstyle, dress style;
Product analysis: product identification, brand identification;
General content analysis: video classification, feature extraction, scene classification, Angle detection, object detection, watermark detection, cover selection.

Based on this, the multimedia content analysis platform provides tags, features, and indexes to support business requirements.

Short video data has several characteristics:

Video source: mobile phone shooting;
Video form: portrait, personal-centered, special effects and filters;
Video structure: the scene in the same video is fixed;
Information dimension: multi-modal information, screen and background audio inconsistency;
Large amount of data;
Contents unknown;
Timeliness.

In building this platform, we encountered a number of problems. To sum up, there are two key issues:

One is how to define the labelling system effectively. As mentioned earlier, tags are a form of output for this platform. We need to determine which labels to export to help the business, so the definition of labels is very important. Deep learning-based algorithm training requires some training data, and how the label is formulated in the training data is also very important.

On the other hand, how to improve the efficiency of model iteration. Short video data has strong timeliness. For example, the model trained two months ago may not have such good effect after two months. Therefore, we need a mechanism that can quickly annotate data and replace it online to support business stably.

How to define a label system effectively

We have a hot video pool, and operations and products will manually tag some videos. You might say, well, we can do model training with this part of the tag. What are the problems if we apply business tags to algorithms?

First, the label of the business is more abstract. For example, some labels like funny and humor may be formulated. However, whether a video is funny or humorous cannot be accurately judged from visual or audio information.

For example, if a 3-year-old is crying, a parent might post a funny video, or a sad video if someone in their 20s or 50s is crying.

Secondly, the training data is not balanced. The figure above shows the corresponding data amount of a part of business labels we got. Since business personnel do not consider what each category is when setting labels, there will be the problem of unbalanced training data, which will also affect the training of algorithm model.

Another problem is low category differentiation. For example, finger dancing and selfie have no difference from the visual point of view. If they are forcibly divided into two categories during training, the learning of network model will be affected and some noises will be caused.

Another problem is that labels are one-dimensional. Typically, a video is labeled with four or five dimensions at most, and any more, it becomes very complicated to try to measure the entire video.

Our solution is to take business label as the guide, extract video features and audio features from our video data and text data (accompanied by video title and comment related information) for clustering, and then abstractly define the clustering to obtain the corresponding visual label elements. This tag element is the tag we use for training. Finally, the output of the training label will be mapped to the business label in turn. The label defined in this way is multi-level and multi-dimensional.

As shown in the figure below, is a girl in the video, the video of the attitude, for the whole body movement is in swing, detected for rock music style, then you can judge the video is a beauty in amateur, and in the dance, belong to the talent show, so the generated label is “beauty take”, “dance” and “talent”. This completes the mapping of an algorithmic label to a business label.

How to improve the iterative updating efficiency of online algorithm model

There are three core issues: fast data annotation, effective and stable model evaluation mechanism, and guaranteed algorithm performance.

For fast tagging, we also use an automatic tagging algorithm, which has been mentioned in papers on unsupervised and semi-supervised deep learning. We will take a general data training model in advance, carry out a small amount of annotations on business data, and train a classifier based on a small amount of annotation data. This classifier will label other unlabeled items. With the label output, there will be some confidence, and the data with high confidence will be taken for training, while the data with low confidence will continue the next iteration update, and this process will be repeated many times. The process also includes manual verification and labeling, depending on the difficulty of the task.

Meitu has a smart label service model, which is divided into two parts, as shown in the figure below. The offline process is on the top and the online process is on the bottom. Take the video online and enter the appropriate tag or feature. Automatic tagging in the offline process is the algorithm tagging mentioned above, which uses the output data and labels of automatic tagging to train a model. We will get this data for model training and model evaluation, and the data used in the evaluation is also part of the validation data that we have marked. When the accuracy of the evaluation reaches our domain value, we consider the model usable and update it online.

In addition to video, there are many other image algorithms, but in comparison, the complexity of video algorithm is higher. Here is the video classification technology as an example, to introduce how we do.

Most video categories come in one of these forms. In the first way, we simply extract frames from the video, use CNN network to extract some features, fuse them, and then classify the final results. This method does not take into account the time domain information of video. There are a couple of other algorithms, as you can see on the bottom left and the right. They all take into account the time domain information of the video, but the disadvantage is that the complexity is too high and it is difficult to be applied to the actual scene.

Through the evaluation of short video application scenarios, the first method can achieve a good accuracy, compared with other schemes, its time complexity is also very low.

In terms of video classification, Meitu also has some cutting-edge research achievements. In cooperation with The Chinese Academy of Sciences, we proposed an unsupervised video feature learning and behavior recognition method based on brain-like intelligence.

There are many models in the industry and academia, but in the actual business scenario, the effect of the seemingly huge network is not so ideal. More importantly, for us, how can we extract the most critical and important data from the mass of business data to train a model suitable for business scenarios? We have also summarized several points in building a multimedia content analysis platform, and we believe that three pieces of information are very important:

Reasoning service: on the one hand, reasoning analysis of the obtained data, on the other hand, updating the model after training;
Data service: it provides annotation data for model training, and its core module is algorithm automatic annotation;
Training service: It contains a training trigger module that periodically updates model training.

All of the above mentioned is that we do a content analysis, most of which is based on tag output. In fact, there are some problems in using labels to describe things. One is timeliness. The figure above shows a part of the label changes in the labeling system that we summarized in March and August respectively. The second problem is imperfection. To label a video or an image, you need a lot of dimensions. What if we want to compare the similarity of two videos when we are searching for videos? It’s very difficult if you use tags.

The figure above shows the process of video retrieval, which is a video retrieval scheme that applies the process of text retrieval proposed by Google in 2003. It consists of two parts, one is the construction of the target retrieval database, the other is retrieval. After obtaining some videos, frames are extracted from the videos, features of these frames are extracted, and then visual word lists are constructed.

Feature comparison includes class, binary feature and floating point feature, which have their own advantages and disadvantages. One of the advantages of binary features is efficient storage and distance calculation, but the disadvantage is that the range of representable is relatively small. Floating point features Euclidean distance, extreme interference resistance, distance representation value is relatively large, theoretically speaking, from zero to infinity.

Here’s a simple idea. We will use multi-level tags to guide the network to learn based on the multi-level directional system constructed above. In addition, Triplet loss will be used. For the same video, 5 frames will be selected, and then 5 more frames will be selected in different time periods, so as to form a positive sample pair. Negative sample pairs from one video and other videos enable him to learn what kind of information is characteristic of the video.

The diagram above shows the architecture of a short video similarity retrieval service, with an offline module on the left and an online module on the right. Offline is batch training, generating hashes.

conclusion

In the process of building this platform, we also encountered some problems and had some thoughts. The main points are:

AI has given multimedia technology greater application space, and there are still a lot of business scenes in the field of video that need AI’s help.
Data is still the core element for the effective implementation of AI algorithms in the current stage. The closer it is to the business scene, the more obvious the importance of domain data.
The role of general algorithm technology is weakening, need to be combined with specific business scenarios in-depth optimization;
Technical areas need to be continuously explored, such as algorithm performance issues, fine-grained semantic understanding, intelligent interaction of Internet multimedia content, etc.

Scan for a video review of the talk

If you want to get more lecture materials and learn more about RTC technology, please follow the wechat public account of “Sound net Agora”

Analysis and retrieval of massive short video content based on deep learning

Related Posts

Speech denoising based on Matlab Wiener + Kalman + spectrum subtraction speech denoising

· Sequential has no attribution “validation_data

Deep learning structure and training process