In personalized recommendation system, users’ interests are usually understood by mining the attributes of objects to build recommendation models. It is often simple to understand the properties of items from user behavior, usually just some simple tag statistics. In order to get into user behavior to understand content, Meipai uses users’ clicking and playing behaviors to cluster the content of videos. On the one hand, it breaks the restriction of understanding video content from the visual perspective, and on the other hand, it can dig out the classification knowledge not summarized manually, so as to improve the effect of personalized recommendation.

In the ninth meitu Technical Salon, Bai Yang from Meitu Company introduced the video clustering scheme based on user behavior and discussed some practices of video clustering in the RECOMMENDATION system of Meitu photography.


The application of clustering in the United States

As shown in the figure below, the screenshot on the far left is the UI interface of meipai home page, where users can see 4-6 videos. The application scenarios of video clustering on this page are as follows: When two of the six videos displayed on the interface belong to the same cluster, they will not be displayed at intervals. In this case, we will perform an operation to split the video from the cluster to ensure that users can see more categories in the limited page and achieve the diversity of recommendation results.

The second is the application scenario of similar video retrieval. We often need to find similar videos of a certain video, or videos similar to it. In this case, we can find out which cluster this video belongs to through video clustering, and then find similar videos from this cluster to achieve the function of fast retrieval.

The third application scenario is to find some niche videos through clustering, or find some short-term popular videos to help the product to make a better operation strategy.

The fourth scenario is the expansion of recommendation strategies. Video clustering can be used to do some common recommendation strategies. We can find out what kinds of clustering users are interested in through clustering. For example, if the clustering users are interested in is food and beauty, we can recommend videos according to the clustering users are interested in.

The final scenario is to add the clustering of videos as features to the ranking model to improve the effect. These are the five important application scenarios of video clustering in the United States.

So how do we find out what’s on the video? The most intuitive method is to mine the information expressed in this video from the perspective of video content or image. We will mine the entities in this video from the perspective of image (such as food and pets in the picture). Second, from the perspective of sound (such as background music or audio in a video); Third, through the text of the video (such as video description, comments, subtitles, etc.); Finally, discover what the video is about through the video cover, keyframes, continuous screenshots, etc.

At the same time, the above methods also have defects:

Video content/image:

Prior knowledge is required

Text:

Coverage is incomplete and description may not be accurate

Therefore, we mine video content from user behavior, which consists of user portrait and video portrait. The application scenarios of content mining through user behavior are as follows: Most of the massive users watch a certain two videos, so it can be inferred that the two videos may have the same audience (that is, the content of the two videos is related), and further inferred that the two videos belong to the same cluster. This method does not require prior knowledge.

So what problems should be solved by video clustering based on user behavior?

1. Massive data. The daily user behavior of Meopai is a massive amount of data in the unit of T-level, and the model needs to process massive data every day.

2. Model updates rapidly. As there are so many new videos uploaded by users every day, it is necessary to find the cluster to which the video belongs as soon as possible.

3. Interpretability. To understand the implied meaning of each cluster, for example, this video belongs to both food and beauty, it can be inferred that this video is a beauty eating show.



Evolution of video clustering scheme

We put forward four solutions to the above problems, as shown in the figure, which is the evolution of our four solutions. We started from TopicModel TopicModel to Item2vec to keyword spread to DSSM. Next, we will mainly talk about the evolution process of these four models and the online effect.

1.TopicModel

Firstly, the topic clustering TopicModel of the video is introduced. TopicModel is actually a very classic model in the field of natural language processing, which can mine the topic of various documents from numerous documents. With such a model, it is possible to find out which topics each document belongs to and which topics the words in the document belong to. If we write a document with 100 topics, we will first select the topic of the document according to the topic distribution of the document. After selecting the topic, we will select a word from the topic distribution of a word to complete the writing. TopicModel predicts the topic distribution of documents and words by counting word frequency and word co-occurrence in each document.

So how does it apply to the recommendation in the United States? Firstly, user behavior can be understood as a document, and then the videos played or liked by users can be understood as words. In this way, user behavior data can be used in TopicModel to obtain clustering results.

So how does TopicModel meet the three requirements mentioned above?

1. Handling massive amounts of data. TopicModel can solve the problem of massive data by means of data parallelism.


2. Update quickly. For the new video, we can get the user’s behavior to the new video very soon. We can infer the theme of the new video according to the theme of the user, so as to quickly update the theme model.


3. The explainability of TopicModel is very good. We can intuitively understand the general meaning of themes (clustering) and get the distribution of videos in various themes, so as to judge whether the distribution results are in line with people’s understanding in reality.

After solving the three problems, TopicModel can be used to do the first solution, to solve the most common problems of the four topic models:

1. Model evaluation. When doing a subject model evaluation, you need an appropriate way to evaluate the quality of the model.


2. Modeling method. As mentioned above, modeling based on user behavior is modeled using video as a word, but there are other ways to model.


3. Number of topics (clustering). The number of topics is a parameter that often needs to be adjusted in the topic model.


4. Repeated clustering. There are often too many similar clusters in the structure of the topic model, which will affect the clustering fragmentation of the scene, leading to the failure of the correct fragmentation.

The first thing to solve is model evaluation. Only by finding an accurate model evaluation method can we compare the results of various schemes.

As shown in the figure above, the front is two commonly used indicators in the theme model, and the figure below is the trend of the two indicators in a model training. These two indicators can be used to judge whether the model converges or not. Of course, these two indicators can also be used simply to judge the reliability of the model quality. The results of topic (clustering) are introduced into the ranking model, and the reliability of the clustering model can be judged by using the indexes of the ranking model. We want to know the effect of such clustering scheme in the subsequent application, so we introduce the third indicator: ranking model.

With these metrics in hand, we solve the second problem: modeling. Above the first example is a word to build with video, then, in turn, can be based on user word to do modeling, i.e., according to the video for the document, the user as the words, compared these two plans of three indicators can be found: to the user as a word, the three indicators are higher than the video as a word, but its training time is more long than video words, need to 5 hours. Since a popular video might have been played by a million users, the document would be a million words long, so training time would be longer.

As shown in the figure above, the effect of the two modeling methods in the ranking model, the light blue at the bottom is the baseline on the line, and the ordinate is the AUC of the ranking model. We found that after the introduction of the theme model, no matter what scheme is better than the baseline on the line, so the theme model has a better effect.

By comparing the second question, we find that the AUC of these two modeling schemes is not very different. In order to shorten the training time, they tend to take words as the scheme. When taking different number of topics, we find that there are certain differences between dark blue AUC and yellow AUC, so at this time we need to solve the third problem, that is, the choice of number of topics.

The number of topics is very related to the application scenarios, and the discussion is classified according to the application scenarios mentioned above:

1. Niche video clustering discovery. Try to find low-frequency, long-tail clustering, to expand the number of topics, according to experience found that when the number of topics is 1000 or more, its long-tail video clustering effect is better.

2. Introduce the topic model into the ranking model. After comparing the AUC of each topic, baseline is still at the bottom. Each line in the figure above represents the situation of AUC with different number of topics. We found that AUC did not change significantly when the number of topics increased to 100 or 200. The screenshot is the result of the research experiment. According to AUC and GAUC of ranking model, the number of topics is set at 150.

3. Cluster and disperse. In the clustering scatter scenario, we found that the fewer the number of topics, the more beneficial it was to share indicators. For example, when the number of topics is 100, the per-capita online sharing increases by 8%, while the index of broadcasting decreases to some extent. Then, when the number of topics increases to 200, the index of sharing only increases by 2%, but the index of broadcasting also improves to some extent.

Why does this happen? It can be understood that when the number of topics is small, the clustering granularity will be coarser, that is to say, users will see more categories in a screen of 6 videos. Suppose that a girl in the circle of friends to share a modelling or spread the video, she will probably only a day to share a modelling, spread the video, that is, a user in the same class of the day the number of video sharing is less, thus increase her video recommended category, a greater probability of hitting the user wants to share the category, which improved the share index. However, the video category that the user is interested in will be moved back after the break, and the user’s playback will be lower.

So why do both indicators go up when you increase the number of topics? The main reason is that the dispersion improves the diversity of videos, the clustering granularity is fine, and the interested videos are not easy to fall back, which improves the user experience and increases the user drop down, and naturally brings about the improvement of indicators.

The last problem is the theme of similarity. For example, after a model training, we got two similar themes, both of which were videos of doll catching. Without processing these two clusters, we could not make an accurate split of the video of doll catching. Theme can get the theme of the video distribution model results, the transposed to the theme of the video distribution do get a topic vector, this topic vector each dimension is video belongs to the subject of a probability, then gets the topic vector can do similarity calculation for each topic, and then you can merge similar topics, In this way, two dolls or even some other similar clustering can be combined to improve the accuracy of clustering. These are some of the problems we apply the theme model to video clustering.

To summarize the advantages and disadvantages of TopicModel:

Advantages:

The topic model is relatively simple to use, just sort out the user behavior, and then understand the entire user behavior as documents into the topic model to produce the desired clustering effect.

Disadvantages:

The clustering granularity is coarse. For example, in the field of NLP, theme mining of a lot of news corpus may lead to a theme of entertainment news. In fact, the theme of entertainment news can be more fine-grained (the clustering of stars of entertainment news can be obtained), but the theme model is difficult to do fine-grained optimization.


How do you get finer grained clustering? A second scenario, Item2vec, can be introduced.

2.Item2vec

Item2vec is actually a variation of Word2vec applied to our recommended scenarios.

First, a brief introduction to the SkipGram model. As shown on the right, this is an example of positive sampling. The main purpose of the SkipGram is to find out which words are similar in context. The idea is that you can use the model for user behavior, similar to topic modeling, by understanding videos as words and using the model to figure out what similar videos are. So how do you train the model? The blue box in the figure represents an input word, and the white box represents an output word. We combine the input and output words into a word pair, and such word pairs are input into the network structure, so that the network can learn the words with similar context for each word.

What are the advantages of this model over the theme model? Why is it more granular? Comparing the sampled window with the topic model, the topic model analyzes the user’s complete behavior. For example, the user’s behavior in a month is understood as a document, then the topic model statistics the co-occurrence information of videos in a month, while Item2vec pays more attention to the co-occurrence of videos near this video. So the correlation between videos clicked today is significantly higher than the correlation between videos clicked today and videos clicked a month ago. This model is therefore more granular than the topic model. The second advantage is that it is a simple network structure, which can be added to other deep learning tasks for end-to-end optimization.

Word vectors have an interesting analogy in NLP, and video vectors have some interesting analogies as well. For example, in the popular wechat game “Leapfrog” last year, we found such a vector when we went to find a video of a cat playing leapfrog, and then subtract it from the video vector of a dog playing leapfrog, and we can analogically say that it is approximately equal to the video vector of a cat minus the video vector of a dog, which represents the generated video vector. It implies not only simple physical information, but also some other hierarchical information. Similarly, the video vector of the beautiful woman dancing and subtracting the handsome man dancing can be analogous to the video vector of the beautiful woman video subtracting the handsome man video.

Next comes the comparison of clustering. In order to make the granularity of clustering finer, we compare the intensity of clustering with a similar video of a video. Taking food videos as an example, a comparison was made between the theme model and the most similar videos in Item2vec, and it was found that the clustering meaning of the theme model was higher-level information. These similar videos all belonged to food, and only a few videos could be subdivided into recipes and cooking recipes. The two similar videos obtained by Item2vec are obviously more accurate from the cover: both are eating show videos, and even the ones with very similar content, it is obvious that the new clustering scheme is finer in granularity.

We added video clustering as a feature to the ranking model, and THE AUC was greatly improved. In retrospect, the AUC of the local topic model mentioned above was only one-thousandth to two-thousandth at that time, while the clustering feature of Item2vec reached nine-thousandth, showing a very significant effect.

The second scenario is a scheme to expand the recommendation strategy, which is to use the video vector generated by the model to cluster the videos, find the cluster that the user is interested in, and then recall the recommended videos. After sorting the recalled videos, make a recommendation. The online effect of this recommendation strategy has been improved by 4%, and the effect is also very obvious.

After introducing Item2vec, summarize the advantages and disadvantages:

Advantages:

The clustering strength is finer.

Disadvantages:

Poor stability.

Why is it less stable? Suppose there are a bunch of video vectors to be clustered, and the simplest method (such as Kmeans) is used for clustering. The cluster with ID 0 after the first clustering may represent delicious food. Then we use these vectors for clustering again in the second time, is the cluster with ID 0 still delicious food? Definitely not. That’s why it’s less stable.

What is the effect of poor stability? If cluster ID is added to the ranking model as a feature, but the hidden meaning of cluster ID is different every time the training, which will have a great impact on feature engineering and require some tedious engineering work. Therefore, we want to make a stable clustering result.


There are some low-frequency videos based on user behavior, so the accuracy is poor.

So how do you solve these two problems? We introduce text information to summarize the meaning of clustering and improve the accuracy of low-frequency video.

3.KeywordPropagation

However, there will also be problems when using text for clustering. Firstly, the coverage of text is relatively low. Taking Meipai as an example, not every user is willing to fill in the description of the video when uploading the video. The second problem is the error of the video description. When filling in the description, users may want to ceng hot topics, so they will write some hot topics in the description, but actually the video does not belong to this topic. Third, we need to maintain a long tail word library when extracting keywords, because we want to find some niche and fresh video clusters.

In order to solve these three problems, we investigated the mode of communication. Suppose you now have a graph in front of you, and the nodes in the graph are videos, and the relationship between videos is the edge between nodes in the graph. Some videos have keywords and some don’t, so you can use edges between nodes to propagate keywords to videos without keywords.

Firstly, the nodes in the graph are initialized, and each node (video) is given a unique label. The videos with key indicators are directly assigned keywords, and the nodes without keywords keep the unique initialized label. The second step is to use the relationship between videos to spread tags. After the label transmission, we will update the label of each video. A video may obtain many labels transmitted from other videos. The simplest way to merge is to take the label with the largest number of labels uploaded to this video as the new label of this node. Then judge the label change of the whole graph after the propagation process. If the label change of the whole graph is small (small), it can be considered as convergence.

How do you use user behavior to solve this problem? This graph can be constructed by using user behavior. For example, if 100 users watch video A and video B at the same time, the two nodes of video A and B will be connected by an edge with A weight of 100, and the nodes and edges in the graph can be used to spread keywords.

Review the previous process: firstly, keywords of the video will be extracted from the description, comments and subtitles of the video, and then the graph will be constructed by user behavior. Then, keywords will be spread by the graph, so that each video can get the spread keywords. Finally, N-Gram will be used for clustering, and bi-Gram will be adopted online.

This keyword dissemination effect is as follows:

  • Improved coverage

    A large number of videos without keywords can obtain keywords through communication, and the coverage rate has reached 95% now.

  • Text description error correction

    As shown in the picture above, the video in the lower left corner is a braiding and styling video, but its description is “# Miss you gesture dance # Double click support oh”, which has nothing to do with the video at all, just a simple hot spot rubbing behavior. Based on the graph constructed by user behavior, there are also some modeling videos around the modeling video, so the original wrong keywords can be covered with accurate keywords to complete the correction of video description.

  • Find clusters of niche videos

    If there is no word “Tibetan dance” in the keyword database, the accurate keywords of these videos can not be extracted, and the clustering can not be completed naturally, but the discovery of such niche clustering can be realized by keyword propagation. There are one hundred “Tibetan dance” video, for example, the first step in keywords to initialize, because smoke less than keywords will give the one hundred video, are assigned a unique set of tags that when doing the spread of the label, it will do with surrounding “Tibetan dance” video transmission, in the end they tags (keywords) will converge, convergence into their node weight between the highest label. Therefore, it is possible to find niche clusters that are not otherwise maintained in the keyword database.

Item2vec: Firstly, the topic model and word vector are used for clustering, which is unsupervised. Then, keyword propagation is used for semi-supervised clustering, and the semi-supervised model is completed by using the videos with keywords and the keywords between videos. Is there a supervised way to improve video clustering? At this time, we introduced the depth model DSSM.

4.DSSM

The initial application scenario of the DSSM is search. Brief search scene, in baidu search “beauty shoot” this word will be very much with the related web pages, there may be beauty news website, pat or beauty of baidu encyclopedia, we may choose to click on the website, the one click behavior can be understood as a sample, the same can also take a video click understood as a sample, Then, behaviors and videos that users dislike are understood as negative samples. If there are few videos that users dislike, unclicked data can be used as negative samples.

DSSM interpretation in NLP scenario: First input content, as shown in the figure above, Q on the left can be understood as a word input in Baidu, and D on the right can be understood as a page title displayed after search. Word hashing is mainly used to reduce dimension in NLP scenes. For example, the dimension of five hundred thousand words is reduced to thirty thousand. How to do that in American video scenes? This layer can be accessed using the models proposed above (e.g., topic model, Item2vec). Then DNN mapping to 128 dimension semantic space, then matching layer, one calculates the similarity between Q and D using cosine similarity. Finally, softmax is used to convert the similarity into a posterior probability. When the model is applied to meizhao recommendation, Q can be understood as the user and D as the video, and then weakly supervised learning can be done by using the user behavior. Why is it a weakly supervised model? This is mainly because negative samples (i.e., unclicked videos exposed) are not necessarily disliked by users, and some unclicked videos may just be in the later order, so it is called weakly supervised model.

DSSM film in the United States adopted the following improvements: first broadcast video points a lot of kinds of scene, in the United States is on the front page of the most common, there are other sources, such as the list of authors on broadcast video, search after click video, etc., in the user clicks in the different sources or the purpose of behavior is not the same. We split the different sources so that they share a user mode and learn what users want to play video from different sources. As mentioned above, the input at the bottom of DSSM is in the form of bag of word, so we introduce LSTM in the hope that the model can learn more context information to improve the model and learn some long-term interests of users.

Reviewing the effects brought by DSSM, at the beginning, the AUC of the theme model was improved by 0.1%, Item2vec by 0.9%, and DSSM was improved by 1.3% at the end, showing very obvious effects. Moreover, it should be noted that the first two models (especially the theme model) require very large training data, which may use two weeks or even a month of user behavior, while DSSM only needs two or three days of behavioral data to get a better result.

Reviewing the development path of our four clustering schemes, the theme model was used at the beginning because it is relatively simple to use and has strong explanatory ability. Followed by a scene with fine-grained clustering, so Item2vec clustering scheme is referenced; In the third scheme, keyword propagation mainly uses text to stabilize clustering and improve the clustering effect of some low-frequency and niche videos. The final solution, DSSM, uses a supervised approach to improve user and video vector results.


future

First is multi-level, we video clustering at present is only a single level, actually can get quite a lot of clustering, such as food will also get malatang, pasta, cake, such as clustering, and the three clustering is a hierarchy, but at present our solution is not administrative levels sense, administrative levels feeling of the text can be used to solve the problem of hierarchical clustering.

The second is real-time. We hope to enable clustering to analyze and update which cluster the video belongs to online, and more importantly, to get a new video online. When it has a small amount of user behavior, we can get the cluster of the video immediately.

The third is accuracy. There are many schemes for accuracy. For example, user portrait features or video portraits can be introduced to DSSM to improve the video vector, and more accurate video clustering can be obtained after the video vector is improved.

So those are the three directions we want to go in the short term.