Construction and application of script standardization based on knowledge graph of Meituan

As an emerging business with explosive growth, playbook has shortcomings in order placing by merchants, purchasing by users, matching between supply and demand, etc. Supply standardization can create value for users, merchants and platforms, and boost business growth. This paper introduces the process and algorithm scheme of meituan to store integrated business data team’s rapid construction of script supply standardization from 0 to 1. We covered meituan store GENE (GEneral NEeds Net) to the script killing industry, constructed the script killing knowledge map, and realized the standardization construction of supply, including the mining of script killing supply, the construction of standard script library, the association between supply and standard script, and implemented the application in multiple scenes. Hope to bring you some help or inspiration.

The background,

The script killing industry has witnessed explosive growth in recent years. However, as the script killing industry is emerging, the existing category system and product form of the platform are increasingly difficult to meet the rapidly growing needs of users and merchants, mainly reflected in the following three aspects:

Lack of platform categories: The platform lacks specialized “playbook killing” categories and the lack of centralized traffic entry, which leads to confusion in user decision-making path and difficulty in establishing unified user cognition.
Low decision-making efficiency of users: the core of script killing is script. Due to the lack of standard script library and the lack of association between standard script and supply, the standardization degree of information display and supply management of script is low, which affects the efficiency of users’ decision-making on script selection.
Cumbersome commodity shelving: commodity information needs to be entered one by one by merchants, and there is no available standard template for pre-filling information, resulting in a low proportion of scripts for merchants to be shelved on the platform, and there is a large space for improvement of shelving efficiency.

In order to solve the above pain points, the business needs to carry out the standardization construction of script killing supply: first establish a new category of “script killing”, and complete the category migration of the corresponding supply (including merchants, goods and content). On this basis, with the script as the core, the standard script library is built, and the script killing supply is associated, and then the information distribution channel, evaluation and scoring system and list system of the script dimension are established, so as to meet the decision-making path of users “looking for stores based on scripts”.

It is worth pointing out that supply standardization is an important way to simplify users’ cognition, help users make decisions and promote supply-demand matching. The level of standardization has a decisive influence on the scale of platform business. Specific to the script killing industry, the supply of standardized construction is an important basis for the continuous growth of script killing business, and the construction of standard script library is the key to the standardization of script killing supply. Specific scripts cannot be identified based on script attributes such as specifications such as “City Limits”, background such as “ancient style” and subject matter such as “emotion”, but script names such as “Sheri” can serve as unique identifiers. Therefore, the construction of standard script library is firstly the construction of standard script name, and secondly the construction of standard script attributes such as specification, background, theme, difficulty and genre.

To sum up, The integrated business data team of Meituan in-store works with the business to help the business carry out the standardization construction of the supply of scripts. In the process of construction, it involves multiple types of entities such as script name, script attribute, category, merchant, commodity and content, as well as the construction of diversified relations between them. As a semantic network to reveal entities and their relationships, knowledge graph is particularly suitable to solve this problem. In particular, we have built The Meituan to store Integrated knowledge graph (GENE, GEneral NEeds Net). Therefore, based on the construction experience of GENE, we quickly built the knowledge graph of the new business of script killing, and realized the standardization construction of script killing from 0 to 1, so as to improve supply management and supply and demand matching. Create greater value for users, merchants and platforms.

Second, solutions

We construct GENE, around the user’s requirements, comprehensive local life in industry system, the demand objects, the concrete demand, the scene elements and scenario step by step progressive needs five levels, covering the play, medical, education, family, marriage, and other business, system design and technical details visible Meituan to store knowledge comprehensive atlas of the related articles. As an emerging meituan to store integrated business, Playbook reflects the new needs of users in play, which naturally fits GENE’s architecture. Therefore, we covered GENE with new business, followed the same idea to construct the corresponding knowledge map, so as to realize the corresponding supply standardization.

The key to realize the standardization of play killing based on knowledge graph is to construct the knowledge graph of play killing based on standard play. The design of the atlas system is shown in Figure 1. Specifically, a new category of playbook is constructed at the industry system level, the supply of playbook is mined, and the subordinate relationship between supply (including merchants, goods and content) and category is established. On this basis, in the demand object layer, further realize the mining and relationship construction of the core object node of the standard script name and its script attribute node, establish the standard script library, and finally establish the association relationship between each standard script in the standard script library and suppliers and users. In addition, the three layers of representational demand, scene elements and scene demand have realized the explicit expression of the user’s representational service demand and scene demand on the script. This part is not introduced here because it has little connection with the standardization construction of the script supply.

A specific example of the standardized part of the playkill knowledge graph is shown in Figure 2 below. Among them, the standard script name is the core node, and all kinds of standard script attribute nodes around it include theme, specification, genre, difficulty, background, nickname and so on. At the same time, it is possible to construct such types of relationships between standard plays as “same-series”, such as “shirk” and “shirk 2”. In addition, the standard script will establish relationships with products, merchants, content, and users.

Supply standardization is carried out based on these nodes and relations of the playkilling knowledge graph. In the process of map construction, there are three main steps: playkilling supply mining, standard play library construction, and supply and standard play association. The implementation details and algorithms involved in the three steps are introduced below.

Third, implementation method

3.1 Play killing supply mining

As an emerging business, there is no corresponding category in the existing industry category tree, so it is impossible to obtain related supplies (including merchants, goods and content) of script killing directly according to the category. Therefore, we need to first of all play kill supply mining, that is, from the current and play kill industry similar to the purpose of supply mining play kill related supply.

For the merchant supply mining of playbook killing, it is necessary to judge whether the merchant provides playbook service, and the discrimination basis includes the text corpus from merchant name, commodity name and commodity details, and merchant UGC. This is essentially a multi-source data classification problem, however, because of the lack of labeled training samples, we have no direct use end-to-end model of multi-source data classification, input, instead of relying on business matching using unsupervised and supervised fitting combination of efficient implementation, specific discriminant process are shown in figure 3 below, including:

Unsupervised matching: Firstly, we construct the keyword lexus related to the script, and conduct accurate matching in the text corpus from merchant name, commodity name and commodity details, and merchant UGC, respectively. Then, we construct the general semantic drift discriminant model based on BERT[1] to filter the matching results. Finally, according to the business rules, the corresponding matching score is calculated based on the matching results from each source.
Supervised fitting: In order to quantify the influence of matching scores from different sources on the final discriminant results, operators manually mark a small number of merchant scores to represent the strength of the merchant’s script killing service. On this basis, we construct a linear regression model to fit the labeled merchant scores and obtain the weight of each source, so as to realize the accurate mining of the script killing merchant.

By using the above methods, the mining of desktop and real scene scenarios has been realized, and the accuracy and recall rate have reached the requirements. Based on the mining results of play-killing merchants, commodities can be further mined and play-killing categories can be created, which lays a good data foundation for the subsequent construction of play-killing knowledge map and standardization construction.

3.2 Standard script library construction

As the core of the whole knowledge graph, the standard script plays an important role in the standardization of script supply. Based on the similar aggregation of script killing products, we mined standard scripts with manual audit, and obtained script authorization from relevant publishers, so as to build a standard script library. The standard script consists of two parts, one is the standard script name, the other is the standard script attributes. Therefore, the construction of standard script library is divided into two parts: mining standard script name and mining standard script attribute.

3.2.1 Mining of standard play names

According to the characteristics of playkill products, we have adopted three methods of mining and iterating, namely rule aggregation, semantic aggregation and multi-mode aggregation, to get thousands of standard playkill names from hundreds of thousands of playkill product names. The following three polymerization methods are introduced respectively.

The rules of polymerization

The same script killing commodities in different business naming is often different, there are more non-standard and personalized. On the one hand, the name of the same play itself can be called a variety of ways, such as “Shiri”, “Shiri one”, “Shiri 1” is the same play; On the other hand, in addition to the name of the script, the commodity name of the script will often add the attribute information of the script, such as the specification and theme, as well as the descriptive text that attracts users, such as the emotional text of “Parting”. Therefore, we first consider the naming characteristics of the script killing products, and design the corresponding cleaning strategy for the script killing product name after cleaning and then aggregation.

In addition to sorting out common non-script words and constructing a thesaurus for rule filtering, we also try to transform them into named entity recognition problems [2], and use sequence annotation to distinguish characters into two categories: “is the name of the script” and “is not the name of the script”. For the product names after cleaning, the similarity calculation rules based on the longest common subsequence (LCS) are combined with threshold screening to aggregate them, for example, “shaoli”, “Shaoli one” and “Shaoli 1” are finally clustered together. The whole process is shown in Figure 4. The rule aggregation method can help businesses to quickly aggregate script name at the initial stage of construction.

Semantic aggregation

Although the rule aggregation method is simple and easy to use, due to the diversity and complexity of script names, we found that there are still some problems in the result of aggregation: 1) Goods that do not belong to the same script are aggregated. For example, “Shiri” and “Shiri 2″ are two different scripts of the same series, but are aggregated together. 2) The commodities belonging to the same script are not aggregated, for example, it is difficult to regularly aggregate when the commodity name uses the abbreviation of the script (” Chinatown Detective and Cat “and” Detective Tang “) or misspelled (” Freud’s anchor “and” Freud’s anchor “).

In view of the above two problems, we further consider using the semantic matching method of commodity name to carry out the aggregation from the perspective of the same text semantics. The commonly used text semantic matching models are divided into two types: interactive and double tower. Interactive is to input two pieces of text together into the encoder and make them exchange information in the process of coding before discriminating; The two-tower model uses an encoder to encode vectors for two texts, and then discriminates based on the two vectors.

Due to the large number of commodities, the interactive method needs to combine commodity names in pairs before model prediction, which is inefficient. Therefore, we adopt the double-tower method to achieve this. Based on the model structure of Sentence-BERT[3], after extracting vectors from the text of two commodity names through BERT, Cosine distance is then used to measure the similarity between the two. The complete structure is shown in Figure 5 below:

In the process of training the model, we first constructed coarse-grained training samples based on the results of rule aggregation by generating positive cases within the cluster and generating negative cases across clusters to complete the training of the initial model. On this basis, the sample data is further improved by combining active learning. In addition, we also generated targeted samples in batches according to the two problems arising from the aggregation of the rules mentioned above. Specifically, the sample is automatically constructed by adding the serial number after the commodity name, and using wrong characters, other characters and traditional characters replacement.

Multimodal polymerization

Semantic aggregation is used to realize synonym aggregation from the semantic level of commodity name text. However, we find some problems after re-analysis of the aggregation result: two commodities belong to the same script, but it is impossible to distinguish from the commodity name only. For example, “Shiri 2” and “denunciation” cannot be aggregated semantically, but they are essentially a script “Shiri 2· Denunciation”. Although the names of the two goods are different, their images are often the same or similar. Therefore, we consider introducing the image information of the goods for auxiliary aggregation.

A simple method is to use the mature pre-training model in CV field as the image encoder for feature extraction and directly calculate the image similarity of two commodities. In order to unify the results of commodity image similarity calculation and commodity name semantic matching, we try to construct a multi-modal matching model of commodity image, which makes full use of commodity name and image information to match. The model follows the double-tower structure used in semantic aggregation, and the overall structure is shown in FIG. 6 below:

In the multi-mode matching model, the name and image of the script commodity are represented by the corresponding vector through the text encoder and image encoder respectively, and then are splicing as the final commodity vector. Finally, cosine similarity is used to measure the similarity between commodities. Among them:

Text encoder: BERT[1], a text pre-training model, is used as a text encoder, and the average pooled output is used as a vector representation of text.
Image encoder: With the image pre-training model EfficientNet[4] as the image encoder, the last layer of network output is extracted as the vector representation of the image.

During the training of the model, the text encoder performs Finetune, while the image encoder fixes parameters and does not participate in the training. For training sample construction, we delineate the range of manual annotation samples based on the product image similarity based on semantic aggregation results. Specifically, positive cases are directly generated for commodity images with high similarity in the same cluster, negative cases are directly generated for commodity images with low similarity across the cluster, and the remaining sample pairs are manually labeled and determined. Through multi-mode aggregation, it makes up for the deficiency of only using text matching. Compared with it, the accuracy is increased by 5%, and the mining effect of standard script is further improved.

3.2.2 Mining of standard script attributes

The attributes of the standard script include more than ten dimensions, such as the background, specification, genre, theme and difficulty of the script. Since merchants will input these attribute values of commodities when they place orders on the script, mining the attributes of the standard script is essentially mining the attributes of all aggregated commodities corresponding to the standard script.

In the actual process, we conduct mining through voting statistics, that is, for an attribute of the standard script, we vote by the attribute value of the corresponding aggregate commodity on the attribute, select the attribute value with the highest vote as the candidate attribute value of the standard script, and finally confirm it by manual audit. In addition, in the process of mining the name of the standard script, we found that the same script has a variety of names. In order to have a better description of the standard script, we further added a nickname attribute to the standard script, which was obtained by cleaning and re-iterating the names of all aggregation commodities corresponding to the standard script.

3.3 Supply is associated with standard scripts

After completing the construction of the standard script library, it is also necessary to establish three supplies of script killing commodities, merchants and content, and the association relationship with the standard script, so as to standardize the supply of script killing. Since the relationship between the commodity and the standard script can be directly obtained through the association between the commodity and the standard script, we only need to carry out the standard script association between the commodity and the content.

3.3.1 Commodity association

In Section 3.2, we mined standard scripts by aggregating stock scripts to kill commodities. In this process, the association between stock commodities and standard scripts has been established. For subsequent additions, we also need to match them to the standard playbook to establish a relationship between the two. For commodities that cannot be associated with standard scripts, we will automatically mine the names and attributes of standard scripts, and then add them to the standard script library after manual review.

The whole process of commodity association is shown in Figure 7 below. Firstly, the commodity name is cleaned and then the matching association is carried out. In the matching process, we make a matching judgment based on the multi-modal information of the product and the standard script.

Unlike matching between goods, the relationship between goods and a standard play does not need to maintain the symmetry of matching. In order to ensure the effect of association, we modified the structure of the multi-mode matching model in Section 3.2.1, and calculated the probability of association between commodities and standard scripts through the full connection layer and Softmax layer after stitching the vectors. The training samples are directly constructed according to the correlation between stock goods and standard scripts. Through commodity association, we have achieved the standardization of most of the script killing commodities.

3.3.2 Content association

The association between user-generated content (UGC, such as user comments) and standard scripts is mainly targeted at the association between user-generated content and standard scripts. Due to a period of UGC text usually contain more than one sentence, and it only part of the sentence will mention standard script related information, so we’ll UGC matching with standard script, elaborated to its other size match, at the same time, considering the balance of efficiency and effectiveness, further the matching process is divided into the recall and sort two phases, shown in the figure 8 below:

In the recall stage, clauses of UGC text are split, and precise matching is carried out in clause set according to the standard script name and its nickname. For clauses in the matching, refined association relation discrimination is carried out in the sorting stage.

In the sorting stage, the discrimination of association relation is transformed into an aspect-based classification problem. Referring to the practice of attribute-level emotion classification [5], a matching model based on BERT sentence relation classification is constructed. The standard script names of UGC clauses actually hit and corresponding UGC clauses are connected with [SEP] and then input. The full connection layer and Softmax layer are added after BERT to realize the dichotomies of whether the UGC association is involved or not. Finally, the classification probability output by the model is filtered by threshold value, and the standard script of UGC association is obtained.

Different from the model training mentioned above, the matching model of UGC and standard script cannot quickly acquire a large number of training samples. Considering the lack of training samples, several hundred samples were manually labeled at first. On this basis, in addition to active learning, we also tried comparative learning and conducted regular constraints on the outputs of the model’s two Dropout based on Regularized Dropout[6] method. Finally, when the training samples were less than 1K, the accuracy of UGC association standard scripts reached the online requirements, and the number of UGC associated with each standard script was also greatly improved.

4. Application and practice

The current script killing knowledge atlas, with thousands of standard scripts as the core, related to millions of supplies. The results of the standardized construction of script supply have been applied in meituan in several business scenarios. The following describes the specific application methods and effects.

4.1 Category Construction

Through the mining of script killing supply, it helps the business to identify the script killing merchants, thus facilitating the construction of new categories of script killing and corresponding script killing list pages. The script killing category migration, the script killing entrance and the script killing list page of the leisure and entertainment channel page have all been launched. Among them, the script killing ICON of the channel page is fixed as the first in the third line, providing a centralized traffic entrance and helping to establish unified user cognition. The online example is shown in Figure 9 ((a) Script killing entrance of leisure and Entertainment Channel page, (b) Script killing list page).

4.2 Personalized Recommendation

The standard playbook and attribute nodes contained in the playbook knowledge graph, as well as their association with suppliers and users, can be applied to the recommendation bits of each playbook page. On the one hand, it is applied to the recommendation of popular scripts on the playlist page (FIG. 10(a)), and on the other hand, it is also applied to the recommendation of commodities on the playlist page (left of FIG. 10(b)), playable stores (left of FIG. 10(b)) and related script recommendation modules (right of FIG. 10(b)). The application of these recommendation slots helps to cultivate users’ mind to find scripts on the platform, optimize users’ cognition and purchasing experience, and improve the matching efficiency between users and suppliers.

Taking the popular script recommendation module of the script list page as an example, the nodes and relations contained in the script killing knowledge graph can not only be directly used for script recall, but also be further applied in the stage of refinement. In the process of refinement, based on the knowledge graph of playkilling, combined with user behavior, and referring to the Deep Interest Network (DIN) [7] model structure, we try to model the sequence of users accessing scripts and commodities, construct a two-channel DIN model, deeply depict users’ interests, and realize personalized distribution of scripts. The commodity access sequence is converted into script sequence through the association between commodity and standard script, and is modeled with candidate script in Attention mode. The specific model structure is shown in Figure 11 below:

4.3 Information exposure and screening

Based on nodes and relations in the playkilling knowledge graph, related label screening items are added to the playkilling list page and the playkilling list page, and the playkilling attributes and associated supply information are exposed. The related applications are shown in Figure 12 below. The disclosure of these label screening items and information provides users with standardized information display, reduces the cost of decision-making for users, and makes it more convenient for users to choose shops and scripts.

4.4 Ratings and lists

In the script details page, the association between the content and the standard script is involved in the score calculation of the script (Figure 13(a)). On this basis, the lists of classic must-play and recent popular scripts are formed based on the dimension of scripts, as shown in FIG. 13(b), thus providing more help for users’ script selection and decision-making.

Fifth, summarize the outlook

Facing the script to kill this emerging industry, we quickly respond to business, standard script as the core node, combined with industry characteristics, through the script to kill, standard script library construction, supply and supply mining standard script, the corresponding knowledge map construction, from 0 to 1 gradually play the standardization construction of supply kill with simple and effective method to solve the problems of the script to kill the business.

At present, the knowledge graph of script killing has been applied in multiple scenarios of script killing, enabling script killing business continues to grow, significantly improving user experience. In the future work, we will continue to optimize and explore:

Continuous improvement of the standard script library: optimize the name and attribute of the standard script as well as the corresponding supply relation, ensure the quality and quantity of the standard script library is good, and try to introduce external knowledge to supplement the current standardization results.
Scenario killing: at present, the knowledge map of scenario killing mainly focuses on the concrete demand objects of users such as “scenario”. In the follow-up, it will dig deeply into users’ scenario needs, explore the linkage between scenario killing and other industries, and better help the development of the scenario killing industry.
More application exploration: Atlas data can be applied to search modules to improve the efficiency of supply matching in more application scenarios, thus creating greater value.

reference

[1] Devlin J, Chang M W, Lee K, et al. Bert: Research on deep Bidirectional Transformers [J]. ArXiv Preprint arXiv:1810.04805, 2018.

[2] Lample G, Ballesteros M, Subramanian S, Neural architectures for entity-recognition based on architectures [J]. ArXiv Preprint arXiv:1603.01360, 2016.

[3] Reimers N, Gurevych I. Sentence-bert: Sentence Embeddings using Siamese Bert-Networks [J]. ArXiv PrePrint arXiv: 198.10084, 2019.

[4] Tan M, Le Q. EfficientNet: Rethinking model scaling for convolutional neural networks[C]//International Conference on Machine Learning. PMLR, 2019:6105-6114.

[5] Sun C, Huang L, Qiu X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence[J]. arXiv preprint ArXiv: 1903.09588, 2019.

[6] Liang X, Wu L, Li J, et al. R-Drop: A Regularized Dropout for Neural Networks[J]. ArXiv PrePrint arXiv:2106.14448, 2021.

[7] Zhou G, Zhu X, Song C, et al. Deep interest network for click-through rate prediction[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 1059-1068.

Author’s brief introduction

Li Xiang, Chen Huan, Zhihua, Xiaoyang and Wang Qi are all from the platform technology department of Meituan store to the comprehensive business data team.

Recruitment information

Meituan Store platform Technology Department – To the integrated business data team, long-term recruitment of algorithm (natural language processing/recommendation algorithm), data warehouse, data science, system development and other positions, the coordinate is Shanghai. Interested students are welcome to send their resumes to: [email protected].

Read more technical articles from meituan’s technical team

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.