Exploration and implementation of NER technology in Meituan search

1. The background

Named Entity Recognition (NER), also known as “specific name Recognition”, refers to the Recognition of entities with specific meanings in texts, including people’s names, place names, organization names, proper nouns and so on. NER is an important basic tool for information extraction, question answering system, syntax analysis, machine translation, metadata annotation oriented to Semantic Web and other application fields, and plays an important role in the process of the practical application of natural language processing technology. In the meituan search scenario, NER is the basic signal of Deep Query Understanding (DQU), which is mainly applied to search recall, user intent recognition, entity link and other links. The quality of NER signal directly affects users’ search experience.

The application of entity recognition in search recall will be briefly described below. In O2O search, the description of merchant POI is a number of text fields with little correlation between the merchant name, address, category and so on. If the O2O search engine also adopts the method of all text field hit intersection, it may produce a large number of false recall. Our solution, as shown in Figure 1 below, is to make a specific query perform inverted retrieval only in a specific text field, which is called “structured recall” to ensure the strong correlation of recalled merchants. For example, for “Haidilao” such a request, some business address will be described as “haidilao several hundred meters nearby”, if the use of full-text local domain search these businesses will be recalled, obviously this is not what the user wants. The structured recall is based on NER identifying “Haidilao” as a merchant, and then retrieving only the relevant text field of the merchant name, so that only haidilao brand merchants can be recalled, precisely meeting the needs of users.

Different from other application scenarios, The NER task of Meituan Search has the following characteristics:

The number of new entities is large and growing fast: the local life service sector is developing rapidly, with new stores, new goods and new service categories emerging endlessly; User Query is often mixed with many non-standard expressions, abbreviations and buzzwords (such as “anxious”, “cat-sniffing”, etc.), which poses a great challenge to achieve NER with high accuracy and high coverage.
Strong domain relevance: entity recognition in search is highly related to business supply. In addition to general semantics, business-related knowledge should be added to assist judgment, such as “having a haircut”. The general understanding is to generically describe an entity, but it is a business entity in search.
High performance requirements: it takes a very short time from the user to initiate a search to the final results presented to the user. NER, as the basic module of DQU, needs to be completed in milliseconds. Recently, many researches and practices based on deep networks have significantly improved the effect of NER, but these models often have large computation and long prediction time. How to optimize the performance of models to meet THE requirements of NER for computing time is also a big challenge in NER practice.

2. Technical selection

According to the characteristics of NER tasks in the O2O field, our overall technology selection is the framework of “entity dictionary matching + model prediction”, as shown in Figure 2 below. Both entity dictionary matching and model prediction have their own emphases, and both are indispensable in the current stage. Here are three answers to why we chose it.

Why do I need physical dictionary matching?

** ANSWER: ** is mainly for the following four reasons:

First, the head flow of user query in search is usually short and simple in expression form, and it is concentrated in three types of entity search, such as merchant, category and address. Entity dictionary matching is simple, but the accuracy of processing such query can reach more than 90%.

Second, NER is domain-dependent. Business entity dictionaries can be obtained by mining business data resources, and the recognition results can be guaranteed to be domain-compatible after online dictionary matching.

Third, new business access is more flexible, and entity recognition in new business scenarios can be completed only by providing business-related entity thesaurus.

Fourth, some downstream users of NER have very high requirements on response time and fast dictionary matching speed, so there are basically no performance problems.

Why model prediction when you have physical dictionary matching?

** ANSWER: ** has the following two reasons:

First, with the continuous increase Of search volume, the expression Of long tail search flow is complicated, and more and more OOV (Out Of Vocabulary) problems begin to appear. Physical dictionaries can no longer meet the increasingly diversified user needs. Model prediction has the ability Of generalization and can serve as an effective supplement for dictionary matching.

Two entities is a dictionary matching can’t solve the problem of ambiguity, such as “yellow crane tower food”, “yellow crane tower” at the same time in the entity in the dictionary is the merchants of the attractions of wuhan, Beijing, cigarette products, do not have dictionary matching disambiguation ability, the three types will be output, while the model prediction can be combined with the context, not the output is “yellow crane tower” cigarette products.

How do the results of entity dictionary matching and model prediction combine the output?

** A: ** Currently, we use trained CRF weight network as a classifier to score NER paths of entity dictionary matching and model prediction. When there is no dictionary matching result or the path scoring value is obviously lower than the model prediction, the model recognition result is adopted, and the dictionary matching result is still used in other cases.

After introducing our technology selection, we will introduce our work in physical dictionary matching and online model prediction, hoping to provide some help for you in the exploration of O2O NER field.

3. Entity dictionary matching

Traditional NER techniques can only deal with established, existing entities in a common domain, but cannot deal with the types of entities that are unique to vertical domains. In the case of Meituan search, offline mining of POI structured information, merchant review data, search logs and other unique data can solve the domain entity recognition problem well. After the continuous enrichment and accumulation of offline entity library, the online use of lightweight thesaurus matching entity recognition is simple, efficient and controllable, and can well cover the head and waist flow. At present, the online NER recognition rate based on entity library can reach 92%.

3.1 Offline Mining

Meituan has rich and diverse structured data and can obtain high-precision initial entity database by processing structured data in the field. For example, from the basic merchant information, you can obtain the merchant name, category, address, commodities or services sold and other types of entities. From the cat’s eye entertainment data, we can obtain film, TV series, artists and other types of entities. However, the entity names searched by users are often mixed with many non-standard expressions, which are different from the standard entity names defined by business. Therefore, how to mine domain entities from non-standard expressions becomes particularly important.

The existing new word mining technologies are mainly divided into unsupervised learning, supervised learning and distance supervised learning. Unsupervised learning through to generate candidate frequent sequence sets, and filtered by calculating the closeness and freedom indicators, although this method can produce good candidate set, but only by feature threshold filter can not effectively balance the precision rate and recall rate, in real applications usually select a threshold to ensure high accuracy at the expense of a recall. Most advanced new word mining algorithms are supervised learning, which usually involve complex grammar analysis model or deep network model, and rely on domain experts to design numerous rules or a large number of manual labeling data. Remote supervised learning generates a small amount of labeling data through open source knowledge base, although it alleviates the problem of high labor labeling cost to some extent. However, labeled data with small sample size can only learn simple statistical models, and cannot train complex models with high generalization ability.

Our offline entity mining is multi-source and multi-method, involving data sources including structured merchant information base, encyclopedia entries, semi-structured search logs, and unstructured user comments (UGC). There are many mining methods used, including rules, traditional machine learning model, deep learning model and so on. As an unstructured text, UGC contains a large number of non-standard entity names. Below, we will introduce in detail an automatic mining method of new words in the vertical domain for UGC, which mainly consists of three steps, as shown in Figure 3 below:

Step1: Candidate sequence mining. Word sequences that occur frequently and continuously are effective candidates for potential new words, and we use frequent sequences to generate sufficient candidate sets.

Step2: Large-scale labeled corpus generation based on remote supervision. Frequent sequences change with a given corpus, so manual labeling is extremely costly. We use the entity dictionary accumulated in the domain as the remote supervision thesaurus, and take the intersection of the candidate sequence and entity dictionary in Step1 as the training positive example sample. At the same time, the candidate sequence analysis shows that only about 10% of the candidates out of millions of frequent NGrams are truly high-quality novel words. Therefore, for negative sample, negative sampling is adopted to produce training negative sample sets [1]. Aiming at the massive UGC corpus, we designed and defined four dimensional statistical features to measure the usability of candidate phrases:

Frequency: meaningful new words should meet a certain frequency in corpus, which is calculated by Step1.
Compactness: mainly used to evaluate the co-occurrence strength of continuous elements in new phrases, including t-distribution test, Pearson chi-square test, point-by-point mutual information, likelihood ratio and other indicators.
Degree of information: The newly discovered words should have real meaning and refer to a new entity or concept. This feature mainly considers the reverse document frequency of phrases in the corpus, the distribution of part of speech and the distribution of stop words.
Completeness: The newly discovered words should be interpreted as a whole in a given context, so the completeness of a phrase should be measured by considering the tightness of both the subset phrase and the superset phrase.

After the construction of small sample marker data and the extraction of multidimensional statistical features, the binary classifier is trained to calculate the predicted quality of candidate phrases. Since the negative sample of training data adopts the method of negative sampling, a small number of high-quality phrases are mixed in this part of data. In order to reduce the impact of negative example noise on the quality score of phrase prediction, the error can be reduced by integrating multiple weak classifiers. After model prediction, the set of candidate sequences with scores over a certain threshold is taken as the positive example pool, and the set with lower scores as the negative example pool.

Step3: phrase quality assessment based on deep semantic network. Deep network models can automatically and efficiently learn corpus features and produce efficient models with generalization capability when there is a large amount of labeled data. BERT learned semantic representation of text through massive natural language texts and deep models, and set new records in several natural language comprehension tasks with simple fine-tuning. Therefore, we trained the phrase quality scorer based on BERT. In order to better improve the quality of training data, we use search log data to remotely guide the large-scale positive and negative example pool data generated in Step2, and take the terms with a large number of search records as meaningful keywords. In order to improve the reliability and diversity of the training data, we take the part of the positive example pool which overlaps with the search log as the model positive sample, and subtract the part of the search log set from the negative example pool as the model negative sample. In addition, Bootstrapping method was adopted to obtain the phrase quality score for the first time, and then update the training sample according to the existing phrase quality score and remote corpus search log. Iterative training improved the effect of the phrase quality classifier and effectively reduced the pseudo positive and pseudo negative examples.

After a large number of new words or phrases are extracted from UGC corpus, AutoNER[2] is referred to predict the types of newly mined words, so as to expand the offline entity database.

3.2 Online Matching

The original online NER dictionary matching method directly performs two-way maximum matching against Query to obtain the candidate set of component recognition, and then filters the final result based on word frequency (here refers to the entity search volume). This strategy is relatively simple and requires very high accuracy and coverage of thesaurus, so it has the following problems:

The character-based maximum matching algorithm is prone to segmentation errors when the Query contains thesaurus without overwriting entities. For example, if you search for the word “Haituo valley”, the thesaurus can only match “Haituo mountain”, so the segmentation of “Haituo mountain/valley” is wrong.
The granularity is not controllable. For example, the results of the search term “Starbucks coffee” are segmented depending on how the lexicon covers “Starbucks”, “coffee” and “Starbucks coffee”.
The node weights are not properly defined. For example, based on the entity search volume as entity node weight, when the user searches for “Xinyang restaurant”, the score of “Xinyang food/restaurant” is greater than that of “Xinyang/restaurant”.

In order to solve the above problems, CRF word segmentation model is introduced before entity dictionary matching, word segmentation criteria are formulated for vertical domain meituansearch, and the training corpus is manually annotated and the CRF word segmentation model is trained. Meanwhile, a two-stage repair method is designed to solve the problem of model segmentation errors:

Combining model segmentation Term and domain-dictionary-based matching Term, the optimal solution of Term sequence weight sum is solved according to dynamic programming.
Strong repair rule based on Pattern regular expression. Finally, the result of component recognition based on entity library matching is output.

4. Online prediction of the model

For long tail, unlogged queries, we use the model for online identification. The evolution of NER model has gone through several stages as shown in Figure 5 below. At present, the main models used online are BERT[3] and BERT+LR cascade model. In addition, the offline effects of some models under exploration have also been confirmed to be effective. The construction of NER online model in search mainly faces three problems:

High performance requirements: NER is the basic module, and model prediction needs to be completed in milliseconds. However, current models based on deep learning have problems of large computation and long prediction time.
Strong domain correlation: entity types in search are highly correlated with business offerings, so it is difficult to guarantee the accuracy of model recognition only considering general semantics.
Lack of annotation data: NER annotation task is relatively difficult, requiring entity boundary segmentation and entity type information. The annotation process is time-consuming and laborious, and large-scale annotation data is difficult to obtain.

To solve the problem of high performance requirements, a series of performance tuning was carried out when upgrading our online model to BERT. To solve the problems related to NER domain, we propose a knowledge enhancement NER method combining search log features and entity dictionary information. Aiming at the difficulty of obtaining training data, we propose a weakly supervised NER method. Let’s take a closer look at these techniques.

4.1 BERT model

BERT is a natural language processing method that Google unveiled in October 2018. Once this method is published, it has attracted extensive attention from academia and industry. In terms of effects, BERT refreshed the current optimal effects of 11 NLP tasks, and the method was also rated as a major advance in NLP 2018 and NAACL 2019’s best paper[4,5]. BERT and the GPT method released earlier by OpenAI are basically the same technical route, with slight differences in technical details. The main contribution of the two works is the use of pre-training + fine-tuning ideas to solve natural language processing problems. Taking BERT as an example, the model application includes two steps:

Pre-training: this part learns network parameters from a large number of common Corpus, including Wikipedia and Book Corpus, which contains a large number of texts and can provide rich language-related phenomena.
Fine-tuning: In this step, network parameters are fine-tuned using “task-related” annotated data, and task-specific network ab initio training for target tasks is no longer needed.

There is a challenge when BERT is applied to prediction on entity recognition line, that is, slow prediction speed. We explored model distillation and predictive acceleration, and launched BERT distillation model, BERT+Softmax model and BERT+CRF model in stages.

4.1.1 Model distillation

We tried two methods of clipping and distillation for BERT model, and the results proved that the precision loss of clipping was serious for NER, a complex NLP task, while model distillation was feasible. Model distillation is to use simple models to approximate the output of complex models in order to reduce the amount of computation required for prediction and ensure the prediction effect. Hinton elaborated the core idea in his paper in 2015 [6]. The complex Model is generally called Teacher Model, and the simple Model after distillation is generally called Student Model. Hinton’s distillation method used probability distributions of pseudo-labeled data to train Student Model, but did not use labels of pseudo-labeled data to train Student Model. The author’s point of view is that compared with labels, probability distribution can provide more information and stronger constraints, which can better ensure the consistency of the prediction effects of Student Model and Teacher Model. In the Workshop of NeurIPS in 2018, [7] proposed a new network structure BlendCNN to approximate the prediction effect of GPT, which is essentially model distillation. BlendCNN is 300 times faster than the original GPT, and slightly more accurate for certain tasks. The following conclusions can be drawn about model distillation:

The essence of model distillation is functional approximation. For specific tasks, the author believes that Student Model can be completely different from Teacher Model as long as the complexity of Student Model can meet the complexity of the problem. The example of Student Model selection is shown in Figure 6 below. For example, suppose that the sample (x, y) in the problem is sampled from a polynomial function, with the highest exponential degree d=2; The available Teacher Model uses higher exponential times (such as D =5). In this case, a Student Model should be selected for prediction. The Model complexity of the Student Model should not be lower than the complexity of the problem itself, that is, the corresponding exponential times should reach at least D =2.
Depending on the size of the unlabeled data, the constraints used for distillation can vary. As shown in Figure 7, if the scale of unlabeled data is small, value (Logits) approximation can be used to learn and enforce constraints. If there is no annotated data of medium size, distribution approximation can be adopted. If the scale of unlabeled data is large, label approximation can be adopted for learning, that is, only the predictive labels of Teacher Model are used to guide Model learning.

With this conclusion in mind, how can we apply model distillation to the search NER task? First, analyze the task. Compared with relevant tasks in the literature, searching NER has a significant difference: as an online application, searching has a large amount of unlabeled data. User queries can reach tens of millions per day, and the data scale is far larger than that provided by some offline tests. Based on this, we simplify the distillation process: without limiting the form of Student Model, we choose the mainstream neural network Model with fast inference speed to approximate BERT; Instead of using value approximation and distribution approximation as learning targets, the training directly uses label approximation as the target to guide the learning of Student Model.

We use IDCNn-CRF to approximate BERT entity recognition model. IDCNN (Iterated Dilated CNN) is a kind of multi-layer CNN network, in which the convolution at the lower level uses ordinary convolution operation, and the convolution result is obtained by weighted sum of the positions delineated by the sliding window. At this point, the distance interval of each position delineated by the sliding window is equal to 1. High level Convolution is performed using Atrous Convolution, where the distance between the positions mapped by the sliding window is equal to d (d>1). By using dilated convolution at high level, the amount of convolution computation can be reduced and there is no loss in sequence dependent computation. In text mining, IDCNN is often used to replace LSTM. Experimental results show that compared with the original BERT model, the online prediction speed of distillation model is improved by tens of times without obvious loss of accuracy.

4.1.2 Predicted acceleration

Due to the large number of small operators in BERT and the amount of Attention calculation, the prediction time of BERT is relatively high when it is applied online. We mainly use the following three methods to accelerate model prediction. At the same time, for the high-frequency Query in the search log, we upload the prediction results to the cache in the way of dictionary to further reduce the QPS pressure of online model prediction. Here are three methods for model prediction of acceleration:

Operator fusion: By reducing the number of Kernel Launch and improving the access efficiency of small operators, the time consuming cost of BERT small operators can be reduced. We investigated the implementation of Faster Transformer here. The average time delay has an acceleration ratio of 1.4x~2x. TP999 has an acceleration ratio of 2.1x~3x. The method is suitable for standard BERT model. The open source version of Faster Transformer has low engineering quality and many problems in terms of ease of use and stability, so it cannot be directly applied. We conducted secondary development based on NV open source Faster Transformer, mainly improving its stability and ease of use:

Ease of use: Automatic conversion, Dynamic Batch, and Auto Tuning are supported.
Stability: Fixes memory leaks and thread safety issues.

Batching: The principle of Batching is to combine multiple requests into one Batch for reasoning, reduce the number of Kernel Launch, and make full use of multiple GPU SM, so as to improve the overall throughput. With max_batch_size set to 4, the native BERT model can achieve average Latency less than 6ms and maximum throughput up to 1300 QPS. This method is very suitable for BERT model optimization in meituan search scenario, because search has obvious peak and low peak periods, which can improve the throughput of peak model.

Mixing precision: the mixing precision refers to the mixing method of FP32 and FP16. Using the mixing precision can speed up BERT training and prediction process and reduce video memory overhead, while taking into account the stability of FP32 and the speed of FP16. In the process of model calculation, FP16 is used to speed up the calculation process. In the process of model training, weights are stored in FP32 format, and parameters are updated in FP32 type. Using FP32 master-weights to update parameters in FP32 data type can effectively avoid overflow. While the mixing accuracy does not affect the effect, the model training and prediction speed are improved to some extent.

4.2 NER for knowledge enhancement

How to embed the external knowledge of a specific domain into language models as auxiliary information has been a hot research topic in recent years. K-bert [8], ERNIE[9] and other models explored the combination method of knowledge graph and BERT, providing us with a good reference. NER in Meituan search is domain-dependent, and the determination of entity type is highly related to business supply. Therefore, we also explore how external knowledge such as supply POI information, user clicks, and domain entity thesaurus can be integrated into the NER model.

4.2.1 Latch-LSTM fusion for Searching log Features

In the field of O2O vertical search, a large number of entities are customized by businesses (such as business name, group name, etc.), and entity information is hidden in the attributes of the supply POI, so the recognition effect of traditional semantic method is poor. For Chinese entity recognition, Lattice-LSTM[10] enriched semantic information by adding input of word vector. We learn from this idea and combine user behavior search to mine the potential phrases in Query, which contain POI attribute information, and then embed these hidden information into the model to solve the domain new word discovery problem to a certain extent. Compared with the original Lattice-LSTM method, the recognition accuracy is improved by 5 points in thousandths.

1) Phrase mining and feature calculation

This process mainly includes two steps: matching position calculation and phrase generation, which will be described in detail below.

Step1: Calculation of matching position. Search logs are processed, focusing on calculating detailed matches between queries and document fields and calculating document weights, such as click through rates. As shown in Figure 9, the user enters the query as “handmade”, and for document D1 (POI in the search), “handmade” appears in the field “tuple order” and “woven” appears in the field “Address”. For document 2, “handweave” appears in both “merchant name” and “group order.” The start position and end position match the start position and end position of the query substring that should be matched respectively.

Step2: phrase generation. Using the results of Step1 as input, the model is used to infer candidate phrases. Multiple models can be used to produce results that satisfy multiple assumptions. We model the generation of candidate phrases as an Integer Linear programming (ILP) problem and define an optimization framework in which the hyperparameters in the model can be customized according to business requirements to obtain results that satisfy no assumptions. For a specific query Q, each segmentation result can be represented by integer variable xij: xij=1 means that the position from I to j of the query constitutes a phrase, that is, Qij is a phrase, xij=0 means that the position from I to j of the query does not constitute a phrase. The optimization objective can be formalized as maximizing the matching score collected in the case of given different shard XIj. Optimization objectives and constraint functions are shown in Figure 10, where P: document, F: field, W: weight of document P, wf: weight of field F. Xijpf: query whether substring Qij appears in the f field of document P, and the final segmentation scheme will consider the observation evidence. Score(XIjpF) : the observation Score considered by the final segmentation scheme, w(xij) : the weight corresponding to the segmentation Qij, yijpf: The observed match, the query substring Qij, appears in the f field of document P. χ Max: The maximum number of phrases the query contains. Here, χ Max, WP, WF and W (XIj) are hyperparameters, which need to be set before solving THE ILP problem. These variables can be set according to different assumptions: manually set according to experience, or set based on other signals, which can be set according to the method given in Figure 10. The feature vector of the final phrase is represented by the click distribution of each attribute field in POI.

2) Model structure

The model structure is shown in Figure 11. The blue part represents a layer of standard LSTM network (which can be trained alone or combined with other models). The input is word vector, the orange part represents all word vectors in the current query, and the red part represents all phrase vectors calculated by Step1 in the current query. For LSTM hidden state input, it is mainly composed of two levels of features: the current text semantic feature, including the current word vector input and the previous word vector hidden layer output; Potential entity knowledge features include phrase features and word features of the current word. Here are the calculation and combination methods of the characteristics of the current potential knowledge :(in the following formulas, σ represents sigmoid function, ⊙ represents matrix multiplication)

4.2.2 Two-stage NER of merging entity dictionary

We consider merging domain dictionary knowledge into the model and propose a two-stage NER recognition method. The method divides NER task into two sub-tasks: entity boundary recognition and entity label recognition. The advantage of this approach over the traditional end-to-end NER approach is that entity sharding can be reused across domains. In addition, in the stage of entity tag identification, the accumulated entity data and entity link can be fully used to improve the accuracy of tag identification, but the disadvantage is that there will be the problem of error propagation.

In the first stage, the BERT model focuses on the determination of entity boundaries, while in the second stage, the information gain brought by entity dictionary is integrated into the entity classification model. The entity classification of the second stage can predict each entity individually, but this approach will lose entity context information, and we deal with it as follows: The entity dictionary is used as training data to train an IDCNN classification model, which encodes the segmentation results output in the first stage and adds the encoding information to the tag recognition model in the second stage. The decoding is completed by combining context words. Based on the evaluation of Benchmark annotation data, the model achieves a 1% improvement in Query granularity accuracy compared to Bert-Ner. Here we use IDCNN mainly to consider the performance of the model, so we can replace it with BERT or other classification models according to the application scenario.

4.3 Weak supervision NER

To solve the problem of difficult acquisition of annotation data, we proposed a weak supervision scheme, which consists of two processes, namely, weak supervision annotation data generation and model training. These two processes are described in detail below.

Step1: Weakly supervised annotation sample generation

Preliminary model: the entity recognition model is trained with the labeled small-batch data set. Here, the latest BERT model is used to obtain the preliminary model ModelA.
Dictionary data prediction: the entity recognition module currently precipitates millions of high-quality entity data as dictionaries, and the data format is entity text, entity type and attribute information. The entity recognition results of the ModelA predictive dictionary obtained in the previous step are outputed.
Predicted results correction: entity dictionary entity, higher accuracy in theory model of the predicted the entity type of at least one should be given in entity dictionary of the entity type, otherwise the description model for this type of input recognition effect is not good, need to supplement samples, our model for this type of input text results after correction. We tried two correction methods, namely, global correction and partial correction. Global correction means that the whole input is corrected to the dictionary entity type, and partial correction means that the single Term segmented from the model is type corrected. For example, the entity type given in the “Brother BARBECUE Diy personality” dictionary is business, and the model prediction result is modifiers + dishes + categories, without Term belonging to business type. The model prediction result is different from the dictionary, so we need to correct the output label of the model. There are three correction candidates, which are “business + dishes + category”, “modifiers + businesses + category”, and “modifiers + dishes + businesses”. We choose the one that is closest to the prediction of the model. The theoretical significance of this choice is that the model has converged to the point where the predicted distribution is closest to the real distribution, and we only need to fine-tune the predicted distribution, rather than drastically change the distribution. So how do you pick the correction candidate that comes closest to the model’s prediction? We use the method is to calculate correction candidate scored under the model of probability, and then with the results of the current prediction (think the current model of the optimal result) calculating probability ratio, probability ratio calculation formula as shown in formula 2, probability than the oldest one is the resulting correction candidate, namely the resulting weak supervision and label samples. In the example of “brother BARBECUE diy”, the correction candidate of “merchant + dishes + category” has the largest probability ratio with the output of “modifiers + dishes + category”, and the annotation data of “brother/merchant barbecue/Dishes DIY/category” will be obtained.

Step2: Weakly supervised model training

Weakly supervised model training methods include two kinds: first, the generated weakly supervised samples and labeled samples are mixed to conduct model training again without distinction; Second, fine-tuning training is performed with weakly supervised samples on the basis of ModelA generated by annotation sample training. We tried both approaches. According to the experimental results, fine-tuning effect is better.

5. Summary and outlook

This paper introduces the characteristics and technology selection of NER task in O2O search scenario, and details the exploration and practice in entity dictionary matching and model construction.

Entity dictionary matching mining POI structured information, merchant review data, search logs and other unique data offline can solve the domain entity recognition problem well. In this part, we introduce a new word automatic mining method suitable for vertical domain. In addition, we have also accumulated other mining technologies that can process multi-source data. If necessary, we can conduct technical exchanges offline.

In terms of model, we explored three core problems of NER model construction in search (high performance requirements, strong domain correlation, and lack of annotated data). In view of high performance requirements, model distillation and prediction acceleration are adopted, so that the main model on NER line can be successfully upgraded to BERT with better effect. In terms of solving domain-related problems, the methods of merging domain knowledge of search log and entity dictionary are proposed respectively. Experimental results show that these two methods can improve the prediction accuracy to a certain extent. In view of the difficulty of obtaining annotation data, we propose a weak supervision scheme, which alleviates the problem of poor prediction effect of model with less annotation data to a certain extent.

In the future, we will continue to conduct in-depth research on solving NER non-login identification, ambiguity and domain-related issues, and welcome peers in the industry to communicate with us.

6. Reference materials

[1] Automated Phrase Mining from Massive Text Corpora. 2018.

[2] Learning Named Entity Tagger using Domain-Specific Dictionary. 2018.

[3] Bidirectional Encoder Representations from Transformers. 2018

[4] www.jiqizhixin.com/articles/20…

[5] naacl2019.org/blog/best-p…

[6] Hinton et al.Distilling the Knowledge in a Neural Network. 2015.

[7] Yew Ken Chia et al. Transformer to CNN: Label-scarce distillation for efficient text classification. 2018.

[8] K-BERT:Enabling Language Representation with Knowledge Graph. 2019.

[9] Enhanced Language Representation with Informative Entities. 2019.

[10] Chinese NER Using Lattice LSTM. 2018.

7. Author profile

Li Hong, Xing Chi, Yan Hua, Ma Lu, Liao Qun, Zhi ‘an, Liu Liang, Li Chao, Zhang Gong, Yun Sen, Yong Chao, etc., all from Meituan Search and NLP department.

Recruitment information

Meituan search department, long-term recruitment search, recommendation, NLP algorithm engineer, coordinate Beijing. Interested students are welcome to send their resumes to [email protected] (subject line: Search and NLP)

To read more technical articles, please scan the code to follow the wechat public number – Meituan technical team!