In the field of natural language processing, text representation learning can help us transform the real world into data that can be processed by computers in order to build more accurate learning models. However, in the context of Chinese search, there are always difficult problems in the recall and similarity matching of texts such as homophones, confusable words and typos. This paper tries to train Chinese word vectors from the perspective of graph calculation, and has achieved positive results. I hope to share with you. Article author: Zhai Binxu, Senior R&D engineer of Tencent Cloud Big data.

I. Technical background

In the context of Chinese search, the recall and similarity matching of homophones, miscible words, misspellings and other texts is a common and difficult problem. Text matching and recall by the NLP(Natural Language processing) community has gone through the transition from the early full-text retrieval based on word segmentation and inverted indexes to the popular text vector retrieval today.

Vector retrieval can obtain text vector by training and learning distributed representation of text, which can solve semantic similarity matching problem that cannot be solved by inverted index. Moreover, there are mature solutions for large-scale rapid retrieval of high-dimensional vector in the industry, such as Faiss, Nmslib and so on.

However, the present commonly used presentation learning methods seldom consider the text similarity matching problem caused by input errors and pronunciation problems in Chinese scenes.

For example, in the author’s Tencent cloud enterprise portrait product development process, often encountered similar needs. When users search “Tengxun Technology Group Co., LTD.” in our product, the registered name of the enterprise they want to search should be “Tencent Technology (Shenzhen) Co., LTD.”. However, due to input method errors (” Tencent “is wrongly typed as” Tengxun “), cognitive errors (” limited liability company “is mistaken for” Group limited liability Company “) and other reasons, the user input can not match the desired search results, and even OOV (” Tengxun “may not be in the word table).

How to solve the problem of Chinese morphological representation learning without too much consideration of semantic similarity is the focus of this paper.

Ii. Evolution of word embedding training

In statistical learning model, Word Embedding is a key technology in NLP field. The common training methods and main characteristics of word embedding (also known as text representation) are shown in the figure below.

Figure 1. Overview of distributed representation methods for text

The early research on word embedding mainly includes one-hot encoding, TF-IDF and other word bag models. The Bag of Words (BOW) model refers to ignoring elements such as grammar and word order of a document and treating the document as a collection of unordered Words, each of which is independent.

These methods are all discrete representations. When the vocabulary is large, it will occupy a lot of storage space. The size of the vocabulary determines the dimension of the vector, and there is the problem of dimension disaster. In addition, this kind of method cannot get the similarity between words through any calculation, so there is no correlation between word vectors.

In view of the problems of dimension disaster and semantic gap in word bag representation, Yoshua Bengio et al. [1] proved that the language model trained by neural network could generate better word vector, and proposed many methods to optimize training.

As shown in the figure below, the whole network is divided into two parts. The first part uses the word feature matrix C to obtain the distributed representation of words (i.e. word embedding). Said the second part is the context of the n word word embedded pieced together, through a hidden layer and one output layer, and finally through the softmax output current p (wt | context) (to maximize the probability distribution of the current context semantic, to predict the probability of the word, can training this model).

This model framework not only trains a language model represented by neural network, but also obtains word embedding (in matrix C) as a by-product of the language model.

Figure 2. Classical natural language models (Bengio et al., 2003)

The training of classical language model has the problem of a large amount of computation (mainly focused on the full connection layer from the hidden layer to the output layer and softmax calculation of the output layer), which is difficult to implement. In view of these problems, Mikolov et al. [2,3] simplified the language model and presented the word2vec model of Cbow and skip-gram architecture. At the same time, Hierarchical Softmax and Negative Sampling can be used to reduce complexity in the specific learning process.

As shown in the figure below, this architecture greatly simplifies the computation. However, the word vector trained by Word2vec has a one-to-one static mapping relationship with words, and the polysemy problem has not been solved.

Figure 3. Word2vector model structure

In order to solve the problem of polysemy, ELMO model [4] is proposed, it through the study of language model, is embedded, said a word in the actual use of word embedded words according to the context of semantic to adjust the word word embedded said, so as to make the word in different context of different word embedded said.

The network structure adopts two-way LSTM network. The forward bilayer LSTM and reverse LSTM represent the forward and reverse encoders respectively, and the input is the above and below of the word respectively. When a sentence is input into the trained network, three different embedding representations will be obtained for each word: the two-layer word embedding representation in the bidirectional LSTM and the word embedding representation. The two-layer word embedding representation in bidirectional LSTM encodes the syntactic and semantic information of words respectively. When doing the actual task, the word embedding representation corresponding to the words in the network will be extracted and added to the actual task as a new feature.

When the embedding is dynamically adjusted by ELMO according to the context, it can not only find the corresponding sentences with the same semantics, but also guarantee that the corresponding parts of speech in the found sentences are the same. However, ELMO’s ability to extract features using LSTM is not as good as that of later Transformer. Its bi-directional language model uses simple splicing to merge features, and the integrated feature fusion performance is poor.

Figure 4. Schematic diagram of ELMO model

BERT[5], as a master of dynamic word embedding representation learning, is a model of bidirectional fusion features.

BERT proposed two tasks: MLM (Masked Language Model) and NSP (Next Sentence Predict). The former is word level. The method adopted is to randomly block 15% of words and let the model predict the word, which can train the deep bidirectional word embedding vector representation. The latter is a sentence-level task, which is also a dichotomous task. The method adopted is to take the sequence of two sentences as the input of the model and predict whether the following sentence is the context of the previous one. This method can learn the relationship between sentences and capture the sentence-level representation. Therefore, the word embedding representation obtained by BERT incorporates more grammatical, lexical and semantic information, and dynamically changing word embedding can also make words have different word embedding in different contexts.

However, BERT has a high requirement on data scale. If there is not enough corpus, it is difficult to achieve the expected effect. The calculation is large and the cost is high.

Figure 5. Schematic diagram of BERT model structure

Has been the main word vector model is based on western language, the internal composition of western languages are the Latin alphabet, however, because of Chinese and western language is totally different, Chinese words such as homophone, wrong character set, and the Chinese character component radical Chinese characters internal and pronunciation also contains strong semantic information, therefore, How to effectively use semantic information inside Chinese characters to train word vectors has become a hot research topic in recent years [6,7,8].

The typical representative here is the Cw2VEC model based on Chinese strokes proposed by Ant Financial in 2018 [6]. In this paper, Chinese strokes are divided into 5 categories, similar to the idea of Fasttext [9], and each word is represented as multiple stroke sequences by n-gram window sliding. Each gram and word is represented as a vector, which is used to train and calculate their similarity. In order to simplify the calculation, the negative sampling method is also used in the paper, and good results are obtained in the experiment.

FIG. 6. Schematic diagram of CW2VEC model

3. Existing problems and solutions

From the above work, it can be seen that the current main learning methods of word embedding representation mainly focus on learning word embedding from the perspective of contextual semantics of text corpus, and there are few studies on other perspectives, such as Chinese morphology. The word vectors trained and learned by these methods differ greatly in the distance of word embedding space even if the words with the same pronunciation are edited in close distance in Chinese.

For example, taking the million-word vector published by Tencent AILab as an example, this version of the word vector model can well capture semantic similarity between Chinese words, but has poor effect on similarity measurement scenarios of sub-words and homophones, as shown in the figure below.

Figure 7. Example of word vector similarity calculation

Without too much consideration of semantic similarity, this paper proposes to train the vector representation of learning text from the perspective of graph computation to solve the problem of similarity matching in Chinese morphology. The basic principle of the algorithm is as follows.

Construct an undirected weighted graph of common Chinese characters and thesaurus in business scenarios: Each word and Chinese character is taken as a node in the figure, and sub-word and pinyin nodes are added at the same time. A link is established between the “word-sub-word – single-word – Pinyin” nodes in the figure in turn (as shown in Figure 8), according to the editing distance between words in pinyin and composition (it can be flexibly set here according to business needs. You can also directly train the weight model separately) to assign weights to the connected edges between nodes.

In particular, this paper focuses on establishing linking edges between nodes of homophones, flat tongues, warped tongues and subword sequences to ensure that homophones and easily confusable words are reachable in the graph. Meanwhile, the introduction of subword preserves the word order characteristics of the text to a certain extent. Then, node2vec or Metapath2vec skip-Gram model is used to learn the vector representation of each node, which is used as the distributed representation of characters.

Figure 8. Sample composition

In terms of the choice of walking mode, deepwalk is not suitable for the current major learning methods in the industry, such as DeepWalk [10], Node2VEc [11], Line[12], Metapath2vec [13] and other algorithms, considering that the graph constructed here is a weighted graph. However, the node association in the figure needs to consider second-degree or higher connections, so Line is not suitable. Therefore, this paper focuses on comparing the differences in embedding effects between Node2VEc and Metapath2VEc algorithms.

For unregistered words, the method similar to FastText can be used to average the single word vector of unregistered words. If the single word is also unregistered words, word vector of pinyin node can be used instead. In this way, word embedding Of high quality can be obtained, so that words with similar morphology also have high similarity in word embedding space. Meanwhile, the problem Of OOV (Out Of Vocabulary) can be greatly alleviated due to the addition Of pinyin and sub-word nodes.

In short, the main steps of the algorithm include the following parts: graph construction – > random walk sampling – > training word embedding.

The above is the complete introduction of Tencent’s Chinese word representation learning algorithm AlphaEmbedding. AlphaEmbedding focuses on solving distributed word representation learning in Chinese scene from the perspective of graph calculation, making the word vectors learned have a close distance in the vector space with similar words in Chinese morphology.

4. Experimental results

In the experimental data, a full number of domestic industrial and commercial registered enterprise names (about 220 million) were used to segment the enterprise names (due to business needs, an available Bert-CRF sequence annotation model has been developed), and a graph was constructed after cleaning, filtering and removing the name field. An example of input data is shown below:

Figure 9. Schematic diagram of name segmentation

Part of the graph construction was realized by Pyspark, and the scale of the graph formed included ten million nodes and ten million levels of edges. The random walk and word embedding training were carried out in Tencent cloud EMR cluster by WXG’s Plato computing framework, which was open source. The random walk among 20 nodes took about 1 minute, and the node embedding learning took about 2 hours. The constructed weighted graph effect is shown below:

Figure 10. An example of an undirected weighted graph constructed

In terms of the realization of specific composition, this paper compares two realization methods: In the initial arrangement and combination based on full name, only homophonic words and sub-words were considered to establish edges between nodes. Each Chinese node had the shortest two characters, and the pinyin node was the pinyin corresponding to the Chinese word node. We called this method combination style, and the schematic diagram of its composition was shown in the figure below.

Figure 11. Example combination Style composition

Node2vec was used for training and learning. Taking the similarity ranking of commonly used words in business as an example, the experimental results are as follows (the retrieval word “Tencent” has been marked in red in the figure) :

Figure 12. Combination Style effect

In the business application of combination style, there were many OOV cases, and then a more concise composition of “word-sub-word-word-word-word-pinyin” was adopted (as shown in Figure 8), which we called fasttext style. In addition, node2vec and Metapath are trained in node embedding learning. Here, the similarity ranking of commonly used words in business is taken as an example, and the experimental results are as follows:

Figure 13. Fasttext Style + node2vec depth-first search effect

Figure 14. Fasttext Style + node2vec width first search effect

Figure 15. Fasttext Style + Metapath2vec effect

Among them, the parameter Settings of the three walkout modes are as follows. Except for the different walkout sampling Settings, other input data and embedded learning parameters are the same:

  • Node2vec depth priority: P =100, Q =0.2, Step =5, epoch=30

  • Node2vec width priority: P =100, Q =5, step=5, epoch=30

  • Metapath2vec: Meta =”3-2-1-0″, step=5, epoch=30

  • Word2vec: Learning_rate =[(Linear, 0.02, 0.0001)], WINDOW_size =5, epoch=20

Contrast effects can be found from the experiment, combination style of compositions by only considering the rolls word permutation and combination, figure of the smallest (ten million nodes), and the nearest between retrieval words are homophones, but the sorting between homophones, not considering the similarity of words form on pinyin node as cut-off point of similarity sorting, After that, the similarity of sub-word nodes decreases sharply, and the overall effect is relatively poor.

The word list obtained by metapath2vec is larger than that of combination Style but smaller than that of Node2vec (more than 20 million nodes). Similar words prefer sub-word nodes, and the distinction between adjacent words is poor.

This result may be related to the way of swimming. On the one hand, the way of swimming defined by Metapath is relatively fixed. If we only use the way of “word-sub-word-word-word-word-pinyin” to swim, Actual business chart showed more word (word) no child words but to connect words scene (at this point the corresponding metapath should be “words – words – pinyin”), these metapath is lost when the sampling, cause node sampling imbalances, on the other hand because the child word node in the process of migration is always the same word co-occurrence of nodes, Therefore, the similarity of sub-word nodes is relatively high on the whole. Due to the poor differentiation of metapath2VEC training results, the performance of the obtained word embedding in the downstream task is not ideal.

Node2vec depth-first search and width-first search have similar results. They obtain the same size of thesglossary (nearly 30 million nodes) and return similar sorting results for specific search terms.

In contrast, depth-first search results can match the sub-words with a larger editing distance, and the fault-tolerant boundary of the model is extrapolated to a larger range. The distinction between adjacent words obtained by width-first search is smoother and more friendly to downstream tasks. For the sub-word pairs with the same editing distance, the relative ordering of the sub-word pairs returned by the two search methods is basically the same because n-gram word order information is taken into account.

On the whole, compared with the two composition and wandering methods mentioned above, node2VEc model achieves better word vector effect, and width-first search results are more in line with business requirements.

Five, the summary

This article reviewed the NLP said the current main text distributed learning method, in view of the Chinese search scenario homophones, easy to mix the text similarity matching problem, such as word attempt from the perspective of figure calculation in this paper, a word vector training method, makes the model to study words with similar words in the Chinese view on the word vector in the vector space also has a close distance.

By comparing the combination style and Fasttext style composition methods and node2VEc depth-first, Node2vec width-first and Metapath2vec edge sampling methods, the word embedment effects in business applications were obtained. This paper explores the application of graph computing in text representation learning and provides positive help for improving business effect.

At present this work has been applied to Tencent cloud enterprise portrait product search business. In the future, we will make more attempts and explorations in related aspects. For example, we will consider adding stroke pair word construction modeling in order to improve the similarity matching effect of wrong words. Consider using GCN/GraphSAGE/GAT and other graph neural network modeling in order to improve the quality of word embedding, etc. Students in the industry are also welcome to provide more ideas and criticisms.

References:

[1] Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3, 1137 — 1155.

[2] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119.

[3] Mikolov T, Chen K, Corrado G, Representations of rice representations in vector space using an Efficient estimation [J]. ArXiv PrePrint arXiv:1301.3781, 2013.

[4] Peters M E, Neumann M, Iyyer M, Phytopathic structures in aquatic plants and their structures in China [J]. ArXiv PREprint arXiv:1802.05365, 2018.

[5] Devlin J, Chang M W, Lee K, et al. Bert: Research on deep Bidirectional Transformers [J]. ArXiv Preprint arXiv:1810.04805, 2018.

[6] Cao S, Lu W, Zhou J, et al. cw2vec: Learning chinese word embeddings with stroke n-gram information[C]//Thirty-second AAAI conference on artificial intelligence. 2018.

[7] Yin R, Wang Q, Li P, et al. Multi-granularity chinese word embedding[C]//Proceedings of the 2016 conference on empirical methods in natural language processing. 2016: 981-986.

[8] Xu J, Liu J, Zhang L, et al. Improve chinese word embeddings by exploiting internal structure[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016: 1041-1050.

[9] Joulin A, Grave E, Bojanowski P, Ieee Transactions on Pattern Analysis and Machine Intelligence, 2016, 10 (1) : 97-101.

[10] Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014: 701-710.

[11] Grover A, Leskovec J. node2vec: Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016: 855-864.

[12] Tang J, Qu M, Wang M, et al. Line: Large-scale information network embedding[C]//Proceedings of the 24th international conference on world wide web. 2015: 1067-1077.

[13] Dong Y, Chawla N V, Swami A. metapath2vec: Scalable representation learning for heterogeneous networks[C]//Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017: 135-144.

Tencent technology, learn knowledge of cloud computing, came to cloud + community: cloud.tencent.com/developer