How to extract text keywords well? Start with three algorithms of philosophical data application

Han Wei: Data mining engineer of Philosophical Data, responsible for the mining and application of philosophical data text. Mainly involved in the construction and implementation of philosophical data label extraction and text classification system, with strong interest in deep learning and NLP data mining.

0 profile

In the field of natural language processing, the key to deal with massive text files is to extract the most concerned problems of users. Whether it is a long text or a short text, the theme of the whole text can often be explored through a few key words. At the same time, whether text-based recommendation or text-based search, text keywords are also highly dependent, and the accuracy of keyword extraction is directly related to the final effect of the recommendation system or search system. Therefore, keyword extraction is a very important part in the field of text mining.

There are three key word extraction methods for text: supervised, semi-supervised and unsupervised:

Supervised keyword extraction algorithms are built to treat the keyword extraction algorithm as a dichotomous problem, determining whether a word or phrase in a document is a keyword or not. Since it is a classification problem, it is necessary to provide the marked training expectation, use the training corpus to train the keyword extraction model, and carry out keyword extraction for the documents requiring keyword extraction according to the model.

Semi-supervised keyword extraction algorithm only needs a small amount of training data, which is used to build a keyword extraction model, and then the model is used to extract keywords from new texts. These keywords are manually filtered, and the filtered keywords are added to the training set to retrain the model.

Unsupervised methods do not need manually annotated corpus, and use some methods to find important words in the text as keywords for keyword extraction.

Supervised text keyword extraction algorithm needs high labor cost, so the existing text keyword extraction mainly adopts unsupervised keyword extraction with strong applicability. The text keyword extraction process is as follows:

FIG. 1 Flow chart of unsupervised text keyword extraction

Unsupervised keyword extraction algorithms can be divided into three categories: keyword extraction based on statistical features, keyword extraction based on word graph model and keyword extraction based on topic model.

1. Keyword extraction algorithm of base statistical features

The idea of keyword extraction algorithm based on statistical features is to extract keywords from documents by using the statistical information of words in documents. Usually, the text is preprocessed to get the set of candidate words, and then the key words are obtained from the set by eigenvalue quantization. The key to the keyword extraction method based on statistical features is the way of quantifying indicators with characteristic values. Currently, there are three commonly used methods:

1) Feature quantization based on word weight

The feature quantization based on word weight mainly includes part of speech, word frequency, reverse document frequency, relative word frequency and word length.

2) Feature quantification of document location based on word

This quantification of features is based on the assumption that sentences at different points in the article are of different importance to the document. Usually, the first N words, the last N words, the beginning of the paragraph, the end of the paragraph, the title, introduction and other words are representative, these words as keywords can express the whole topic.

3) Feature quantization of word-based association information

Word association information refers to the degree of association between words and documents, including mutual information, HITS value, contribution, dependency, TF-IDF value, etc.

We introduce several commonly used characteristic value quantization indexes.

1.1 the part of speech

Part of speech is the result of word segmentation and grammatical analysis. Most of the existing keywords are nouns or gerunds. In general, nouns can better express the main idea of a text than other parts of speech. However, as an indicator of feature quantification, part of speech is generally used in combination with other indicators.

1.2 word frequency

Word frequency indicates how often a word appears in text. It is generally believed that the more frequently a word appears in the text, the more likely it is to be the core word of the text. Word frequency simply counts the number of occurrences of words in the text. However, the keywords obtained by relying only on word frequency have large uncertain lines, and this method will make a lot of noise for long texts.

1.3 Location Information

In general, the position of a word is of great value to the word. For example, the title and abstract themselves are the central idea of the article summarized by the author, so the words appearing in these places have a certain representativeness and are more likely to become keywords. However, because each author has different habits and writing styles, and the position of key sentences will be different, this is also a very broad method to obtain keywords, which will not be used alone in general.

1.4 mutual information

Mutual information is a concept in information theory and a measure of the interdependence between variables. Mutual information is not limited to real-valued random variables, it is more general and determines the similarity between the joint distribution P (X,Y) and the product P (X)p(Y) of the decomposed edge distribution. The calculation formula of mutual information is as follows:

Where, p(x,y) is the joint probability distribution function of X and y, and p(x) and P (y) are the marginal probability distribution functions of X and Y respectively.

When the mutual information is used as the feature quantization of keyword extraction, the PAT tree is constructed by using the body and title of the text, and then the mutual information around the string is calculated.

1.5 word span

Word span refers to the distance between the first occurrence and the last occurrence of a word or phrase in a text. The larger the word span is, the more important the word is to the text and can reflect the theme of the text. The formula for calculating the span of a word is as follows:

Among them,Represents the last position of the word I in the text,Represents the position of the word I in the text for the first time. Sum represents the total number of words in the text.

Word span is used as a method to extract keywords because in reality, there is always a lot of noise in the text (referring to those words that are not keywords), and the use of word span can reduce the noise.

1.6 TF – IDF values

The TF of a word refers to the frequency of occurrence of the word in the document, assuming that a word W appears m times in the text and the total number of words in the text is N, then

The IDF of a word is derived from the corpus, indicating the frequency of occurrence of the word in the whole corpus. Assuming that in the whole corpus, there are M texts containing the word W and N texts in the corpus, then

Thus, the tF-IDF value of the word W is:

Tf-idf has the advantage of being simple to implement and relatively easy to understand. However, TFIDF algorithm also has obvious shortcomings in keyword extraction. It relies heavily on corpora, so it is necessary to select a corpus of high quality and consistent with the processed text for training. In addition, as for IDF itself, it is a kind of weighting that tries to suppress noise and tends to favor words with low frequency in the text, which makes the accuracy of TF-IDF algorithm not high. Another disadvantage of TF-IDF algorithm is that it cannot reflect the location information of words. When extracting keywords, the location information of words, such as the title of the text, the first sentence and the last sentence of the text, contains important information and should be given a higher weight.

The keyword extraction algorithm based on statistical features sorts the keywords through some feature quantization indexes above, and obtains TopK words as keywords.

The key point of keywords based on statistical characteristics lies in the calculation of quantitative indicators of characteristics, and the demerits obtained by different quantitative indicators are not the same. Meanwhile, different quantitative indicators also have their own advantages and disadvantages. In practical application, Topk words are usually obtained by combining different quantitative indicators as keywords.

Keywords extraction algorithm based on word graph model

Keywords extraction based on the word graph model firstly constructs the language network diagram of the document, then carries on the network diagram analysis of the language, and searches for the words or phrases with important functions on this diagram, which are the keywords of the document. The nodes in the language network graph are basically words. According to the different linking modes of words, the language network can be divided into four main forms: co-occurrence network graph, syntactic network graph, semantic network graph and other network graph.

In the construction of language network graph, preprocessed words are used as nodes and the relationships between words are used as edges. In language network graph, the weights between edges are generally expressed by the degree of correlation between words. When using the language network graph to obtain keywords, it is necessary to evaluate the importance of each node, and then sort the nodes according to the importance, and select the words represented by TopK nodes as keywords. The importance of nodes can be calculated in the following ways.

2.1 Comprehensive feature method

The comprehensive feature method is also called the centrality analysis method of social network. The core idea of this method is that the importance of nodes is equal to the significance of nodes, which is based on not destroying the integrity of the network. This method is to quantitatively analyze the topological properties of the network structure from the perspective of local and global properties of the network. The commonly used quantitative calculation methods are as follows.

2.1.1 degrees

The degree of node refers to the number of nodes directly connected with the node vector, and represents the local influence of the node. For the unweighted network, the degree of node is:

For the weighted network, the degree of the node is also called the strength of the node, and the calculation formula is:

2.1.2 proximity

Proximity of nodes refers to the reciprocal of the sum of the shortest paths from nodes to other nodes, and represents the closeness of information transmission. Its calculation formula is as follows:

2.1.3 Feature vectors

The idea of feature vector is that the centralization test value of a node is determined by all connected nodes around it, that is, the centralization index of a node should be equal to the linear superposition of the centralization index of its neighboring nodes, which represents the indirect influence obtained through the neighboring nodes with height value. The calculation formula of feature vector is as follows:

2.1.4 Agglomeration coefficient

The clustering coefficient of a node is the ratio of the number of connections between its adjacent nodes to all their possible links, which is used to describe the degree of clustering among the vertices of the graph. The calculation formula is as follows:

2.1.5 Average shortest path

The even-draw shortest path of a node is also called compact centrality. It is the average sum of all shortest paths of a node, and represents the dependence of a node on other nodes when it transmits information. The closer a node is to other nodes, the less dependent it is on others to disseminate information. If a node is only a short distance from each point in the network, that point is not constrained by other nodes. The calculation formula is as follows:

Because the emphasis of each algorithm is different, the quantitative analysis method selected in practical problems will be different. At the same time, for keyword extraction, word collocation network can also be constructed by combining with the weight of words obtained by the statistical method proposed in the previous section, such as part of speech, and then keywords can be obtained by using the above method.

2.2 System science method

The idea of the system science method for centrality analysis is that the importance of nodes is equal to the damage degree of the whole language network graph after the node is deleted. The deletion of important nodes will change the connectivity of the network. If we delete a node in the network graph, some specified features of the graph change, the importance of the node can be obtained according to the size of the change of features, so as to filter the nodes.

2.3 Random walk

Random walk algorithm is a very famous algorithm in network graph. It randomly selects neighbor nodes from given graph and starting point to move to neighbor node, and then takes the current node as the starting point to iterate the above process.

Random walk algorithm a very famous application is the famous PageRank algorithm, PageRank algorithm is the core algorithm of the whole Google search, is a hyperlink between the web page to calculate the importance of the technology, its key idea is the importance of transmission. In the field of keyword extraction, the TextRank algorithm proposed by Mihalcea et al. draws on this idea in the field of text keyword extraction.

The PageRank algorithm regards the entire Internet as a directed graph, in which web pages are nodes and links between web pages are edges. According to the idea of importance passing, if A large website A has A hyperlink to web page B, then the importance of web page B will rise according to the importance of WEB page A. The idea of conveying web importance is shown below.

Figure 2 Simple description of PageRank (from PageRank paper)

In the PageRank algorithm, the most important is the importance of the initial page (PR value) calculation, because the importance of the page A in the figure above we can not predict. However, in the original paper, an iterative method is given to calculate this importance. In the paper, it is pointed out that the power method to calculate the eigenvalue of the matrix is independent of the initial value of the matrix. Then, each page can be randomly assigned an initial value, and then iterated to a convergence value, and the convergence value is independent of the initial value.

PageRank for page I PR value calculation is as follows:

Where, d is the damping coefficient, usually 0.85.Is a collection of pages that point to page I.Is the collection that links in page J point to,It’s the number of elements in the set.

TextRank changes nodes from web pages to sentences when constructing graphs, and introduces weights for the edges between nodes, where weights indicate the similarity of two sentences. Its calculation formula is as follows:

In the formulaIs the node in the figureandThe weight of the edge of. The other symbols are the same as the PageRank formula.

TextRank algorithm can not only do text keyword extraction, but also do text abstract extraction, the effect is good. However, TextRank is not widely used due to its high computational complexity.

3. Keyword extraction based on topic model

The theme keyword extraction algorithm mainly uses the distribution of theme in the theme model to extract keywords. The algorithm steps are as follows:

Get candidate keywords from the article. In other words, text segmentation can also select candidate keywords according to the part of speech.
Subject models are derived from large-scale predictive learning.
According to the implicit topic model, the distribution of topic and candidate keywords are calculated.
The topic similarity of documents and candidate keywords is calculated and sorted, and the first n words are selected as keywords.

The key of the algorithm is the construction of topic model. Topic model is a document generation model. For an article, our idea is to first determine several themes, then think of the words to describe the theme according to the theme, and then form the words into sentences and paragraphs according to the grammar rules, and finally generate an article. The topic model is also based on this idea, which considers the document as a mixed distribution of some topics, and topics are the probability distribution of words. The pLSA model is the first model constructed based on this idea. Similarly, in reverse, we find the topic of the document, and then the words in the topic represent the core meaning of the document, the keywords of the document.

According to the pLSA model, every word in a document is selected from a topic with a certain probability, and then selected from the topic with a certain probability. The calculation formula of the word is as follows:

Some Bayes school researchers have improved the pLSA model. They believe that the probability of the article corresponding to the topic and the probability of the topic corresponding to the word is not certain, but also subject to a certain probability, so there is the topic model commonly used at the present stage –LDA topic model.

LDA was proposed by D.M. Blai in 2003. LDA adopts the word bag model to simplify the complexity of the problem. In the LDA model, each document is a probability distribution of several topics, and each topic is a probability distribution of many words. At the same time, no matter the probability distribution of topic composition or word composition is definite, these distributions also obey the Dirichlet prior distribution.

The document generation model can be represented by the following figure model:

Among themandIs the hyperparameter of prior distribution,Is the distribution of all words under the KTH theme,Is the subject distribution of the document, w is the word of the document, and z is the subject of w.

Figure 3. Blei’s graph model in the paper

LDA mines the deep semantics of the text, namely the theme of the text. Using the theme of the text to represent the text also reduces the dimension of the text vector to a certain extent. Many people classify the text in this way and have achieved good results. Please refer to the LDA algorithm.

LDA keyword extraction algorithm uses the implicit semantic information of documents to extract keywords, but the keywords extracted by topic model are relatively broad and cannot reflect the document theme well. In addition, the time complexity of LDA model is high, which requires a lot of practical training.

4 application

At present, text keyword extraction is widely used in the fields of text-based search, recommendation and data mining. At the same time, in practice, because of the complexity of the application environment, for different types of text, such as long text and short text, the same text keyword extraction method can get the same effect. Therefore, in practical application, the algorithm used for different conditions will be different, and no one kind of algorithm has a good effect in all environments.

Compared with the algorithms mentioned above, some combination algorithms are widely used in engineering to make up for the deficiency of single algorithm, such as combining TF-IDF algorithm with TextRank algorithm, or integrating TF-IDF and part of speech to get keywords, etc. At the same time, the project for the text preprocessing and the accuracy of text word segmentation also have a great dependence. For text errors, deformed words and other information, need to be solved in the pre-processing stage, the choice of word segmentation algorithm, unknown words and ambiguous word recognition will have a great impact on keyword extraction to a certain extent.

Keyword extraction is a seemingly simple task, but it is very difficult in practical application. Engineering optimization is carried out on the basis of existing algorithms. Philosophical data has made great efforts in this aspect and achieved good results.

5 concludes

This paper introduces three commonly used unsupervised keyword extraction algorithms and introduces their advantages and disadvantages. Keywords extraction in the field of text mining has a very broad application, the existing methods also have some problems, we will continue to study the problem of keyword extraction, but also welcome you to actively exchange.

reference

[1] TextRank algorithm for extracting keywords and abstractions.me /2017/04/08/… [2] Page L, Brin S, Motwani R, et al. The PageRank citation ranking: Bringing Order to the Web [R]. Stanford InfoLab, 1999. [3] Research on keywords Extraction method based on document topic structure [D]. Beijing: Tsinghua University, 2011. [4] TF-IDF, zh.wikipedia.org/zh-hans/Tf-… [5] is rounding zhuanlan.51cto.com/art/201712/ LDA theme in the field of machine model… [6] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. Journal of machine Learning research, 2003, 3(Jan): [7] Zhao Jingsheng, Zhu Qiaoming, ZHOU Guodong, et al. A review of automatic Keyword Extraction [J]. Journal of Software, 2017, 28(9): 2431-2449.

Stamp here
ArchSummit Global Architect Summit