Keyword extraction

Part of it is as follows

When it comes to automatic summarization algorithm, tF-IDF is the most common and easy to implement, but the effect of TF-IDF is not as good as TextRank.

TextRank is a weighted algorithm designed for sentences in text, inspired by Google’s PageRank algorithm, with the goal of automatic summarization. It uses the principle of voting, giving each word a vote in favor of its neighbor (a window, in the jargon) depending on how many votes it has. This is a “chicken or egg” paradox, which PageRank solves by means of matrix iterative convergence. TextRank is no exception:

PageRank calculation formula:

Normal TextRank formula

The normal TextRank formula introduces the concept of edge weights on the basis of PageRank’s formula, representing the similarity of two sentences:

But obviously I only want to calculate the keyword. If I treat a word as a sentence, then all the edges of the sentence (words) have a weight of 0 (no intersection, no similarity), so the weights of the numerator and denominator cancel out, and the algorithm degrades to PageRank. Therefore, it is not unreasonable to call the keyword extraction algorithm PageRank.

In addition, if you want to extract key sentences (automatic summarization), please refer to the companion article “Java Implementation of Automatic Summarization for TextRank Algorithm”.

Pyhanlp is used as follows

from pyhanlp import *# Keyword extraction
content = (
    "Programmer (English Programmer) is a professional engaged in the development and maintenance of programs."
    "Programmers are generally divided into programmers and programmers,"
    "But the line between the two is not very clear, especially in China."
    "Software practitioners are divided into junior programmers, senior programmers and systems."
    "Analysts and project managers.")
TextRankKeyword = JClass("com.hankcs.hanlp.summary.TextRankKeyword")
keyword_list = HanLP.extractKeyword(content, 5)
print(keyword_list)

The text is too short. And then we'll use it in the case
# newword_list = HanLP.extractWords(content, 5)
# print(newword_list)
Copy the code
[Programmer, person, program, divided into, development]Copy the code

Automatic paper

Part of the original text

Automatic summarization is the automatic extraction of key sentences from an article. What is a key sentence? Human understanding is the sentence that can summarize the center of the article, while machine understanding can only simulate human understanding, that is, draw up a weighted scoring standard, give a score to each sentence, and then give the top few sentences.

TextRank formula

TextRank’s scoring idea is still derived from PageRank’s iterative idea, as shown in the following formula:

The left-hand side of the equation shows the weight of a sentence (WS is short for weight_sum), and the sum on the right shows how much each adjacent sentence contributes to the sentence. Different from keyword extraction, it is generally considered that all sentences are adjacent and no window is extracted.

The denominator wji represents the similarity of the two sentences, the denominator is weight_sum, and WS(Vj) represents the weight of the last iteration j. The whole formula is an iterative process.

Calculation of similarity degree

BM25 is recommended for wJI calculation

BM25 algorithm is usually used for searching correlation bisection. The main idea is summarized in one sentence: to parse Query morpheme to generate morpheme QI; Then, for each search result D, the correlation score between each morpheme Qi and D is calculated. Finally, the weighted sum of the correlation score between QI and D is carried out to obtain the correlation score between Query and D.

BM25 algorithm PDF

Automatic summarization is used in PyhanLP

# Automatic summary

document = Chen Mingzhong, director of the Water Resources Department of the Ministry of Water Resources, said at a press conference held by The State Council Information Office on September 29 that some provinces had just completed the assessment of their water resources management system, and some had exceeded the red line. As for some areas that exceed the red line, Chen mingzhong said that the approval of some water projects will be limited, and the verification of water resources and the approval of water extraction permits will be strictly carried out. ' ' '

TextRankSentence = JClass("com.hankcs.hanlp.summary.TextRankSentence")
sentence_list = HanLP.extractSummary(document, 3)
print(sentence_list)

sentence_list = HanLP.extractSummary(document, 2)
print(sentence_list)

sentence_list = HanLP.extractSummary(document, 1)
print(sentence_list)

sentence_list = HanLP.getSummary(document, 50)
print(sentence_list)

sentence_list = HanLP.getSummary(document, 30)
print(sentence_list)

sentence_list = HanLP.getSummary(document, 20)
print(sentence_list)
Copy the code
Strictly carry out the verification of water resources and the approval of water intake permits. Some provinces have exceeded the red line, Said Chen Mingzhong, director of the Department of Water Resources of the Ministry of Water Resources, at a press conference held by The State Council Information Office on September 29. Chen Mingzhong, director of the Department of water Resources at the Ministry of Water Resources, revealed at a press conference held by The State Council Information Office on September 29 that some provinces had exceeded the red line [to strictly conduct water resource verification and approval of water extraction permits]. Some provinces have exceeded the red line. Some provinces have exceeded the red line. Strictly conduct water resource verification and approval of water extraction permits. Some provinces have exceeded the red line.Copy the code