Welcome to Tencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article is published in cloud + community column by Tencent QQ big data

With the development and accumulation of social networks, the generation, dissemination and consumption of content have been deeply integrated into people’s lives. Then the work of content analysis came into people’s field of vision. In recent years, a variety of public trend analysis products have emerged, and companies are using their own resources to grab a foothold.

The public trend analysis platform makes use of natural language processing and machine learning methods to analyze data and provide users with help in public opinion analysis, competitive product analysis, data marketing and brand image building. Among them, hot spot discovery is an indispensable part of public trend analysis. Hot spot discovery Through the analysis of massive data (this paper focuses on text data), mining the important contents of relevant people.

In our business scenario, quick and efficient expression of real-time topics from a large number of social essays can help product, operation, public relations and other students to better attract users. However, it is not easy to generate grammatically correct and well-defined topics directly from a mass of text. This paper mainly introduces a relatively simple and efficient method for topic generation.

The so-called subject

At present, the topic collection of many content platforms has the support of relevant product strategy or operation colleagues. For example, let users customize the topic and identify it with a specific symbol, such as “# White Valentine #”. In some text scenarios, these conditions are not supported, and we need to directly extract hot topics, or hot events, from massive user social texts. The purpose of this paper is to automatically discover hot events or hot topics from a large number of social essays.

A lot of related work, the topic extraction using the method of topic analysis to solve, the use of topic model (LDA, etc.), clustering and other methods, but this idea of output of each topic of some keywords or related words, rather than directly generate topic phrases. The idea of event extraction or text summarization can be considered to solve the hot topic extraction problem in such scenarios, but it often requires monitoring data. This paper introduces a simple and practical method to extract hot topics.

The specific practices

This paper proposes a method to extract hot topics from hot words. The following is the overall flow chart of the method. First, hot words are extracted, and then topic extraction is done on the basis of hot words. The following two parts are detailed.

Hot word extraction

The main idea is to use word frequency gradient and smoothing method.

As the graph above shows, the popularity of words is influenced by many factors.

  • Market impact: There will be a large fluctuation in the overall number of social messages during the day and early morning, weekends and weekdays, holidays and ordinary days.
  • Inter-word influence: maybe a paragraph in the corpus is suddenly very popular, which will lead to some words that are not related to each other, all of a sudden become hot words.
  • Cycle effect: the periodic changes of 24 hours, week, month and solar term often make words with weak significance of events such as “good morning”, “Monday” and “March” become hot words.
  • Self-trending: This is the heat information we care about most. These sudden and incremental increases of related words caused by events are what our algorithm wants to identify and analyze.

In view of the above factors, we extract hot words from the following aspects.

1, pre-processing: mainly including text weight, advertising recognition and other methods, some de-impetuous data work.

2, ** gradient: the main measure of the increment of ** word frequency.

3, ** A method of evaluating the mean of a population using outside information, especially a pre-existing belief.

Typical applications of Bayesian averaging include user-voted ranking, product rating ranking, smoothing of AD click-through rates, and so on.

Taking user voting ranking as an example, there are very few users who vote to score, so the average score is likely to be not objective enough. At this point, external information is introduced, assuming that there are still a number of people (C) who vote, and they all give the average score (M). By adding these people’s ratings to the ratings of existing users and averaging them, the average score can be modified to add some degree or perspective to the objectivity of the final score. Easy to get, when the turnout is low, the score tends to be even; The larger the number of voters, the closer the results of the Bayesian average are to the arithmetic average of the real vote, and the less influence the added parameters have on the final ranking.

4. Calculation of ** heat fraction: ** Use Bayesian average to correct gradient fraction.

Here, the average frequency of words in the formula is C, and the average score is M in the Bayesian average formula. That is to say, in hot word extraction, we use the average score of gradient score as prior M and the average word frequency as C.

Hot word extraction can be understood as follows: every time a word appears, it is equivalent to scoring the heat of the word.

If the frequency of words is small, it means that the number of people scoring is small, so the uncertainty of scoring is large, which needs to be corrected and smoothed by average scoring. For example, if a word appeared 18 times today and 6 times yesterday, the gradient score is relatively high, at 0.75, but this word is more likely not a hot word.

Words with high frequency, much higher than the average frequency, are rated by more people. The more the scores tend to their actual scores, then the impact of average scores becomes smaller. This is reasonable, for example, a word that was originally a million magnitude, the next day also appeared a triple increment, here the heat value obviously increased.

5, ** difference: ** The main consideration here is to solve the problem of periodic influence of hot words. The approach is very simple, and the time interval for comparison needs to include some significant time period. For example, hourly hot words, it is best to compare today with yesterday at the same point in time.

6, ** co-occurrence model: ** conducts a layer of screening for the co-occurrence words of hot words.

Through frequent itemsets, word2vector and other methods, the relationship of co-occurrence words is found. Using the information of co-occurrence words, the hot words are screened and the most valuable hot words are extracted to avoid information redundancy.

7, ** Time series analysis: ** consider more detailed historical factors.

Through time series analysis of word frequency, we can distinguish short-term, long-term and periodic hot spots in more detail. Heat alerts for more valuable buzzwords; Analyze the growth trend of hot words and so on.

To sum up, we analyze word heat through bayesian average modified word gradient score in periodic time interval, and further screen hot words by using word co-occurrence information in corpus. Through time series analysis, the characteristics and growth trend of hot words are obtained.

Topic extraction

Hot words have been extracted, but the ability of a word to express events or topics is limited. Here we start with hot words and further extract topics.

Here the topic extraction work is also divided into two steps, the first step is to find some candidate topic phrases; The second step is to use the idea of Attention to find a phrase that contains a more important word from the candidate phrases and serve as the output topic.

Candidate phrase extraction

Candidate phrase extraction is mainly based on the theory of information entropy, using the following features.

1. Internal convergence — mutual information

It starts with information entropy. Information entropy is a measure of the expected value of the occurrence of a random variable, the information of a variable

The higher the entropy is, the more possible states it can appear, the more uncertain it is, that is, the more information it has.

Mutual information can explain the strength of the relationship between two random variables. The definition is as follows:

The transformation of the above formula can be obtained:

Is the uncertainty of Y; The uncertainty of Y, given X, is the conditional entropy of Y, given X. Then, it can be known that represents the quantity introduced by X that reduces the uncertainty of Y. The larger the value is, it indicates that after X appears, the uncertainty of Y appears decreases, that is, Y is likely to also appear, that is, the closer the relationship between X and Y is. And vice versa.

In practice, the degree of internal convergence of phrases is the degree of internal convergence between words. For a phrase, we select the word combination that reduces the degree of uncertainty the most to illustrate the internal convergence degree of the phrase.

2. The richness of context — left and right information entropy

I just mentioned that entropy tells you how much information you have. So if the left and right entropy of a phrase is greater, that is, the word

The more possible situations left and right of the group, the richer the left and right collocation; The more things the phrase can be discussed in different contexts, the more likely it is to independently describe an event or topic.

3. Ubiquity — This can be intuitively measured by the frequency of phrase occurrences.

Topic fine sieve

For a certain hot word, after selecting a batch of candidate phrases, each phrase contains different words and contains different amount of information. For example, for the word “Paris” on March 9, we came up with the candidate phrases “Paris fans”, “Paris players”, “Eliminate Paris”, “love Paris”, “Barcelona reverse Paris”, “Paris France”, “Paris fashion Week”. However, “Barcelona player”, “Paris fan”, “eliminated Paris”, “love Paris”, “Paris France” these phrases, “player”, “fan”, “eliminated”, “love” in many other contexts often appear, their directionality is not clear; “Paris, France” has only one location. “Barca reverse Paris” and “Paris Fashion Week” also contain more specific information — the football match, the team, the result, the location or the fashion show, etc., with more specific events. Here, we need to screen the candidate topic phrases.

The main basis or idea of screening is the same as the Attention mechanism. The key is to find the important words. For example, with “Paris”, “Barcelona”, “reverse”, “fashion week” than “fans”, “players”, “heartache”, “France” contain more information, meaning. It can be expected that the words “Barcelona”, “reverse” and “fashion week” do not often appear in other irrelevant corpora, while the words “fan”, “player”, “heartache” and “France” often appear in different corpora, and the information is not clear. Therefore, in our question, Attention can be determined through tF-IDF thinking.

Specifically, it is to measure the specificity of each word in a phrase. We have reason to believe that words like “Barca”, “reversal” and “fashion week” appear more frequently in the relevant corpus with “Paris” in it. The event or topic presentation score of the candidate phrase S for hot words can be calculated by the following formula:

Wherein, N is the number of words in the candidate phrase, is the ith word contained in the candidate phrase, and Corpus (W) refers to the relevant Corpus containing the word W.

On the other hand, we also need to consider the frequency of phrase occurrence. The more phrase occurrence, the more important the event.

To sum up, we carefully screen out topics related to hot words by the event or topic presentation ability score and occurrence frequency of candidate phrases.

Question and answer

usenlpSentence compression?

reading

How is the extreme Crash rate below 0.01% achieved?

S.H.I.E.L.D. recommendation – SUMMARY of MAB algorithm application

How can logistic regression be used to identify and reach new users

Has been authorized by the author tencent cloud + community release, the original link: https://cloud.tencent.com/developer/article/1155587?fromSource=waitui

Welcome to Tencent Cloud + community or follow the wechat public account (QcloudCommunity), the first time to get more mass technology practice dry goods oh ~

Massive technical practice experience, all in the cloud plus community!