This is the 16th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge

1.What is TF-IDF

Tf-idf (Term Frequency-Inverse Document Frequency, word Frequency-Inverse Document Frequency)

Tf-idf is a statistical method used to assess the importance of a word to one document in a set of files or a corpus. The importance of the word increases in proportion to the number of times it appears in documents, but decreases in inverse proportion to its frequency in corpora

The bottom line is that the more times a word appears in an article, but the fewer times it appears in the document as a whole, the more likely it is to represent the article

For example, suppose we have a long article entitled “Crayfish farming in China” and we are going to use a computer to extract its key words. An easy idea to come up with is to find the word that appears the most. If a word is important, it should appear more than once in the article. Therefore, we carried out “Term Frequency” (ABBREVIATED TF) statistics

It turns out that the most frequently used words are — “of” and “is” — the most common nonsense words of the class. They’re called “Stop words,” words that don’t help the results and have to be filtered

Suppose we filter out all these stop words and consider only the ones that have real meaning. Another problem is that we might find the words “China”, “crayfish” and “farming” used equally often. Does that mean they are equally important as keywords?

Obviously not. Because “China” is a very common word, “crayfish” and “farming” are relatively less common. If these three words appear equally often in an article, it would be reasonable to assume that “crayfish” and “farming” are more important than “China”. That is to say, in the order of keywords, “crayfish” and “farming” should rank above “China”

Therefore, we need an importance adjustment factor to measure whether a word is common or not. If a word is rare, but it appears many times in the article, it probably reflects the characteristics of the article and is the key word we need

In statistical terms, each word is assigned an “importance” weight based on its frequency. The most common words (” of “, “is”) are given the least weight, the more common words (” China “) are given less weight, and the less common words (” crayfish “) are given more weight. This weight is called “Inverse Document Frequency” (IDF), and its value is inversely proportional to how common a word is

Knowing the “word frequency (TF)” and the “inverse document frequency (IDF)”, multiply the two values to obtain the TF-IDF value of a word. The more important a word is to the text, the greater its TF-IDF value

Here are the details of the algorithm

Word frequency originally refers to the number of times a given word appears in a text. However, because different articles have different lengths, in order to facilitate the comparison of different articles, word frequency is standardized


TF i . j = n i . j k n k . j \text{TF}_{i,j}=\frac{n_{i,j}}{\sum_k n_{k,j}}

Among them

  • Ni,jn_{I,j}ni,j: The number of occurrences of this word in the document djd_jdj
  • ∑ KNK,j\sum_k n_{k,j}∑ KNK,j: the sum of occurrences of all words in the document djd_jdj

The main idea of inverse document frequency is as follows: if there are fewer documents containing a certain word TTT, the IDF will be larger, which indicates that the word has good classification capability. The IDF of a certain word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient obtained


IDF i = log D { j : t i d j } + 1 \text{IDF}_i=\log\frac{|D|}{|\{j:t_i \in d_j\}| + 1}

Among them

  • ∣ D ∣ | D | ∣ D ∣ : total number of documents
  • ∣ ∈ DJ} {j: ti ∣ | \ {j: t_i \ in d_j \} | ∣ ∈ DJ} {j: ti ∣ : contains words tit_iti document number, plus one, is to avoid the denominator is 0

The frequency of high words in a particular document, multiplied by the frequency of low words in the entire document, Tf-idfi,j=TFi,j×IDFi\text{tf-idf}_{I,j}=\text{TF}_{I,j}\times \text{IDF} _itf-idfi,j=TFi,j×IDFi. Therefore, TF-IDF tends to filter out the common words and retain the important words

Take “Crayfish farming in China” as an example, suppose that there are 1000 words in the text, and “China”, “crayfish” and “farming” appear 20 times each, then the word frequency (TF) of these three words is 0.02. Then assume that there are now 25 billion documents, of which 6.23 billion contain the word “China,” 0.484 million contain the word “crayfish,” and 0.973 million contain the word “farming.” Then their inverse document frequency (IDF) and TF-IDF are as follows:

Number of documents containing the word (billion) IDF TF-IDF
China 62.3 0.603 0.0121
crawfish 0.484 2.713 0.0543
farming 0.973 2.410 0.0482

As can be seen from the above table, the TF-IDF value of “crayfish” is the highest, followed by “farming” and “China” is the lowest. So, if I were to choose just one word, “crayfish” would be the key word in this article