This series is mainly my own research in the direction of big data, so I sort out related materials while researching and developing. Online materials are often fragmentary, and it may be necessary to read several articles at the same time if you want to complete them. Therefore, I hope that interested people can learn relevant knowledge more easily and quickly. I will try to introduce some concepts and algorithms in a simple way so that people without engineering background can understand them as much as possible.

Ps: Due to busy work, updates will be sent to my blog from time to time, you can save www.cybermkd.com.

Simple interpretation of the

Content-based recommendation algorithm is a very common recommendation engine algorithm.

This algorithm is often used to calculate users’ preferences based on the historical information of users’ behaviors, such as evaluation, sharing and liking, etc., and integrate these behaviors, and then recommend the most similar items to users based on the similarity between the calculated recommended items and users’ preferences. For example, in book recommendation, books with high similarity can be recommended to users based on some commonalities (such as author, classification and label) of books that users have read or scored.

Content-based recommendation can be made in two ways. One is personalized recommendation based on users’ behaviors as mentioned above, but the above recommendation relies heavily on users’ data and is not conducive to cold startup without user data. Generally suitable for a few goods, users have special interests.

The second method is based on the correlation of things. This method makes recommendations by comparing the similarity of common attributes between things. For example, if user A likes Dota2 and Dota2 belongs to competitive online games, then user A is likely to like League of Legends.

The advantage of this approach is that it does not depend on the user’s behavior, but requires that the content of the thing be accurate, complete and unambiguous, but it can also be solved by manually entering the label.

The relevant algorithm

1. Space vector model based on keywords

Keywords are generally extracted through TF-IDF, a commonly used weighting calculation method based on statistics, which is generally used to evaluate the importance of a word in a paragraph or an article.

Tf-idf is most meaningful to distinguish documents from those words that appear frequently in documents but less frequently in other documents of the whole document set. Therefore, if TF word frequency is taken as a measure in the feature space coordinate system, characteristics of similar texts can be reflected.

In addition, considering the ability of words to distinguish between different categories, TF-IDF method considers that the smaller the text frequency of a word, the greater its ability to distinguish between different categories of text. Therefore, the concept of inverse text frequency IDF is introduced. The product of TF and IDF is taken as the value measure of the coordinate system of feature space, and the weight TF is adjusted with it. The purpose of adjusting weight TF is to highlight important words and suppress minor words.

There are a number of different mathematical formulas that can be used to calculate TF-IDF.

If a word or phrase has a high frequency TF (word frequency) in one article and rarely appears in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification. TFIDF is actually: TF * IDF, TF Term Frequency, IDF Inverse Document Frequency. TF represents the frequency of entries in document D. The main idea of IDF is that if the number of documents containing the term t is smaller, that is, the smaller n is, the larger IDF is, it indicates that the term T has a good ability to distinguish categories. If the number of documents containing entry T in a certain type of document C is M, and the total number of documents containing t in other categories is K, it is obvious that the number of documents containing t is n= M + K. When M is large, n is also large, and the IDF value obtained according to the IDF formula will be small, indicating that the classification ability of entry T is not strong.

Word frequency (TF) is the number of occurrences of a word divided by the total number of words in the document. If the total number of words in a document is 100 and the word “cow” appears three times, the frequency of the word “cow” in the document is 0.03 (3/100). One way to calculate file frequency (DF) is to measure how many files have the word “cow” in them and divide by the total number of files contained in the file set. So, if the word “cow” appears in 1,000 files out of a total of 10,000,000, the file frequency is 0.0001 (1000/10,000,000). Finally, the TF-IDF score can be obtained by calculating the word frequency divided by the file frequency. For the example above, the tF-IDF score for the word “cow” in this file set would be 300 (0.03/0.0001). Another form of this formula is to take the logarithm of the file frequency.

Vector space model is a form of transforming text into numerical value by weighting formula through feature selection calculation. In this way, the vector space model can be calculated to get the similarity.

We can document user preferences into a vector model, and do the same for goods, and then calculate cosine similarity between product documents and user preference documents.

Details on cosine similarity and TF-IDF can be found in the next article.

2. Rocchoi algorithm

Rocchio algorithm is an efficient classification algorithm, widely used in text classification, query expansion and other fields. The optimal solution is obtained by constructing prototype vector.

Rocchio’s algorithm is probably the first and most intuitive solution to the problem of text categorization that comes to mind. Basic idea is to put the sample documents in a category all take an average (for example, all the “sport” vocabulary “basketball” class document the number of occurrences of an average, put “the referee” take an average, in turn, do it), you can get a new vector, called “centroid” image, center of mass is the most representative of this category vector said. Next time a new document needs to be judged, compare how similar the new document is to the center of mass (in other words, judge the distance between them) to determine whether the new document belongs to the class.

Through the above two algorithms, we can judge whether the contents are similar to each other and make recommendations.