We’re helping The New York Times develop a content-based recommendation system, which you can think of as a very simple recommendation system development example. Based on the user’s recent article browsing data, we will recommend new articles for them to read. To do this, we only need to recommend similar content to the user based on the text data of this article.

Data validation The following is an excerpt from the first NYT article in the dataset, which we have done with text processing.

‘TOKYO — state-backed Japan Bank for International Cooperation [jbc.ul] will lend about 4 billion yen ($39 million) to Russia’s Sberbank, which is subject to Western sanctions, in the hope of advancing talks on a territorial dispute, the Nikkei business daily said on Saturday, […] “

The first problem to solve is how to vectorize this content and design new features such as parts-of-Speech, N-grams, sentiment scores or Named Entities.

Obviously, NLP tunnels are worthy of further research, and you can even spend a lot of time experimenting with existing schemes. But real science often starts with testing the simplest possible solution so that subsequent iterations can improve.

In this article, we start implementing this simple and feasible solution.

We need to preprocess the standard data by identifying the characteristics in the database that meet the requirements, disordering them, and then putting those characteristics into the training and test sets respectively.

# move articles to an array
articles = df.body.values

# move article section names to an array
sections = df.section_name.values

# move article web_urls to an array
web_url = df.web_url.values

# shuffle these three arrays 
articles, sections, web_ur = shuffle(articles, sections, web_url, random_state=4)

# split the shuffled articles into two arrays
n = 10

# one will have all but the last 10 articles -- think of this as your training set/corpus 
X_train = articles[:-n]
X_train_urls = web_url[:-n]
X_train_sections = sections[:-n]

# the other will have those last 10 articles -- think of this as your test set/corpus 
X_test = articles[-n:]
X_test_urls = web_url[-n:]
X_test_sections = sections[-n:]
Copy the code

Text vectorization system can be selected from bag-of-Words (BoW), TF-IDF, Word2Vec and other different text vectorization systems.

One of the reasons we chose TF-IDF is that, unlike BoW, the way tF-IDF recognizes the importance of words includes the inverse document frequency in addition to the text frequency.

For example, if a word like “Obama” appears only a few times in an article (excluding words like “A” and “the” that do not convey much information) but appears in several different articles, it should be given a higher weight.

Because “Obama” is neither a stop-word nor an everyday word (meaning that the word is highly relevant to the topic of the article).

There are several schemes for determining similarity rules, such as comparing Jacard and Cosine.

Jacard’s implementation relies on comparisons between two sets and selection of overlapping elements. Considering that TF-IDF has been selected as a text vectorization system, Jacard similarity is meaningless as an option. If you choose Vectorization in BoWs, maybe Jacard will come into play.

Therefore, we tried to use Cosine as the rule of similarity.

If markers such as “Obama” or “White House” in article A have A high weight, and the same is true in article B, the similarity product of the two will yield A higher value than that of the same markers in article B with A low weight.

Based on the similarity between the user’s read articles and all the other articles in the corpus (that is, the training data), you can now create a function that outputs the first N articles and start making recommendations to the user.

def get_top_n_rec_articles(X_train_tfidf, X_train, test_article, X_train_sections, X_train_urls, n = 5):
    ' ''This function calculates similarity scores between a document and a corpus INPUT: vectorized document corpus, 2D array text document corpus, 1D array user article, 1D array article section names, 1D array article URLs, 1D array number of articles to recommend, int OUTPUT: top n recommendations, 1D array top n corresponding section names, 1D array top n corresponding URLs, 1D array similarity scores bewteen user article and entire corpus, 1D array '' '
    # calculate similarity between the corpus (i.e. the "test" data) and the user's article
    similarity_scores = X_train_tfidf.dot(test_article.toarray().T)
    # get sorted similarity score indices  
    sorted_indicies = np.argsort(similarity_scores, axis = 0)[::-1]
    # get sorted similarity scores
    sorted_sim_scores = similarity_scores[sorted_indicies]
    # get top n most similar documents
    top_n_recs = X_train[sorted_indicies[:n]]
    # get top n corresponding document section names
    rec_sections = X_train_sections[sorted_indicies[:n]]
    # get top n corresponding urls
    rec_urls = X_train_urls[sorted_indicies[:n]]
    
    # return recommendations and corresponding article meta-data
    return top_n_recs, rec_sections, rec_urls, sorted_sim_scores
Copy the code

Here are the steps to execute this function:

1. Calculate the similarity between user articles and corpus;

2. Rank the similarity score from high to low;

3. Get the first N articles most similar to each other;

4. Obtain the subtitles and URLS of the previous N articles;

5. Return the first N articles, subtitles, URLS, and scores

Verification of results Now we can test the results by recommending articles for users to read based on what they are reading.

Next, let’s compare user articles and their subheadings with recommended articles and their subheadings.

First look at the similarity score.

# similarity scores
sorted_sim_scores[:5]
# OUTPUT:
# 0.566
# 0.498
# 0.479
#.
#.
Copy the code

Cosine similarity ranges from 0 to 1, which indicates that the score is not high. How do you improve your score? You can choose a different vectorization system like Doc2Vec, or you can change a similarity criterion. Nonetheless, let’s take a look at the results.

# user's article's section name
X_test_sections[k]
# OUTPUT:
'U.S'

# corresponding section names for top n recs 
rec_sections
# OUTPUT:
'World'
'U.S'
'World'
'World'
'U.S.'
Copy the code

As can be seen from the results, the recommended subheadings are in line with the needs.

# User’s article X_test[K] ‘LOS ANGELES — The White House says President Barack Obama has told The Defense Department that it must ensure service members instructed to repay enlistment bonuses are being treated fairly and expeditiously.\nWhite House spokesman Josh Earnest says the president only recently become aware of Pentagon demands that some soldiers repay their enlistment bonuses after audits revealed overpayments by the California National Guard. If soldiers refuse, they could face interest charges, wage garnishments and tax liens.\nEarnest says he did not believe the president was prepared to support a blanket waiver of those repayments, but he said “we’re not going to nickel and dime” service members when they get back from serving the country. He says they should not be held responsible for fraud perpetrated by others.’

The first five recommendations were all related to the articles the reader was currently reading, and the recommendation system proved to be as expected.

By comparing the ad-hoc verification process of recommendation text and subheadings, it shows that our recommendation system can operate normally as required.

Manual validation works fine, but ultimately what we want is a fully automated system that can be put into a model and self-validate.

How to fit this recommendation system into the model is not the subject of this article, which aims to show how to design such a prototype recommendation system based on real data sets.

Alexander Barriga is a data scientist and author of the article.