In the recommendation system, researchers pay attention to user satisfaction in order to provide more value to users. Given that recommendation systems must be “useful” to users, in addition to getting them to buy more of the same products, researchers will also focus on how users interact and consume when using the system. For now, researchers are tackling this problem by evaluating different metrics, rather than simply using predictive accuracy and machine learning techniques.

The performance of a recommendation system should be measured by the value it generates for the user. In the evaluation of recommendation system, there are many indicators, such as coverage, novelty, diversity and surprise degree. These assessment methods have different names.

Some scholars call the novelty, relevance and surprise degree in the recommendation system “concept”, while others call it “dimensions”. Others call it “Measures of Recommender System Evaluation”.

In this article, we will use the term “concept” to refer to different aspects of evaluating a recommendation system. After categorizing the existing concepts, we categorized them into six categories: utility, novelty, diversity, singularity, coverage, surprise and coverage. There are some concepts left unmentioned, such as trust, risk, robustness, privacy, adaptability, and extensibility. For your convenience, these concepts will be presented in different pages.

Table 1 summarizes the notations used in all the evaluation metrics in this article.

practical

The utility of a recommendation system has many nicknames, such as relevance, usefulness, recommendation value, and user satisfaction. According to Recommender Systems Handbook, practicality represents the value that users get when recommending. If the user likes the recommended items, the recommendations he or she receives are useful. Usability is also defined as the order in which users prefer to spend. If users only consume what they like best, recommending these items will help users find what they love more quickly, thus achieving the usefulness of recommendations.

As you can see, most of the definitions link usability to a user’s desire to spend and user satisfaction. In such a definition, evaluating the usefulness of a recommendation system should focus on how users respond to the predictions generated by the recommendation system. We can measure the usefulness of a recommendation system by evaluating the ratings given by users after they consume an item. This approach may seem desirable if the recommendation results add value to the user, but it involves online evaluation. When it comes to off-line evaluation, some scholars suggest using an index based on accuracy.

In this article, we use the symbol 𝑢𝑡𝑖𝑙 (𝑅𝑢) util (Ru) to indicate the utility of the recommendation system, and the indicators to evaluate the utility will be introduced one by one below.

1. Error metrics

Error measures are widely used to predict accuracy. MAE (Mean Absolute Error) can evaluate the difference between the ratings predicted by the recommendation system and those given by the user.

Formula 1 shows the MAE metric.

In addition, Root Mean Squared Error (RMSE) is another Error measure used to calculate the difference between large errors in rating prediction as shown in Formula 2.

Standard deviation is used to measure the dispersion degree of a set of numbers, while root mean square error is used to measure the deviation between the observed value and the true value. Their research objects and research purposes are different, but the calculation process is similar, and they are both calculated on the prediction list.

In addition, there are other measures of Error, such as Average RMSE (RMSE), Average MAE (MAE), and Mean Squared Error.

2. Precision and Recall

The accuracy of the recommendation includes the number of items consumed (or rated) by the user in the recommendation list, as described in Formula 3. Accuracy measures the percentage of items in the recommendation list that users like and consume.

The recall is calculated based on the number of items that appear on the recommended list out of the total number of items consumed by the user. Formula 4 refers to recall calculation.

3. The ROC curve

The full name of ROC Curve is Receiver Operating Characteristic Curve, which is called “Receiver Operating Characteristic Curve” in Chinese. As the name implies, the main analysis method is to draw this Characteristic Curve.

The ROC curve measures the percentage of items in the recommendation list that the user likes. Unlike error measures, accuracy, and recall metrics, ROC curve calculations emphasize items that are recommended but not preferred by users. The Area under the ROC curve (AUC) can be used to evaluate the algorithm in different scenarios.

4. Ranking Score

It is useful to refer to ranking metrics when evaluating lists of recommendations. Recommendation systems usually predict lists of rankings, but users are unlikely to browse through all items. Therefore, ranking metrics can be interesting in measuring usability and ranking information. The items at the top of the list are more important.

Formula 5 refers to the r-score measure, where 𝑟 (𝑖, 𝑗) R (I, j) is the grade of item 𝑖 I in the grade, 𝑖 D is the median grade, and αα represents the half-life decay value.

In addition to R-Score, there are other ranking indicators, such as Kendall and Spearman rank correlation and Normalized distance-based Performance measures.

5. Online evaluation metrics based on usability

In the online evaluation, it also evaluates the practicality of the recommendation system with the user. Researchers often conduct user trials to test the usefulness of their recommendation systems or to evaluate them for industry use.

CTR (click-through-rate) is the percentage of recommended items that a user has clicked/interacted with out of the number of recommended items. Since the rise of web/mobile advertising and online marketing, click-through rates have come into view. CTR is also a major metric used in recommendation systems to study the effective consumption of recommended items.

CTR is used to evaluate the usefulness of a recommendation system on the premise that the recommendation is useful to the user if the user clicks/interacts with/consumes the recommendation. From a business point of view, it shows the effectiveness of recommendation systems in forecasting. The metrics can be seen in Formula 6.

Retention is also used to evaluate recommendation systems online. Persistence measures the impact of a recommendation system on maintaining consumer behavior or using the system. Persistence has been the focus of evaluation and has been used in many scenarios.

It is worth mentioning that the aforementioned evaluation indicators for the practicality of recommendation systems are also applicable to online evaluation. For example, accuracy-based metrics (such as error measure, accuracy, recall) can also be used for online evaluation.

Related reading:

Workflow of recommendation system

Vernacular recommendation system

Want to learn about recommendation systems? Look here! (2) — Neural network method

Want to learn about recommendation systems? Look here! (1) — Collaborative filtering and singular value decomposition

How to realize automatic online, operation and maintenance of intelligent recommendation system?

Getting started with recommendation systems, a list of knowledge you shouldn’t miss

For more information, please search and follow the recommendation wechat public account (ID: DSFSXJ).

This account is the official account of the first recommendation of the fourth paradigm intelligent recommendation product. The account is based on the computer field, especially the cutting-edge research related to artificial intelligence, aiming to share more knowledge related to artificial intelligence with the public and promote the public’s understanding of ARTIFICIAL intelligence from a professional perspective. At the same time, we also hope to provide an open platform for discussion, communication and learning for people related to ARTIFICIAL intelligence, so that everyone can enjoy the value created by artificial intelligence as soon as possible.