Similarity in recommendation systems

The author | Madhukara Putty compile | source of vitamin k | gitconnected

Have you ever wondered how Netflix recommends movies you want to watch? Or how does Amazon show you products you feel you need to buy?

Obviously, those sites already know what you like to watch or buy. They have a piece of code running in the background that collects data about user behavior online and predicts how much an individual user likes or dislikes a particular content or product. Such systems are called “recommendation systems”.

Generally speaking, there are two ways to develop recommendation systems. In one approach, the system takes into account the attributes of content consumed by individuals. For example, if you watch the Matrix on Netflix in one day, Netflix knows you like sci-fi movies and are more likely to recommend other sci-fi movies. In other words, the recommendation is based on the movie genre – science fiction in this case.

In another approach, the recommendation system takes into account the preferences of others with similar tastes to yours and suggests films they have seen. Unlike the first approach, recommendations are based on the behavior of multiple users, rather than on the attributes of the content consumed. This method is called collaborative filtering.

In this case, we think both approaches are more likely to recommend sci-fi movies to you, but they take different approaches to reach their conclusions.

The utility matrix

An important part of collaborative filtering is identifying viewers with similar preferences. While Netflix collects information on user preferences in a variety of ways, for the sake of simplicity, let’s say it asks viewers to rate movies on a scale of 1-5. We also assume that only seven films (Harry Potter trilogy HP1 3, TwtwTwTwTwsw1 3 and Star Wars trilogy SW1 3) will be censored, and only four viewers will be asked to rate them.

Figure 1 shows the ratings provided by our four carefully selected viewers. Such a table, with products in columns and users in rows, is called a utility matrix. Blank space means that some users have not yet rated certain movies.

In fact, thousands of Netflix shows are consumed by millions of viewers every day. Accordingly, its actual utility matrix will have millions of rows, spanning thousands of columns. In addition, as the system continuously obtains user behavior information, the matrix will be updated dynamically.

Looking at the utility matrix in Figure 1, we can draw some obvious conclusions.

Audience A likes Harry Potter 1 and Twilight, but not Star Wars 1
Audience B likes all the harry Potter films
Audience C likes Star Wars I and II, but doesn’t like Twilight
Audience D doesn’t mind watching Harry Potter 2 and Star Wars ii on a boring day, but neither film was her choice

All in all, audience A and AUDIENCE B have similar tastes, because they both like Harry Potter 1. In contrast, audience A and C have different tastes, because Audience A likes Twilight, but audience C doesn’t like it at all. Similarly, A doesn’t like Star Wars, but C does. Recommendation systems need a way to compare reviews from different audiences and tell us how similar their tastes are.

Quantified similarity

There are different criteria to compare the ratings provided by two viewers and find out if they have similar taste. In this article, we will study two of them: the Jaccard distance and the cosine distance. Audiences with similar tastes are closer.

Jaccard distance

Jaccard distance is a function of another quantity called Jaccard similarity. By definition, the Jaccard similarity of sets S and T is the ratio of the size of the intersection of S and T to the size of the union. Mathematically, it can be written as:

The Jaccard distance D (x, y) between sets A and B is given by the following equation,

Cosine distance

The cosine distance between two vectors A and B is the Angle D (A, B), given by,

Among them

Is the $L_2$norm of vectors A and B, respectively, and n is the number of products (in this case, movies) to be examined. Cosine distance varies between 0 and 180 degrees.

Calculation of distance measure of utility matrix

To better understand these distance measures, let’s calculate the distance using the data in the utility matrix (Figure 1).

Calculate the Jaccard distance: The first step in calculating the Jaccard distance is to write the user’s score in the form of a set. The set corresponding to users A and B is:

A={HP1, TW, SW1}

B={HP1, HP2, HP3}

The intersection of sets A and B is the set of elements common to both sets. The union of A and B is the set of all the members of A and B. so

A ⋂ B = {HP1}

A⋃B={HP1, HP2, HP3, TW, SW1}.

The Jaccard distance between A and B is:

Similarly, the Jaccard distance between A and C, d(A, C)=0.5. By this measure, observers A and C are more similar than observers A and B, contrary to what an intuitive analysis of the utility table would reveal. Therefore, The Jaccard distance is not suitable for the type of data we consider.

Calculate the cosine distance: Now let’s calculate the cosine distance between viewers A and B and between viewers A and C. To do this, we must first create a vector that represents its rating. For simplicity, let’s assume that the space is equal to the level of 0. This is a questionable choice, as a score of zero may also represent a difference given by the audience. Vectors corresponding to audience A, B, and C are:

[4,0,0,5,1,0,0]

B=[5, 5, 4, 0, 0, 0]

C=[0,0,0,2,4,5,0]

The cosine distance between A and B is:

Similarly, the cosine distance between A and C is:

This is reasonable because it shows that A is closer to B than C is.

Conversion score

We can also transform the data captured in the utility matrix by applying well-defined rules to each element in the matrix. In this article, we’ll look at two types of transformations: rounding and normalization.

rounded

Audiences often give similar ratings to similar films. For example, audience B gives a high rating to all the Harry Potter movies, while Audience C gives a high rating to Star Wars I and II. This similarity in scoring can be eliminated by rounding the score with a rule. For example, we could set the rule to round levels 3, 4, and 5 to 1 and treat levels 1 and 2 as Spaces. After applying this rule, our utility matrix becomes:

In the case of rounded ratings, the intersection of the sets corresponding to audience A and C is the empty set. This reduces the Jaccard similarity to its minimum value, 0, and shoots the Jaccard distance towards its maximum value, 1. Furthermore, the Jaccard distance between the set corresponding to audience A and B is less than 1, which makes A closer to B than C. Note that the Jaccard distance measure does not provide this insight into user behavior when calculating distance using the raw user score. Rounding the cosine distances gives the same result.

Standardized score

Another way to change the original score is to standardize it. By standardization, we mean subtracting the average rating per viewer from each rating. For example, let’s find A standardized score for audience A, whose average score is 10/3. So her standardized score is,

The utility matrix with all values normalized is given below. Note that this converts higher values to positive values and lower values to negative values.

Since individual values in the utility matrix change, we can expect the cosine distance to change. However, the Jaccard distance remains the same because it depends only on how two users rate the movie, not on the given rating.

For the normalized values, the vectors corresponding to audience A, B and C are:

The cosine distance between A and B and between A and C is:

Although the cosine distance calculation of the standardized score does not change the original conclusion (that A is closer to B than C), it does magnify the distance between the vectors. Vectors A and C seem to be particularly far apart from the standardized score, although neither is very close.

conclusion

Recommendation system is the core of Internet economy. They are the computer programs that keep us hooked on social media, online shopping and entertainment platforms. The job of a recommendation system is to predict what a particular user is likely to buy or consume. One of the two broad ways to predict this is to look at what other people — especially those with similar preferences for users — buy or consume. A key part of this approach is quantifying similarities between users.

Calculating Jaccard and cosine distances are two ways to quantify similarity between users. The Jaccard distance takes into account the number of products rated by the two users being compared, rather than the actual value of the ratings themselves. Cosine distance, on the other hand, takes into account the actual value of the rating, not the number of products rated by two users. Jaccard and cosine distance measures sometimes lead to conflicting predictions due to differences in calculated distances. In some cases, we can avoid such conflicts by rounding the score according to explicit rules.

Ratings can also be converted by subtracting the average rating given by the user from each rating given by the user. This process, called normalization, does not affect the Jaccard distance, but has a tendency to amplify the cosine distance.

The original link: levelup.gitconnected.com/measuring-s…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/