RS in action-two – Utilizes user behavior data

The second chapter is supposed to be the core of the whole book, explaining how to use user behavior data to “see what they say, see what they do”. It focuses on two algorithms:

  • Collaborative filtering algorithm based on userUserIF
  • Collaborative filtering algorithm based on itemItemIF

Read some blogs before, very messy; This time I read Mr. Xiang Liang’s book, at least I know the theoretical difference between the two algorithms, which is still very beneficial

User behavior data

Mining user data

We can learn about users’ interests and needs from the words and behaviors they leave behind.

Users’ behaviors are not random, but contain many patterns. It is necessary to automatically excavate user behavior data through algorithms, and infer users’ interests from their behaviors, so as to recommend the items they are interested in.

Beer and diapers

Supermarket staff found that many people would buy beer 🍺 and diapers at the same time; They thought that the woman had to stay at home to take care of the children, so they asked the husband to buy diapers. At the same time, the husband did not forget to buy him a beer.

So the supermarket staff put the beer and diapers together. As a result, sales of both unrelated items increased.

The story of beer and diapers has been amplified on the Internet: analyzing users’ shopping carts, for example, to find things like “people who bought ITEM A also bought item B.”

User behavior data

The data exists on the site in the form of a log. Session logs are typically stored in distributed data warehouses, such as Offline Hadoop Hive and Google Dremel, which supports online analysis. Two types of user behavior:

  • Explicit feedback: the explicit feedback reflects the user’s explicit preferences for the item
  • Implicit feedback: Can not clearly reflect the user’s preferred behavior, such as the most representative page browsing behavior

According to the clarity of feedback, user behavior data is divided into explicit feedback and implicit feedback, and according to the direction of feedback is divided into positive feedback and negative feedback.

Positive feedback: users like an item; Negative feedback: the user tends not to like the item

Explicit feedback can clearly distinguish between positive and negative feedback, while implicit feedback is difficult to distinguish

Specific examples:

Six metrics of user behavior

  • The user who generates the behavior
  • Object of action
  • Types of behavior
  • The context in which the action is generated
  • Content of behavior
  • The weight

Representative data set

Different data sets represent different user behaviors

type Record the data Representative data
No context + recessive User ID and item ID
No context + explicit User ID, item ID, user rating on item
Have context + recessive User ID, item ID, timestamp of the user’s action on the item Lastfm
There is context + explicit User ID, item ID, user rating on item, timestamp of behavior Netflix Prize

User behavior analysis

Zipf’s law

Ziff, a linguist at Harvard University, was studying English words when he discovered:

If words are listed in descending order, the frequency of each word is inversely proportional to the power of the number of times it ranks in people’s lists

Most words in English have low frequency, and only a few words are used frequently

The long tail distribution

Data gate

Many data distributions on the Internet satisfy a distribution called Power Law, also known as the long-tailed distribution:

A long-tailed distribution is a distribution with a very long tail, so Ziv’s law is a long-tailed distribution.

User behavior data also satisfies the above Ziff’s Law:

Among them,Represents the number of users who generate behavior for K itemsUsers;Represents the number of items that have been acted upon by K usersItems

Item popularity: The total number of users who have acted on the item

User activity: The total number of items that users have acted on

The popularity distribution curve of items:

User activity distribution curve:

Collaborative filtering algorithm

The relationship between user activity and item popularity:

New users are unfamiliar with the site and tend to browse hot items; Regular users are starting to browse for less popular items

Recommendation algorithm based on user behavior analysis is an important algorithm of personalized recommendation system, which is called collaborative filtering algorithm in academia.

Collaborative filtering: Users work together through continuous interaction with the site, so that their recommendation list can constantly filter out the items that they are not interested in, so that they can more and more meet their needs

Multiple studies on collaborative filtering algorithms:

  • Neighborhood based method NB,neighborhood-based(Most widely used)
  • Arcane meaning model LFM,latent factor model
  • Random walk algorithm based on graph,random walk on graph

Two of the most well-known and widely used algorithms are included in the neighbor-based approach:

  • UserIF, a user-based collaborative filtering algorithm, recommends items to users who have similar interests
  • ItemIF, an item-based collaborative filtering algorithm, recommends items that are similar to the items the user previously liked

Measurement of the algorithm

The three methods

Evaluate three methods of recommendation system

  • Off-line experimental
  • User survey
  • The online test

The data set

In this case, the MovieLens data set is used, which is a rating data set, and users can rate movies with 5 different grades (1-5).

The book focuses on TopN recommendation questions in implicit feedback data sets:

TopN recommended question: Predicting whether users will rate a movie, rather than predicting how much users would rate a movie given the fact that they were prepared to rate a particular movie.

The experiment design

Design steps of offline experiment:

  1. The user behavior data set was divided into M parts according to uniform random distribution. One part was selected as the test set, and the remaining M-1 part was selected as the training set
  2. The interest model of users is established on the training set
  3. On the test set, the user model is predicted and the corresponding evaluation index is calculated

In order to prevent over-fitting of results, M times of tests should be conducted, and different test sets should be used each time. The average value of M times of tests is used as the final evaluation index. Python is used to divide the data set into training set and test set:

def SplitData(data, M, k, seed):
  test = []
  train = []
  random.seed(seed)  # Same random seed
  for user, item in data:
 if random.randint(0,M) == k: # Select different K for each experiment  test.append([user,item])  else:  train.append([user,item])  return train, test Copy the code

Evaluation indicators

N items are recommended to user U, denoted as R(u), and the set of items that user U likes in the test set is T(u). The accuracy of the algorithm is evaluated by recall rate and accuracy:

  • Calculation of recall rate:

The recall rate describes what percentage of the user-item score records are included in the final recommendation list

  • Calculation of accuracy:

The accuracy rate describes what percentage of the recommended list is the result of user-item scoring

# Recall rate and accuracy calculation

# recall rate
def Recall(train, test, N):
    hit = 0
 all = 0  for user in train.keys():  tu = test[user]  rank = GetRecommendation(user, N)  for item, pui in rank:  if item in tu:  hit += 1  all += len(tu)  return hit / (all * 1.0)  # accuracy def Precision(train, test, N):  hit = 0  all = 0  for user in train.keys():  tu = test[user]  rank = GetRecommendation(user, N)  for item, pui in rank:  if item in tu:  hit += 1  all += N  return hit / (all * 1.0) Copy the code

coverage

Coverage rate reflects the recommendation algorithm’s ability to explore the long tail items. The higher the coverage rate, the better the recommendation algorithm can recommend the items in the long tail to users

Coverage indicates what percentage of items are included in the final recommended list.

Coverage is 100% if all items are recommended to at least one user

def Coverage(train, test, N):
    recommend_items = set()
    all_items = set()
    for user in train.keys():
        for item in train[user].keys():
 all_items.add(item)  rank = GetRecommendation(user, N)  for item, pui in rank:  recommend_items.add(item)  return len(recommend_items) / (len(all_items) * 1.0) Copy the code

novelty

The novelty of the recommended results was measured by the average popularity of the total items in the recommended list. If the items are very popular, it means that the novelty of the recommendation is low; otherwise, the recommendation effect is very good

def Popularity(train, test, N):
  item_popularity = dict()
  for user, items in train.items():
    for item in items.keys():
      if item not in item_popularity:
 item_popularity[item] = 0  item_popularity[item] += 1  ret = 0  n=0  for user in train.keys():  rank = GetRecommendation(user, N)  for item, pui in rank:  ret += math.log(1 + item_popularity[item])  n += 1  ret /= n * 1.0  return ret Copy the code

The resources

  1. Liang Xiang, Recommendation System Practice
  2. The resources