The user behavior of tagging, as a form of expression of user-generated content at present, is also regarded as a feature of the current recommendation scenario. At the same time, tagging is also an abstract form of user portrait, which has a very wide range of uses. For the tag system, there will be a recommendation algorithm based on it. This article will end with a discussion of tag-based recommendation algorithms: SimpleTagBased, NormalTagBased and Tagbased-TFIDF.

1. Functions of the label system

GroupLen’s Shilads Wieland Sen did research on tagging questionnaires on the MoveLens movie recommendation system. The results show that users have the following feedback when tagging movies:

  • Expression: The tagging system helps me express my views on the item. (30% agreed.)
  • Organizing: Tagging helps me organize my favorite movies. (twenty-three percent agreed.)
  • Study: Tagging helps me learn more about movies. (27% agreed.)
  • Discovery: The tagging system makes it easier for me to find movies I like. (nineteen percent agreed.)
  • Decision making: The tagging system helps me decide whether to watch a particular movie. (14% agreed.)

Motivation for labeling

Then the winter of users’ labeling can be discussed from two dimensions. The first is the social dimension, where some user annotations are for content uploaders (to help them organize their own information) and some user annotations are for mass users (to help other users find information). The other dimension is the functional dimension. Some of the annotations are used to better organize the content for the user’s future search, while others are used to convey certain information, such as when and where the photo was taken.

How do users tag

On the Internet, although each user’s behavior seems random, there are many rules behind these seemingly random behaviors. In the user behavior data set, the distribution of user activity and item popularity follow the long-tail distribution (Power Law distribution). So let’s first look at the distribution of tag popularity. If a tag we define is used by a user on an item, its popularity is increased by one.

The contents of the label

When a user sees an item, we want them to tag keywords that accurately describe the content attributes of the item, but instead of doing what we want them to do, they may tag all kinds of weird things.

  • Indicate what the object is: if it is a bird, it will be labelled with the word “bird”; Is the home page of Douban, there is a label called “Douban”; If it’s Jobs on the front page, it’ll have a hashtag “Jobs.”
  • Indicate the type of item: For example, in Delicious bookmarks, the tags that represent a category of a web page include article, blog, book, and so on.
  • Indicate who owns things: Many blogs, for example, have tags that include information about who wrote the blog.
  • Express the user’s opinion: for example, if the user thinks the page is interesting, they’ll tag it funny, or if they think it’s boring, they’ll tag it boring.
  • User related tags: My favorite, My comment, etc.
  • User tasks: to read, job search, etc.

Recommend labels to users

When user U labels item III, it can recommend labels related to item III to the user as follows:

  • Method 1: Recommend the most popular tags in the system to uuU
  • Method 2: Uuu recommend the most popular tags on the item I to the user
  • Method 3: Recommend to user Uuu the tags he or she uses most often
  • Methods 2 and 3 are weighted and fused to generate the final label recommendation result

2. Tag-based personalized recommendation algorithm

A data set of user labeling behavior is typically represented by a set of triples, where the record (U, I, b) indicates that user U has labeled item I with B. Of course, the user’s real labeling behavior data is far more complex than the triplet, such as the user’s labeling time, the user’s attribute data, the item’s attribute data, etc.

2.1 SimpleTagBased

  • Count the commonly used tags of each user
  • For each tag, count the items that have been tagged the most times
  • For a specific user, find the tags he or she uses most and recommend the most popular items in those tags
  • Sorting is recommended

The ranking score formula is as follows:


s c o r e ( u . i ) = t U s e r T a g s [ u . t ] T a g I t e m s [ t . i ] score(u,i)=\sum_{t}^{}UserTags[u,t]*TagItems[t,i]

UserTags[u,t]UserTags[u,t]UserTags[u,t] UserTags[u,t]UserTags[u,t] TagItems[t, I]TagItems[t, I]TagItems[t, I] Indicates the number of times item III has been tagged with tag TTT.

2.2 NormTagBased

This algorithm normalizes the score of the SimpleTagBased algorithm:


s c o r e ( u . i ) = t U s e r T a g s [ u . t ] U s e r T a g s [ u ] T a g I t e m s [ t . i ] T a g I t e m s [ t ] score(u,i)=\sum_{t}^{}\frac{UserTags[u,t]}{UserTags[u]}*\frac{TagItems[t,i]}{TagItems[t]}

2.3 TagBased – TFIDF

If a tag is popular, UserTags[t]UserTags[t]UserTags[t] UserTags[t]UserTags[t] are large, so even if TagItems[u,t]TagItems[u,t]TagItems[u,t] are small, Also, the score(u, I)score(u, I)score(u, I)score(u, I) is very large, which will lead to recommendation of popular items to users, thus reducing the novelty of recommendation results. In addition, this formula uses the user’s tag vector to model the user’s interest, where each tag is a tag used by the user, and the tag weight is the number of times the user uses the tag. The disadvantage of this modeling approach is that it gives too much weight to popular tags, which cannot reflect the user’s personalized interests. TagUser[t]TagUser[t]TagUser[t] TagUser[t] indicates how many different users TTT is used by:


s c o r e ( u . i ) = t U s e r T a g s [ u . t ] log ( 1 + T a g U s e r s [ t ] ) T a g I t e m s [ t . i ] score(u,i)=\sum_{t}^{}\frac{UserTags[u,t]}{\log (1+TagUsers[t])}*TagItems[t,i]

3. Practice of SimpleTagBased algorithm

Algorithm steps:

  1. Import the data set and store it in the dictionary format {user:{item:tag}}
  2. Divide training set and test set;
  3. Calculate the score for user1. The score is calculated as the multiplication of n in {user1:{tag:n}} and m in {tag:{item:m}}, score=n*m, sorted from largest to smallest, topn;
  4. Evaluate using test sets
# Use the SimpleTagBased algorithm to recommend Delicious datasets
# raw data sets: https://grouplens.org/datasets/hetrec-2011/
# date format: userID bookmarkID tagID timestamp
import pandas as pd
import warnings
import math
import random
import operator
warnings.filterwarnings('ignore')

file_path = 'user_taggedbookmarks-timestamps.dat'
{user: {item1:[tag1,rag2] {user: {item1:[tag1,rag2]... }... }
records = {}
# Training set, test set
train_data = {}
test_data = {}
# User tag, product tag
user_tags = dict()
user_items = dict()
tag_items = dict(a)# Data loading
def load_data() :
    print('Data loading... ')
    df = pd.read_csv(file_path,sep = '\t')
    # put df into the dictionary format
    for i in range(len(df)):
    #for i in range(10):
        uid = df['userID'][i]
        iid = df['bookmarkID'][i]
        tag = df['tagID'][i]
        #setdefault set uid to dictionary, iID to []
        records.setdefault(uid,{})
        records[uid].setdefault(iid,[])
        records[uid][iid].append(tag)
    #print(records)
    print('Dataset size: %d.' %len(df))
    print('Set number of people to tag :%d.' %len(records))
    print('Data loading complete \n')

# Split the data set into training set and test set. Ratio is the proportion of test set
def train_test_split(ratio,seed = 100) :
    random.seed(seed)
    for u in records.keys():
        for i in records[u].keys():
            #ratio is the ratio set
            if random.random()<ratio:
                test_data.setdefault(u,{})
                test_data[u].setdefault(i,[])
                for t in records[u][i]:
                    test_data[u][i].append(t)
            else:
                train_data.setdefault(u,{})
                train_data[u].setdefault(i,[])
                for t in records[u][i]:
                    train_data[u][i].append(t)
    print("The number of training set users is %d, and the number of test machine users is %d." % (len(train_data),len(test_data)))

Set the matrix mat[index,item] to store the relationship between index and item, = {index:{item:n}},n is the number of samples
def addValueToMat(mat,index,item,value = 1) :
    if index not in mat:
        mat.setdefault(index,{})
        mat[index].setdefault(item,value)
    else:
        if item not in mat[index]:
            mat[index].setdefault(item,value)
        else:
            mat[index][item] +=value

# Initialize user_tags,user_items,tag_items, /user_tags to {user1: {tagS1 :n}}
# {user1: {tags2: n}}... {user2: {tags1:n}}, {user2: {tags2:n}}.... N is the number of samples and so on
# user_items for {user1: {items1: n}}... The principle of same
# tag_items for {tag1: {items1: n}}... The principle of same
def initStat() :
    records = train_data
    for u,items in records.items():
        for i,tags in records[u].items():
            for tag in tags:
                The matrix of the relationship between users and tag
                addValueToMat(user_tags,u,tag,1)
                # Relationship between users and item
                addValueToMat(user_items,u,i,1)
                The relationship between #tag and item
                addValueToMat(tag_items,tag,i,1)
    print('User_tags,user_items,tag_items initialization completed.')

# topN recommendation for a user
def recommend(user,N) :
    recommend_item = dict()
    tagged_items = user_items[user]
    for tag,utn in user_tags[user].items():
        for item,tin in tag_items[tag].items():
            # Not recommended if a user has already tagged an item
            if item in tagged_items:
                continue
            if item not in recommend_item:
                recommend_item[item] = utn * tin
            else:
                recommend_item[item] = recommend_item[item]+utn*tin
    # Sort by value, from largest to smallest
    return sorted(recommend_item.items(), key=operator.itemgetter(1), reverse=True) [0:N]

Use test sets to calculate accuracy and recall rates
def precisionAndRecall(N) :
    hit = 0
    h_recall = 0
    h_precision = 0
    for user,items in test_data.items():
        if user not in train_data:
            continue
        rank = recommend(user,N)
        for item,rui in rank:
            if item in items:
                hit = hit+1
        h_recall = h_recall +len(items)
        h_precision = h_precision+N

    # Return accuracy and recall rate
    return (hit/(h_precision*1.0)), (hit/(h_recall*1.0))

# Use test_data to evaluate the recommended results
def testRecommend() :
    print('Recommended results are assessed as follows:')
    print("%3s %10s %10s" % ('N'."Accuracy rate".'Recall rate'))
    for n in [5.10.20.40.60.80.100]:
        precision,recall = precisionAndRecall(n)
        print("% 3 d % 10.3 f % % % % % 10.3 f" % (n, precision * 100, recall * 100))


load_data()
train_test_split(0.2)
initStat()
testRecommend()
Copy the code