The author | GUEST compile | source of vitamin k | Analytics Vidhya

introduce

Recommendation systems are becoming increasingly important in today’s busy world. People are always looking for the best product/service for them. Recommendation systems are therefore important because they can help users make the right choice without consuming cognitive resources.

In this blog, we will learn the basics of recommendation systems and learn how to build a movie recommendation system using collaborative filtering by implementing the K-nearest neighbor algorithm. We will also predict the rating of a given movie based on its neighbors and compare it to the actual rating.

Recommended system type

Recommendation systems can be broadly divided into three categories

  1. Collaborative filtering

  2. Content-based filtering

  3. Hybrid recommendation system

Collaborative filtering

This filtering method is usually based on collecting and analyzing information about users’ behaviors, activities, or preferences, and predicting what they will like based on their similarity to other users. A major advantage of the collaborative filtering approach is that it does not rely on machine-parsed content, so it can accurately recommend complex items such as movies without the need to “understand” the items themselves.

In addition, there are several types of collaborative filtering algorithms

  • User-user collaborative filtering: Try to search for similar customers and offer products based on his/her similar choices.
  • Item-item collaborative filtering: It is very similar to the previous algorithm, but instead of finding a similar-looking customer, we try to find similar items. Once we have the product appearance similarity matrix, we can easily recommend similar products to customers who buy from the store.
  • Other algorithms: There are other methods, such as market basket analysis, which works by looking for a mix of items that often appear in trades.

Content-based filtering

These filtering methods are based on the description of the project and the user’s profile. In content-based recommendation systems, keywords are used to describe items and user profiles are built to describe the types of items that users prefer. In other words, these algorithms try to recommend products that are similar to what users have liked in the past.

Hybrid recommendation system

Recent research suggests that a hybrid approach combining collaborative filtering and content-based filtering may be more effective in some cases. A hybrid approach can be achieved in a number of ways, taking content-based and collaborative predictions separately, then combining them, adding content-based features to a collaborative approach (or vice versa), or unifying these approaches into a single model.

Netflix is a good example of a hybrid recommendation system. The site makes recommendations by comparing the viewing and search habits of similar users (collaborative filtering) and by offering movies with the same characteristics as movies that users rate highly (content-based filtering).

Now that we have a basic intuition about recommendation systems, let’s start by building a simple movie recommendation system in Python.

Found here contains a complete code, data sets and all illustrations of Python Notebook www.kaggle.com/heeraldedhi…

TMDb- Movie database

The Movie Database (TMDb) is a community-built movie and TELEVISION database that holds a large amount of data about movies and TV shows. Here are the statistics: www.themoviedb.org/

For simplicity and ease of calculation, I used a subset of this huge dataset, the TMDb 5000 dataset. It has information for 5000 movies, split into 2 CSV files.

  • **tmdb_5000_movies.** Contains scores, titles, release dates, genres, and more.
  • ** TMDB_5000_credits. CSV :** Contains the cast and crew information for each movie.

The link to the dataset is here: www.kaggle.com/tmdb/tmdb-m…

Python implementation

Step 1- Import the data set

Import the required Python libraries, such as Pandas, Numpy, Seaborn, and Matplotlib. The CSV file is then imported using the read_CSV() function predefined in Pandas.

movies = pd.read_csv('.. /input/tmdb-movie-metadata/tmdb_5000_movies.csv')
credits = pd.read_csv('.. /input/tmdb-movie-metadata/tmdb_5000_credits.csv')
Copy the code

Step 2- Data exploration and cleanup

We will first use the head(), descripe() functions to look at the values and structure of the dataset, and then continue to clean up the data.

movies.head()
Copy the code

movies.describe()
Copy the code

Similarly, we can take credits data frames and get the following output

Looking at the data set we can see that genres, keywords, production_companies, production_countries, spoken_languages are all in JSON format. Similarly, in other CSV files, cast and CREW are in JSON format. Now let’s convert these columns into a format that is easy to read and interpret. We will convert them to strings, and later to lists for easier interpretation.

The JSON format is similar to dictionary(Key :value) pairs embedded in strings. In general, parsing data is computationally expensive and time consuming. Fortunately, this data set does not have that complex structure. One of the basic similarities between columns is that they have a name key that contains the values we need to collect. The easiest way to do this is to parse the JSON and examine the name key in each row. Once you find the name key, store its value in a list and replace the JSON with list.

But we can’t parse the JSON directly, because it must be decoded first. To do this, we use json.loads to decode it. We can then parse through the list to find the desired values. Let’s look at the correct syntax below.

Change the genres from JSON to string
movies['genres'] = movies['genres'].apply(json.loads)
for index,i in zip(movies.index,movies['genres']):
    list1 = []
    for j in range(len(i)):
        list1.append((i[j]['name'])) # "name" contains the name of the genre
    movies.loc[index,'genres'] = str(list1)
Copy the code

In a similar fashion, we’ll convert JSON to a list of strings for columns: keywords, production_companys, cast, and crew. We’ll use movies.iloc[index]

Step 3- Merge two CSV files

We’ll merge movies and Credits data frames and select the desired columns, and have a unified Movies Dataframe to handle.

movies = movies.merge(credits, left_on='id', right_on='movie_id', how='left')
movies = movies[['id'.'original_title'.'genres'.'cast'.'vote_average'.'director'.'keywords']]
Copy the code

We can check the size and properties of movies like this –

Step 4 – Use the “Genres” column

We will clear out the “Genres” columns to find the “Genres” list

movies['genres'] = movies['genres'].str.strip('[]').str.replace(' '.' ').str.replace("'".' ')
movies['genres'] = movies['genres'].str.split(', ')
Copy the code

Let’s plot the movie genre in terms of what happens to it to gain insight into the genre in terms of popularity.

plt.subplots(figsize=(12.10))
list1 = []
for i in movies['genres']:
    list1.extend(i)
ax = pd.Series(list1).value_counts()[:10].sort_values(ascending=True).plot.barh(width=0.9,color=sns.color_palette('hls'.10))
for i, v in enumerate(pd.Series(list1).value_counts()[:10].sort_values(ascending=True).values): 
    ax.text(8., i, v,fontsize=12,color='white',weight='bold')
plt.title('Top Genres')
plt.show()
Copy the code

Drama seems to be the most popular genre after comedy

Now let’s generate a list “genreList” that contains all the possible unique types mentioned in the dataset.

genreList = []
for index, row in movies.iterrows():
    genres = row["genres"]
    
    for genre in genres:
        if genre not in genreList:
            genreList.append(genre)
genreList[:10] # Now we have a list of genres
Copy the code

One – hot coding

“GenreList” will preserve all genres. But how do we know which genre each movie belongs to? Now some movies are “action,” some are “action, adventure,” and so on. We need to classify movies according to genre.

Let’s create a new column in the Dataframe that will hold a binary value indicating whether or not the genre exists. First, let’s create a method that will return a binary list for each movie genre. “GenreList” is now available for comparison with values.

For example, we have 20 genres on the list. Therefore, the following function returns a list of 20 elements, which can be 0 or 1. For example, we have a movie, which genre = ‘Action’, then the new column will contain,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 [1].

Similar to the “action, adventure”, we will have,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 [1]. Converting genres into such binary lists will help to easily classify movies by genre.

def binary(genre_list) :
    binaryList = []
    
    for genre in genreList:
        if genre in genre_list:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList
Copy the code

My binary() function is applied to my “genres” columns to get “genre_list”

For other features, such as actors, directors, and keywords, we will follow the same notation.

movies['genres_bin'] = movies['genres'].apply(lambda x: binary(x))
movies['genres_bin'].head()
Copy the code

Step 5- Use the Cast column

Let’s draw a graph of the actors with the highest attendance

plt.subplots(figsize=(12.10))
list1=[]
for i in movies['cast']:
    list1.extend(i)
ax=pd.Series(list1).value_counts()[:15].sort_values(ascending=True).plot.barh(width=0.9,color=sns.color_palette('muted'.40))
for i, v in enumerate(pd.Series(list1).value_counts()[:15].sort_values(ascending=True).values): 
    ax.text(8., i, v,fontsize=10,color='white',weight='bold')
plt.title('Actors with highest appearance')
plt.show()
Copy the code

Samuel L. Jackson, aka Nick Fury in “The Avengers,” appears in the most movies. I originally thought Morgan Freeman might be the actor who made the most movies, but numbers are better than assumptions!

When I originally created the list of all actors, it had a unique value of about 50,000, because many movies have entries with about 15-20 actors.

But do we need all that? The answer is no. We only need the actors who can contribute the most to the movie. The Dark Knight movies have a lot of actors. But we only cast the main cast, Christian Bale, Michael Caine, Heath Ledger. I have selected four main actors from each film.

One question that might come to mind is how you determine the importance of an actor in a movie. Fortunately, the order of actors in the JSON format is based on their contribution to the movie.

Let’s see how we do this and create a column “cast_bin”

for i,j in zip(movies['cast'],movies.index):
    list2 = []
    list2 = i[:4]
    movies.loc[j,'cast'] = str(list2)
movies['cast'] = movies['cast'].str.strip('[]').str.replace(' '.' ').str.replace("'".' ')
movies['cast'] = movies['cast'].str.split(', ')
for i,j in zip(movies['cast'],movies.index):
    list2 = []
    list2 = i
    list2.sort()
    movies.loc[j,'cast'] = str(list2)
movies['cast']=movies['cast'].str.strip('[]').str.replace(' '.' ').str.replace("'".' ')castList = []
for index, row in movies.iterrows():
    cast = row["cast"]
    
    for i in cast:
        if i not inCastList: castList append (I) movies [' cast_bin] = movies [' cast ']. Apply (lambdaX: binary (x)) movies [' cast_bin] head ()Copy the code

Step 6 – Use the “Directors” column

Let’s draw a chart with the highest director attendance

def xstr(s) :
    if s is None:
        return ' '
    return str(s)
movies['director'] = movies['director'].apply(xstr)plt.subplots(figsize=(12.10))
ax = movies[movies['director']! =' '].director.value_counts()[:10].sort_values(ascending=True).plot.barh(width=0.9,color=sns.color_palette('muted'.40))
for i, v in enumerate(movies[movies['director']! =' '].director.value_counts()[:10].sort_values(ascending=True).values): 
    ax.text(. 5, i, v,fontsize=12,color='white',weight='bold')
plt.title('Directors with highest movies')
plt.show()
Copy the code

We create a new column “director_bin”, just as we did before

directorList=[]
for i in movies['director'] :if i not in directorList:
        directorList.append(i)movies['director_bin'] = movies['director'].apply(lambda x: binary(x))
movies.head()
Copy the code

Finally, with all of this done, we have the movies data set shown below

Step 7- Use the Keywords column

Keywords or tags contain a lot of information about movies, which is a key feature for finding similar movies. For example, movies like The Avengers and The Ant Man might have some key words in common, like superheroes or miracles.

To analyze keywords, we will try different approaches and plot a word cloud to get a better intuition:

from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.corpus import stopwordsplt.subplots(figsize=(12.12))
stop_words = set(stopwords.words('english'))
stop_words.update(', '.'; '.'! '.'? '.'. '.'('.') '.'$'.The '#'.'+'.':'.'... '.' '.' ')words=movies['keywords'].dropna().apply(nltk.word_tokenize)
word=[]
for i in words:
    word.extend(i)
word=pd.Series(word)
word=([i for i in word.str.lower() if i not in stop_words])
wc = WordCloud(background_color="black", max_words=2000, stopwords=STOPWORDS, max_font_size= 60,width=1000,height=1000)
wc.generate("".join(word))
plt.imshow(wc)
plt.axis('off')
fig=plt.gcf()
fig.set_size_inches(10.10)
plt.show()
Copy the code

Above is a word cloud that shows the main keywords or tags that describe the movie

We find “words_bin” from the following keywords

movies['keywords'] = movies['keywords'].str.strip('[]').str.replace(' '.' ').str.replace("'".' ').str.replace('"'.' ')
movies['keywords'] = movies['keywords'].str.split(', ')
for i,j in zip(movies['keywords'],movies.index):
    list2 = []
    list2 = i
    movies.loc[j,'keywords'] = str(list2)
movies['keywords'] = movies['keywords'].str.strip('[]').str.replace(' '.' ').str.replace("'".' ')
movies['keywords'] = movies['keywords'].str.split(', ')
for i,j in zip(movies['keywords'],movies.index):
    list2 = []
    list2 = i
    list2.sort()
    movies.loc[j,'keywords'] = str(list2)
movies['keywords'] = movies['keywords'].str.strip('[]').str.replace(' '.' ').str.replace("'".' ')
movies['keywords'] = movies['keywords'].str.split(', ')words_list = []
for index, row in movies.iterrows():
    genres = row["keywords"]
    
    for genre in genres:
        if genre not in words_list:
            words_list.append(genre)movies['words_bin'] = movies['keywords'].apply(lambda x: binary(x))
movies = movies[(movies['vote_average']! =0)] Delete movies with a score of 0 and no drector name
movies = movies[movies['director']! =' ']
Copy the code

Step 8- Similarities between movies

We will use cosine similarity to find similarities between two movies. How does cosine similarity work?

Let’s say we have two vectors. If the vectors are nearly parallel, if the Angle between the vectors is 0, then we can say that they’re all “similar”, because cosine of 0 is equal to 1. However, if the vectors are orthogonal, then we can say that they are independent or not “similar” because cosine of 90 =0.

Below is a detailed study of links: blog.christianperone.com/2013/09/mac…

Below I define a function similarity that checks for similarities between movies.

from scipy import spatialdef Similarity(movieId1, movieId2):
    a = movies.iloc[movieId1]
    b = movies.iloc[movieId2]
    
    genresA = a['genres_bin']
    genresB = b['genres_bin']
    
    genreDistance = spatial.distance.cosine(genresA, genresB)
    
    scoreA = a['cast_bin']
    scoreB = b['cast_bin']
    scoreDistance = spatial.distance.cosine(scoreA, scoreB)
    
    directA = a['director_bin']
    directB = b['director_bin']
    directDistance = spatial.distance.cosine(directA, directB)
    
    wordsA = a['words_bin']
    wordsB = b['words_bin']
    wordsDistance = spatial.distance.cosine(directA, directB)
    return genreDistance + directDistance + scoreDistance + wordsDistance
Copy the code

Let’s examine the similarities between two random movies

We see that the distance is about 2.068, which is very high. The further away the film becomes, the less similar it becomes. Let’s see what these random movies are.

Obviously, the Dark Knight Rises and how to Train your Dragon 2 are very different movies. So the distance is huge.

Step 9- Score predictor

So now when we’ve got everything in place, we’re now going to set up a score predictor. The main function is Similarity(), which calculates the Similarity between movies and will find the 10 most similar movies. These 10 movies will help predict the score of the movie we want. We will average the scores of similar movies and find the score of the movie we want.

Now, the similarity between the movies will depend on our newly created column containing the binary list. We know that characteristics like director or actor will play a very important role in the success of the film. We always assume that David Fincher and Chris Nolan movies will do well. In addition, they have a higher chance of success if they work with their favorite actors, who always bring them success, and who also work on their favorite subject matter. Using these phenomena, let’s try to build our score predictor.

import operatordef predict_score():
    name = input('Enter a movie title: ')
    new_movie = movies[movies['original_title'].str.contains(name)].iloc[0].to_frame().T
    print('Selected Movie: ',new_movie.original_title.values[0])
    def getNeighbors(baseMovie, K) :
        distances = []
    
        for index, movie in movies.iterrows():
            if movie['new_id'] != baseMovie['new_id'].values[0]:
                dist = Similarity(baseMovie['new_id'].values[0], movie['new_id'])
                distances.append((movie['new_id'], dist))
    
        distances.sort(key=operator.itemgetter(1))
        neighbors = []
    
        for x in range(K):
            neighbors.append(distances[x])
        return neighbors
    
    K = 10
    avgRating = 0
    neighbors = getNeighbors(new_movie, K)print('\nRecommended Movies: \n')
    for neighbor in neighbors:
        avgRating = avgRating+movies.iloc[neighbor[0]] [2]  
        print( movies.iloc[neighbor[0]] [0] +" | Genres: "+str(movies.iloc[neighbor[0]] [1]).strip('[]').replace(' '.' ') +" | Rating: "+str(movies.iloc[neighbor[0]] [2]))
    
    print('\n')
    avgRating = avgRating/K
    print('The predicted rating for %s is: %f' %(new_movie['original_title'].values[0],avgRating))
    print('The actual rating for %s is %f' %(new_movie['original_title'].values[0],new_movie['vote_average']))
Copy the code

Now just run the function below and enter your favorite movie 10 similar movies and its forecast ratings

predict_score()
Copy the code

Thus the realization of movie recommendation system based on K nearest neighbor algorithm is completed.

K value

In this project, I chose an arbitrary value of K=10.

But in other applications of KNN, it is not easy to find K value. A smaller K value means that noise has a greater impact on the results, while a larger K value will lead to computational overhead. Data scientists usually choose odd numbers, and if the number of classes is 2, another easy way to choose k is to set k= SQRT (n).

Found here contains a complete code, data sets and all illustrations of Python Notebook:www.kaggle.com/heeraldedhi…

Further reading

  1. Recommendations: en.wikipedia.org/wiki/Recomm…

  2. Based on the K nearest neighbor algorithm of machine learning: towardsdatascience.com/machine-lea…

  3. Use Python’s recommendation system. Part 2: collaborative filtering algorithm (K – nearest neighbor) : heartbeat. Fritz. Ai/recommender… .

  4. What is cosine similarity? : deepai.org/machine-lea…

  5. How to find the optimal value of K in KNN? : towardsdatascience.com/how-to-find…

The original link: www.analyticsvidhya.com/blog/2020/0…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/