“This is the 15th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

preface

Use Python to crawl IMDB movies. Without further ado.

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests module;

The random module;

Bs4 module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

First, douban as an introduction to reptiles, in-depth analysis of all kinds of cattle has tended to be perfect; On the other hand, with the development of Chinese film industry, we need to turn our perspective to the international market and learn about the films that foreigners are interested in through data analysis.

Thought analysis

IMDB top250 home page

IMDB Movie Details Page (1)

IMDB Movie Details Page (2)

Based on the above web page structure, we found that we only need to get the detail page code of each movie (unique), and through two “leap-frog” times, the details page (1) and (2) can export the information of country & type, score & number of people. Easy to understand, climb the mind map as follows:

The crawler code

IMDB top250 home page

# Import library -------------------------------------------
from urllib import request
from chardet import detect
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

Get the source code of the web page and generate the soup object -------------------------
def getSoup(url) :
    with request.urlopen(url) as fp:
       byt = fp.read()
       det = detect(byt)
       time.sleep(random.randrange(1.5))
       return BeautifulSoup(byt.decode(det['encoding']),'lxml')
   
# Parse data -------------------------------------------
def getData(soup) :
   # get a score
   ol = soup.find('tbody', attrs = {'class''lister-list'})
   score_info = ol.find_all('td',attrs={'class':'imdbRating'})
   film_scores = [k.text.replace('\n'.' 'for k in score_info]
   # Get rating, movie title, director/actor, release year, details page link
   film_info = ol.find_all('td',attrs={'class':'titleColumn'})
   film_names =  [k.find('a').text for k in film_info]
   film_actors =  [k.find('a').attrs['title'for k in film_info]
   film_years = [k.find('span').text[1:5for k in film_info]
   next_nurl =  [url2 + k.find('a').attrs['href'] [0:17]  for k in film_info]
   data=pd.DataFrame({'name':film_names,'year':film_years,'score':film_scores,'actors':film_actors,'newurl':next_nurl})      
   return data    
Copy the code

IMDB Top250 movie details page

# for more details page data -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
def nextUrl(detail,detail1) :
  # Get Movie Nation
  detail_list = detail.find('div',attrs={'id':'titleDetails'}).find_all('div',attrs={'class':'txt-block'})
  detail_str = [k.text.replace('\n'.' 'for k in detail_list]
  detail_str = [k for k in detail_str if k.find(':') > =0]
  detail_dict = {k.split(':') [0] : k.split(':') [1for k in detail_str}
  country = detail_dict['Country']    
  # Get the movie genre
  detail_list1 = detail.find('div',attrs={'class':'title_wrapper'}).find_all('div',attrs={'class':'subtext'})
  detail_str1 = [k.find('a').text for k in detail_list1]
  movie_type=pd.DataFrame({'Type':detail_str1})
  Get detailed movie ratings and number of people by group
  div_list = detail1.find_all('td',attrs= {'align''center'})
  value = [k.find('div',attrs= {'class''bigcell'}).text.strip() for k in div_list]
  num   = [k.find('div', attrs={'class''smallcell'}).text.strip() for k in div_list]
  scores=pd.DataFrame({'value':value,'num':num})  
  return country,movie_type,scores
Copy the code

The results show

The data analysis

Genre comparison

Let’s take a look at the percentage of films by genre:

The top three categories of Top250 movies are comedy, crime and action.

The intense and exciting mood and the well-balanced plot are the most memorable movie viewing experience for fans.

Let’s take a look at the scores of each genre

From the perspective of genre, the reason why western films are so popular may be related to the small audience and the high score given by the wild running character of fans. Crime, action, adventure, mystery and horror also scored high

Year compared

First let’s take a look at the year of the TOP250 films

There are many films in 1957, 1995 and 2014 on the Top250 list, while after 1975, there has been an obvious trend of increase, which may be related to the growing maturity of the film industry.

As for 1995, those who are familiar with movies may know that 1995 is the 100th anniversary of the world film industry. Numerous film geniuses produced their great works in this year with the idea of offering gifts. We are familiar with the Shawshank Redemption, Forrest Gump, Pulp Fiction, Four Weddings and a Funeral, Seven, the Lion King and so on.

At the same time, let’s look at the review scores for each year’s movies

There is no obvious upward or downward trend when comparing the scores of films over time, indicating that film art does not lose its value over time. For the film, technology is not the first, emotional resonance factors account for more weight; Which movie is the best? The answer lies within each of us.

contrast

Let’s take a look at the percentage of countries and regions in the TOP250

It’s an interesting statistic, a bit like the Nobel Prize, with American films taking half the pie and the rest of the world taking the rest of the pie. At the top of the list are Britain, France, Japan and Germany. In China, the only film on the list is in the Mood for Love.

If it is due to western mainstream values, neighboring Japan, also a representative of Oriental culture, has 16 films on the list, which shows that Western values cannot be the main reason for the low number of Chinese films on the list. In recent years, there have been a number of high-quality films released in China, such as Big Fish and Begonia and the recently released The Wandering Earth, but the international market is still lackluster. I believe that movies have a common language, and there really is such a thing as universal values. How to build an international film industry and tell stories to the people of the world is the next topic that Chinese filmmakers need to explore.

The director contrast

Let’s take a look at the directors most frequently featured on the TOP250 list

The film world’s Nobel Prize has been announced. Here are the authors who made the list. Given you the right name may not be familiar with foreign director, made a director – as seen in table to do here, it is worth noting that ridley Scott, James Cameron, David fincher also directed the movie “alien” 1 “aliens” alien 3 “, a “aliens” out of the three on the director, shows its influence in the series.

The crowd contrast

First, let’s look at how different groups of people score

In terms of gender, men are more likely to give high marks than women. On the other hand, in terms of age groups, teenagers were the most likely to give high scores, and scores became more and more similar with age, with people over 45 giving the lowest scores. After the sea, hard heart is more difficult to move? Or is it possible to be well-informed in order to evaluate a film fairly and objectively? Perhaps a study on the subject, such as a Scientific Approach to the Age Allocation of Film Festival judges.

However, knowing the scoring situation, we also need to understand the proportion of various groups

Although “old uncles” and “old aunts” have low ratings, there is no need to worry too much about a movie’s reputation. Because the data tells us that the movie will have a good reputation for catering to the 30-44 and 18-29 age groups. In recent years, such war action movies as Wolf Warrior and Operation Red Sea have gained critical acclaim, which gives a glimpse of the rating system.

The relationship between type, age and score

First of all, we use the thermal map to look at the scores of different types of movies by different groups

Different age groups have different preferences for movie genres. For example, underage males and females show a strong interest in mysteries and westerns, while males and females over 45 prefer science fiction and film noir respectively.

The level of the score also needs to be combined with the proportion of comprehensive analysis

This time, we have refined the data granularity to each age group, and combined with the scores of each age group, below we give the recommended movies of each age group in the TOP250 list.

The movie is recommended

Minor male (<18)

18 to 29 year old male

30 to 44 years old men

The + 45 men

Female minors (<18)

Women aged 18 to 29

Women aged 30 to 44

45 + women

The above is the movie recommended according to IMDBtop250 data, if there is not consistent with the situation, here to say sorry. After all, the preferences of the American people are somewhat different from those of China.