I like to watch movies at ordinary times, and I often refer to douban movie ratings. Douban is very friendly to crawler lovers, without too many anti-crawler measures, and is very friendly to beginners.

BeautifulSoup, Pandas, Matplotlib and other popular data analysis libraries are used in this paper to visually climb the TOP 250 list of Douban movies.

Data crawl

First open the page and enter the debug mode, we will see the following page:

As shown in the figure, we can see the data intuitively. The graphic information is the information we want to climb this time.

Next, click on the bottom of the page to turn the page, we notice that the address of the second page is: movie.douban.com/top250?star… The address on page 3 is: movie.douban.com/top250?star…

It’s not hard to see the pattern, so start writing code.

 import os
 
 import matplotlib.pyplot as plt
 import pandas as pd
 import requests
 from bs4 import BeautifulSoup
 
 def movies_spider(a):
    records = []
    for start in (range(250) : :25]):
        url = f"https://movie.douban.com/top250?start={start}"
        response = requests.get(url).text
        soup = BeautifulSoup(response, 'html.parser')
        movie_list = soup.find_all(class_='item')
        for item in movie_list:
            rank = int(item.find('em').string)  Rank #
            pic = item.find(class_='pic')
            href = pic.find('a') ['href']  # link

            info = item.find(class_='info')
            name = info.find(class_='title').string  # Movie title
            rating_num = info.find(class_='rating_num').string  # score
            total = info.find(
                class_='rating_num').find_next_sibling().find_next_sibling().string[:- 3]  # Number of people evaluated
            inq = info.find(class_='inq')  # profile
            try:
                quote = inq.get_text()
            except AttributeError:
                quote = None
                print("Type error")
            bd_div = item.find(class_='bd')
            infos = bd_div.find('p').get_text().strip().split('\n')
            # infos = [' Directed by Frank Darabont\xa0\ xA0 \ xA0 starring Tim Robbins /...',
            # '1994\xa0/\xa0 USA \xa0/\xa0 Crime plot ']
            info1 = infos[0].split('\xa0\xa0\xa0')
            director = info1[0] [4:]  # director
            info2 = infos[1].lstrip().split('\xa0/\xa0')
            year = info2[0] [:4]
            area = info2[1]
            movie_type = info2[2]

            movie = {
                'rank': rank,
                'name': name,
                'director': director,
                'year': year,
                'area': area,
                'type': movie_type,
                'rating_num': rating_num,
                'comment_num': total,
                'quote': quote,
                'url': href
            }
            records.append(movie)
    return records
Copy the code

The text will be stored in a CSV file and then analyzed by PANDAS

headers = ['rank'.'name'.'director'.'year'.'area'.'type'.'rating_num'.'comment_num'.'quote'.'url']
df = pd.DataFrame(rows, columns=headers)
df.to_csv('top250.csv')
Copy the code

Let’s take a look at the data:

Data analysis and visualization

Countries and regions

By analyzing the countries and regions of production first, it can be seen that some films have more than one country and region. For example, The country and region of production of Farewell My Concubine is Mainland China and Hong Kong. Therefore, we can use split for processing

area_split = df['area'].str.split(' ').apply(pd.Series)
Copy the code

You can see that the movie has up to 5 countries and regions, and we do statistics, sum up, and finally plot

a = area_split.apply(pd.value_counts)
area_count = a.sum(axis=1)
area_df = pd.DataFrame(area_count, columns=['count'], dtype=int).sort_values(by='count')
area_df.plot.barh()
Copy the code

The US leads the pack with 144, while Hong Kong is fourth with 25 and China is seventh with 17

  • The United States: 144

  • Japan: 34

  • English: 34

  • Hong Kong: 25

  • France: 21

Film type

We analyze the genre in the same way

Drama also led the pack with 191 films, with romance, adventure, comedy and crime also being popular genres.

  • Plot: 192

  • Love: 57

  • Adventure: 47

  • Comedy: 47

  • Crime: 44

The director

The top 5 directors are:

  • Christopher Nolan: 7

  • Hayao Miyazaki: 6

  • Wong Kar Wai: 5

  • Steven Spielberg: 5

  • Lee: 5

These directors really deserve it

score

Let’s start with the top 10 movies

ranking The movie score
1 The Shawshank Redemption 9.6
33 The prosecution witness 9.6
2 Farewell my concubine 9.6
5 Life is beautiful 9.5
8 Schindler’s list 9.5
3 The killer is not too cold 9.4
4 Forrest gump 9.4
32 Twelve angry men 9.4
14 Spring in the cattle class 9.3
217 City lights 9.3

Note that City Lights ranks at 217 with a score of 9.3, and the question is, what is the relationship between the ratings and the rankings, which we analyze by drawing scatter plots and histograms

df.plot.scatter(x='rating_num', y='rank')
plt.title('Rating and Ranking')
plt.xlabel('score')
plt.ylabel('排名')
plt.gca().invert_yaxis()

df['rating_num'].plot.hist(bins=10, rwidth=0.9)
plt.title('Score distribution')
Copy the code

As can be seen from the above figure, with the increase of scores, the ranking is also near the top, with scores mainly between 8.4 and 9.2. The average score is 8.83, the mode score is 8.7, and the correlation coefficient is -0.6882.

Evaluation of the number

Top 10 movies rated by people

Using the same method as before, we analyze the distribution of the number of reviewers and the relationship between the number of reviewers and the ranking

Different from the score, with the increase of the number of reviewers, the ranking of films tends to advance. The number of reviewers is mainly distributed between 10W and 50W, with an average of 34.36W and a correlation coefficient of -0.6865. The number of reviewers is strongly correlated with the ranking.

year

Movies are mainly concentrated after 1990, and the correlation coefficient is 0.0173. There is no correlation between year and ranking.

The earliest film was Chaplin’s City Lights in 1931.

The most recent was in 2017, with three

  • Call me by Your Name: 117

  • Three Billboards outside Ebbing, Missouri: 191

  • Coco: 63

The top three years were:

  • 2010:14

  • 2004:12

  • 1994:11

With three of the top five films, 1994 came in third, so let’s take a look at the top five films of 2010

ranking The movie
9 inception
24 heartache
68 Let the bullets fly
86 Shutter island
99 confession
118 How to train your dragon master
123 God steal dads
125 Alietta, the little person who borrowed things
128 echoes
142 The black swan
155 Toy Story 3
163 You look delicious
214 The small matter of first love
237 The King’s Speech

conclusion

In this paper, the data is simply analyzed and visualized. The purpose is to learn the basic use of these libraries. You can also conduct in-depth analysis, such as further analysis of the relationship between the number of ratings, ratings, rankings, actors, language and duration, etc., a good idea may be more important than technology.

Finally, here’s a TOP250 family photo

May you meet another world in the movie

May you feel the joy of learning in coding

If you are interested in Python development and full stack engineering, please follow the wechat official account. There is more than Python here