After watching the small turtle climb douban TOP250 video, I rewrote the code.

The preparatory work

To observe the

To observe the information of each page, so as not to make mistakes in the following operation, such as: some movies do not write the leading actor or even the leading actor is an ellipsis and so on (mainly its information is not complete), which is more conducive to our code. If you go to this link movie.douban.com/top250, there are 25 movies on each page, there are 10 pages, and then when we go to the second page, it goes to movie.douban.com? Start =25&filter=, When entering the next page, the value of start increases by 25 to 225, and the value of start on the first page is 0.

Try to get the page source code

url = "https://movie.douban.com/top250"
r = requests.get(url)
r.encoding = r.apparent_encoding
print(r.status_code)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
Copy the code

The result is that the HTTP status code is 418. In this case, baidu usually searches for the HTTP status code. Generally, the request is successful only when the status code is 200. The server must reject the crawler request. Generally speaking, there are several methods:

The request header

Set the headers argument to the get or Request method (the first argument is method, representing 7 different ways to request, and the second argument is url). This parameter has a dictionary type, a “user-agent” key, and a value that identifies the type of browser you are using on the current site.

url = "https://movie.douban.com/top250"
r = requests.get(url)
print(r.request.headers)
{'User-Agent': 'the python - requests / 2.24.0'.'Accept-Encoding': 'gzip, deflate'.'Accept': '* / *'.'Connection': 'keep-alive'}
Copy the code

If this parameter is not set, the site will recognize the crawler by the “user-agent” identifier and deny access. Things change when we set this parameter.

url = "https://movie.douban.com/top250"
hd = {'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/ 537.36EDg /85.0.564.63'}
r = requests.get(url, headers=hd)
print(r.request.headers)
{'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/ 537.36EDg /85.0.564.63'.'Accept-Encoding': 'gzip, deflate'.'Accept': '* / *'.'Connection': 'keep-alive'}
Copy the code

With this identifier, the site responds to the request (that is, the response object) with the headers and sees the “user-agent “, so that the crawler can hide its identity as a browser and make the site think it is a human click.

The proxy IP

Set proxies to use proxy servers. When crawlers are used to crawl a website, they usually visit the website frequently. The website will detect the number of visits to a certain IP address in a certain period of time. If the number of visits is too many, it will prohibit your visit. So we can set up some proxy servers to do the work for us, and switch an IP every once in a while, so that access is not blocked. For the above two methods, we can use the choice method of the Random library to randomly select ‘user-agent’, ‘HTTP’ or ‘HTTPS’.

Cookie authentication

Set the cookies parameters in the above two methods, corresponding to the cookies in the Request, which can check the Network corresponding to the link of the web page to obtain cookies. When a crawler has an authentication cookie, the server redirects it to the originally requested resource, preventing the crawler from being directed to the login screen. If that fails after several attempts, other anti-crawler methods need to be considered.

How do you get the number 10

Go to the number 10 at the bottom of the page and click on it to check. The browser will now display the source code for the page, and it will automatically redirect to the 10 tag. Start =225&filter=”>10</a>

url = "https://movie.douban.com/top250"
hd = {'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/ 537.36EDg /85.0.564.63'}
r = requests.get(url, headers=hd)
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text, "html.parser")
depth = soup.find("a", href="? start=225&filter=")
print(depth.text)
Copy the code

The result is 10, but of type STR, and a cast is required. There’s another way to do this, and it’s messy and silly and confusing, but THERE’s one small detail I want to highlight. As you can see the < a > and < span > these two tags are < div >, and them it is tags at the same level, so let’s find a < a href = “? Start =225&filter=”>10</a> <span> tag and then parallel traverse to <a href=”? Start = 225 & filter = “> < / a >.

url = "https://movie.douban.com/top250"
hd = {'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/ 537.36EDg /85.0.564.63'}
r = requests.get(url, headers=hd)
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text, "html.parser")
depth = soup.find('span', class_='next').previous_sibling
print(depth)
Copy the code

The result is nothing, but there seems to be one more line of white space in the output than usual.

>>>depth = soup.find('span', class_='next') > > >print(list(depth))
['\n', <link href="? start=25& filter=" rel="next">
<a href="? start=25& filter="> after page & gt; </a> </link>]Copy the code

And then you add a previous_SIBLING and you get 10, again, type conversions.

depth = soup.find('span', class_='next').previous_sibling.previous_sibling
print(depth.text)
Copy the code

Crawl web pages

With the above preparation we can climb the source code of the web page, nonsense not to say, directly on the code to explain.

# crawl the page code
def get_html(url) :
    # Use proxy
    iplist = ["121.232.148.225"."123.101.207.185"."69.195.157.162"."175.44.109.246"."175.42.128.246"]
    # The choice method of the random library randomly selects IP addresses
    proxies = {"http": random.choice(iplist)}
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 \ (KABUL, like Gecko) Chrome/85.0.4183.121 Safari/ 537.36EDg /85.0.564.63'.'cookie': 'bid=mDUxIx660Vg; douban-fav-remind=1; ll="118300"; '
                  '_vwo_uuid_v2=D0D095AA577CA9982D96BF53E5B82D902|19baada46789b48f67201d5ad830a0f6; '
                  '__yadk_uid=gAC153pUujwGBubBMhpUqg8osqjO7FfD; '
                  '__utmz = 30149280.1601027331.15.8.utmcsr=accounts.douban.com | utmccn = ('
                  'referral)|utmcmd=referral|utmcct=/passport/login; '
                  '__utmz = 223695111.1601028074.12.5.utmcsr=accounts.douban.com | utmccn = ('
                  'referral)|utmcmd=referral|utmcct=/passport/login; push_noty_num=0; push_doumail_num=0; ct=y; '
                  'ap_v = 0,6.0; __utmc=30149280; __utmc=223695111; '
                  '__utma = 30149280.846971982.1594690041.1601194295.1601199438.18; '
                  '__utma = 223695111.2022617817.1595329698.1601194295.1601199438.15; __utmb = 223695111.0.10.1601199438; '
                  '_pk_ses. 100001.4 cf6 = *; _pk_ref. 100001.4 cf6 = % 5 b % 22% 2 c % 22% 2 22% 22% c1601199438%2 c % 22 HTTPS % 3 a % 2 f '
                  '%2Faccounts.douban.com%2Fpassport%2Flogin%3Fredir%3Dhttps%253A%252F%252Fmovie.douban.com'
                  '%252Ftop250%22%5D; dbcl2="216449506:1anURmdR9nE"; ck=LxNQ; __utmt=1; __utmv = 30149280.21644; '
                  'douban-profile-remind=1; '
                  '__gads=ID=c502f7a27a33facd-22f60957b9c300e3:T=1601201244:S=ALNI_MbN9wSYWbg2k5f3fWucwuXpORbGNw; '
                  '__utmb = 30149280.20.10.1601199438; '
                  '_pk_id. 100001.4 cf6 = d35078454df4f375. 1595329083.15.1601201286.1601197192.'}
    res = requests.get(url, headers=headers, proxies=proxies, timeout=30)
    return res
Copy the code

Define a function that gets the source code of the web page and returns a Response object. There are two ways to set cookies directly, either to add headers or to set cookies parameters. Set timeout, timeout=([connection timeout time], [read timeout time]), here we set the connection timeout time in case crawler freezes and no error is reported.

Parse web pages

Parsing web pages is the most important part of a crawler. We’ve got the link for movie.douban.com/top250?star… Under the page source code, and then we have to parse the HTML code. Boil it into a delicious soup using BeautifulSoup library, use HTML. Parser to parse the text attribute of the Response object returned by custom get_html(URL) function to get the source code.

soup = BeautifulSoup(res.text, 'html.parser')
Copy the code

The remaining 9 pages are pretty much the same in terms of methodology, notice individual movies that are missing information, and then loop around. Each movie basically has a title, score, number of critics, director, star, release date, country and genre, and a description of the movie. There’s a lot of content, and we’re going to have to extract it one by one with a crawler. Define a function that stores all extracted information into a list and returns a list.

The movie name

movies = []
Find the div tag
targets = soup.find_all("div", class_="hd")
for each in targets:
    try:
        The a tag under the # div tag below the span tag attribute next is the movie nameThe three tags are the inclusion relationship movies.append(each.a.pan.text)To make sure the program works, add an exception handler.
    finally:
        continue
Copy the code

The mouse arrow on the movie name, and then click can see the corresponding label, the label is just < div class = “hd” > tag in the < a href = “movie.douban.com/subject/129…”

Shawshank Use the find_all method to find all labels that match this requirement and return a collection, then add each element of the collection to the list.

score

ranks = []
targets = soup.find_all("span", class_="rating_num")
for each in targets:
    try:
        ranks.append(each.text)
    finally:
        continue
Copy the code

Ratings and movie titles are similar and that’s how you get them.

The director

director = []
targets = soup.find_all("div", class_="bd")
for each in targets:
    try:
        The # split method splits the string and returns a list of the split strings
        # lstrip removes the specified string on the left
        director.append(each.p.text.split('\n') [1].strip().split('\xa0\xa0\xa0') [0].lstrip('Director:'))
    finally:
         continue
Copy the code

The old way is to find the director’s content tag in <div class=”bd”> <p class>, and find that the information we want is in there. The next step is to extract the content in <p>, which is a string type. In this case, split the entire string and return a list. Then select the correct string and strip the irrelevant content.

starring

leading_actor = []
targets = soup.find_all("div", class_="bd")
for each in targets:
    try:
        leading_actor.append(each.p.text.split('\n') [1].strip().split('\xa0\xa0\xa0') [1].split('starring:') [1])
    finally:
        continue
if movies[0] = ="Shawshank Redemption.":
    leading_actor.insert(24."No")
if movies[0] = ="Havoc in Heaven":
    leading_actor.insert(6."No")
    leading_actor.insert(8."No")
if movies[0] = ="Once upon a Time in America":
    leading_actor.insert(5."No")
if movies[0] = ="Sisters on the Sunshine":
    leading_actor.insert(23."No")
if movies[0] = =Seven Samurai:
    leading_actor.insert(10."No")
if movies[0] = ="The Imitation Game":
    leading_actor.insert(4."No")
    leading_actor.insert(9."No")
if movies[0] = =Rashomon:
    leading_actor.insert(2."No")
Copy the code

There are 10 pages in total, and some pages of the movie will be missing the main actor. If there is no try, finally exception, the program will report an error: List index out of range, meaning the index is out of range, because there are strings in the parentheses of append() that do not fit the split, and there is no list after the split.

For example, the last movie on the first page, untouchable, cannot be split with “star :”, so there is no index split(” star :”)[1] without returning a list. After each loop finds the corresponding missing page, the missing information is inserted into the list, ensuring that the length of each leading_actor for 10 loops is 25, because there are 25 movies per page.

Evaluation of the number

people = []
targets = soup.find_all("div", class_="star")
for each in targets:
    try:
        people.append(str(each.contents[7].text).rstrip("Human evaluation"))
    finally:
        continue
Copy the code

Convert all of the contents under the <div> tag to STR and remove the “person rating”, and finally store in the list.

The film describes

# Movie description, there are two movies that are not described on the last two pages
sentence = []
targets = soup.find_all("span", class_="inq")
for each in targets:
    try:
        sentence.append(str(each.text))
    finally:
        continue
if sentence[22] = ="The whole angel protection thing.":
    sentence.insert(9."No")
    sentence.insert(14."No")
if sentence[22] = ="Sick E.T. has the color of persimmon cake.":
    sentence.insert(6."No")
    sentence.insert(22."No")
Copy the code

Mark the last movie description on the missing page and insert “none” in the list to indicate that the movie has no description. Ensure that the length of sentence in each loop is 25.

Release time

year = []
targets = soup.find_all("div", class_="bd")
for each in targets:
     try:
        year.append(each.p.text.split('\n') [2].strip().split('\xa0') [0].strip((Mainland China)))
    finally:
        continue
    if movies[0] = ="Havoc in Heaven":
        year[0] = "1961/1964/1978/2004"
Copy the code

Use this exception handling to remove the string “mainland China” from the year of The Book of Legends. Another movie, uproar in Heaven, was released in 1961, 1964, 1978 and 2004 respectively. Just change the value of the first year[0] on this page.

countries

country = []
targets = soup.find_all("div", class_="bd")
for each in targets:
    try:
        country.append(each.p.text.split('\n') [2].strip().split('\xa0') [2])
    finally:
        continue
Copy the code

type

kinds = []
targets = soup.find_all("div", class_="bd")
for each in targets:
    try:
        kinds.append(each.p.text.split('\n') [2].strip().split('\xa0') [4])
    finally:
        continue
Copy the code

It’s the same thing as before.

Integration of the data

result = []
length = len(movies)
for i in range(length):
    result.append([movies[i], ranks[i], director[i], leading_actor[i],people[i], sentence[i], year[i], country[i], kinds[i]])
return result
Copy the code

The value of length is 25, some movie information is missing, and the list storing movie information is not 25. In this case, IndexError will be reported, so make sure that the list storing movie information is 25 after each loop, and use the loop to save all information for each movie into result. And return it as the function return value.

Save the data

def save_to_excel(result) :
    Instantiating an object is equivalent to creating an Excel document
    wb = Workbook()
    # activation Sheet
    ws = wb.active
    # set the title
    ws["A1"] = "Name"
    ws["B1"] = "Grade"
    ws["c1"] = "Director"
    ws['d1'] = "Star"
    ws["e1"] = "Number of evaluators"
    ws["f1"] = "Movie Description"
    ws["g1"] = "Year"
    ws["h1"] = "Country"
    ws["i1"] = "Type"
    Add information to WS
    for each in result:
        ws.append(each)
    # save file
    wb.save(Top250.xlsx)
Copy the code

This uses python’s third-party module OpenPyXL, installed with the PIP command.

The main function

def main() :
    host = "https://movie.douban.com/top250"
    res = get_html(host)
    depth = find_pages(res)
    result = []
    for i in range(depth):
        url = host + '/? start=' + str(25 * i)
        res = get_html(url)
        result.extend(parser_html(res))
    save_to_excel(result)
Copy the code

Use find_pages(res) to get the page count, loop depth times, visit the link one at a time and get the content, add it to result, and save it as an Excel file.

The program code

import requests
from bs4 import BeautifulSoup
import random
from openpyxl import Workbook


def get_html(url) :
    # Use proxy
    iplist = ["121.232.148.225"."123.101.207.185"."69.195.157.162"."175.44.109.246"."175.42.128.246"]
    proxies = {"http": random.choice(iplist)}
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 \ (KABUL, like Gecko) Chrome/85.0.4183.121 Safari/ 537.36EDg /85.0.564.63'.'cookie': 'bid=mDUxIx660Vg; douban-fav-remind=1; ll="118300"; '
                  '_vwo_uuid_v2=D0D095AA577CA9982D96BF53E5B82D902|19baada46789b48f67201d5ad830a0f6; '
                  '__yadk_uid=gAC153pUujwGBubBMhpUqg8osqjO7FfD; '
                  '__utmz = 30149280.1601027331.15.8.utmcsr=accounts.douban.com | utmccn = ('
                  'referral)|utmcmd=referral|utmcct=/passport/login; '
                  '__utmz = 223695111.1601028074.12.5.utmcsr=accounts.douban.com | utmccn = ('
                  'referral)|utmcmd=referral|utmcct=/passport/login; push_noty_num=0; push_doumail_num=0; ct=y; '
                  'ap_v = 0,6.0; __utmc=30149280; __utmc=223695111; '
                  '__utma = 30149280.846971982.1594690041.1601194295.1601199438.18; '
                  '__utma = 223695111.2022617817.1595329698.1601194295.1601199438.15; __utmb = 223695111.0.10.1601199438; '
                  '_pk_ses. 100001.4 cf6 = *; _pk_ref. 100001.4 cf6 = % 5 b % 22% 2 c % 22% 2 22% 22% c1601199438%2 c % 22 HTTPS % 3 a % 2 f '
                  '%2Faccounts.douban.com%2Fpassport%2Flogin%3Fredir%3Dhttps%253A%252F%252Fmovie.douban.com'
                  '%252Ftop250%22%5D; dbcl2="216449506:1anURmdR9nE"; ck=LxNQ; __utmt=1; __utmv = 30149280.21644; '
                  'douban-profile-remind=1; '
                  '__gads=ID=c502f7a27a33facd-22f60957b9c300e3:T=1601201244:S=ALNI_MbN9wSYWbg2k5f3fWucwuXpORbGNw; '
                  '__utmb = 30149280.20.10.1601199438; '
                  '_pk_id. 100001.4 cf6 = d35078454df4f375. 1595329083.15.1601201286.1601197192.'}
    res = requests.get(url, headers=headers, proxies=proxies, timeout=30)
    # res = requests.get(url, headers=headers)
    return res


def parser_html(res) :
    soup = BeautifulSoup(res.text, 'html.parser')

    # the movie name
    movies = []
    targets = soup.find_all("div", class_="hd")
    for each in targets:
        try:
            movies.append(each.a.span.text)
        finally:
            continue

    # score
    ranks = []
    targets = soup.find_all("span", class_="rating_num")
    for each in targets:
        try:
            ranks.append(each.text)
        finally:
            continue

    # director
    director = []
    targets = soup.find_all("div", class_="bd")
    for each in targets:
        try:
            director.append(each.p.text.split('\n') [1].strip().split('\xa0\xa0\xa0') [0].lstrip('Director:'))
        finally:
            continue

    Starring #
    leading_actor = []
    targets = soup.find_all("div", class_="bd")
    for each in targets:
        try:
            leading_actor.append(each.p.text.split('\n') [1].strip().split('\xa0\xa0\xa0') [1].split('starring:') [1])
        finally:
            continue
    if movies[0] = ="Shawshank Redemption.":
        leading_actor.insert(24."No")
    if movies[0] = ="Havoc in Heaven":
        leading_actor.insert(6."No")
        leading_actor.insert(8."No")
    if movies[0] = ="Once upon a Time in America":
        leading_actor.insert(5."No")
    if movies[0] = ="Sisters on the Sunshine":
        leading_actor.insert(23."No")
    if movies[0] = =Seven Samurai:
        leading_actor.insert(10."No")
    if movies[0] = ="The Imitation Game":
        leading_actor.insert(4."No")
        leading_actor.insert(9."No")
    if movies[0] = =Rashomon:
        leading_actor.insert(2."No")

    # Number of people evaluated
    people = []
    targets = soup.find_all("div", class_="star")
    for each in targets:
        try:
            people.append(str(each.contents[7].text).rstrip("Human evaluation"))
        finally:
            continue

    # Movie description, there are two movies that are not described on the last two pages
    sentence = []
    targets = soup.find_all("span", class_="inq")
    for each in targets:
        try:
            sentence.append(str(each.text))
        finally:
            continue
    if sentence[22] = ="The whole angel protection thing.":
        sentence.insert(9."No")
        sentence.insert(14."No")
    if sentence[22] = ="Sick E.T. has the color of persimmon cake.":
        sentence.insert(6."No")
        sentence.insert(22."No")

    # years
    year = []
    targets = soup.find_all("div", class_="bd")
    for each in targets:
        try:
            year.append(each.p.text.split('\n') [2].strip().split('\xa0') [0].strip((Mainland China)))
        finally:
            continue
    if movies[0] = ="Havoc in Heaven":
        year[0] = "1961/1964/1978/2004"
    # countries
    country = []
    targets = soup.find_all("div", class_="bd")
    for each in targets:
        try:
            country.append(each.p.text.split('\n') [2].strip().split('\xa0') [2])
        finally:
            continue
    # type
    kinds = []
    targets = soup.find_all("div", class_="bd")
    for each in targets:
        try:
            kinds.append(each.p.text.split('\n') [2].strip().split('\xa0') [4])
        finally:
            continue
    result = []
    length = len(movies)
    for i in range(length):
        result.append([movies[i], ranks[i], director[i], leading_actor[i],
                       people[i], sentence[i], year[i], country[i], kinds[i]])
    return result


Find out how many pages there are
def find_pages(res) :
    soup = BeautifulSoup(res.text, 'html.parser')
    depth = soup.find('span', class_='next').previous_sibling.previous_sibling.text

    return int(depth)


def save_to_excel(result) :
    wb = Workbook()
    ws = wb.active
    ws["A1"] = "Name"
    ws["B1"] = "Grade"
    ws["c1"] = "Director"
    ws['d1'] = "Star"
    ws["e1"] = "Number of evaluators"
    ws["f1"] = "Movie Description"
    ws["g1"] = "Year"
    ws["h1"] = "Country"
    ws["i1"] = "Type"
    for each in result:
        ws.append(each)
    wb.save(Top250.xlsx)


def main() :
    host = "https://movie.douban.com/top250"
    res = get_html(host)
    depth = find_pages(res)
    result = []
    for i in range(depth):
        url = host + '/? start=' + str(25 * i)
        res = get_html(url)
        result.extend(parser_html(res))
    save_to_excel(result)


if __name__ == "__main__":
    main()
Copy the code