Simple crawler exercise using Requests +BeautifulSoup

This is the 17th original article in my daily python tutorial

After the previous article about BeautifulSoup library, this article is to use the knowledge from the previous article to crawl our theme website today: top100 cat eye movies. This website is also quite easy, so you can first climb down, come back to this article if you have any questions.

This article is mainly a practice, no use, big guy please detour ha!

1. Libraries and websites used in this paper

requests
BeautifulSoup
The target website: http://maoyan.com/board/4

2. Analyze target websites

It is easy to find the information we want. The arrow of the 5 above is all the information we want, which is the picture address of the movie, the title of the movie, the stars, the running time and the score. Now that you have the content, it’s time to get the link to the next page.

There are two ways to do this. The first way is to get links to all pages on the home page, and the second way is to get links to the next page of each page. In this case, we only give the link to part of the page, so we get the link to the next page, which is easier.

All right, analysis done, code up.

3. The type code

No matter what, immediately make a get request

import requests

from bs4 import BeautifulSoup



url_start = 'http://maoyan.com/board/4'

response = requests.get(url_start)

if response.status_code == 200:

    soup = BeautifulSoup(response.text, 'lxml')

print(response.text)

Copy the code

Output result:

Is it a surprise? Is it a surprise? If you play a lot of crawlers, this is not surprising. We’re crawling backwards. Let’s try adding a request header.

url_start = 'http://maoyan.com/board/4'

headers = {'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'}

response = requests.get(url_start, headers=headers)

Copy the code

This can normally return, because the general site will be in the request head add a reverse climb, so encountered a reverse climb don’t worry, add a request head try?

Next, BeautifulSoupL is used to retrieve the content

imgs = soup.select('dd .board-img')  # This is the link to get the image

titles = soup.select('dd .board-item-main .name')  # This is to get the movie name

starses = soup.select('dd .board-item-main .movie-item-info .star')  # It's getting the movie lead

times = soup.select('dd .board-item-main .movie-item-info .releasetime')  # This is to get the movie release time

scores = soup.select('dd .board-item-main .score-num')  # This is getting a score

Copy the code

Here, each retrieved statement contains information about each different movie, so it is not possible to put information about each movie in the same character at once as the re does. For example, when I get images, a statement gets links to all the movie images on this page, and we’re going to pull them out separately when we store them. So what I’m doing here is I’m using the for loop 0 through 9 to store the same coordinates in the same dictionary.

films = []  Store all movie information on a page

    for x in range(0.10) :

        This is the link to get the property

        img = imgs[x]['data-src']

        Get the contents of the tag and remove the Spaces at both ends

        title = titles[x].get_text().strip()

        stars = starses[x].get_text().strip()[3:]# Using slices is to remove the starring two words

        time = times[x].get_text().strip()[5:]# Use slice to remove the word show time

        score = scores[x].get_text().strip()

        film = {'title': title.'img': img, 'stars': stars, 'time': time, 'score': score}

        films.append(film)

Copy the code

The next step is to get the links for each page

pages = soup.select('.list-pager li a')  You can see the link on the next page in the last A TAB

    page = pages[len(pages)- 1] ['href']

Copy the code

Behind is simple, is to use the loop to take out all the content of the page, the code will not be posted.

Write in the last

This is a small exercise for BeautifulSoup library. I didn’t use much of it yesterday, except for the selector part and the part of obtaining text content and properties. I feel that it is better to use the re, and I can obtain the details of each movie with a re, as follows:

<dd>.*? board-index.*? >([\d]{1.3})</i>.*? title="(. *?) ". *?class="star">(. *?)</p>. *?class="releasetime">(. *?)</p>. *?class="integer">(. *?)</i>. *?class="fraction">(. *?)</i>

Copy the code

We also need to use a matching pattern: re.s. So I recommend using regular expressions.

Need complete code please check my Github ha!

Github:github.com/SergioJune/…

If this article is useful to you, how about liking it and retweeting it?

◐ ◑Climb The Hitchhiker’s Guide to Python! Python advanced book and PDF

◐◑ BeautifulSoup for python crawlers

◐◑ old driver take you to use Python to climb the girl graph

Daily learning python

Code is not just buggy, it’s beautiful and fun

Simple crawler exercise using Requests +BeautifulSoup

Related Posts

Kill mapper XML! MyBatis new feature dynamic SQL really fragrant!

Jlogger – a Scala library that prints logs in JSON format

High concurrent storage: Redis routine, catch all