Learn to use scrapy crawlers

1. What is a reptile

When it comes to crawlers, we may think of using the Internet to quickly get the network resources we need, such as pictures, texts or videos. It is true that if we master crawler technology, we can crawl things on the Internet in a short time. However, the online resources we can crawl are directly displayed on the other website, meaning that we can manually copy and paste to get, and crawlers just replace our manual, and speed up the process. Before to write a web crawler, we must make sure that we want to obtain network resources have erratically, know that we have to network resources in which website or request interface, and then we have to ensure that the resources we need is available, because if you don’t want their resources are sometimes optional crawl, their website there may be some authentication mechanism, To intercept web crawlers.

The specific process of web crawler is to obtain the HTML content of web pages in batches (or obtain data through the resource interface of the website)-> write data processing code to process the data and convert it into the resources we need. Crawlers are much easier if you have a site’s data interface, and we typically use Requests to get the data directly. However, in most cases, the data we need (or data interface) is in the HTML of the web page, and we also need to parse and extract HTML page content before formally obtaining data. HTML page fetching, content parsing, data extraction, and finally data storage, and sometimes anti-crawler mechanisms need to be considered. This seems like a lot of work, and Python has a third-party library that can help simplify and improve crawler efficiency. It’s scrapy.

2. Scrapy

According to Scrapy’s website, Scrapy is an application framework for crawling web data and extracting structured data. Today we’re going to try our first crawler using scrapy.

2.1 Scrapy installed

Let’s create a new folder for our crawler project called douban_scrapy. The target resource we want to crawl this time is the high-scoring movies on Douban. There is someone who makes a summary of high-scoring movies on Douban. However, it is a bit bad that we cannot match them by time or score or classify them by movie type, so we can crawl down the above data and sort them out by ourselves.

Before we start our scrapy installation, let’s create a New Python environment under the project path (it’s a good habit to prevent collisions with the installation environment):

Create a new folder, myenv, to store the libraries in your new Python environment
Run the following command to create a new environment:python -m venv ./douban_scrapy_env
Activating a New Environmentsource ./douban_scrapy_env/bin/activate

Then execute PIP Install scrapy to install the scrapy library

Create a scrapy project

Now we can use scrapy. Executing scrapy startProject douban automatically creates a new project folder (douban) in the current directory. This folder contains the following files:

douban/
    scrapy.cfg
    douban/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            
Copy the code

Among them:

Scrapy. CFG is used to configure projects, such as project Settings file path, project path, etc
Kitems. py is used to define each entity we climb, such as title, image, etc
Middlewares.py is used to define some middleware
Settings. py is used for project Settings, such as crawler robot name, crawler script path, etc
Spiders are used to place crawler scripts

To write a crawler using scrapy, you define items first and then write the crawler script under the spiders folder. But right now, we don’t know because we don’t know what page we’re trying to crawl. Let’s try scrapy’s terminal debugging to familiarize ourselves with the pages we’re trying to crawl.

2.3 scrapy shell

We can use scrapy shell [URL] command to open the corresponding web scraping terminal, while crawling down the crawl result response, we can debug our data extraction rules based on it. Now we try scrapy shell https://www.douban.com/doulist/30299/ command to crawl the douban high film list first page:

As you can see, the result of our crawl returns 403, which means that our crawl was intercepted by their crawler mechanism. To do this, try adding -s USER_AGENT=’Mozilla/5.0′ to the end of the command:

We managed to get normal crawler results, indicating that the request header identified us as a crawler script and changed the USER_AGENT to make it look like we were accessing through a browser. If we don’t want to add USER_AGENT every time we run a command, we can set USER_AGENT=’Mozilla/5.0′ in the setting file in the project folder.

Now we can start debugging the crawl rules. As we know, response is the result of our crawling object. When we create a selector based on this object, we can use this selector to extract the desired content using scrapy commands. There are two ways to create a selector:

We’re going to introduce a Selector class to create a Selector objectfrom scrapy.selector import Selector
Use the selector directlyresponse.selector

In scrapy, we can use CSS or xpath methods on the selector to extract the data we need. If you’re familiar with the front end, you’ll know that CSS is the language where the front end defines styles, and in this case, the selector object has CSS methods, whose parameters are regular statements that extract data based on the style. Xpath is also a function defined in the selector to extract data, but xpath has a different rule syntax than CSS. This article focuses on the xpath approach, but here are the basic rules of xpath syntax (replacing HTML elements with single characters such as ABC):

response.selector.xpath("a"), extracts a element that has no parent element in the specified environment
response.selector.xpath("a/b"), extract the first-level element B under element A
response.selector.xpath("//a"), extract all a elements in the page regardless of their location
response.selector.xpath("a//b"), extract all elements b under element A, regardless of the level of element B below element A
response.selector.xpath("a/@b"), extract the b attribute under element A
response.selector.xpath("a/b[1]"), extract the first of the sublevel b elements of element A
response.selector.xpath("//a[@b]")Extract all a elements with b attributes
response.selector.xpath("//a[@b='c']"), extract all a elements whose b attribute value is C
response.selector.xpath("//a[contains("b", "c")]"), extract all a elements that have b attributes and whose b attribute values contain C characters

Now that we know some of the syntax rules of xpath, let’s try to extract the names of all the movies in the list of pages. We first execute view(Response) to open the web page in the browser, and then use the browser’s developer tools mode to view the HTML content containing the movie name.

Based on our observations, we found that: Each item of the movie list is stored in a div called Doulist-item, which has a div called mod to hold the content, and a div called BD Doulist-subject to hold the body content, Then the a tag in the title div holds the movie names we need, which is how most movie names are stored on the page (we know this because every style in the movie list on the page is the same), so I can use the following rule to extract the name of every movie on the page:

response.selector.xpath("//div[@class='doulist-item']/div[@class='mod']/div[@class='bd doulist-subject']/div[@class='title']/a/text()").extract()
Copy the code

Since we want the text content under the A tag, we follow the a tag path with text() to indicate that we want the text under this element. The xpath function returns a list of selector objects, and we need to use extract to convert the selector object to plain text.

The image above shows the movie names extracted from the page. Since the extracted text has some extra newlines and backspaces, we use the strip function for each extracted string to get the final extraction result. We carefully observed the results and found that there were blank strings in the extracted results. Normally, our extraction path should be completely consistent. Why there were occasional interspersed strings in the results?

We see that the second movie name has an extra play icon than the first movie name. Is that why we extract a blank string? Now let’s expand the second selector from the first one that contains the name of the movie:

As we can see from the figure, the extracted a tag element containing the movie name, if there is a movie icon in it, will be preceded by a string containing a newline character and a space. This is why the extracted movie name has a blank string after removing the extra characters. When we extract two paragraphs of text, which means that the first is a blank string and the second is the movie we want, we just take the last paragraph of text every time, so we can modify our xpath rules again:

response.selector.xpath("//div[@class='doulist-item']/div[@class='mod']/div[@class='bd doulist-subject']/div[@class='title']/a/text()[last()]").extract()
Copy the code

Here we add [last()] after text() to indicate that we only want the final text. Also, because we’re using xpath to return a list of selectors, we’re using xpath functions on a selector object, and we can use them on a list of selectors. This means that we can extract the result returned by xpath again using xpath, which makes the code structure clearer. The above movie name extraction code is the same:

films_box = response.selector.xpath("//div[@class='doulist-item']/div[@class='mod']")

films_body = films_box.xpath("div[@class='bd doulist-subject']")

films_name = films_body.xpath("div[@class='title']/a/text()[last()]").extract()
Copy the code

Finally, we summarize the process of deriving the rules for extracting target data using Scrapy’s xpath:

Use scrapy shell to crawl the specified page
Use View (Response) to open the contents of the page from the browser
Open developer mode in your browser and watch where the target data is stored on the page
Transform the path of the target data in the page into an xpath extraction rule

2.4 the items

We now have a rule for extracting movie names from the movie list, which means we can start writing crawler scripts. In scrapy, to write a crawler script, we need to define items, which are the target entities to crawl. Class variables that are directly defined in the class are the attributes of the item. For example:

import scrapy

class Film(scrapy.item):
    name = scrapy.Field()
    
Copy the code

The above is the item of Film defined by us, which has the property value of movie name. The created item class inherits scrapy.item and uses scrapy.Field when declaring the item attribute, without specifying the type. The main purpose of creating items is to use scrapy’s auto-save capabilities, such as exporting the extracted results to an Excel file or saving them to a database based on our defined Item structure.

2.5 spiders

Now that we know how to define an item, we can start to think about writing spider scripts.

Introducing the spiders classimport Spider from scrapy, you need to inherit this class when writing a crawler
The name variable defines the name of our crawler, and allows our crawler to be executed by scrapy using the crawler name. The value of name is always lowercase
To set urls or request objects to be crawled, we can specify the list of urls to be crawled by setting the start_urls class variable, or we can generate iterable request objects by writing a class function called start_requests
A class called parse executes its scrapy function if no other callbacks are specified

Below we write douban movie crawler script:

import scrapy
from douban.items import Film

class DouBan(scrapy.Spider):
    name = "douban"
    start_urls=[""https://www.douban.com/doulist/30299/""]
    
    def parse(self, response):
        films_box = response.selector.xpath("//div[@class='doulist-item']/div[@class='mod']")
        films_body = films_box.xpath("div[@class='bd doulist-subject']")
        films_name = films_body.xpath("div[@class='title']/a/text()[last()]").extract()
        for name in films_name:
            name = name.strip()
            film = Film()
            film["name"] = name
            yield film
Copy the code

In the above crawler script, we set the name of the crawler as Douban, the page to climb, and the method to parse the crawl result and extract the required data in the parse function. After we finish the crawler script, we can execute scrapy crawl douban -o file name to save the extracted data to the file, Scrapy supports file types such as’ JSON ‘, ‘Jsonlines ‘,’ jL ‘, ‘CSV ‘,’ XML ‘, ‘Marshal ‘, and ‘pickle’, which initially didn’t include Excel files. But exporting to Excel files is what we want most (using Excel to filter or sort movie results directly).

2.6 scrapy – XLSX

To do this, we can install the Python library scrapy-Xlsx, whose main function is to export our results to Excel files within our scrapy project. PIP install scrapy-xlsx scrapy-xlsx scrapy-xlsx

FEED_EXPORTERS = {
    'xlsx': 'scrapy_xlsx.XlsxItemExporter',
}
            
Copy the code

Now we can execute scrapy crawl douban -o douban_films. XLSX in the project folder path to export the result into an Excel file

2.7 the item pipeline

We used the last() method in xpath to filter empty movie names by taking only the last paragraph of text. If we can’t handle useless text using xpath syntax, do we need to process it in the Spider script? In fact, scrapy provides us with an easy way to process and save extracted results, called pipelines. Pipeline means pipeline. Here we use pipeline as the process of building a new product pipeline, where we test and process the product, or how the product is packaged and preserved. Therefore, we define pipeline mainly for the following four points:

Clean up the data
Discard unavailable data
Data to check again
Save the data

> < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;” Each pipeline is a class that can be named whatever it wants, but must contain the process_item(item, spider) method, where scrapy extracts the item and processes it. Now, let’s try writing a DoubanPipeline that loses the item named a blank string:

from scrapy.exceptions import DropItem class DoubanPipeline: def process_item(self, item, spider): if item["name"]: Return item else: Raise DropItem(" Discard blank movie name ")Copy the code

Now that we’ve written a Pipeline, we’re going to let scrapy use it, in settings.py:

ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 300,
}
Copy the code

< span style = “box-sizing: border-box; line-height: 22px; display: block; word-break: inherit! Important; word-break: inherit! Important;”

Now we can modify our original crawler script:

def parse(self, response):
    films_box = response.selector.xpath("//div[@class='doulist-item']/div[@class='mod']")
    films_body = films_box.xpath("div[@class='bd doulist-subject']")
    films_name = films_body.xpath("div[@class='title']/a/text()").extract()
    for name in films_name:
        name = name.strip()
        film = Film()
        film["name"] = name
        yield film
        
Copy the code

We do not use last() in xpath statements in the parse function, nor do we determine and filter empty movie names because we already have a pipeline for filtering movie names, Scrapy crawl douban -o douban_film.xlsx

As can be seen from the background output information, when the movie name is extracted as a blank string, it will discard and output our customized warning information, and the final output file result is consistent with our original.

2.8 Extract each page

Now we know how to extract results from the page, but the douban movie list we’re targeting is longer than one page. In order to crawl all the movie list data, we need to know the corresponding links for each page.

We observed the page jump tag with the number 123 at the bottom of the page and saw the corresponding links of each tag. We found that each link was consistent except for the different number after start=, and the value after start= of each link was monotonically increasing by 25. The list item of movies on the first page was 25. So we know that for each link start= should be the number of pages times 25.

Now that we know how to generate links for every page to crawl, we also know when to stop. We can see that the span tag representing the current page in the jump page tag has an attribute value of data-total-page, which is all we need.

We know that when we set start_urls, scrapy will crawl all the links and execute the parse function by default, so we can place the first page link in start_urls and extract the data-total-page value from the parse function. And return the scrapy.Request object (with the link we generated), which is called to parse_page.

class DouBan(scrapy.Spider):
  name = "douban"
  start_urls = ["https://www.douban.com/doulist/30299/"]

  def parse(self, response):
      page_total = response.selector.xpath("//span[@class='thispage']/@data-total-page").extract()
      if page_total:
          page_total = int(page_total[0])
          for i in range(page_total):
              url = "https://www.douban.com/doulist/30299/?start={}&sort=seq&playable=0&sub_type=".format(i*25)
              yield scrapy.Request(url, callback=self.parse_page)

  def parse_page(self, response):
      films_box = response.selector.xpath("//div[@class='doulist-item']/div[@class='mod']")
      films_body = films_box.xpath("div[@class='bd doulist-subject']")
      films_name = films_body.xpath("div[@class='title']/a/text()").extract()
      for name in films_name:
          name = name.strip()
          film = Film()
          film["name"] = name
          yield film
Copy the code

In addition, we have a large amount of data to climb this time. If we continue to climb, we may be detected by the other party. They probably do not want their website to be visited so intensively without meaning. To do this, we can slow down the speed at which our crawler script accesses the other database by setting it in settings.py in the project folder:

AUTOTHROTTLE_ENABLED = True
DOWNLOAD_DELAY = 3
Copy the code

Among them:

AUTOTHROTTLE_ENABLED If set to True, AUTOTHROTTLE_ENABLED automatically adjusts the number of concurrent scrapy requests and the download delay. AUTOTHROTTLE_ENABLED The download delay cannot be lower than the set DOWNLOAD_DELAY
DOWNLOAD_DELAY Sets the download delay, in seconds, between the next download and the last download

2.9 Crawl all required data

Now we know how to debug the path where xpath extracts data, define an item, write a spider, process the data with pipeline, and crawl each page. We are basically learned how to use scrapy crawler, but we have a initial goal: each movie information (name, score, the number of evaluation, director, star, type, producer countries or regions, and time) to climb down, good convenience in offline we sort these data to find the movie like to see our own.

Before we start writing the extraction path for this information, we define a new information field in the original item class Film:

class Film(scrapy.Item):
  name = scrapy.Field()
  score = scrapy.Field()
  rating_users = scrapy.Field()
  director = scrapy.Field()
  starring = scrapy.Field()
  film_type = scrapy.Field()
  country_regin = scrapy.Field()
  year = scrapy.Field()
Copy the code

Now let’s debug the xpath extraction path for each message in the shell:

films_box = response.selector.xpath("//div[@class='doulist-item']/div[@class='mod']") films_body = Xpath ("div[@class='bd doulist-subject']") # films_score = Films_body.xpath ("div[@class='rating']/span[@class='rating_nums']/text()").extract() # rating_users = films_body.xpath("div[@class='rating']/span[3]/text()").extract() number_pattern = "[0-9]+" rating_users = [re.search(number_pattern, I).group() for I in rating_users # films_abstract = films_body.xpath("div[@class='abstract']")Copy the code

The main information of the movie includes the movie director, leading actor, type and other information we need. Now we add it to the original Spider script:

. import re ... def parse_page(self, response): films_box = response.selector.xpath("//div[@class='doulist-item']/div[@class='mod']") films_body = films_box.xpath("div[@class='bd doulist-subject']") films_name = films_body.xpath("div[@class='title']/a/text()[last()]").extract() films_score = films_body.xpath("div[@class='rating']/span[@class='rating_nums']/text()").extract() rating_users = films_body.xpath("div[@class='rating']/span[3]/text()").extract() number_pattern = "\d+" rating_users = [re.search(number_pattern, i).group() for i in rating_users] films_abstract = films_body.xpath("div[@class='abstract']") for idx, name in enumerate(films_name): name = name.strip() film = Film() film["name"] = name film["score"] = films_score[idx] film["rating_users"] = rating_users[idx] info_list = films_abstract[idx].xpath("text()").extract() info_list = [i.strip() for i in info_list] For s in info_list: if s.startswith(" director"): film["director"] = s[4:] elif s.startswith(" starring "): Film ["film_type"] = s[4:] elif s.startswith Film ["country_regin"] = s[9:] elif s.startswith(" year"): film["year"] = s[4:Copy the code

Now execute scrapy crawl douban -o douban_films. XLSX again on the project folder path to get an Excel that contains all the information we need.

The above is the information of the movies that we screened out in the result Excel file with a score greater than 8.9, ranked in descending order of the number of reviewers, that is, the movies that everyone agreed were good to watch.

3. Summary

Scrapy scrapy scrapy scrapy scrapy scrapy

What is a reptile
What is a scrapy
The installation of scrapy
Start a scrapy project
Build an initial scrapy project
Use scrapy shell to debug crawl paths
What is xpath and its basic syntax
Write the items
Write the spiders
Scrapy file output
How do SCRAPy output Excel
The use of pipeline
How do I keep getting every URL I need
Adjust the crawler script to the site data acquisition speed
Get douban movie rating

I hope this article can be helpful to everyone’s reptilian enlightenment.