Scrapy Entry to Abandon 04: Downloader middleware, make the crawler more perfect

This is the fifth day of my participation in the August More text Challenge. For details, see:August is more challenging

preface

MiddleWare is, as the name suggests, MiddleWare. It handles requests (such as adding a proxy IP, adding a request, etc.) and handles responses

This article focuses on the concept of downloader middleware and how to use middleware and custom middleware.

MiddleWare classification

It’s still the familiar architecture diagram.

From the figure, middleware can be divided into two categories:

1. Downloader MiddleWare
3. Spider MiddleWare

This article mainly introduces the downloader middleware, first look at the official definition:

The loader middleware is a Scrapy hook framework for processing request/response. Scrapy Request is a lightweight, low-level system for globally modifying Scrapy Request and response.

role

As described in the architecture diagram, the downloader middleware is located between the Engine and the downloader. When the engine sends an unprocessed request to the downloader, it passes through the downloader middleware, where the request can be wrapped, such as modifying the request header (UA, cookie, etc.) and adding the proxy IP.

When the downloader sends a web site response to the Engine, it also passes through the downloader middleware, where the response content can be processed.

Built-in downloader middleware

Scrapy has a lot of built-in downloader middleware for developers to use. When we start a Scrapy crawler, Scrapy automatically helps us enable this middleware. As shown in figure:

This is the log that the console printed when we started our Scrapy program. We found that Scrapy enabled a lot of our downloader middleware and Spider middleware.

Here’s a look at how the built-in middleware works.

RetryMiddleware

In fact, the built-in middleware works with the configuration in Settings. Take RetryMiddleware for example. When a request fails, the RETRY_ENABLED and RETRY_TIMES configurations enable the retry policy and determine the number of retries. Sauce!!

Where can I find the mapping between Settings and middleware?

Here I do it in two ways:

Go to the official documentation, link to the previous article
Look at the source code comment, under the scrapy package has the middleware corresponding py file

This is clearly stated in the comments, as well as the parameters obtained in the code.

Custom middleware

Sometimes the built-in middleware doesn’t meet our needs, so we have to go our own way and customize the middleware. All middleware is defined in middlewares. Py.

We opened middlewares. Py and found that a downloader middleware and Spider middleware had been automatically generated.

First look at the self-generated downloader middleware template:

As you can see, there are five main methods:

from_crawler: class method used to initialize the middleware
process_request: Each request will call this method when downloading middleware, corresponding to Step 4 in the architecture diagram
process_response: Processes the response content returned by the downloader, corresponding to Step 7 in the architecture diagram
process_exception: This method is called when the downloader or processing request exceptions
spider_opened: the built-in semaphore callback method, here do not care, do not care!

I’m going to focus on 3, and I’m going to take a look at 4 and 5.

process_request()

This method takes two arguments:

Request: Indicates the request that is initiated by the spider and needs to be processed
Spider: The corresponding spider of the request. The semaphore is temporarily used to describe this object

def process_request(self, request, spider) :
        # Called for each request that goes through the downloader middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        # installed downloader middleware will be called
        return None
Copy the code

This is just to show you the comments, but the point of the comments is to show you that this method must return a value.

None: Basically uses this return value. Indicates that the request is ready to be processed by the next middleware.
request: Stops calling the process_request method and reschedules the request back on the queue
responseProcess_response is executed without calling process_request.

The other is that raise raises an exception, but it usually returns None. The rest can only be explored for now, if you are interested.

process_response()

This method takes three arguments:

Request: Indicates the request corresponding to response
Response: The response that is processed
Spider: The corresponding spider of response

def process_response(self, request, response, spider) :
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
Copy the code

Again, look at the comment, return two values:

response: The response content returned by the downloader, which is processed in process_Response of each middleware
request: Stops calling the process_response method, the response does not reach the spider, and reschews the request back on the queue

Remember, just return response.

process_exception()

def process_exception(self, request, exception, spider) :
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
Copy the code

This method is entered when the above two methods throw an exception, and returns three values, similar to the above, using None.

Enabling and disabling middleware

Custom middleware, which sometimes overlaps with built-in middleware, is also concerned about overlapping functions. So here we have the option to turn off the built-in middleware in the configuration.

I personally prefer custom user-agent middleware, but Scrapy has UserAgentMiddleware built in, which conflicts. If the built-in middleware execution is low priority and executes later, the built-in UA overwrites the custom UA. Therefore, we need to turn off the built-in UA middleware.

The DOWNLOADER_MIDDLEWARES parameter is used to set the loader middleware. Key is the path of the middleware, and Value is the priority of execution of the middleware. The smaller the number is, the higher the execution priority is. When Value is None, the execution is disabled.

# settings.py
DOWNLOADER_MIDDLEWARES = {
    # Disable the default UserAgent plug-in
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None.# Enable custom middleware
    'ScrapyDemo.middlewares.VideospiderDownloaderMiddleware': 543,}Copy the code

In this way, the built-in UA middleware is disabled.

Call priority

The second thing we need to be clear about is that middleware is a chain call, where a request passes through each middleware based on the priority of the middleware, and so does the response.

As mentioned above, each middleware set an execution priority, and the smaller the number, the higher the execution priority. For example, the priority of middleware 1 is set to 200 and that of middleware 2 is set to 300.

When a spider makes a request, the request is processed by process_REQUEST of middleware 1, processed by this method of middleware 2, processed by this method of all middleware, and finally sent to the downloader for a website request, and then returned with a response.

Process_response is the reverse processing, first to this method in middleware 2, then to middleware 1, and finally the response is returned to the spider to be processed by the developer.

practice

Here we define a custom downloader middleware to add the user-agent.

Custom middleware

Define a middleware in middlewares. Py:

class CustomUserAgentMiddleWare(object) :

    def process_request(self, request, spider) :
        request.headers['User-Agent'] = 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari
        return None

    def process_response(self, request, response, spider) :
        print(request.headers['User-Agent'])
        return response
Copy the code

Enabling middleware

For simplicity, we don’t change the Settings. Py global configuration, but use the local configuration in the code.

import scrapy

class DouLuoDaLuSpider(scrapy.Spider) :
    name = 'DouLuoDaLu'
    allowed_domains = ['v.qq.com']
    start_urls = ['https://v.qq.com/detail/m/m441e3rjq9kwpsc.html']

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            # Disable the default UserAgent plug-in
            'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None.# Enable custom middleware
            'ScrapyDemo.middlewares.CustomUserAgentMiddleWare': 400}}def parse(self, response) :
        pass
Copy the code

First, the default UA middleware is disabled, and then the custom UA middleware is enabled. And I put a breakpoint on the last line and Debug to see if UA was set successfully.

The test results

Debug mode starts the program, where the custom UA middleware is disabled first.

As you can see, request UA is Scrapy. We remove the comment, start the UA middleware, and start the program testing again.

As shown in the figure, the UA of the request has become the UA that I set in the middleware.

Setting the Proxy IP address

Again, the proxy IP is set in the process_request method.

The code is as follows:

request.meta["proxy"] = 'http://ip:port'
Copy the code

conclusion

The main function of the downloader middleware is to package requests. My personal custom downloader middleware is used for dynamically setting UA and real-time detection of changing proxy IP. For other scenario requirements, the built-in downloader middleware is basically sufficient.

Of course, you can develop Scrapy crawlers without learning about loader middleware, but loader middleware will make your crawler even better.

Originally I wanted to write the downloader middleware and Spider middleware in one article, but the knowledge is too broken, not good for layout, and easy to confuse, so I will leave the Spider middleware in the next article, looking forward to the next encounter.

Scrapy Entry to Abandon 04: Downloader middleware, make the crawler more perfect

preface

MiddleWare classification

role

Built-in downloader middleware

RetryMiddleware

Custom middleware

process_request()

process_response()

process_exception()

Enabling and disabling middleware

Call priority

practice

Custom middleware

Enabling middleware

The test results

Setting the Proxy IP address

conclusion

Related Posts

Nginx/OpenResty based API gateway-Orange

Do you really know String? (Revised version)

Discussion and comparison of distributed transaction selection