This is the fifth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

preface

Let’s write Spider middleware. It’s early in the morning, I don’t want to write at all, mainly because it’s useless. Oh, no, I don’t use it much. Because of the work, has been put off for a long time, this time while taking advantage of midnight to write a.

Scrapy-deltafetch is implemented in Spider middleware to remove the logic, the development process is less personal use.

role

It’s the same familiar architecture diagram, which, unsurprisingly, is the last one to appear in this series of Scrapy articles.

As the architecture diagram shows, the Spider middleware sits between the Spiders and Engine and processes the Item and Response just before the Item is about to embrace the Pipeline. The official definition is as follows:

Spider middleware is a hook framework for Scrapy Spider handling. Code can be added to handle the response sent to Spiders and the item and request generated by Spiders.

Spiders middleware

When we start the crawler, Scrapy automatically activates some of the built-in Spider middleware.

As shown in the figure, five Spider middleware is enabled for us here, and we analyze one wave in turn.

Built-in Spider middleware

As mentioned earlier in the downloader middleware, most of the built-in middleware is used with the configuration in Settings. Spider middleware is no exception. This is where I want to put

1. HttpErrorMiddleware

Effect: By default, filter all failures, i.e., response values not between 200 and 300).

Use: many ways, only two, choose one.

  1. In-program property definition
class MySpider(CrawlSpider) :
    handle_httpstatus_list = [404]
Copy the code

This is similar to custom_settings, which is only valid for the current crawler, but is present as an attribute.

  1. Settings. py global definition
HTTPERROR_ALLOWED_CODES = [400.404]
Copy the code

If you define this configuration using custom_settings, as in method 1, it takes effect in the current program. At this point, take a look at how the HttpError middleware source code handles the response code.

The above figure is HttpError middleware source, you can see through the status of response to obtain the response code, and then filter judgment, if the response code in [200, 300) interval, then directly through; otherwise you need to check the configuration, again to judge.

2. OffsiteMiddleware

The middleware filters all requests whose host names are not in spider allowed_domains. The request sets dont_filter, If not, pider’s handle_httpstatus_list attribute or HTTPERROR_ALLOWED_CODES setting is not filtered to specify the response return value that the spider can handle

3. RefererMiddleware

Set the Request Referer field based on the URL that generated the Request’s Response

4. UrlLengthMiddleware

Filter URLLENGTH_LIMIT – The longest length allowed to crawl a URL

5. DepthMiddleware

The maximum depth used to limit the depth of a crawl or similar, DEPTH_LIMIT – The maximum depth allowed for a crawl, if 0, there is no limit. DEPTH_STATS – Whether to collect crawl status. DEPTH_PRIORITY – Whether requet is prioritized based on its depth

Custom Spider middleware

Spider middleware is also defined in middlewares.py and is enabled via SPIDER_MIDDLEWARES.

Let’s take a look at the custom template given by Scrapy.

As shown in the figure, the following methods are mainly needed:

  1. From_crawler: Class method used to initialize middleware
  2. Process_spider_input: This method is called to process the response when it passes through spider middleware
  3. Process_spider_output: This method is called when the Spider returns result in response
  4. Process_spider_exception: This method is called when an exception occurs
  5. Process_start_requests: This method is called with a request started by a spider and performs a procedure similar to process_SPIDer_outpu, except that it has no associated response and must return Request (not item).

HttpErrorMiddleware can be implemented using HttpErrorMiddleware.

The difference between

What is the difference between Spider middleware and downloader middleware?

  1. Spider middleware can retrieve items, which are encapsulated structures for crawling data.
  2. Spider middleware is one-way, handling requests and responses. The downloader middleware is bidirectional, processing the request the first time and the request and response the second time.
  3. Spider middleware mainly deals with the response results after the request; The downloader middleware mainly constructs the request before the request, such as adding the request header, proxy IP and so on.

conclusion

This article is intended to be a knowledge extension, for Spider middleware, knowing and using the built-in middleware is enough, but customizing is rarely used.

Writing this kind of basic theory is the most annoying, in fact, you may understand a look, but it is very difficult to speak out. Fortunately, it should soon be time for the actual operation. Looking forward to our next meeting.