Scrapy middleware (3)

The previous two articles introduced the use of the downloader Middleware. This article will introduce the use of Spider Middleware.

Crawler middleware

The use of crawler middleware is very similar to that of downloader middleware, except that they act on different objects. The function object of the downloader middleware is to request request and return response; The crawler middleware works with crawlers, more specifically, the files written under the spiders folder. Their relationship, as illustrated in the following figure, can be nicely delineated on Scrapy’s data flow diagram.

Among them, 4 and 5 represent the downloader middleware, and 6 and 7 represent the crawler middleware. Crawler middleware is invoked in the following situations.

When running toyield scrapy.Request()oryield itemWhen crawler middlewareprocess_spider_output()Method is called.
When the code for the crawler itself appearsExceptionWhen crawler middlewareprocess_spider_exception()Method is called.
When a callback function in a crawlerparse_xxx()Crawler middleware before being calledprocess_spider_input()Method is called.
When running tostart_requests()When crawler middlewareprocess_start_requests()Method is called.

Handles the crawler’s own exceptions in the middleware

The crawler middleware can handle the crawler itself. Take, for example, write a creeper crawled UA exercise practice page. Kingname. Info/exercise_mi… , deliberately creates an exception in the crawler, as shown in Figure 12-26.

Because the website returns only a common string, not a JSON format string, so using JSON to parse, will definitely cause an error. This type of error is different from that encountered in the downloader middleware. Errors in the downloader middleware are usually caused by external causes, not code level. And now this kind of error is due to the code itself caused by the problem, is the code is not comprehensive enough to cause.

To solve this problem, in addition to carefully examining the code and considering various situations, crawler middleware can also be developed to skip or handle such errors. Write a class in middlewares.py:

class ExceptionCheckSpider(object):

    def process_spider_exception(self, response, exception, spider):
        print(F 'returns:{response.body.decode()}\n Error cause:{type(exception)}')
        return None
Copy the code

This class only serves the purpose of logging. Print out the content returned from the site if it fails to parse the content returned from the site using JSON.

The process_spider_Exception () method, which can either return None or run the yield item statement or, like the crawler code, use yield scrapy.request () to initiate a new Request. If you run yield Item or yield scrapy.request (), the program bypasses the original code in the crawler.

For example, for requests with exceptions, there is no need to retry, but it is necessary to record which request has exceptions. In this case, exceptions can be detected in the crawler middleware and an item containing only tags can be generated. Or in the crawl exercise. Kingname. Info/exercise_mi… Take the contents of this practice page as an example, but this time you don’t retry, just note which page is in question. Take a look at the crawler code, this time with the page number in the meta, as shown below.

If the crawler finds a parameter error, it manually raises a custom exception using the raise keyword. In actual crawler development, the reader can also deliberately not use try in some places… Except catches exceptions, but lets them be thrown directly. For example, as a result of XPath matching, you can read the value directly, without first determining whether the list is empty. If the list is empty, an IndexError will be thrown, allowing the crawler’s process to enter process_SPIDer_Exception () in the crawler middleware.

An ErrorItem is created in kitems.py to record which page has a problem, as shown below.

Next, the error page and the current time are stored in the ErrorItem in the crawler middleware and submitted to Pipeline and saved in MongoDB, as shown in the figure below.

In this way, the function of recording the number of error pages is realized, which is convenient to analyze the cause of the error later. Since this is where the item is submitted to Pipeline, don’t forget to open Pipeline in settings.py and configure MongoDB. The code for storing error pages to MongoDB is shown below.

Activating crawler middleware

The activation mode of crawler middleware is very similar to that of the downloader middleware. In settings.py, the configuration item of crawler middleware is above the configuration item of downloader middleware, which is also annotated by default.

Scrapy also has several built-in crawler middleware, their names and order shown below.

The smaller the number of downloader middleware is, the closer it is to Scrapy engines, and the larger the number is, the closer it is to crawlers. If you are not sure where your custom middleware should go, choose between 500 and 700.

Crawler middleware input/output

There are two less commonly used methods in crawler middleware, which are process_spider_input(response, spider) and process_spider_output(response, result, spider). Process_spider_input (response, spider) is called immediately after the downloader’s middleware has finished processing, before entering some callback function called parse_xxx(). Process_spider_output (response, result, output) is called when the crawler runs yield item or yield scrapy.request (). After this method completes processing, the data, if item, is passed to Pipeline; If it is a request, it will be handed to the scheduler and the downloader middleware will start running. So you can make further changes to the item or request in this method. The result argument is the crawler’s item or scrapy.request (). Since yield yields a generator, and generators are iterable, result is also iterable, and can be expanded using a for loop.

def process_spider_output(response, result, spider):
    for item in result:
        if isinstance(item, scrapy.Item):
            Here you can do various things to the item that will be submitted to the pipeline
            print(F 'item will be submitted to pipeline')
        yield item
Copy the code

Or to monitor and modify requests:

def process_spider_output(response, result, spider):
    for request in result:
        if not isinstance(request, scrapy.Item):
            Here you can make various changes to the request
            print('Now you can also modify the request object... ')
        request.meta['request_start_time'] = time.time()
        yield request
Copy the code

This article is excerpted from my new book “Python crawler Development from The Beginning to the Actual combat”. The full table of contents can be found on jd.com item.jd.com/12436581.ht…

It is not important to buy the book or not. The important thing is to pay attention to my official account: I have not heard Code

The public account has been more than three months in a row. There will be more days in a row for a long time to come.

Crawler middleware

Handles the crawler’s own exceptions in the middleware

Activating crawler middleware

Crawler middleware input/output

Related Posts

【10 minutes to play Tencent’s core database 】TcaplusDB East China customer hands-on activity

Play the projection query to JPQL in Spring-data-JPA

[Youth Training camp] Teacher Yueying told me to write a summary of JavaScript principles and skills