Scrapyrt provides an HTTP interface for Scrapy scheduling. Instead of executing Scrapy, we can schedule Scrapy tasks by requesting an HTTP interface, eliminating the need for a command line to start projects. If the project is running on a remote server, using it to start the project is a good option.

I. Objectives of this section

We take the introduction to this chapter Scrapy project as an example to illustrate the use of Scrapyrt method, project source address is: https://github.com/Python3WebSpider/ScrapyTutorial.

Second, preparation

Make sure Scrapyrt is installed and running correctly

3. Start the service

Start by downloading the project and running Scrapyrt in the project directory, assuming the current service is running on port 9080. Here’s how to use Scrapyrt.

GET request

Currently, the GET request supports the following parameters.

  • Spider_name: Spider name. It is a string and mandatory. If the Spider name passed does not exist, a 404 error is returned.

  • Url: Crawl link, string type, must be passed if the start link is not defined. If passed, Scrapy generates the Request directly from the URL, ignoring the definition of the start_requests() method and the start_urls attribute.

  • Callback: Indicates the name of the callback function. It is a string and optional. If passed, the callback function is used; otherwise, the callback function defined in Spider is used by default.

  • Max_requests: specifies the maximum number of requests. This parameter is optional. It defines the maximum number of requests that Scrapy can execute. A maximum of five requests will be executed and the rest will be ignored.

  • Start_requests: Indicates whether to execute the start_requests method, Boolean, optional. If the start_requests() method is defined in your Scrapy project, it will be invoked by default when the project starts. This is not the case in Scrapyrt, which does not execute the start_requests() method by default. To do so, you need to set the start_requests parameter to true.

We execute the following command:

curl http://localhost:9080/crawl.json? spider_name=quotes&url=http://quotes.toscrape.com/Copy the code

The result is shown below.

Returns a JSON-formatted string whose structure we parse as follows:

{
  "status": "ok"."items": [{"text": "The world as we have created it is a process of O...."author": "Albert Einstein"."tags": [
        "change"."deep-thoughts"."thinking"."world"]},... {"text": "“... a mind needs books as a sword needs a whetsto..."."author": "George R.R. Martin"."tags": [
        "books"."mind"]}],"items_dropped": []."stats": {
    "downloader/request_bytes": 2892,
    "downloader/request_count": 11."downloader/request_method_count/GET": 11."downloader/response_bytes": 24812,
    "downloader/response_count": 11."downloader/response_status_count/200": 10,
    "downloader/response_status_count/404": 1,
    "dupefilter/filtered": 1,
    "finish_reason": "finished"."finish_time": "The 2017-07-12 15:09:02"."item_scraped_count": 100,
    "log_count/DEBUG": 112,
    "log_count/INFO": 8,
    "memusage/max": 52510720,
    "memusage/startup": 52510720,
    "request_depth_max": 10,
    "response_received_count": 11."scheduler/dequeued": 10,
    "scheduler/dequeued/memory": 10,
    "scheduler/enqueued": 10,
    "scheduler/enqueued/memory": 10,
    "start_time": "The 2017-07-12 15:08:56"
  },
  "spider_name": "quotes"
}Copy the code

Most items are omitted here. The items section is the result of the Scrapy Item, the items_dropped Item list is ignored, and the stats section is the statistics of the crawl result. This result is the same as the result of running the Scrapy project directly.

This allows us to schedule our Scrapy project through the HTTP interface and get the crawl results. If our Scrapy project is deployed on the server, we can easily start a Scrapyrt service to schedule the task and get the crawl results directly.

POST requests

In addition to GET requests, we can also request Scrapyrt via POST requests. However, the Request Body must be a valid JSON configuration, in which the corresponding parameters can be configured. More configuration parameters are supported.

Currently, the following parameters are supported in THE JSON configuration.

  • Spider_name: Spider name. It is a string and mandatory. If the Spider name passed does not exist, a 404 error is returned.

  • Max_requests: specifies the maximum number of requests. This parameter is optional. It defines the maximum number of requests that Scrapy can execute. A maximum of five requests will be executed and the rest will be ignored.

  • Request: Indicates the request configuration. It is a JSON object. You can use this parameter to define each parameter of the Request. You must specify the URL field to specify the crawl link. Other fields are optional.

Let’s look at a JSON configuration example as follows:

{
    "request": {
        "url": "http://quotes.toscrape.com/"."callback": "parse"."dont_filter": "True"."cookies": {
            "foo": "bar"}},"max_requests": 2."spider_name": "quotes"
}Copy the code

We execute the following command to pass the JSON configuration and issue a POST request:

curl http://localhost:9080/crawl.json -d '{"request": {"url": "http://quotes.toscrape.com/", "dont_filter": "True", "callback": "parse", "cookies": {"foo": "bar"}}, "max_requests": 2, "spider_name": "quotes"}'Copy the code

The running result is similar to the above, with the same output of crawl status, results, statistics and other content.

Six, the concluding

Scrapyrt. It allows us to easily schedule the run of our Scrapy project and retrieve the crawl results. More usage can refer to the official document: http://scrapyrt.readthedocs.io.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)