Distributed deployment of crawlers

The distributed crawler is complete and running successfully, but there is one step that is very cumbersome, and that is code deployment.

Consider the following scenarios.

If you deploy the code by uploading the file, you compress the code first, upload the file to the server using SFTP or FTP, and then connect to the server to decompress the file. This configuration is required on every server.
If you use Git synchronization to deploy code, you can Push the code to a Git repository and then Pull the code remotely from each host.

If the code suddenly changes, we have to update every server, and if the version of any host is not well controlled, this may affect the overall distributed crawl situation.

So we needed a more convenient tool to deploy Scrapy projects, and it would have been much easier to do without having to log in and deploy each server over and over again.

In this section we’ll take a look at Scrapyd, a tool that provides distributed deployment.

Learn Scrapyd

Scrapyd is a service that runs Scrapy crawlers and provides HTTP interfaces to help you deploy, start, stop, and remove crawlers. Scrapyd supports version management, but also can manage multiple crawler tasks, using it we can be very convenient to complete the deployment task of Scrapy crawler project scheduling.

Second, preparation

Make sure your machine or server has properly installed Scrapyd.

Visit Scrapyd

After installing And running Scrapyd, you can access the server’s 6800 port and see a web page. For example, if my server address is 120.27.34.25, I can go to my local browser: http://120.27.34.25:6800 and see the first page of Scrapyd. This can be replaced with your server address, as shown belowIf this page is successfully accessed, the Scrapyd configuration is fine.

4. Scrapyd features

Scrapyd provides a series of HTTP interfaces for various operations. The interface is Scrapyd’s IP address 120.27.34.25.

1. daemonstatus.json

This interface is responsible for viewing the current service and task status of Scrapyd. Curl curl curl curl curl curl curl curl curl curl curl curl

The curl http://139.217.26.30:6800/daemonstatus.jsonCopy the code

We get the following result:

{"status": "ok"."finished": 90, "running": 9, "node_name": "datacrawl-vm"."pending": 0}Copy the code

The result is a JSON string, finished for completed Scrapy tasks, RUNNING for pending Scrapyd tasks, and node_name for the host name.

2. addversion.json

This interface is primarily used to deploy Scrapy projects. We first package the project as an Egg file, then pass in the project name and deployment version.

We can implement project deployment as follows:

Curl http://120.27.34.25:6800/addversion.json - F project = wenbo - F version = first - an egg F = @ weibo. An eggCopy the code

In this case, -f means adding a parameter, and we also need to package the project as an Egg file for local storage.

After making the request, we get the following result:

{"status": "ok"."spiders": 3}Copy the code

This results in a successful deployment with a count of 3 spiders.

This deployment method can be cumbersome, and more convenient tools to implement project deployment will be described later.

3. schedule.json

This interface is responsible for scheduling deployed Scrapy projects.

We can use the following interface to implement task scheduling:

The curl http://120.27.34.25:6800/schedule.json-d project=weibo -d spider=weibocnCopy the code

We need to pass in two parameters, project Scrapy and spider.

The result is as follows:

{"status": "ok"."jobid": "6487ec79947edab326d6db28a2d86511e8247444"}Copy the code

Status represents the start of the Scrapy project, and jobid represents the currently running crawl task code.

4. cancel.json

This interface can be used to cancel a crawl task. If the task is pending, it will be removed; If the task is in the running state, it will be terminated.

We can cancel the task with the following command:

The curl http://120.27.34.25:6800/cancel.json-d project=weibo -d job=6487ec79947edab326d6db28a2d86511e8247444Copy the code

Two parameters are passed in: project, the project name, and job, the crawl task code.

The result is as follows:

{"status": "ok"."prevstate": "running"}Copy the code

Status represents the execution status of the request, and PrevState represents the previous running status.

5. listprojects.json

This interface is used to list all project descriptions deployed to the Scrapyd service.

We can get all the project descriptions on our Scrapyd server using the following command:

The curl http://120.27.34.25:6800/listprojects.jsonCopy the code

You don’t need to pass any arguments here.

The result is as follows:

{"status": "ok"."projects": ["weibo"."zhihu"]}Copy the code

Status represents request execution, and projects is a list of project names.

6. listversions.json

This interface is used to get all the version numbers of an item in order, with the last entry being the latest version number.

We can get the project version number with the following command:

The curl http://120.27.34.25:6800/listversions.json? project=weiboCopy the code

You need a parameter project, which is the name of the project.

The result is as follows:

{"status": "ok"."versions": ["v1"."v2"]}Copy the code

Status stands for request execution status, and Versions is the list of version numbers.

7. listspiders.json

This interface is used to get all the Spider names for the latest version of a project.

We can get the project’s Spider name with the following command:

The curl http://120.27.34.25:6800/listspiders.json? project=weiboCopy the code

You need a parameter project, which is the name of the project.

The result is as follows:

{"status": "ok"."spiders": ["weibocn"]}Copy the code

Status stands for request execution, and the spiders are a list of Spider names.

8. listjobs.json

This interface is used to get details of all the tasks currently running for a project.

We can obtain all task details with the following command:

The curl http://120.27.34.25:6800/listjobs.json? project=weiboCopy the code

You need a parameter project, which is the name of the project.

The result is as follows:

{"status": "ok"."pending": [{"id": "78391cc0fcaf11e1b0090800272a6d06"."spider": "weibocn"}]."running": [{"id": "422e608f9f28cef127b3d5ef93fe9399"."spider": "weibocn"."start_time": "The 2017-07-12 10:14:03. 594664"}]."finished": [{"id": "2f16646cfcaf11e1b0090800272a6d06"."spider": "weibocn"."start_time": "The 2017-07-12 10:14:03. 594664"."end_time": "The 2017-07-12 10:24:03. 594664"}}]Copy the code

Status Indicates the request execution status, Pendings indicates the pending tasks, RUNNING indicates the running tasks, and Finished indicates the completed tasks.

9. delversion.json

This interface is used to delete a version of a project.

We can remove the project version with the following command:

The curl http://120.27.34.25:6800/delversion.json-d project=weibo -d version=v1Copy the code

You need a parameter project, which is the name of the project, and a parameter version, which is the version of the project.

The result is as follows:

{"status": "ok"}Copy the code

Status represents request execution, which indicates that the deletion was successful.

10. delproject.json

This interface is used to delete an item.

We can delete an item with the following command:

The curl http://120.27.34.25:6800/delproject.json-d project=weiboCopy the code

You need a parameter project, which is the name of the project.

The result is as follows:

{"status": "ok"}Copy the code

Status represents request execution, which indicates that the deletion was successful.

The above interfaces are all Scrapyd interfaces. We can directly request the HTTP interface to control the deployment, startup, and operation of the project.

Use of Scrapyd API

These interfaces may not be very convenient to use. No matter, there’s also a Scrapyd API library that wraps these interfaces, as described in Chapter 1.

Here’s how to use the Scrapyd API. The core principles of the Scrapyd API are the same as HTTP requests, but Python’s library is much easier to use.

We can create a Scrapyd API object as follows:

from scrapyd_api import ScrapydAPI
scrapyd = ScrapydAPI('http://120.27.34.25:6800')Copy the code

Call its methods to implement operations on the corresponding interface, such as deployment operations, as follows:

egg = open('weibo.egg'.'rb')
scrapyd.add_version('weibo'.'v1', egg)Copy the code

This allows us to package the project as an Egg file, and then deploy the locally packaged Egg project to remote Scrapyd.

In addition, the Scrapyd API implements all of the apis provided by Scrapyd, with the same names and parameters.

For example, calling list_projects() lists all deployed projects in Scrapyd:

scrapyd.list_projects()
['weibo'.'zhihu']Copy the code

There are other methods listed here, all with the same names and arguments. For more detailed operations, please refer to the official document http://python-scrapyd-api.readthedocs.io/.

Six, the concluding

This section introduces you to the use of Scrapyd and the Scrapyd API, which allows you to deploy projects and control the execution of tasks through an HTTP interface. However, the deployment process is a bit cumbersome, as the project needs to package the Egg file and then upload it. In the next section, we introduce a more convenient tool to complete the deployment process.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)

Learn Scrapyd

Second, preparation

Visit Scrapyd

4. Scrapyd features

1. daemonstatus.json

2. addversion.json

3. schedule.json

4. cancel.json

5. listprojects.json

6. listversions.json

7. listspiders.json

8. listjobs.json

9. delversion.json

10. delproject.json

Use of Scrapyd API

Six, the concluding

Related Posts

How to learn Kubernetes easily?

Distributed Application scenarios of Zookeeper

Get started with the Spring WebFlux framework quickly