Package and deploy crawlers using scrapyd-client

When the crawler code is written, you can choose to start the crawler directly by running the startup file, or deploy the crawler to Scrapyd and use the crawler API to start the crawler.

So how do you package and deploy crawler projects to Scrapyd?

I will familiarize myself with Scrapy crawler packaging, installation and use of Scrapyd-Client, and crawler deployment through two specific deployment examples (to local and to cloud).

Crawler package

The entire process of Scrapyd package deployment is as follows:

The early stage of the packaging

Once you’ve written your crawler code using the Scrapy framework, you need to package your project before deploying it to Scrapyd. The official documentation describes the packaging of the project:

Deploying your project involves eggifying it and uploading the egg to Scrapyd via the
addversion.json endpoint. You can do this manually, but the easiest way is to use the scrapyd-deploy tool provided by scrapyd-client which will do it all for you.
Copy the code

Scrapy projects need to be packaged using the scrapyd-client tool.

Scrapyd-client

It is a client tool for Scrapy packaging, also developed by the Scrapy development team. Use scrapyd-client to package the project into an.egg file.

Scrapyd – the installation of the client

Like Scrapyd, it can also be installed via PIP:

pip install scrapyd-client
Copy the code

Project configuration before packaging

Before packing, we need to set up our Scrapy project. In the Scrapy project directory, locate the.cfg file (usually scrapy. CFG) in the project root directory and open it with an editor:

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = arts.settings

[deploy]
#url = http://localhost:6800/
project = arts

Copy the code

Configuration files are divided into Settings level and Deploy level. Settings specifies the configuration file to use for the project, while Deploy specifies the Settings to package the project.

  • URL – Specifies the target address for deployment
  • Project – Specifies the packaged Project
  • Deploy – Specifies the project alias

In this section, the arts project is used, and the local service localhost:6800 is used as the basis for the demonstration.

As you can see, the URL in the.cfg file is annotated by default, so remove the annotation and add the alias locals for the project:

[settings]
default = arts.settings

[deploy:locals]
url = http://localhost:6800/
project = arts
Copy the code

Packaged deployment

Then use the command from the root of the Arts project (at the same level as.cfg) to make sure the Scrapyd service is running:

scrapyd-deploy locals -p arts
Copy the code

Package the project and deploy it to the specified target service. The Scrapyd service returns the result of the request in JSON format:

node-name:arts$ scrapyd-deploy locals -p arts
Packing version 1538645094
Deploying to project "arts" in http://localhost:6800/addversion.json
Server response (200):
{"node_name": "node-name", "status": "ok", "project": "arts", "version": "1538645094", "spiders": 1}
Copy the code

The return information contains the version number of the package, target service address, nodeName, project status, project name, and the number of crawlers contained therein. In addition, the name of project Arts can also be seen on the Web interface, as shown in the picture below:

To consider

Is the Deploy name mandatory in scrapy. CFG? What if you don’t set it? Can there be more than one Deploy level configuration?

We can test these questions by doing hands-on experiments.

If Deploy is not set to the name

As you can see, if the Deploy level configuration does not set the name, the project can be packaged without using the name on the command line.

If multiple Deploy configurations are required

CFG file, one of which has no Deploy name and a URL pointing to the local Scrapyd. Another Deploy sets the name to Servers and the URL points to the server’s Scrapyd. CFG code:

[settings]
default = arts.settings

[deploy]
url = http://localhost:6800/
project = arts

[deploy:servers]
url = http://192.168. 061.:6800/
project = arts
Copy the code

As you can see, multiple Deploy level configurations are allowed, and we can use the Deploy name to distinguish them.

summary

Through the deployment case of Scrapy project, we learned the installation and use of scrapyd-client, as well as the configuration of the pre-package. CFG configuration file, and successfully package a Scrapy project on the target server.

Not enough?

Does GIF teaching feel easier to understand and more novel?

More crawler deployment knowledge and Scrapyd modification knowledge is waiting for you, click here to see the nuggets brochure, let’s build a crawler deployment console with access control!

Like this one: