Distributed crawler deployment Scrapyd docking Docker

We used scrapyd-client to successfully deploy our Scrapy project to Scrapyd, provided that we had to install our Scrapyd on the server in advance and run the Scrapyd service, which was a bit of a hassle. If we deploy a single Scrapy project on 100 servers at the same time, do we need to manually configure the Python environment for each server and change the Scrapyd configuration? If these servers have different versions of Python environments, and other projects are running at the same time, conflicting versions can cause unnecessary trouble.

So, we need to address a pain point, which is the Python environment configuration issue and version conflict resolution issue. If we package Scrapyd directly as a Docker image, we can start the Scrapyd service by executing the Docker command on the server, so we don’t have to worry about the Python environment, and we don’t have to worry about version conflicts.

Next, we’ll package Scrapyd into a Docker mirror.

First, preparation

Please make sure Docker is properly installed on the machine.

2. Docker connection

Create a new project and create a new scrapyd.conf file that looks like this:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 10
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatusCopy the code

Here is actually modify the configuration files of the since the official document: https://scrapyd.readthedocs.io/en/stable/config.html#example-configuration-file, there are two in the changes.

Max_proc_per_cpu =10 (4) The CPU can execute a maximum of 4 Scrapy tasks on a single core.
Bind_address = 0.0.0.0 bind_address = 0.0.0.0 bind_address = 0.0.0.0

Create a requirements. TXT and list the required libraries for Scrapy projects as follows:

requests
selenium
aiohttp
beautifulsoup4
pyquery
pymysql
redis
pymongo
flask
django
scrapy
scrapyd
scrapyd-client
scrapy-redis
scrapy-splashCopy the code

If you need additional libraries to run your Scrapy project, they can be added to this file themselves.

Create a new Dockerfile with the following contents:

/code WORKDIR /code COPY./scrapyd.conf /etc/scrapyd/ EXPOSE 6800 RUN pip3 install -r requirements.txt CMD scrapydCopy the code

The FROM in the first line refers to building on the image of Python :3.6, which means you already have a Python 3.6 environment at build time.

The second line of ADD puts the local code into the virtual container. It takes two arguments: the first argument is., which represents the current local path; The second parameter, /code, represents the path in the virtual container, which places all the contents of the local project in the /code directory of the virtual container.

The third line, WORKDIR, specifies the working directory. Set the code path as the working path. The directory structure under this path is the same as the current local directory structure, so you can directly execute the library installation command in this directory.

The fourth line of COPY is to COPY the current scrapyd.conf file to the /etc/scrapyd-/ directory of the virtual container. Scrapyd reads this configuration by default.

Line 5 EXPOSE is a declaration that the runtime container provides a service port. Note that this is just a declaration that the runtime doesn’t necessarily open the service on this port. The purpose of this declaration is to tell the user the port on which the mirror service is running so that you can easily configure the mapping, and the container will automatically map the EXPOSE ports randomly when running with random port mappings.

The sixth line, RUN, executes some commands, generally doing some environment preparation. The Docker virtual container contains only the Python 3 environment and no Python library, so we run this command to install the corresponding Python library in the virtual container so that the project can be deployed to Scrapyd.

CMD on line 7 is the container start command, which is executed when the container is running. Here we start the Scrapyd service directly with scrapyd.

With the basic work done, we run the following command to build:

docker build -t scrapyd:latest .Copy the code

Run the tests after a successful build:

docker run -d -p 6800:6800 scrapydCopy the code

Open: http://localhost:6800 and see the Scrapyd service, as shown below.

The Scrapyd Docker image is built and running successfully.

We can upload this image to a Docker Hub. For example, my Docker Hub user name is Germey, and I want to create a new project called scrapyd.

docker tag scrapyd:latest germey/scrapyd:latestCopy the code

Please replace it with your project name.

Then Push:

docker push germey/scrapyd:latestCopy the code

Then run this command on another host to start the Scrapyd service:

docker run -d -p 6800:6800 germey/scrapydCopy the code

Scrapyd runs successfully on other servers.

Three, endnotes

We solved the problem of the Python environment with Docker. Next, we will solve the problem of batch deployment of Docker.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)

Distributed crawler deployment Scrapyd docking Docker

First, preparation

2. Docker connection

Three, endnotes

Related Posts

Dubbo Series -Dubbo’s service invocation process

The interviewer talked to you about distribution and Zookeeper.

Activemq SIMPLE Jms example