Disclaimer: This article is only for study and research, prohibited for illegal use, otherwise the consequences, such as infringement, please inform to delete, thank you!

Project scenario:

When we work with crawler, may be a few crawler for a start, we can deploy the manual scheduling, accumulate over a long period, from ten to one hundred, when the 100 crawler is completed, we will manually restart them, this is very troublesome, and if you want to look at their output log also one by one, A tool for quick deployment, task scheduling, and log viewing is a must. Here we use the Scrapyd deployment tool and spiderKeeper visual crawler management UI to do this.

Module Overview:

Scrapy is an open source web crawler framework written in Python. It is a program framework designed to crawl network data and extract structural data.

PIP install scrapy

Scrapyd: A service that runs Scrapy crawlers, allowing you to deploy and control their crawlers using the HTTP JSON API.

PIP install scrapyd

Scrapyd-client: The scrapyd-client is a Scrapyd Client that allows you to deploy projects to the Scrapyd server. You can also generate an egg file.

PIP install scrapyd-client

Spiderkeeper: Visual crawler management UI that can be set to run periodically to view statistics.

PIP install SpiderKeeper


Solution:


1. Create a new crawler (scrapy startProject mySpider) and enter the mySpider directory to create a crawlerwww.baidu.com



2. Modify scrapy. CFG and add a deployment name my after deploy

3. Start scrapyd


4. Upload our crawler project under mySpider directoryscrapyd-deploy my -p myspider



5. The status that can be viewed after the upload is OK, and then the crawler is executedSpiders curl http://127.0.0.1:6800/schedule.json - d project = myspider - d = spiders



6. Status “OK” indicates that the operation is successful. Then we need to display it on the SPIDerKeeper UI and start spiderKeeper to listenhttp://localhost:6800 spiderkeeper --server=http://localhost:6800



7. Access after startupHttp://server IP address :5000 The Spiderkeeper management page is displayedThe default account password is admin.



8. Click Create Project to Create the Project, and then we will generate the egg filescrapyd-deploy --build-egg output.eggIf the information in the red box is displayed, success is indicated



9. Then we upload the egg file





10. Click Submit and click Project to select the project we just created

conclusion

At this point, our scrapy project is successfully deployed. If your scrapy crawler code is updated later, you just need to re-upload the crawler to scrapydScrapyd -deploy Deployment name -p Project name
Reference links:zhuanlan.zhihu.com/p/63302475