Key words:

Crawlers, big data, programming

Description:

At present, we are going to crawl the public data of a target website, and it is predicted that 1.2 million API requests will be launched in total. Each crawler script is single process and single thread. According to key fields (such as ID), the data to be crawled is divided into multiple segments (each segment is 10000 pieces of data) and arranged for different crawlers to crawl at the same time. These reptiles, I call them reptiles. All the data I crawl is stored in my local mysql.

How do I implement a crawler? Using Jupyter Lab, create N.ipynb files, create a cell for each file, copy the initial crawler script into it, and modify the key parameters (such as start ID), and then start the cell, that is, activate the crawler.

If you plan to run 7*24 hours to complete the crawl, 1200000/(7*24*60*60) = 1.984 seconds, which is the average interval for each request. The response and processing time of each request of each crawler script is about 20 seconds, so at least 10 crawler scripts can be opened at the same time to complete as planned.

Question:

Is this request volume, against the target site, an attack, or normal?

Ideas:

1. Ethics of crawlers: they should not interfere with or influence the business of the target website, nor should they crawl the unpublished data of the target website.

2. Crawler performance: it is necessary to calculate and master the performance of our crawler group, and judge whether the performance is beyond the tolerance range of the target website according to the business of the target website. The performance of crawler, on the one hand, is the traffic generated to the target website, on the other hand, the local traffic and IO read and write. Generally, there are no problems in memory and CPU.

3. Security of crawlers: on the one hand, anti-crawlers from target websites should be considered, and on the other hand, ethics should be considered.

Conclusion:

1. No pressure on local performance of reptiles.

2. The business volume of the target website is not large, and the request frequency of the current crawler group is obviously too high, which should be reduced by at least half.


This article is signed by Press. one: Press. one/file/ V? S = 50…