It has been more than half a year since I joined Bingjian Technology to develop crawlers. After developing and maintaining several crawlers, I finally have a sense of entry in the crawler on the Web end. I stepped on many pits in the middle, and also had my own understanding of many details of reptiles, so I hope to share some reptile experience today. Although many things of crawler are hard to say in detail, because if it is too detailed, others will immediately have targeted anti-crawler, and many technical industry does not have general solutions (others are not willing to share even if they have made it), I have gradually explored out by myself. However, I think appropriate technical exchanges between the industry/friends are necessary, not behind closed doors. I am also eager to have more private in-depth exchanges with the industry/friends, so that we can share more talents and progress. Recently, I am studying app decompilation crawler, so I am particularly interested in this piece.

Why PHP

In fact, in the current industry, python under the crawler wheel is the most, most of my factory students use Python to do crawlers. Since I used to use PHP a lot for the web backend, I knew everything about PHP ecology and third-party libraries, and the factory did not impose any requirements on the language used, so I started to use PHP. Some students may feel that there are not many crawler wheels under PHP, and even some students who are used to doing PHP background also pick up Python when they need to complete crawler tasks. Is PHP not suitable for crawler? On the contrary, I think PHP has accumulated a large number of mature third-party libraries in the Web space, and its powerful content handling capabilities make it well suited for trivial crawler tasks. Crawler can be roughly divided into two types in terms of running time: 1. Real-time crawler: WHEN a request comes, I will open a crawler to crawl the results. In general, this crawler directly provides API to the external. 2. Long-term crawler: This crawler generally runs all the time or regularly to update data into the database. Generally speaking, these two crawlers need frequent maintenance updates. As a scripting language with simple deployment, PHP can implement hot update crawler code, which is very convenient.

Use third-party libraries

For PHP crawlers, make use of third-party libraries in Composer. PHP has accumulated a large number of mature third-party libraries for the Web. Almost any library you can think of can be found on Github. If you don’t use third-party libraries, you are giving up PHP’s great advantage in the Web. Guzzle: HttpClient is a fully functional httpclient with asynchronous concurrency functionality that can’t be found in other scripting languages. Symfony’s DOM-crawler and CSS-selector is a simple wrapper, you can also use Symfony’s CSS-selector to extract HTML DOM elements Symfony PHP open process library (package proc_open), compatible with Windows, need to know that PCNTL extension does not support Windows 4, php-webdriver: Some time ago, there was a “I stole one million Users of Zhihu a day with the crawler, just to prove THAT PHP is the best language in the world” on the PHP client maintained by Facebook officially. This repo is very popular and maintained all the time. I also looked into his code and found it was of high quality, but one drawback was that he chose to encapsulate it himself instead of using existing third-party libraries. We should spend our energy on crawler business instead of rebuilding the wheel. I just use various third-party libraries under existing Composer mindlessly. I have only written three crawlers (in addition to crawler business, redis based distributed crawler scheduling, single-machine multi-crawler concurrency, alarm + monitoring + parameter control, Selenium multi-browser matching + feature customization, proxy strategy customization and so on) in 8 months since I joined the company in April this year. All of this adds up to just 6000 lines of PHP code. There are already mature and stable third-party libraries out there, so it’s not worth it to build your own wheels.

Multi-threaded, multi-process, and asynchronous

Crawler has to talk about concurrency. As an IO intensive rather than CPU intensive task, a good concurrent crawler should meet the following requirements: 1. Download bandwidth as high as possible (the higher the download bandwidth, the more data to climb); 2. 2, as little CPU consumption and as little memory consumption as possible. Multithreading seems like a good way to implement concurrency, and it’s often said that “PHP doesn’t have multithreading” makes PHPer’s back up. PHP can’t use multithreading as a Web backend, but it does when running on the command line. We know PHP is divided into thread-safe (ZTS) and non-thread-safe (NTS) versions. The latter is meant to be compatible with ISAPI of IIS in Win, which forces PHP extensions to basically provide thread-safe and non-thread-safe versions. That is, in theory command-line PHP multithreading is true multithreading, with no global locks like in Py or Ruby (only one thread is actually running at any one time), but in practice PHP command line multithreading is less stable (it wasn’t designed for PHP-CLI, after all). So I recommend that command-line applications continue to use multiple processes for concurrency.

Asynchronism is also an important method to realize concurrency. Most cases where crawlers need concurrency are to climb multiple urls at the same time, I think. In this case, it is not necessary to use multi-process/multi-thread, but to use asynchronism directly in a single process. For example, PHP’s Guzzle asynchrony is very useful. The default asynchrony of Guzzle is done by wrapping several functions of curl’s curl_multi. If you want to use a more efficient asynchronous Event library you can set the Adapter of the Guzzle to react-guzzle-psr7 (of course you need to install asynchronous PECL extensions such as events). I personally think Guzzle’s default asyncrony is sufficient, a single process can send dozens or hundreds of HTTP requests running up the water pipe without any problems, and the CPU and memory consumption is minimal. In short, it is not a problem to combine PHP’s multi-process and asynchrony to achieve good concurrency.

About crawler Framework

A crawler frame packaged out of the box is not a silver bullet. I started by looking at some of the better known frameworks in Java and PY, trying to learn them first and then incorporate my crawler tasks, which I found difficult. Admittedly, with the crawler framework, you can basically change two lines and run, which is great for simple crawler tasks. However, using a framework encapsulated by others will lead to poor customization of crawlers (you should know that crawlers need to deal with various situations flexibly), and we know that the essence of crawlers is to open httpClient to fetch HTML and then dom extraction data is finished (if concurrent, then add multi-process management). This simple task is encapsulated into a framework of complex systems designed to satisfy as many people as possible, not necessarily suitable for all situations. There was a question on V2EX about how easy it was for me to just use Requests, so what’s the advantage of scrapy? My understanding is that the advantage of the crawler framework is to do all the concurrent scheduling of crawlers, and we can only write a single process crawler without concurrent scheduling if we directly write a single process crawler. Multi-process concurrent scheduling for crawlers isn’t that complicated, and it doesn’t need to be. Let me tell you how my PHP crawler does it (python next story).

Crawler multi-process scheduling

My PHP crawler multi-process scheduling is simple and rude. Crawler can be divided into Master process that manages crawler process and worker process that is responsible for specific crawler business, while Redis is responsible for controlling crawler and displaying crawler state.

For example, I have A crawler task to crawl site A. After I develop crawler Worker A, I can set it in Redis and open two Worker A on server Node1 to climb, while the master1 process on Node1 will read control parameters in Redis regularly. If the number of Worker A processes on Node1 is less than 2, A new Worker A process will be started to supplement it. Of course, you can customize what the control parameters need to contain. For example, I have customized the upper limit of Worker for each node, the agent policy used, whether to prohibit loading pictures, browser feature customization, etc. There are two ways for the Master process to start the Worker process: One is to open a new command-line PHP Worker process by calling the exec class (for example, proc_open(‘ PHP worker.php balabala ‘, $descriptorSpec, $pipes) in the Master process). The other is through the fork mechanism. I use the method called by the exec class (actually symfony/ Process library, which is the process opened by the proc_open function enclosed by it) to open the Worker process (if you want to pass command line parameters to the Worker process, please pay attention to base64 encoding, because the command line may filter some parameters). The advantage of this is decoupling. It should be noted that all Worker processes are subprocesses of the Master process, so if the Master process quits, all Worker processes will also quits. Therefore, the Master process should pay attention to abnormal catch, especially in redis, databases and other places with network IO. If you want the Worker process to damonize, follow the instructions in this article (PHP is the same, but not Windows compatible). I do not recommend the Master process to control the Worker process through IPC mechanism, because this suddenly makes the Master process and the Worker process coupled, the Master process should simply be responsible for starting the Worker process. The control of Worker process can be completed by Redis, that is, the Worker process can go to Redis to read control parameters every few seconds (it can complete an HTTP request, or every few seconds) (or report its status if necessary. More parameters with redis pipeline), in practice this method works well. I have adopted this simple and crude scheme in ALL my PHP crawlers. I think it has four advantages: 1. It supports distributed and simple dependency, and parameter control + state reporting is directly through a single Redis node. I recommend you to use a good REDis GUI tool to manage Redis, Redis 5 kinds of data structure used to do crawler parameter control + crawler state display is very convenient 2, Master process and Worker process decouple, and can solve the crawler more memory leakage problem (Worker process run directly exit), 3. Real-time crawler can be done by preempting the request pushed into the Redis list by the Master process, while the crawler of long-term task can be replenished immediately after the unexpected exit of the Worker process, which can adapt to various crawler tasks. 4. Development is convenient, do not care about scheduling problems

The downside, of course, is that you have to write all the mechanics yourself, and the price of high customizability is to do it yourself.

conclusion

Here are some aspects of my PHP crawler experience. Due to limited space, I will save my Selenium experience for next time.