2016, my summary of crawlers

It’s the end of the year, it’s time to write a summary for yourself. Today I want to talk about the part of work I have been responsible for in the company for more than a year — crawler. Being a reptile for so long, it’s time to write something, leave something behind. I’ve been in charge for a while. I have summarized the following types of crawler design ideas.

Simple server timing crawler
Client crawler
- Lua parsing
- Javascript parsing
Server offline crawler

Let’s talk about it in detail.

Server – Timed simple crawler

In the beginning, that’s what we did. This is probably the simplest crawler. That’s probably what the search engines came up with.

The crawler feature of this kind is that I only need to crawl a certain part of the data of a website, initiate HTTP request for HTML parsing, and then save the database, and that’s it. For example, some other websites provide some public data, or do not require real-time data. Such as autohome car data, such as league of Heroes hero data, such as the government website of some display data. I used scrapy to grab some data for dota2, and when I realized that scrapy wasn’t always in my control, I wrote my own crawler framework called Tspider. I use it now for some simple crawlers. PHP – based coroutine and curl_multi_* function set implementation. Single process up to 2000 effective processing /min.

This kind of crawler processing process is similar to the following

Simple crawler flow chart

Online propaganda such as “XXX grab Taobao MM photos”, “I stole from the zhihu how much data” and so on the crawler, probably so. Deja vu in 21 Days of Mastering XX.

The advantage of this type of reptile is probably simplicity.

Client side parsing (Lua, javascript)

Our company is doing mobile products, if you also happen to have (Android /ios) client support. And the real-time requirements of data are high, or the IP is too severe when climbing data. You can try what I say below. I’ll call it Client Parsing.

In this kind of crawler, the script execution engine needs to be embedded in the client. The HTTP request and data parsing are performed on the client side. Finally, the data is presented or reported to the server. High accuracy and real-time.

Client parsing crawler

Scripts: Scripts simply convert data from the site (JSON, JSONP, HTML, etc.) to the data we need (formatted data). When they do, we just change the script.
Policy: The policy tells the client, which method in the script do you need to execute at certain times? Do you need buffering? Do you need to show the content of the original site? Something that controls the behavior of the client through the server.
Offline crawler: Whether you need to crawl on the server, throw the request to the message queue, and the offline crawler plays by itself.

This method has at least two advantages: IP discrete, high real-time.

Server – Offline crawler system

For most services, data crawling needs to be done on the server. For this class of reptiles. Architecture design, need to have good scalability.

Server offline crawler

The crawl request comes from the background control and external gateway.
Crawler trigger is to tell crawler node when to crawl and whose data to crawl through message queue.
Background control, which can control the support of the website data, alarm, abnormal management.
Message queues are used to distribute messages to a crawler node.
Crawler node: completes concrete crawler and formats crawler data. Support a website to crawl, usually modify the line, here also need to do a good job of statistics, alarm.
Go heavy: go heavy can tryBloom filterwithComparison between the SimHash fingerprint algorithm and the Hamming distance.

Ideas or principles

To sum up:

Only care about the right, not the wrong. You can never enumerate the wrong ones.
In a hierarchical structure, the lower the data (request) is, the higher the percentage of valid data (request) is.
It’s important to back up your thoughts. If the probability of one machine hanging up is 1 in 100, the probability of two machines hanging up at the same time is 1 in 10,000.
There is no master key, specific problems, specific analysis, specific solutions.
There is no perfect solution, and sometimes you have to make trade-offs based on your business.

Finally, I would like to thank my leader, Corey, thank you!

Server – Timed simple crawler

Client side parsing (Lua, javascript)

Server – Offline crawler system

Ideas or principles

Related Posts

Clean up the LeetCode series of circular lists with Java brush questions

Suffering from the destruction of the computer department, angry and share the pure version of THE SSM framework (attached source code)

Arraylist vs. LinkedList