Before we understand what anti-crawler means, let’s first take a look at what a crawler is.

What is a reptile

In today’s society, there is a lot of useful data on the Internet, and we just need to be patient observation, coupled with some technical means, can obtain a lot of valuable data. By “technical means” I mean web crawlers.

Crawler is a program that automatically acquires web content, such as search engine, Google, Baidu and so on. Every day, huge crawler system is run to crawl data from websites all over the world for users to use in retrieval.

Malicious crawlers will not only occupy a large amount of website traffic, resulting in users with real needs can not enter the website, but also may cause the leakage of key information of the website, affecting the normal operation of the website or APP.

Therefore, for general websites with high data value, website developers will provide some technical means for web crawler.

If you want to implement a simple crawler example for yourself, check out my previous article:

Five steps take you to explore the truth behind the crawler video barrage, with crawler implementation source code

Common anti-crawler measures

Generally speaking, we will subdivide the anti-crawler methods from the characteristics, which can be divided into information verification anti-crawler, dynamic rendering anti-crawler, text obfuscation anti-crawler, behavior verification anti-crawler and so on.

Among them, text obfuscation anti-crawler is the most interesting, while behavior verification anti-crawler is the most difficult.

Text obfuscation anti crawler

Text obfuscation is simply how to avoid crawler to obtain important text data in Web applications. The premise of anti-crawler is that it can not affect users’ normal browsing of web pages and reading of text content. It is easy to see directly confused text, so developers usually use the mapping relationship between fonts to achieve confusion.

For example: autohome forum text mapping.

In this paper, through font mapping for some special characters, the web crawler cannot directly obtain complete data during data collection, and it does not affect the normal reading of normal users.

Dynamic rendering anti-crawler

With the continuous iteration of technology, more and more websites have changed from the traditional static data loading to dynamic data loading, and the process of dynamic loading is accompanied by more and more data encryption.

The simple understanding of dynamic data loading is to let the browser load the general framework of the website first, and then send an asynchronous request to complete the data filling. In the process of sending the request, it shields the very low-level crawler script by encrypting the request parameters.

For example: Redman data set – JS parameter encryption

Here, some of the most basic crawler requests are directly intercepted by verifying key parameters when sending asynchronous requests. Data can be normally obtained only by simulating the process of parameter encryption.

Behavior verification anti-crawler

Behavioral captcha is a popular captcha. Literally, this is done through the user’s actions, without the need to read distorted images and text. There are two common: drag type and touch type.

Example: 12306 Login verification code – tap behavior authentication

According to the choice made by the user after identifying the picture, it can be judged whether the request is made by the normal user at present, so as to shield the crawler with low technical content.

Finally, crawler and anti-crawler are the battle of wits between Internet development engineers. As a website developer, we should not only master the technology of crawler, but also know how to realize anti-crawler.

If you want to learn more, you can stay tuned, and a series of specific anti-crawler solutions will be updated next.

Thanks for your attention ~

Need more python related source code, you can get it in my Git repository, which also has Java and big data related code, you want to learn can get their own later will continue to update

The warehouse address is here

For starters, IN Readme, I’ve also written some initial introductions to Python that you can check out for yourself

Follow the public account: Java Architects Association, to be a versatile code writer