— published simultaneously at imweb. IO

1. Introduction

For a web page, we often want it to be well structured and clear so that search engines can accurately recognize it. On the other hand, there are situations where we don’t want content to be easily accessible, such as sales on e-commerce sites, topics on educational sites, etc. Because these contents, often the lifeblood of a product, must be effectively protected. That’s where the topic of reptiles and anti-reptiles comes from.

2. Common anti-crawler strategies

But there is no website in the world that is perfectly anti-crawler.

If the page wants to be displayed normally in front of the user, and at the same time does not give a chance to the crawler, it must be able to recognize human beings and robots. Therefore, engineers have made various attempts. Most of these strategies are adopted in the back end, which is a conventional and single effective method at present. For example:

  • User-agent + Referer detection
  • Account and Cookie authentication
  • Verification code
  • IP limit frequency

And a reptile can get infinitely close to a real person, for example:

  • Chrome Headless or PhantomJS to simulate the browser environment
  • Tesseract identification verification code
  • Agent IP taobao can buy

So we say, 100% anti-crawler strategy? It doesn’t exist. It’s more physical work. It’s a matter of difficulty.

But as front end engineers, we can increase the difficulty of the game and design some sang bing Kuang anti-crawler strategies.

3. Front-end and anti-crawler

3.1 Font-face patchwork

Example: Cat eye movies

In cat’s eye movies, the box office figures are not pure numbers. The page uses font-face to define the character set and is mapped through Unicode. That is to say, in addition to image recognition, must simultaneously crawl the character set, in order to recognize the number.

Also, the URL of the character set changes every time the page is refreshed, which undoubtedly increases the crawl cost more difficult.

3.2 Background patchwork

Example: Meituan

Similar to font’s strategy, Meituan uses background patchwork. Numbers are actually pictures that display different characters depending on the background offset.

And different pages, the character order of the picture is also different. But in theory only need to generate 0-9 and the decimal point, why there are repeated characters is not very clear.

A page:



Page B:

3.3 Interspersed characters

Example: wechat official account article

Some wechat public articles, interspersed with a variety of mystery characters, and through the style of these characters hidden away.

Shocking though it is… But it’s not that hard to identify and filter, and you can do even better, but it’s kind of an imagination.

By the way, is there anyone I can reimburse for my cell phone data?

3.4 Hidden form of pseudo-element

Example: Motor Home

Autohome, the key manufacturer information, into the fake element content. This is the same idea: to crawl a web page, you have to parse the CSS and get the content of the fake elements, which makes crawlers more difficult.

3.5 Element positioning overlay type

Example: Where

For a 4-digit air ticket price, first render with four I tags, then use two B tags to absolutely locate the offset, cover the INTENTIONALLY displayed wrong I tags, and finally form the correct price visually…

This means that a crawler can parse CSS, but it also has to be able to do math.

3.6 Iframe Asynchronous loading

Example: netease Cloud Music

When the page of netease Cloud Music is opened, there is almost only one IFrame in the HTML source code, and its SRC is blank: about:blank. Then js starts running, asynchronously stuffing the entire page frame into the iframe…

However, this approach is not too difficult, just a twist on asynchronous and iframe processing (or for some other reason, not entirely based on anti-crawler concerns), whether you are using Selenium or Phantom, there is an API for retrieving the content in iframe.

3.7 Character segmentation

Example: Network-wide proxy IP address

In some pages that display proxy IP information, IP protection is also a great deal of trouble.

They would first split IP numbers and symbols into DOM nodes, and then insert confusing numbers in the middle. If the crawler did not know this strategy, it would think it had succeeded in getting the numbers. But if a reptile notices, it’s easy to fix.

3.8 Character set Substitution

Example: Where to move side

Also fooling crawlers is the mobile version of Qunar.

The HTML says 3211, but the visual representation is 1233. It turned out that they had redefined the character set so that the order of 3 and 1 was reversed…