Hello, everyone. Today, I would like to introduce the research and implementation of crawler anti-crawler technical scheme. For content-oriented companies, the security of data is self-evident. An online education platform, topic data is very important, but by others through crawler technology all crawled away, that result is “cool”. Or say an indie developer wants to copy your product, take away your core data by capturing packages and crawlers, and build a website and App that will become your enemy in a short time. Results: Posted on SegmentFault, received 148 upvotes.

Big front-end era security how to do


If you want to learn more about security at the big front end (Web, App, interface) level, check out my article.

The reptilian engineer’s means of repelling


  • Find the node of interest directly from the rendered HTML page and get the corresponding text
  • To analyze the corresponding interface data, more convenient and accurate data acquisition

Work out the technical scheme of Web – end anti – crawl


I from these two points of view (web page see not get, check interface request useless) set out, developed the following anti – crawl scheme.

  • Using HTTPS
  • If too many requests are blocked per unit of time, the account is blocked
  • Front-end technology limitations (core technology next)

For example, the data to be displayed correctly is “19950220”.

  1. First according to their own needs to use the corresponding rules (digit out-of-order mapping, such as normal 0 is still 0, but the out-of-order is 0 <-> 1,1 <-> 9,3 <-> 8,…) Creating custom Fonts (TTF)
  2. According to the above out-of-order mapping rule, the data to be returned can be obtained
  3. For the string obtained in the first step, each character is iterated, converting each character to a linear transformation (y=kx+b). The coefficient and constant terms of a linear equation are calculated from the current date. For example, if the current date is “2018-07-24”, then k of the linear transformation is 7 and B is 24.
  4. Each transformed string is then returned to the interface caller with a “3.1415926” concatenation. (Why is it 3.1415926? Because of digital forgery, it will not attract the attention of researchers if the text is definitely a number. However, if the length of the number is too short, normal data will be accidentally injured, so the familiar one is used Π)

For example, the back end returns 323.14743.14743.1446. According to our agreed algorithm, the result can be 1773

The above calculated 1773, and then according to the TTF file, the page sees 1995

  • Then in order to prevent the crawler from viewing the JS research problem, the JS file is encrypted. If your technology stack is Vue, React, etc., WebPack provides you with JS encryption plugins, which are also easy to handle

JS obfuscation tool

  • Personally, I don’t think this approach is very safe. Then came up with a combination of various schemes. Such as

Anti-crawl upgrade


Personally, I think if a crawler developer with rich front-end experience, the above scheme may still be cracked, so I made an upgraded version on the basis of the previous one

  • Combination 1: The font file is not fixed, although the requested link is the same, but according to the current timestamp of the last digit modulo, such as Demo modulo 4, there are 4 values 0, 1, 2, 3. These 4 values correspond to different font files, so when the crawler tries its best to crawl to the font in case 1, but does not expect to ask again, the font file rules have changed

  • Combo 2: The previous rule is the font is out of order, but only the number match is out of order. Like 1 -> 4, 5 -> 8. The next trick is to get a Unicode code for each number, and then create your own font, be it.ttf,.woff, etc.

The combination of these blows. For the average reptile.

Anti-crawl means to upgrade


The above method is mainly for the number of anti-crawl means, if you want to do anti-crawl Chinese characters how to do? Several options are offered

Plan 1: Create a Chinese character map, or custom font file, for the word cloud with the highest frequency on your site, using the same steps as numbers. First, generate corresponding TTF files for commonly used Chinese characters; Convert the TTF file to SVG file according to the link provided below, then select the previously generated SVG file from the website clicked by the “Font mapping” link below, and map each Chinese character in the SVG file. In other words, Chinese characters are made into Unicode codes (note that unicode codes are not directly generated online, because direct generation is also regular. My method is to use the website to generate, and then make a simple change to the result, such as “e342” to “e231”); The data returned by the interface is then mapped backward according to the rules of our font file.

Scheme 2: The important fonts of the website and the HTML part are used to generate pictures. In this way, the cost of crawler to identify the required content is very high, and OCR is needed. It’s also inefficient. So you can intercept some of the crawlers

Solution 3: The principle is that different machines and different hardware always have pixel-level errors in the drawings drawn by Canvas. Therefore, we judge that if the fingerprints of a large number of Canvas are consistent for access, it is considered as a crawler and can be blocked.

Key steps

  1. First, according to your product to find commonly used keywords, word cloud generation
  2. Based on the word cloud, a Unicode code is generated for each word
  3. Make the Chinese characters included in the word cloud into a font library
  4. TTF font library to SVG format, and then upload to icomoon to make custom font, but there are rules, such as “year” corresponding to the Unicode code is “\ U5e74”, but we need to do a Caesar encryption, such as we set the offset to 1, The unicode code for “year” encrypted by Caesar is “\u5e75”. Use this rule to create the font library we need
  5. Each time the interface is called, the server does: The server encapsulates a method to determine whether the data is in the word cloud through the method. If it is a character in the word cloud, use the rules (find the Unicode code corresponding to Chinese characters, and then encrypt according to Caesar, set the corresponding offset, which is 1 in Demo, encrypt each Chinese character) and return the data after encryption
  6. What the client does:
    • Let’s introduce the Chinese font library we made earlier
    • Call the interface to get the data and display it to the corresponding Dom node
    • If it is Chinese character text, we set the CSS class of the corresponding node as Chinese character class, and the corresponding font-family of this class is the Chinese character font library introduced above

If you’re a crawler, check out this open source project to see how it works, how it works, and why.

Open source address:Github.com/FantasticLB…

Do you like today’s recommendation? If you like the words, please leave a message at the bottom of the article or like, to express my support, your message, like, forward attention is I continue to update the power oh!

Concern public number reply: “1024”, free to receive a large wave of learning resources, first come, first served oh!