For a content company, data security is important. For content companies, the importance of data is clear. For example, if you are an online education platform, the data of the topic is very important, but it is all crawled away by others through the crawler technology? If the core competitiveness is taken away, it is cool. Another example is an independent developer who wants to copy your product, take away your core data by means of packet capture and crawler, and then build a website and App in a short period of time, and become your fierce rival in a short period of time.

The crawler means

  • The current crawler technology is to directly find the node of interest from the rendered HTML page, and then obtain the corresponding text
  • Some websites do good security, for example, the list page may be easy to get, but the details page needs to click the corresponding item from the list page, submit the itemId through the form, the server generates the corresponding parameters, and then redirects to the details page (the detailID parameter of the details page is only provided after the redirected address). This step can block a portion of the crawler developers

To develop aWeb – end anti – crawling technology solution

I from these two perspectives (web page see not obtained, check the interface request is useless), developed the following anti – climbing scheme.

  • Using HTTPS protocol

  • If you limit too many requests per unit time, the account will be blocked

  • Front-end technology limitations (core technology next)

For example, "19950220" is required to display the correct data.

1. First, according to their own needs to use the corresponding rules (numbers mapping out of order, such as the normal 0 corresponds to 0, but out of order is 0<->1, 1<->9, 3<->8,...). Making custom Fonts (TTF)2. According to the above out-of-order mapping law, obtain the data that needs to be returned 19950220 -> 177302203. For the string obtained in the first step, we iterate through each character in turn, transforming each character according to a linear transformation (y=kx+b). The coefficients and constant terms of linear equations are calculated from the current date. For example, if the current date is "2018-07-24", the linear transformation k is 7 and b is 24.4. Each transformed string is then returned to the interface caller with the concatenation of "3.1415926". (Why is it 3.1415926? Because the number is falsifying and backward crawling, so the pieced text must be numbers, which will not attract the attention of researchers. But if the number length is too short, it will mistakenly damage the normal data, so we use the familiar π data.)` ` ` 1773 - > "1 * 7 + 24" + "3.1415926" + "7 * 7 + 24", "3.1415926" + "7 * 7 + 24" + "3.1415926" + "3 * 7 + 24" - > 313.1415926733.1415926733.141592645 02 - > "0 * 7 + 24" + "3.1415926", "2 * 7 + 24" - > 243.141592638 - > 20 + 2 * 7 + 24 "3.1415926" + "0*7+24" -> 383.141592624 First, split the obtained string into an array according to "3.1415926". 2. For each data in the array, follow the "linear transformation" (y=kx+b, k and b are also solved according to the current date), and then solve the original value in reverse. 3. Splice the data obtained in Step 2 one by one, and then according to the TTF file Render page.Copy the code
  • The back-end needs to encrypt data according to the protocol designed in the previous step

Here’s what you need to do on the back end, using Node.js as an example

  • First, set the interface route on the back end

  • Gets the parameters following the route

  • Generate data based on service requirements and SQL statements. If it is a numeric part, it needs to be converted as agreed above.

  • Convert the generated data into JSON and return it to the caller

    // json
    var JoinOparatorSymbol = "3.1415926";
    function encode(rawData, ruleType) {
      if(! isNotEmptyStr(rawData)) {return "";
      }
      var date = new Date(a);var year = date.getFullYear();
      var month = date.getMonth() + 1;
      var day = date.getDate();
    
      var encodeData = "";
      for (var index = 0; index < rawData.length; index++) {
        var datacomponent = rawData[index];
        if (!isNaN(datacomponent)) {
          if (ruleType < 3) {
            var currentNumber = rawDataMap(String(datacomponent), ruleType);
            encodeData += (currentNumber * month + day) + JoinOparatorSymbol;
          }
          else if (ruleType == 4) {
            encodeData += rawDataMap(String(datacomponent), ruleType);
          }
          else {
            encodeData += rawDataMap(String(datacomponent), ruleType) + JoinOparatorSymbol; }}else if (ruleType == 4) {
          encodeData += rawDataMap(String(datacomponent), ruleType); }}if (encodeData.length >= JoinOparatorSymbol.length) {
        var lastTwoString = encodeData.substring(encodeData.length - JoinOparatorSymbol.length, encodeData.length);
        if (lastTwoString == JoinOparatorSymbol) {
          encodeData = encodeData.substring(0, encodeData.length - JoinOparatorSymbol.length); }}Copy the code
    // Font mapping processing
    function rawDataMap(rawData, ruleType) {
    
      if(! isNotEmptyStr(rawData) || ! isNotEmptyStr(ruleType)) {return;
      }
      var mapData;
      var rawNumber = parseInt(rawData);
      var ruleTypeNumber = parseInt(ruleType);
      if (!isNaN(rawData)) {
        lastNumberCategory = ruleTypeNumber;
        // Data encryption rules under font file 1
        if (ruleTypeNumber == 1) {
          if (rawNumber == 1) {
            mapData = 1;
          }
          else if (rawNumber == 2) {
            mapData = 2;
          }
          else if (rawNumber == 3) {
            mapData = 4;
          }
          else if (rawNumber == 4) {
            mapData = 5;
          }
          else if (rawNumber == 5) {
            mapData = 3;
          }
          else if (rawNumber == 6) {
            mapData = 8;
          }
          else if (rawNumber == 7) {
            mapData = 6;
          }
          else if (rawNumber == 8) {
            mapData = 9;
          }
          else if (rawNumber == 9) {
            mapData = 7;
          }
          else if (rawNumber == 0) {
            mapData = 0; }}// Data encryption rules under font file 2
        else if (ruleTypeNumber == 0) {
    
          if (rawNumber == 1) {
            mapData = 4;
          }
          else if (rawNumber == 2) {
            mapData = 2;
          }
          else if (rawNumber == 3) {
            mapData = 3;
          }
          else if (rawNumber == 4) {
            mapData = 1;
          }
          else if (rawNumber == 5) {
            mapData = 8;
          }
          else if (rawNumber == 6) {
            mapData = 5;
          }
          else if (rawNumber == 7) {
            mapData = 6;
          }
          else if (rawNumber == 8) {
            mapData = 7;
          }
          else if (rawNumber == 9) {
            mapData = 9;
          }
          else if (rawNumber == 0) {
            mapData = 0; }}// Data encryption rules under font file 3
        else if (ruleTypeNumber == 2) {
    
          if (rawNumber == 1) {
            mapData = 6;
          }
          else if (rawNumber == 2) {
            mapData = 2;
          }
          else if (rawNumber == 3) {
            mapData = 1;
          }
          else if (rawNumber == 4) {
            mapData = 3;
          }
          else if (rawNumber == 5) {
            mapData = 4;
          }
          else if (rawNumber == 6) {
            mapData = 8;
          }
          else if (rawNumber == 7) {
            mapData = 3;
          }
          else if (rawNumber == 8) {
            mapData = 7;
          }
          else if (rawNumber == 9) {
            mapData = 9;
          }
          else if (rawNumber == 0) {
            mapData = 0; }}else if (ruleTypeNumber == 3) {
    
          if (rawNumber == 1) {
            mapData = "&#xefab;";
          }
          else if (rawNumber == 2) {
            mapData = "&#xeba3;";
          }
          else if (rawNumber == 3) {
            mapData = "&#xecfa;";
          }
          else if (rawNumber == 4) {
            mapData = "&#xedfd;";
          }
          else if (rawNumber == 5) {
            mapData = "&#xeffa;";
          }
          else if (rawNumber == 6) {
            mapData = "&#xef3a;";
          }
          else if (rawNumber == 7) {
            mapData = "&#xe6f5;";
          }
          else if (rawNumber == 8) {
            mapData = "&#xecb2;";
          }
          else if (rawNumber == 9) {
            mapData = "&#xe8ae;";
          }
          else if (rawNumber == 0) {
            mapData = "&#xe1f2;"; }}else{ mapData = rawNumber; }}else if (ruleTypeNumber == 4) {
        var sources = ["Year"."万"."Industry"."People"."Believe"."Yuan"."Thousand"."Department"."State"."Information"."Build"."Money"];
        // The string is Chinese
        if (/^[\u4e00-\u9fa5]*$/.test(rawData)) {
    
          if (sources.indexOf(rawData) > - 1) {
            var currentChineseHexcod = rawData.charCodeAt(0).toString(16);
            var lastCompoent;
            var mapComponetnt;
            var numbers = ["0"."1"."2"."3"."4"."5"."6"."Seven"."8"."9"];
            var characters = ["a"."b"."c"."d"."e"."f"."g"."h"."h"."i"."j"."k"."l"."m"."n"."o"."p"."q"."r"."s"."t"."u"."v"."w"."x"."y"."z"];
    
            if (currentChineseHexcod.length == 4) {
              lastCompoent = currentChineseHexcod.substr(3.1);
              var locationInComponents = 0;
              if (/ [0-9].test(lastCompoent)) {
                locationInComponents = numbers.indexOf(lastCompoent);
                mapComponetnt = numbers[(locationInComponents + 1) % 10];
              }
              else if (/[a-z]/.test(lastCompoent)) {
                locationInComponents = characters.indexOf(lastCompoent);
                mapComponetnt = characters[(locationInComponents + 1) % 26];
              }
              mapData = "&#x" + currentChineseHexcod.substr(0.3) + mapComponetnt + ";"; }}else{ mapData = rawData; }}else if (/ [0-9].test(rawData)) {
          mapData = rawDataMap(rawData, 2);
        }
        else{ mapData = rawData; }}return mapData;
    }
    Copy the code
    //api
    module.exports = {
        "GET /api/products": async (ctx, next) => {
            ctx.response.type = "application/json";
            ctx.response.body = {
                products: products
            };
        },
    
        "GET /api/solution1": async (ctx, next) => {
    
            try {
                var data = fs.readFileSync(pathname, "utf-8");
                ruleJson = JSON.parse(data);
                rule = ruleJson.data.rule;
            } catch (error) {
                console.log("fail: " + error);
            }
    
            var data = {
                code: 200.message: "success".data: {
                    name: "@Hangcheng Xiao Liu".year: LBPEncode("1995", rule),
                    month: LBPEncode(".", rule),
                    day: LBPEncode("20", rule),
                    analysis : rule
                }
            }
    
            ctx.set("Access-Control-Allow-Origin"."*");
            ctx.response.type = "application/json";
            ctx.response.body = data;
        },
    
    
        "GET /api/solution2": async (ctx, next) => {
            try {
                var data = fs.readFileSync(pathname, "utf-8");
                ruleJson = JSON.parse(data);
                rule = ruleJson.data.rule;
            } catch (error) {
                console.log("fail: " + error);
            }
    
            var data = {
                code: 200.message: "success".data: {
                    name: LBPEncode("Builder",rule),
                    birthday: LBPEncode("February 20, 1995",rule),
                    company: LBPEncode("Zhongtian Company",rule),
                    address: LBPEncode("Shixiang Road, Gongshu District, Hangzhou City, Zhejiang Province",rule),
                    bidprice: LBPEncode("20000 dollars",rule),
                    negative: LBPEncode("Too productive and negative in 2018.",rule),
                    title: LBPEncode("Builder",rule),
                    honor: LBPEncode("Best prize",rule),
                    analysis : rule
                }
            }
            ctx.set("Access-Control-Allow-Origin"."*");
            ctx.response.type = "application/json";
            ctx.response.body = data;
        },
    
        "POST /api/products": async (ctx, next) => {
            var p = {
                name: ctx.request.body.name,
                price: ctx.request.body.price
            };
            products.push(p);
            ctx.response.type = "application/json"; ctx.response.body = p; }};Copy the code
    / / routing
    const fs = require("fs");
    
    function addMapping(router, mapping){
        for(var url in mapping){
            if (url.startsWith("GET")) {
                var path = url.substring(4);
                router.get(path,mapping[url]);
                console.log(`Register URL mapping: GET: ${path}`);
            }else if (url.startsWith('POST ')) {
                var path = url.substring(5);
                router.post(path, mapping[url]);
                console.log(`Register URL mapping: POST ${path}`);
            } else if (url.startsWith('PUT ')) {
                var path = url.substring(4);
                router.put(path, mapping[url]);
                console.log(`Register URL mapping: PUT ${path}`);
            } else if (url.startsWith('DELETE ')) {
                var path = url.substring(7);
                router.del(path, mapping[url]);
                console.log(`Register URL mapping: DELETE ${path}`);
            } else {
                console.log(`Invalid URL: ${url}`); }}}function addControllers(router, dir){
        fs.readdirSync(__dirname + "/" + dir).filter( (f) = > {
            return f.endsWith(".js");
        }).forEach( (f) = > {
            console.log(`Process controllers:${f}. `);
            let mapping = require(__dirname + "/" + dir + "/" + f);
            addMapping(router,mapping);
        });
    }
    
    module.exports = function(dir){
        let controllers = dir || "controller";
        let router = require("koa-router") (); addControllers(router,controllers);return router.routes();
    };
    
    
    Copy the code
  • The front end reversely decrypts the data returned by the server

    $("#year").html(getRawData(data.year,log));
    
    // util.js
    var JoinOparatorSymbol = "3.1415926";
    function isNotEmptyStr($str) {
      if (String($str) == "" || $str == undefined || $str == null || $str == "null") {
        return false;
      }
      return true;
    }
    
    function getRawData($json,analisys) {
      $json = $json.toString();
      if(! isNotEmptyStr($json)) {return;
      }
      
      var date= new Date(a);var year = date.getFullYear();
      var month = date.getMonth() + 1;
      var day = date.getDate();
      var datacomponents = $json.split(JoinOparatorSymbol);
      var orginalMessage = "";
      for(var index = 0; index < datacomponents.length; index++){var datacomponent = datacomponents[index];
          if (!isNaN(datacomponent) && analisys < 3) {var currentNumber = parseInt(datacomponent);
              orginalMessage += (currentNumber -  day)/month;
          }
          else if(analisys == 3){
             orginalMessage += datacomponent;
          }
          else{
            // Other conditions to be continued, this Demo according to my research in the anti-climbing technology and practice after continuous update}}return orginalMessage;
    }
    
    Copy the code

For example, the back end returns 323.14743.14743.1446. According to our agreed algorithm, the result can be 1773

  • According to the TTF file Render page

    The above calculation is 1773, and then according to the TTF file, the page sees 1995

  • Then, in order to prevent crawlers from viewing JS research problems, the JS files are encrypted. If your technology stack is Vue, React, etc., WebPack provides you with a plugin for JS encryption, which is also easy to handle

    JS obfuscation tool

  • Personally, I don’t think this way is very safe. So I came up with a one-two punch. Such as

Anti-climb upgrade version

Personally, I think that if a crawler developer with rich experience in front end, the above solution may still be cracked, so I have upgraded the version on the basis of the previous one

  1. Combo punch 1: The font file is not fixed, although the requested link is the same, but modulo according to the last number of the current timestamp, such as Demo modulo 4, there are four values 0, 1, 2, 3. These four values correspond to different font files, so when the crawler tried to crawl to the font in one case, it did not expect to request again, the rule of the font file changed to 😂
  2. Combo punch 2: The previous rule is that the font problem is out of order, but only the number match is out of order. Like 1 -> 4, 5 -> 8. The next step is to use a Unicode code for each number, and then make your own fonts, which can be.ttf,.woff, etc.

These kinds of combination blow down. For the general crawler they give up.

Anti-crawl means to escalate

The above method is mainly for the number to do the anti-climb means, if the Chinese characters to do the anti-climb? Here are a few options

  1. Solution 1: For the most frequent word cloud on your site, make a Chinese character map, that is, a custom font file, using the same steps as the numbers. First, the commonly used Chinese characters are generated in the corresponding TTF file; Convert the TTF file to AN SVG file according to the links provided below, then select the SVG file generated in the website under the “font Mapping” link, and map each Chinese character in the SVG file. In other words, make the Chinese characters as Unicode code (note that the Unicode code should not be generated online directly, because what is generated directly is regular. My method is to use the website to generate first, and then make a simple change to the result, such as “e342” to “e231”); The data returned by the interface is then mapped backwards according to the rules of our font file.

  2. Solution 2: The important fonts of the site, the HTML part of the image, so that the crawler to identify the required content is very expensive, need to use OCR. It’s also very inefficient. So we can intercept some of the crawlers

  3. Solution 3: Ctrip’s technology sharing “The highest level of reverse climbing is Canvas’s fingerprint. The principle is that different machines and different hardware always have pixel-level errors for the drawings drawn by Canvas. Therefore, we judge that if a large number of Canvas’s fingerprints are consistent for access, they are considered to be crawlers and can be blocked”.

    I will implement the scheme 1 in the Demo.

Key steps

  1. Start by finding commonly used keywords based on your product and generate a word cloud
  2. Based on the word cloud, unicode codes are generated for each word
  3. Make a font library of the characters included in the word cloud
  4. TTF in SVG format and upload it to icomoon to create a custom font, but there are rules. For example, the unicode code for “year” is “\ U5e74”, but we need to make a keywise encryption, for example, we set the offset to 1. The Unicode code for “year” after Caesar encryption is “\ U5e75”. Use this rule to make the font library we need
  5. What the server does each time the interface is called is: The server encapsulates a method, and determines whether the data is in the word cloud through the method. If it is the character in the word cloud, it uses the rule (find the Unicode code corresponding to the Chinese characters, and then set the corresponding offset according to Caesar encryption, which is 1 in Demo, and encrypts each Chinese character) and returns the data after encryption
  6. What the client does:
    • Let’s start with the library of Chinese characters we created before
    • The calling interface takes the data and displays it to the corresponding Dom node
    • If it is Chinese text, we set the CSS class of the corresponding node to the Chinese character class, which corresponds to the font library we introduced above
//style.css @font-face { font-family: "NumberFont"; SRC: url (" http://127.0.0.1:8080/Util/analysis "); -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } @font-face { font-family: "CharacterFont"; SRC: url (" http://127.0.0.1:8080/Util/map "); -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } h2 { font-family: "NumberFont"; } h3,a{ font-family: "CharacterFont"; }Copy the code

portal

Font making steps, TTF to SVG, font mapping rules

Realized effect

  1. The data seen on the page is not consistent with the results seen by the review element
  2. Looking at the interface data is inconsistent with what the review element and interface see
  3. The results are more inconsistent before each page refresh
  4. The treatment of numbers and Chinese characters is inconsistent

These kinds of combination blow down. For the general crawler they give up.


The previous TTF to SVG site will limit the conversion if the TTF file is too large and let you buy it. A new link is posted below.

Turn the vera.ttf SVG

The Demo address

Running steps

// Client. Check the machine IP in the Demo/spiders - develop/Solution/Solution1. Js and Demo/spiders - develop/Solution/Solution2. Js will interface the machine IP address modification is $CD inside Demo $ ls REST Spider-release file-Server.js Spider-develop Util rule.json $ node file-Server.js Server is runnig at http://127.0.0.1:8080$CD REST/ $NPM install $node app.jsCopy the code