A brief analysis of the front end strategy of crawler attack and defense

See an article that introduces the various imaginations of front-end engineers in the anti-crawler process, see the article here. The article introduces several large websites, in the process of anti-crawler, adopt a variety of strategies, all reflect the fantastic imagination of front-end engineers. Also very interesting, I briefly analyzed, for each solution, to see if there is a solution, so sorted into a blog, record.

1. Customize font forms

This scheme is to customize a font, web pages using garbled characters or other obfuscating characters, through the rendering of the custom font into the correct display of data.

Representative sites have cat’s eye movies and Qunar mobile end.

1.1 Cat eye Movies

As shown in the figure above, it is the top 10 box office statistics of maoyan home page today (the screenshot only captures the top 3). The box office data is private data for crawlers, so Maoyan is “encrypted”.

The page code is a bunch of gibberish, boxes…

We looked at the “box” numbers in the browser’s developer tools and found that the custom font was used to render the numbers visually.

The WOFF font is an open font format for web pages. For details, see MDN.

Let’s download the wOFF font file and take a look at the secret of this custom font.

Here I recommend an online font editing tool: Baidu font editor.

Open the downloaded WOFF file font in Baidu Font editor:

Well, the font file uses random Unicode encoding to define the numbers 0-9, as well as a blank character and a decimal point, and the order in which the numbers are defined is not fixed and the Unicode encoding is not continuous.

That is, the unicode of the box seen on the HTML page should correspond to the font value, which you can verify with the charCodeAt() method.

If the WOFF file is fixed, the problem is simple. However, the cat’s eye wOFF file is not fixed, but random.

It is also easy to define fonts in the WOFF font file in the same order as the actual numbers, or their Unicode values in the same order as the real numbers. But the order is also random…

So, is there really no way to do it?

Of course not!! Anything that can be displayed on the page can be done.

Find fontTools, a Python library that parses fonts into XML files, and look for information in the XML:

In fact, baidu fonteditor code is open source, it relies on a core library fonteditor-core, this library should also be able to parse font data, but I always error parsing error in the experiment, I do not know why, interested partners can study by themselves, and share the research results, thank you.

from fontTools.ttLib import TTFont



font = TTFont('/Users/coolcao/Downloads/b0a53bf9d791622d4681b8344fd118f92088.woff')

font.saveXML('/Users/coolcao/maoyan2.xml')

Copy the code

The generated XML file has two important parts:

<GlyphOrder>

<! -- The 'id' attribute is only for humans; it is ignored when parsed. -->

  <GlyphID id="0" name="glyph00000"/>

  <GlyphID id="1" name="x"/>

  <GlyphID id="2" name="uniEABA"/>

  <GlyphID id="3" name="uniEB51"/>

  <GlyphID id="4" name="uniE06D"/>

  <GlyphID id="5" name="uniF88C"/>

  <GlyphID id="6" name="uniF012"/>

  <GlyphID id="7" name="uniF6C7"/>

  <GlyphID id="8" name="uniE373"/>

  <GlyphID id="9" name="uniF48F"/>

  <GlyphID id="10" name="uniE429"/>

  <GlyphID id="11" name="uniF4CA"/>

</GlyphOrder>

Copy the code

The first part is the font overview, which defines the name of the font set. Note that id here has nothing to do with the actual numbers, not the actual numbers 0,1,2… And so on. Name is defined in Unicode, which is exactly the same as in baidu font editor.

<TTGlyph name="uniF4CA" xMin="0" yMin="-13" xMax="511" yMax="719">

  <contour>

    <pt x="130" y="201" on="1"/>

    <pt x="145" y="126" on="0"/>

    <pt x="216" y="60" on="0"/>

    <pt x="270" y="60" on="1"/>

    <pt x="332" y="60" on="0"/>

    <pt x="417" y="146" on="0"/>

    <pt x="417" y="270" on="0"/>

    <pt x="378" y="309" on="1"/>

    <pt x="337" y="349" on="0"/>

    <pt x="277" y="349" on="1"/>

    <pt x="251" y="349" on="0"/>

    <pt x="215" y="339" on="1"/>

    <pt x="225" y="416" on="1"/>

    <pt x="239" y="415" on="1"/>

    <pt x="296" y="415" on="0"/>

    <pt x="385" y="474" on="0"/>

    <pt x="385" y="535" on="1"/>

    <pt x="385" y="583" on="0"/>

    <pt x="322" y="646" on="0"/>

    <pt x="268" y="646" on="1"/>

    <pt x="217" y="646" on="0"/>

    <pt x="149" y="584" on="0"/>

    <pt x="139" y="518" on="1"/>

    <pt x="51" y="533" on="1"/>

    <pt x="67" y="623" on="0"/>

    <pt x="124" y="670" on="1"/>

    <pt x="182" y="719" on="0"/>

    <pt x="266" y="719" on="1"/>

    <pt x="324" y="719" on="0"/>

    <pt x="374" y="693" on="1"/>

    <pt x="423" y="669" on="0"/>

    <pt x="476" y="581" on="0"/>

    <pt x="476" y="485" on="0"/>

    <pt x="426" y="410" on="0"/>

    <pt x="377" y="388" on="1"/>

    <pt x="440" y="373" on="0"/>

    <pt x="511" y="281" on="0"/>

    <pt x="511" y="211" on="1"/>

    <pt x="511" y="118" on="0"/>

    <pt x="374" y="-13" on="0"/>

    <pt x="270" y="-13" on="1"/>

    <pt x="175" y="-13" on="0"/>

    <pt x="51" y="99" on="0"/>

    <pt x="42" y="189" on="1"/>

  </contour>

  <instructions/>

</TTGlyph>

Copy the code

The second part is the coordinate set information of each font, here I only extract the information of one character F4CA. We refresh the page twice more, take two different WOFF files, and convert them into XML files. The comparison will find that although the Unicode is different each time, the order is random, and the Unicode is not continuous, but, But, but, there is one thing in common, and that is the coordinate information for the font in the second part above. Why is it the same? Since each number pattern is fixed, the coordinates must be the same when you draw the graph.

Ok, guys, here, basically clear, we can manually mark a few digits of the block punctuation, and then every time we refresh, when we get the new WOFF font, fonttool will convert the font into XML format, according to the block punctuation information, determine its uncode value respectively. Then convert the “boxes” in the code into real numbers.

1.2 Qunar mobile web page

Qunar’s solution is similar to Maoyan’s, which uses custom fonts for confusion.

But qunar uses TTF format font file. That’s difference one.

Also, go to the custom font, using Unicode is relatively simple, see below:

Uncode is used to encode the real numbers directly, but the order is not one-to-one with the real numbers, that is, if the source code of the web page is’ 183 ‘, the actual display of the number is’ 361 ‘.

And it seems to be different every time. But it doesn’t matter, as long as you can use Fonttool to convert it into AN XML file and get the coordinates inside, then, no run.

1.3 Qidian Chinese Website

There is a novel reading website called Qidian.com, which also adopts the form of custom fonts. This is also asked by a friend on CNode, and I have a look at it this time.

After getting the “box word” in the source code, I took a look at its Unicode code, which is a Unicode code, as follows:

$ node test.js

94.37

d821

d821

d821

d821

d821

Copy the code

The first line 94.37 is the actual number of reads, and each of the following lines is a box word corresponding to the Unicode code. When I see the result, it crashes, it is the same. The same character encoding can render different numbers.

How is it different from cat’s eye and Where?

From the source, see, this reading number encrypted numbers, using the CSS class named zxJBLkdl, along the source, find the definition of class zxJBLkdl, there is such a code:

<p>

    <em>

        <style>

            @font-face { 

                font-family: zxJBLkdl; 

                src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.eot?') format('eot'); 

                src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.ttf') format('truetype'); 

            } 

            .zxJBLkdl { 

font-family: 'zxJBLkdl' ! important;

display: initial ! important;

color: inherit ! important;

vertical-align: initial ! important;

            }

        </style>

<span class="zxJBLkdl">&#100181; The & # 100184; The & # 100186; The & # 100181; The & # 100185; </span>

    </em>

< cite > word < cite > < I > | < / I > < em > < style > @ the font - face {the font-family: zxJBLkdl; src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.eot?') format('eot'); src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.ttf') format('truetype'); } .zxJBLkdl { font-family: 'zxJBLkdl' ! important; display: initial ! important; color: inherit ! important; vertical-align: initial ! important; }</style><span class="zxJBLkdl">&#100183; The & # 100185; The & # 100186; The & # 100181; The & # 100188; </span></em>

<cite> <span>&#183; </span> Member week click

        <style>

        @font-face {

            font-family: zxJBLkdl;

            src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.eot?') format('eot');

            src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.ttf') format('truetype');

        }



        .zxJBLkdl {

font-family: 'zxJBLkdl' ! important;

display: initial ! important;

color: inherit ! important;

vertical-align: initial ! important;

        }

</style><span class="zxJBLkdl">&#100181; The & # 100185; The & # 100187; The & # 100184; </span></cite><i>|</i><em><style>@font-face { font-family: zxJBLkdl; src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.eot?') format('eot'); src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.ttf') format('truetype'); } .zxJBLkdl { font-family: 'zxJBLkdl' ! important; display: initial ! important; color: inherit ! important; vertical-align: initial ! important; }</style><span class="zxJBLkdl">&#100179; The & # 100186; The & # 100181; The & # 100188; </span></em>

<span>&#183; < / span > weeks

        <style>

        @font-face {

            font-family: zxJBLkdl;

            src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.eot?') format('eot');

            src: url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/zxJBLkdl.ttf') format('truetype');

        }



        .zxJBLkdl {

font-family: 'zxJBLkdl' ! important;

display: initial ! important;

color: inherit ! important;

vertical-align: initial ! important;

        }

</style><span class="zxJBLkdl">&#100188; The & # 100187; </span></cite>

</p>

Copy the code

The bug found in this code is also in the form of random fonts, such as zxjblkdL.woff, which doesn’t matter.

𘝕 The & # 100184; The & # 100186; The & # 100181; The & # 100185; line, what is this, this is the use of HTML escaped characters to output a few unknown blocks of text, and then through the font to render the actual number display.

Here and regular < The actual escaped characters must be converted to entity numbers to be recognized by the browser. For details, see here.

But why do I get the same Unicode?

The answer is in the font file. Use Fonttool to convert the font file to XML, and you’ll find the following code:

<GlyphOrder>

<! -- The 'id' attribute is only for humans; it is ignored when parsed. -->

  <GlyphID id="0" name=".notdef"/>

  <GlyphID id="1" name="period"/>

  <GlyphID id="2" name="zero"/>

  <GlyphID id="3" name="one"/>

  <GlyphID id="4" name="two"/>

  <GlyphID id="5" name="three"/>

  <GlyphID id="6" name="four"/>

  <GlyphID id="7" name="five"/>

  <GlyphID id="8" name="six"/>

  <GlyphID id="9" name="seven"/>

  <GlyphID id="10" name="eight"/>

  <GlyphID id="11" name="nine"/>

</GlyphOrder>

Copy the code

If you want to name a number in English, you can find the following code:

<cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="148" language="0" nGroups="11">

<map code="0x18751" name="eight"/><! -?????? -->

<map code="0x18753" name="two"/><! -?????? -->

<map code="0x18754" name="five"/><! -?????? -->

<map code="0x18755" name="three"/><! -?????? -->

<map code="0x18756" name="zero"/><! -?????? -->

<map code="0x18757" name="nine"/><! -?????? -->

<map code="0x18758" name="six"/><! -?????? -->

<map code="0x18759" name="four"/><! -?????? -->

<map code="0x1875a" name="period"/><! -?????? -->

<map code="0x1875b" name="one"/><! -?????? -->

<map code="0x1875c" name="seven"/><! -?????? -->

</cmap_format_12>

Copy the code

This should be the relationship between each character and its hexadecimal code. Would escape convert hexadecimal number of digital part, at the right moment is the hexadecimal code, because the escape character is “custom”, so the browser does not recognize, only display box, estimates that in the process of copying an exception occurs, the browser can’t identify the specific characters, are in accordance with the box to copy, So the unicode that comes out is the same.

The process of starting Chinese language here is also clear, its essence and cat’s eye is the same, but the process and form is not the same.

1.4 summary

Websites that use custom fonts have the same idea.

The back end builds a set of font generation interface, randomly generates a WOFF font, and then returns the font file, as well as the Unicode correspondence of each number, and fills the front page with data.

Basically using custom font, you can use the above ideas to crack, first get a font file, and then use Fonttool to convert it into XML, manually get the coordinates of each number, and then you can write a program, when you get a new font file, determine how much each number is based on the coordinates information.

Another solution is to use a headless browser to take screenshots and then use OCR tool for text recognition. However, the problem with this solution is that OCR recognition has a certain error rate, so it is not perfect. I won’t say that here.

2. Element positioning coverage

This is an interesting way to give two sets of data, the front one is fake, the back one is real, and then display, through CSS positioning, the fake data is overwritten, only the real data is displayed.

In this code block, the first element has three < I > elements, where 0 and 8 are false, overwritten by the 9 and 4 of the following two elements. The actual data displayed is 479.

This way is very interesting, but in the reverse climb difficulty, and the first way to use a custom font, slightly lower, feel a little deceiving children.

We can calculate the width of the first b element and the width and offset of the following b elements to get the actual value (this is how browsers work).

For example, if the width of the first b element is 54px and the left offset is -54px, the width of the second B element is 18px and the left offset is -18px, then the third B element is -54px and the left offset is -18px, then it is obvious that the third B element is -54px and the left offset is the first number.

This method does not write the specific code, but the code should not be difficult to write, interested can try.

3. Background picture piecing together

Another way is to use a background image, and then give locations, screenshots, and piece together real numbers.

Such as the imWeb article mentioned in the United States this way. But I did not find which page of Meituan is now like this, it should be meituan is now revised, now are directly display numbers.

This method is similar to the above idea of element positioning overlay, but a little more complicated. First, take down the background image, then parse the HTML to get the specific background-position value, use the class library that can parse the image to intercept the number, and get the number in the image format, there is no way. This can only be recognized by OCR once, the picture, really can’t be helped.

Because it’s a picture, rather than trying to figure out what the number is in each position, it’s better to take a screenshot directly through a headless browser, and then use OCR to recognize it directly, because the browser only displays the picture, and can only recognize the text.

This way in the crack complex point, there will be a certain error recognition rate, in fact, is a good anti-crawl front end scheme. But there is a disadvantage, because it is the use of pictures, so on the display, not as clear as the text, and in the browser zoom, there will be a certain fuzzy, to the user experience will be bad, not as clear as the text.

4. Pseudo-class element substitution

Autohome now uses pseudo-class elements, breaking phrases apart and replacing them with pseudo-class elements.

This way to get up, than the above font is difficult to feel.

Take autohome for example, the class name of the pseudo class is random, and the CSS style that defines the pseudo class is dynamically generated by JS, which is more troublesome.

There is no need to climb the Autohouse, not to do, I found an article on the Internet about this way to build autohouse, interested students can read: anti-crawler crack series – Autohouse use CSS style to replace the text crack method

From the end result, js dynamically retrieves the problem to be replaced, and then dynamically replaces the problem. And now the autohome upgrade, to replace the text is also dynamic access, without any logo, so in practical operation, it is quite difficult.

Front end of Autohome, you can do it, admire…

Those of you who are interested can really do it, it’s a real sense of accomplishment to do it.

5. Add interference characters and hide them

This kind of article has wechat public number and the whole network proxy IP this website.

In the wechat official account, part of the text underlined on the left is interference text, and the opacity is set to 0 using CSS opacity to hide the display.

This website, the left part of the narrow box is interference text, use CSS display:none hide not to display.

In this scenario, each DOM element needs to be parsed and assembled using the correct character selected according to its CSS style. It shouldn’t be difficult. There’s no concrete implementation.

6. Summary

This weekend, I focused my energy on the customization of fonts. I thought it was very interesting, because I had encountered it before and didn’t know how to do it.

Reptiles and creepers are always the same, the priest is a foot higher, the devil is ten. In the aspect of anti-crawl, in addition to setting anti-crawl strategy on the back-end, such as limiting IP access frequency, limiting login user access frequency and so on, the front-end in anti-crawl, also racking their brains to do a lot of actions.

Again, what crawlers do is they keep increasing the cost of crawlers parsing the correct data, but they don’t really prevent crawlers.

For a crawler, any data you can see in a browser is available to the crawler, but it’s just not as easy as it should be.

I hope this article can bring you some ideas and help you make more innovative work in the process of crawler and anti-crawler.