Excerpted from the forthcoming Python3 anti-crawler principles and circumventions, the scope of this open book is chapter 6 – text obturation anti-crawler. This is section 3 of Chapter 6, SVG anti-crawler.

SVG mapping anti-crawler

SVG is a graphics format for describing two-dimensional vector graphics. It describes graphics based on XML, and the quality of graphics is not affected by zoom in or zoom out. This feature of vector graphics makes it widely used in Web sites.

The next anti-crawler means we will learn is realized by USING SVG. This anti-crawler means replaces specific text with vector graphics, which will not affect the normal reading of users, but the crawler program cannot obtain the content in SVG graphics like reading text. Since graphics in SVG represent one text after another, the real text must be mapped and replaced with corresponding SVG graphics at the back end or front end. This anti-crawler method is called SVG mapping anti-crawler.

6.3.1 SVG mapping anti-crawler bypasses actual combat

Example 6: SVG mapping anti-crawler example.

Website: http://www.porters.vip/confusion/food.html.

Task: Climb the contact phone number, store address and rating data of the food merchant evaluation website, as shown in Figure 6-15.

Figure 6-15 Example 6 page

Before writing Python code, we need to determine the element positioning of the target data. In the process of positioning, we found a different phenomenon: some numbers do not exist in the HTML code. For example, figure 6-16 shows the element location of taste score data.

Figure 6-16 Location of taste score elements in scoring data

The HTML code should be 8.7 based on what the page displays, but instead we see:

< span class = "item" > : < d class = "vhkjj4" > < / d > 7 < / span >Copy the code

The HTML code has the number 7 and the decimal point, but not the number 8, which seems to be occupied by the D tag. The display at the merchant’s phone number is even weirder, with no digits. The corresponding HTML code for the merchant phone is as follows:

<div class="col more">  <d class="vhkbvu"></d> <d class="vhk08k"></d> <d class="vhk08k"></d> <d class="">-</d> <d class="vhk84t"></d> <d class="vhk6zl"></d> <d class="vhkqsc"></d> <d class="vhkqsc"></d> <d class="vhk6zl"></d> </div>Copy the code

Contains a lot of D tags, does it use d tags for space and then overwrite with elements? We can compare the number of D tags with the number of numbers and find that they are the same, that is, a pair of D tags represents a number.

Each pair of D tags has a class attribute, some with the same value and some with different values. Let’s compare the class attribute values with the numbers to see if we can find a pattern, as shown in Figure 6-17.

Figure 6-17 Comparison between class attribute values and numbers

As shown in Figure 6-17, the class attribute value corresponds to the number one by one. For example, vHK08K corresponds to the number 0. Based on this clue, we can guess that each number corresponds to an attribute value, as shown in Figure 6-18.

Figure 6-18 Mapping between numbers and attribute values

The browser maps to this when rendering the page, so that the page displays numbers, and we see the class attribute values in the HTML code. The browser maps the D tags and numbers in THE HTML according to this relationship during rendering and renders the mapping results on the page. Figure 6-19 shows the mapping logic.

Figure 6-19 Mapping logic

Our crawler code can implement the mapping function according to the same logic. When parsing the HTML code, we take out the class attribute value of THE D tag, and then we can get the number displayed on the page through mapping. How do you implement mappings in crawler code? Instead of using the “attribute name  number” structure, Python has a built-in dictionary that works just fine. We can test this with Python code, which looks like this:

Mappings = {' vhk08K ': 0, 'vHK6ZL ': 1,' vhk9OR ': 2, 'VHKFLN ': 3,' vHKBVU ': 4, 'vHK84T ': 5,' VHKVXD ': 0, 'vhk6ZL ': 2,' VHKFLN ': 3, 'vHKBVU ': 4,' vHK84T ': 5, 'VHKVXD ': 6, 'vhkqsc': 7, 'vhkjj4': 8, 'vhk0f1': Print (mapping.get (html_d_class)) print(mapping.get (html_d_class))Copy the code

The logic of this code is to first define the mapping between attribute values and numbers, then assume the attribute value of an HTML D tag, and then print out the mapping result for that attribute value. After the code is run, the result is as follows:

6Copy the code

The results show that the mapping method is feasible. Next, let’s try mapping out the contact numbers of businesses:

Mappings = {' vhk08K ': 0, 'vHK6ZL ': 1,' vhk9OR ': 2, 'VHKFLN ': 3,' vHKBVU ': 4, 'vHK84T ': 5,' VHKVXD ': 0, 'vhk6ZL ': 2,' VHKFLN ': 3, 'vHKBVU ': 4,' vHK84T ': 5, 'VHKVXD ': 6, 'vhkqsc': 7, 'vhkjj4': 8, 'vhk0f1': Html_d_class = [' vHKbVU ', 'VHK08K ',' VHK08K ', 'VHK84T ',' vHK6ZL ', 'VHKQSC ',' VHKQSC ', 'vhk6zl'] phone = [mapping.get (I) for I in html_d_class]Copy the code

The running results are as follows:

[4, 0, 0, None, 5, 1, 7, 7, 1]Copy the code

We used the mapping method to get the merchant’s contact number, indicating that SVG mapping anti-crawler has been bypassed.

6.3.2 Anti-crawler case of Dianping

This mapping technique is not only used in the examples in this book, but also in large web sites. Dianping is a leading local lifestyle information and trading platform in China and the earliest independent third-party consumer review website in the world. Dianping not only provides users with merchant information, consumer reviews and consumer discounts, but also provides O2O (Online To Offline) transaction services such as group buying, restaurant reservation, takeout and e-membership card. Public review sites also use the map shape to the crawler, open a browser and visit https://www.dianping.com/shop/14741057, a page is shown in figure 6 to 20.

Figure 6-20 Merchant information page of Dianping

The business information page of Dianping is mainly used to display consumers’ scores, phone numbers, store addresses and recommended dishes. We can take a look at the HTML code for a merchant’s phone number or score, as shown in Figure 6-21.

Figure 6-21 HTML code of the merchant phone number

Not all business numbers in Dianping are replaced with D labels, but some of them use numbers. However, after careful observation, it can be found that the number of merchant numbers is equal to the number of D tags plus the number of digits, indicating that the class attribute value of D tags may also have a one-to-one mapping relationship with numbers. Interested students can use the method in Example 6 to try to map the numbers in the Dianping case.

If the way around this method is so simple, then it has long been obsolete, why even large websites like Dianping use it? Let’s continue to look at the HTML code of business hours of Dianping, as shown in Figure 6-22.

Figure 6-22 Business hours of Dianping merchants

In addition to the digital mapping, Dianping has also mapped Chinese. At this point, if you artificially map the class value to the corresponding literal, as in Example 6, it would be very troublesome. Imagine if all text on a web page used this mapping anti-crawler approach. What would a crawler engineer do? Map all the text used?

It can’t be done. There are 10 numbers, 26 English letters and thousands of commonly used Chinese characters to be mapped. And once the target site changes the correspondence of the text, the crawler engineer needs to remap all the text. Faced with such a problem, we must find the rules of text mapping and be able to use Python to implement the mapping algorithm. In this way, we can use this mapping algorithm to get the correct result no matter how the correspondence of the text mapping of the target site changes.

How does this mapping work in a web page? Do you use JavaScript to define arrays on the page? Or is the asynchronous request API getting JSON data? It’s all possible, and we’re going to find out.

6.3.3 Mechanism of SVG anti-crawler

The mapping can’t happen out of thin air, it must use some kind of technical feature. The only things in HTML that are associated with the tag class attribute are JavaScript and CSS. With this clue, we need to continue with example 6. The HTML code for the merchant phone in this case is:

<div class="col more">  <d class="vhkbvu"></d> <d class="vhk08k"></d> <d class="vhk08k"></d> <d class="">-</d> <d class="vhk84t"></d> <d class="vhk6zl"></d> <d class="vhkqsc"></d> <d class="vhkqsc"></d> <d class="vhk6zl"></d> </div>Copy the code

We can pick a random pair of D tags and see if there are any clues to the CORRESPONDING CSS style that we can further analyze. If there are no clues, we can look at JavaScript. The CSS style of the D tag is as follows:

d[class^="vhk"] { width: 14px; height: 30px; margin-top: -9px; background-image: url(.. /font/food.svg); background-repeat: no-repeat; display: inline-block; vertical-align: middle; margin-left: -6px; }. VHKQSC {background: -288.0px-141.0px; }Copy the code

The D tag style looks nothing special except that it sets the coordinate value of the background property. However, the background image is set in the public style of the UPPER D TAB. We can copy the address of the background image and open it in a new TAB page of the browser. The background image of the D TAB is shown in Figure 6-23.

Figure 6-23 Background of the label

The background picture of label D is full of numbers, and these disorderly numbers have 4 lines in total. But this doesn’t seem to be a big image, so let’s look at the source code for the image page, as shown in Figure 6-24.

Figure 6-24 Image page source code

The first two lines of the source code indicate that this is an SVG file that uses the text tag to define the text, the style tag to set the text style, and the text tag to define the text that is the number displayed on the image page. Could these unordered numbers be the phone numbers and rating numbers we see on the page?

Other tags use this CSS style except for the D tag whose class attribute value is vHKBVU, but the coordinates of each pair of D tags are different. Their coordinates are positioned as follows:

.vhkbvu { 
     background: -386px -97px; 
} 
.vhk08k { 
     background: -274px -141px; 
} 
.vhk84t {
     background: -176px -141px; 
}Copy the code

Coordinates are the key to locating numbers, and to know how coordinates are computed, you must know something about SVG.

At the beginning of this section, we looked briefly at the concept of SVG and learned that SVG is xmL-based. In fact, it uses descriptive language in text format to describe image content, so SVG is a vector graphics format independent of image resolution. Open a text editor and write the following to the new file:

<? The XML version = "1.0" encoding = "utf-8" standalone = "no"? > <! DOCTYPE SVG PUBLIC "- / / / / W3C DTD SVG 1.1 / / EN" "http://www.w3.org/Graphics/SVG/1.1/ DTD/svg11 DTD" > < SVG XMLNS = "http://www.w3.org/2000/svg" version = "1.1" XMLNS: xlink = 1999 / xlink "http://www.w3.org/" width = "250 px" Height = "250.0 px" > < text x = '10' y = '30' > hello, world < / text > < / SVG >Copy the code

Save the file as test. SVG and open the test. SVG file, as shown in Figure 6-25.

Figure 6-25 Test. SVG Displays the content

The first three lines of code declare the file type, lines 4 through 5 define the SVG content block and the canvas width and height, and line 6 defines a piece of text with the TEXT tag and specifies the coordinates of the text. This text is what we see in the browser, and the x and y coordinates in the code determine the position of the text on the canvas, as follows.

• Zero coordinate point in the top left corner of the page, i.e. coordinate value (0, 0). • Coordinates are in pixels. • The X-axis is positive from left to right, and the Y-axis is positive from top to bottom. • N characters can have n positional arguments.

If the number of characters is greater than the number of positional arguments, the characters without positional arguments are sorted in textual order with the last positional argument being zero.

It doesn’t look like it’s easy to understand, but we can modify the code to understand the definition of the axes. The x in the text tag represents the position of the list character on the X-axis of the page. The x value in test.svg is 10. Now let’s set it to 0, save and refresh the page, as shown in Figure 6-26.

Figure 6-26 Test-svg displayed when x is 0

When the value of x is 0, the text is close to the left side of the browser. When x is 10, the text is at some distance to the left of the browser, indicating that the value of x determines the position of the text. Now we change the value of the x in the code to “10, 50, 30, 40, 20, 60” (note that the second number 20 is deliberately interchangeable with the fifth number) to set the coordinate position of the first six characters.

In this case, the position parameter of the first character is 10, the position parameter of the second character is 50, and the position parameter of the third character is 30, and so on. The normal characters displayed on the page are as follows:

holle,worldCopy the code

However, the position parameters of the second character and the fifth character are swapped, that is, the positions of letter E and letter O are interchanged, as shown in Figure 6-27.

Figure 6-27 Setting an SVG with multiple X values

The text order in Figure 6-27 is the same as we guessed, indicating that each character in SVG can have its own x-coordinate value. Like x, each character can have its own y-coordinate value. Although we have only six positional arguments, SVG has 11 characters, but characters without positional arguments can still be sorted in textual order. After understanding the basics of SVG, let’s go back to the setting of coordinate parameters in the SVG file used in the case. The characters in Figure 6-23 correspond to the characters in the source code of the picture page in Figure 6-24 one by one, and each character sets the position parameter of the X-axis, while the Y-axis has only one value.

Once we understand positional parameters, we need to figure out character positioning. The browser determines the corresponding number in SVG based on the coordinates and the width and height of the element set in the CSS style. The positive direction of the X axis is from left to right, and the positive direction of the Y axis is from top to bottom, as shown in Figure 6-28.

Figure 6-28 Relationship between the X-axis and Y-axis of SVG and position parameters

In the CSS style, the X-axis is opposite to the Y-axis, that is, the X-axis is negative and the Y-axis is negative, as shown in Figure 6-29.

Figure 6-29 Relationship between the x and y axes of the CSS and position parameters

So when we need to position characters in SVG in CSS, we need to use negative numbers. To understand their relationship, you need to center the first character in line 1 of Figure 6-30 in CSS.

Figure 6-30 SVG

Assuming a character size of 14 px, SVG is evaluated as follows.

• The calculation rule for a character in the center of the X-axis is: divide the character size by 2, and add the parameter of the starting position of the X-axis of the character, that is, 14÷2+0 equals 7. • The calculation rule for a character in the center of the y axis is as follows: The height of the y axis minus the starting point of the y axis minus the size of the character. Divide the value by 2 and add the starting point of the y axis to the value. Finally, add half of the character size, that is, (38−0−14)÷2+0+7 equals 19.

Finally, the coordinates of SVG are:

x='7' y='19'Copy the code

The X-axis and Y-axis of the CSS style are the opposite of SVG, so the position of the character in the CSS style is:

-7px -19pxCopy the code

This enables you to locate the center point of the specified character. However, if you want to display this character in an HTML page, you also need to set the width and height style for the corresponding tag in the HTML, for example:

width: 14px; 
height: 30px;Copy the code

Once we understand the relationship between SVG and CSS styles, we can map the characters in SVG according to the CSS styles.

In a real world scenario, we would need to automate the mapping between CSS styles and SVG rather than do it artificially. Using the SVG and CSS styles in Example 6, if we need Python code to implement automatic mapping, we first need to get the urls of these two files, such as:

url_css = 'http://www.porters.vip/confusion/css/food.css' Copy the codeurl_svg = 'http://www.porters.vip/confusion/font/food.svg'Copy the code

There is also the class attribute value of the HTML tag that needs to be mapped, such as:

css_class_name = 'vhkbvu'Copy the code

Then use the Requests library to request the URL and get the text content. The corresponding code is as follows:

import requests 
css_resp = requests.get(url_css).text 
svg_resp = requests.get(url_svg).textCopy the code

Extract the coordinate value of the tag attribute in the CSS style file, using the re to match. The corresponding code is as follows:

import re pile = '.%s{background:-(\d+)px-(\d+)px; }' % css_class_name pattern = re.compile(pile) css = css_resp.replace('\n', '').replace(' ', '') coord = pattern.findall(css) if coord: x, y = coord[0] x, y = int(x), int(y)Copy the code

The resulting coordinate value is positive and can be used directly for SVG character location. To do this, we need to get the Element objects of all text tags in SVG:

from parsel import Selector 
svg_data = Selector(svg_resp) 
texts = svg_data.xpath('//text')Copy the code

Get y values from all the text tags, and loop through the Element from the previous step:

axis_y = [i.attrib.get('y') for i in texts if y <= int(i.attrib.get('y'))][0]Copy the code

After you get the y value, you can begin character location. Note that the Y value of the TEXT tag in SVG does not need to be exactly the same as the Y value obtained in the CSS style, because the style can be adjusted at will. For example, -90 and -92 in THE CSS style make no difference to SVG positioning. So we just need to know which text it is.

So how do you figure out which text it is?

If the y value in the CSS style is -97, then the y value of text in SVG cannot be less than 97. We just need to get the closest y value of the text tag that is larger than 97. For example, the y value of all text tags in SVG is:

[38, 83, 120, 164]Copy the code

So the closest thing that’s greater than 97 is 120. Convert this logic into code:

axis_y = [i.attrib.get('y') for i in texts if y <= int(i.attrib.get('y'))][0]Copy the code

Once you get the y value, you can determine which text tag it is. The corresponding code is as follows:

svg_text = svg_data.xpath('//text[@y="%s"]/text()' % axis_y).extract_first()Copy the code

Next you need to confirm the text size in SVG, which means you need to find the value of the font-size attribute. The corresponding code is as follows:

font_size = re.search('font-size:(\d+)px', svg_resp).group(1)Copy the code

Once we have the value of font size, we can locate the specific character. How many characters does the X-axis have? Svg_text is the character in the specified text tag:

'671260781104096663000892328440489239185923'Copy the code

Do we need to calculate the string length? We know that each character size is 14 px. We just divide the x value in the CSS style by the character size to get the position of the character in the string. The result of division may be an integer or a non-integer. When the result is an integer, it indicates that the location is completely accurate. We can use the slicing feature to get characters. If the result is not an integer, the location is not completely accurate. Since half of the characters cannot appear, we use floor division (a common downward division method in programming languages, which returns the integer part of the quotient). To get an integer:

Position = x // int(font_size) #Copy the code

That is, the CSS style VHKBVU maps the value at position 27 of line 4 text in SVG. Figure 6-31 shows the mapping results.

Figure 6-31 Mapping result

Then use the slice property to get the character. The corresponding code is as follows:

number = svg_text[position] 
print(number)Copy the code

The code runs as 4. We can also try other class attribute values and get the same result as the characters displayed on the page, indicating that this mapping algorithm is correct. So far, we have completed the bypass of mapping anti-crawler.

6.3.4 summary

As in sections 6.1 and 6.2, the examples in this section use anti-crawler techniques that do not allow you to get “seen” content even with rendering tools. SVG mapping anti-crawler takes advantage of differences in rendering between browsers and programming languages, as well as front-end knowledge of SVG and CSS positioning. If the crawler engineer is not familiar with the rendering principle and front-end knowledge, this anti-crawler method will bring great trouble.

Author: Huawei Cloud enjoy expert attack coder