Hello, everyone, I’m Charlie ~ there are a lot of anti-crawling measures on the website, such as: JS anti-crawling, IP anti-crawling, CSS anti-crawling, font anti-crawling, verification code anti-crawling, sliding click verification anti-crawling and so on, today we take a recruitment to learn font anti-crawling.

Font the climb

Font anti-crawling: a common anti-crawling technology is the anti-crawling strategy completed by the cooperation of web pages and front-end font files. The earliest use of font anti-crawling technology is 58.com, Autohome and so on. Now many mainstream websites or apps also use font anti-crawling technology to add an anti-crawling measure to their websites or apps.

Font crawl principle: through the custom font to replace some data in the page, when we do not use the correct decoding method can not get the correct data content.

Use a custom font with @font-face in HTML, as shown below:

The syntax format is:

@font-face{font-font-family :" name "; SRC :url(' font file link '); Url (' font file link ')format(' file type ')}Copy the code

Font files are generally TTF type, EOT type, wOFF type, wOFF type files are widely used, so we generally encounter wOFF type files.

What is the content of a WOFF file, for example, and how is it encoded so that the data corresponds to the code?

Taking the font file of a recruitment website as an example, enter Baidu font compiler and open the font file, as shown in the picture below:

Open a font randomly, as shown below:

It can be found that the font 6 is placed in a plane coordinate, and the code of the font 6 is obtained according to each point of the plane coordinate. It is not explained here how to obtain the code of the font 6.

How to solve the font crawl?

First of all, mapping can be seen as a dictionary. There are roughly two commonly used methods:

First: manually extract a set of codes and character correspondence and display it in the form of a dictionary, as shown below:

replace_dict={ '0xf7ce':'1', '0xf324':'2', '0xf23e':'3', ....... '0xfe43':'n',} for key in replace_dict: data = data. Replace (key,replace_dict[key])Copy the code

You define a dictionary that corresponds to the font and its corresponding code, and then replace the data one by one through the for loop.

Note: This approach is mainly suitable for data with little font mapping.

The second: First download the font file of the website, and then convert the font file to XML file, find the code of the font mapping relationship inside, decode it by decode function, and then combine the decoded code into a dictionary, and then replace the data one by one according to the dictionary content, because the code is relatively long, I will not write the sample code here. The code for this approach will be shown later in a walkthrough.

Well, the font reverse climbing is simple here, next we formally climb a recruitment website.

We practice

Custom font file lookup

First, enter a recruitment website and open the developer mode, as shown in the picture below:

Here we see that only new words in the code cannot function properly, but are used to replace them with code. It is preliminarily determined that a custom font file is used, at this time we need to find the font file. Then where to find the font file, first open the developer mode and click the Network option, as shown in the picture below:

Under normal circumstances, the Font files in the Font selected card, we found a total of 5 items here, which is a custom Font file entry, every time when we click the next page, will perform a custom Font file, then we only need to click the next page in web pages, as shown in the figure below:

Can see one more begin with the file entry, then can judge the preliminary documents for custom font file, now we download it, download the way is very simple, only need to copy the file opening entry URL and on the web page to open the can, after download, open the compiler in baidu font, as shown in the figure below:

At this moment, I found that I could not open it, maybe I found the wrong font file, the website suggested that this file type was not supported, so we changed the suffix of the downloaded file to. Woff, as shown in the picture below:

That’s when it opens successfully.

Font mapping

Found the custom font file, so how do we use it? We define get_fontfile() to handle the custom fontfile, and then display the mapping in the fontfile as a dictionary in two steps.

  1. Font file download and conversion;
  2. Font mapping decoding.

Font file download and conversion

First of all, the update frequency of custom font files is very high. At this time, we can obtain the custom font files of the web page in real time to prevent the use of the previous custom font files resulting in inaccurate data acquisition. First look at the url link to the custom font file:

www.xxxxxx.com/interns/ico…

www.xxxxxx.com/interns/ico… www.xxxxxx.com/interns/ico…

We can find that the URL of the custom font file only changes the rand parameter, and it is a random 16 bits less than 1 floating point number, so we only need to construct the rand parameter, the main code is as follows:

def get_fontfile(): Rand = round (random uniform (0, 1), 17) url = f 'https://www.xxxxxx.com/interns/iconfonts/file?rand= {rand}' response=requests.get(url,headers=headers).content with open('file.woff','wb')as f: f.write(response) font = TTFont('file.woff') font.saveXML('file.xml')Copy the code

First, the size of random numbers is controlled by the random.uniform() method, and then the number of bits is controlled by the round() method, so that the rand value can be obtained. Then, the.content method is used to convert the URL response contents into binary and write them to the file file.woff. In the TTFont() method to obtain the file content, by saveXML method to save the content as XML file. The content of the XML file is as follows:

Font decoding and presentation

The font.xml file is 4589 lines long. Which part is the code part of the font mapping?

First of all, let’s go back to the content of baidu font encoder, as shown below:

The corresponding code of Chinese character person is f0e2, so we query the code of person in the font.xml file, as shown in the figure below:

It can be found that there are a total of 4 results, but each result is the same after careful observation. At this time, we can obtain the mapping relationship according to their code rules, and then obtain the corresponding data value through decoding, and finally display it in the form of dictionary. The main codes are as follows:

def get_fontfile(): Rand = round (random uniform (0, 1), 17) url = f 'https://www.xxxxxx.com/interns/iconfonts/file?rand= {rand}' response=requests.get(url,headers=headers).content with open('file.woff','wb')as f: f.write(response) font = TTFont('file.woff') font.saveXML('file.xml')Copy the code

First read the contents of file.xml file, find out the values of code and name in the code and set them as keys and values respectively, then decode the values of values into the data we want through the for loop, and finally merge them into a tuple by zip() method and convert them into dictionary data by dict() method. The running result is as shown in the figure:

Access to recruitment data

In the previous step, we successfully converted the font mapping to dictionary data. Next, we made a network request to get the data. The main code is as follows:

def get_data(dict,url): response=requests.get(url,headers=headers).text.replace('&#','0') for key in dict: response=response.replace(key,dict[key]) XPATH=parsel.Selector(response) datas=XPATH.xpath('//*[@id="__layout"]/div/div[2]/div[2]/div[1]/div[1]/div[1]/div') for i in datas: data={ 'workname':i.xpath('./div[1]/div[1]/p[1]/a/text()').extract_first(), 'link':i.xpath('./div[1]/div[1]/p[1]/a/@href').extract_first(), 'salary':i.xpath('./div[1]/div[1]/p[1]/span/text()').extract_first(), 'place':i.xpath('./div[1]/div[1]/p[2]/span[1]/text()').extract_first(), 'work_time':i.xpath('./div[1]/div[1]/p[2]/span[3]/text()').extract_first()+i.xpath('./div[1]/div[1]/p[2]/span[5]/text()' ).extract_first(), 'company_name':i.xpath('./div[1]/div[2]/p[1]/a/text()').extract_first(), 'Field_scale':i.xpath('./div[1]/div[2]/p[2]/span[1]/text()').extract_first()+i.xpath('./div[1]/div[2]/p[2]/span[3]/text( )').extract_first(), 'advantage': ','.join(i.xpath('./div[2]/div[1]/span/text()').extract()), 'welfare':','.join(i.xpath('./div[2]/div[2]/span/text()').extract()) } saving_data(list(data.values()))Copy the code

First, we define get_data() to receive the dictionary data of the font mapping relationship, then replace the dictionary content with the data one by one through the for loop, finally extract the data we want through xpath(), and finally pass the data into our custom saving_data() method.

Save the data

Now that the data has been retrieved, the main code to save the data is as follows:

def saving_data(data):
    db = pymysql.connect(host=host, user=user, password=passwd, port=port, db='recruit')
    cursor = db.cursor()
    sql = 'insert into recruit_data(work_name, link, salary, place, work_time,company_name,Field_scale,advantage,welfare) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)'
    try:
        cursor.execute(sql,data)
        db.commit()
    except:
        db.rollback()
    db.close()
Copy the code

Start the program

Well, the program has been written almost, next will write code to run the program, the main code is as follows:

If __name__ = = "__main__ ': create_db get_fontfile () () for I in range (1, 3) : url=f'https://www.xxxxxx.com/interns?page={i}&type=intern&salary=-0&city=%E5%85%A8%E5%9B%BD' get_data(get_dict(),url)Copy the code

The results show

Well, learning font reverse climb and climb a recruitment is here!! If the article is helpful to you, click like + follow, your support is my biggest motivation