It takes about 5 minutes to read the text.

JS are all big pig hooves

Yesterday, a member of the reader group posted such a message, saying that such a website page information how to extract TD content, chat screenshots show the page is seen in the browser code.

I was just off work at the time. When I saw the message, I thought, “This is easy. Write xpath, write BS4, write regular match, and you can easily get it. However, it is not as easy as you thought.

The key piece of information is then mentioned: when the Python request is actually made, the JS code is returned.

I clearly see on the browser is a very hierarchical beautiful little sister HTML code, how to use the code request became obscure big pig hoof JS code ah? I want little sister!


For a time water friends at a loss, how to extract also can not extract the content they want. Later, some of the water friends in the group suggested that they should try bS4 or use regular matching.

As an old crawler driver, I have climbed over the website although not thousands, but at least nearly hundreds, large and small pits have basically encountered. When I receive a new crawler task, the preferred first step is to analyze the process of web page data request. Many times there is a very simple way to get data from a web page.

So when water friends find me, I first look at what the webmaster looks like, water friends need information is the name of each region.

Website Address:

https://xyq.cbg.163.com/

When I first saw this website, MY impression was that the structure of this website was not complicated and the information was not difficult to extract. But because of the previous group of messages in the foamy, I understand that this web page is RENDERED by JS code.

Render web pages with JS

JS rendered web pages are a common type of web page in crawlers. This type of web site has a feature that the server will not return the correct data to you if you do not request it with the browser environment information. Normal requests can only get large pig hoof JS code, obscure and difficult to understand.

In this case, if you want to see the beauty of the golden age, there are two ways: 1. Use selenium automation framework, 2. Parse specific JS code.

Selenium is like a bruiser that directly simulates a real browser environment and gets real data just as easily as a real browser requests. But the result of this brutal approach is very inefficient.

So we can try the second method: through the analysis of the specific JS code, out of the silt and gently see the little sister’s face.

Then I skillfully opened the browser console, looked at the process of the web request, the specific JS request part to find out. Taking a look at all the JS files, I found a file named server_list_data.js with a list_data field that most likely stores some data. So I clicked on this file to see it in detail.

Sure enough, I saw a lot of Unicode encoding in this file, so I checked with a transcoding site.

These Unicode codes are exactly the content displayed on the web page. What we need to do next is to request the JS link with the program, parse the returned content, and convert the Unicode code into Chinese.

The program code

import requests
import re

def parse_js():
    url = "https://cbg-xyq.res.netease.com/js/server_list_data.js"
    headers = {"User-Agent": "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
    html = requests.get(url, headers=headers)
    patten = re.compile(r"(.*)var server_data =(.*)", re.S)
    data = re.findall(patten, html.text)
    server_data = eval(data[0][1][:-1])
    for i in server_data:
        for j in server_data[i]:
            print(j)

if __name__ == '__main__':
    parse_js()Copy the code

Output result:

What a wonderful little sister, bah, what neat data.

This article is first published in the public account “Chi hai”, the public account back to “1024” to get the latest programming resources.