Crawler series (2) Chrome package capture analysis

In this article, we will try to capture and analyze web pages using an intuitive web analytics tool (Chrome Developer Tools)

1. Test environment

Browser: Chrome

Browser Version: 67.0.3396.99 (official version) (32-bit)

Web analytics tools: Developer tools

2. Web analysis

(1) Web source code analysis

As we know, the webpage has the static webpage and the dynamic webpage, many people will mistake the static webpage is not the dynamic effect webpage, actually this kind of statement is wrong

A static pageA non-interactive web page without a background database, often with.htm,.html,.xmlFor the suffix
Dynamic web pagesAn interactive web page that can transfer data to and from a background database.aspx,.asp,.jsp,.phpFor the suffix

In addition, many dynamic sites have adopted asynchronous loading technology (Ajax), which is often the reason why the source code captured and the source code displayed on the site are inconsistent

As for how to crawl dynamic web pages, there are two methods:

One is analyzing Ajax requests through packet capture, which I’ll cover next
The second is to use Selenium and other tools for dynamic rendering, which you can refer to in my other article, Basic Use of Selenium

Below, we take JINGdong products as an example to analyze how to capture packages through Chrome. We first open the home page of a product

Item.jd.com/10072615543…

Go to the blank area of the page and right-click and choose View the source code of the page (or use the shortcut key Ctrl+U to open the page directly)

Note that looking at the source code of the web page gives you the original source code of the site, which is usually the source code we grab

Go to the blank area of the page again, right-click and select Check (or use the shortcut Ctrl+Shift+I/F12 to open directly).

Note that what you get is ajax-loaded and JavaScript rendered source code, which is the source code for the current site display

After comparison, we can see that the content of the two is not the same, which is a classic example of asynchronous loading technology (Ajax)

At present, at least the prices of jingdong products are generated through asynchronous loading. Here are three methods to determine whether a certain content in the web page is dynamically generated:

One is to analyze the source code generated by viewing the source code of a web page, which can be used to look for typical statements of dynamic requests, and can also be compared with the source code generated by examining it
Second, through the following will explain the web page packet capture analysis to judge, this method is the most commonly used, should be a good grasp
Third, a trick is to disable JavaScript loading in Chrome

Concrete can be input in the address bar of the Chrome Chrome: / / Settings/content/javascript to javascript Settings page

Then turn off the JavaScript option and reload the page to see a blank space where the price was previously displayed

This indicates that the original price was dynamically generated by JavaScript

(2) Web page packet capture analysis

Let’s take jingdong commodities as an example to explain. Open the home page of a commodity and try to grab the dynamically loaded commodity price data

Item.jd.com/10072615543…

Use the shortcut keys Ctrl+Shift+I or F12 to open the Developer tools, and then select the Network TAB for packet capture analysis

Press the shortcut key F5 to refresh the page, you can see various packages appear in the developer tool, we use Filter to Filter the packages

First, we select Doc and see that only one package appears in the list

Typically, this is the first package the browser receives to retrieve the original source code for the requesting site

Click on Header to see its Header parameter Settings

Click Response to see the returned source code. It is easy to find that it is actually the same as the information returned by viewing the source code of the web page

Let’s get back to the point. For dynamically loaded packet capture analysis, focus on the XHR and JS tabs

After selecting JS for filtering, we found that there were many packages in the list. After analysis, we screened out the packages marked in the figure below

This package returns information about prices, but after careful analysis, the prices are not for the current item, but for related items

But the package is still price dependent, so why don’t we look at the request URL for this package

https://p.3.cn/prices/mgets?callback=jQuery1609108&type=1&area=1_72_2799_0&pdtk=&pduid=1539779074977382417990&pdpin=&pin =null&pdbp=0&skuIds=J_25630711066%2CJ_26395831446%2CJ_20823451030%2CJ_11332156897%2CJ_14020547214%2CJ_26498549638&ext=11 100000&source=item-pcCopy the code

Filtering unnecessary parameters, including callback, yields simple and efficient urls

https://p.3.cn/prices/mgets?skuIds=J_25630711066%2CJ_26395831446%2CJ_20823451030%2CJ_11332156897%2CJ_14020547214%2CJ_264 98549638Copy the code

Open the URL directly in a browser, and you can see that it does return JSON data containing price information (except for prices of other items)

By analyzing the parameters of the URL, it can be inferred that skuId should be the unique symbol of each commodity. Then where can we find the skuId of the commodity we need?

In fact, SKU is an abbreviation commonly used in logistics, transportation, etc. The abbreviation stands for Stock Keeping Unit.

This is the basic unit of inventory measurement, which has now been extended to the abbreviation of uniform product number, and each product should have a unique SKU

Review the product homepage we just entered, item.jd.com/10072615543…

This is not hidden in the current product unique number identification (10072615543) it? Try it!

Sure enough, visit the full URL for commodity prices and we get it, p.3.cn/prices/mget…

We can get the current price information by visiting the website directly

In fact, we can also generalize the URL appropriately to accommodate price crawls for all jingdong products

Very simple, just separate out the skuIds as parameters, p.3.cn/prices/mget…

Through the generalized URL, theoretically as long as we can get the skuId of the product, we can access the price of the corresponding product

Crawler series (2) Chrome package capture analysis

1. Test environment

2. Web analysis

(1) Web source code analysis

(2) Web page packet capture analysis

Related Posts

Functions, from edit to compile (up) – shows you what precompilation does

Load balancing practices in Microservices governance: Using the Ribbon and Feign to explain load balancing

Efficient development: how to use SVN, just read this article (test effective)