This article is participating in Python Theme Month. See the link for details

Hello, I’m Brother Chen ~~~

Prerequisite: Before learning the small program data collection in this paper, WE believe that we have mastered the skills of capturing data packets, such as using Mitmproxy to capture data packets. If see you don’t have a master here, you can participate in Calvin before writing an article about mitmproxy use (actual combat | taught you how to how to use caught artifact mitmproxy).

Objective of this paper: Use Mitmproxy to capture the scenic spot data of a small program, and realize the page turning (next page) cycle crawling.

Ideas:

1. Capture and analyze data packets using Mitmproxy

2. Using the analysis results, write Python code to extract data and realize the next page collection

01. Mitmproxy captures data packets

1. Start the mitmproxy

Configure the mobile phone proxy IP address and start mitmProxy

Start mitmWeb in the terminal

mitmweb

Copy the code

View packets in the browser (mitmWeb will automatically open the web page in the browser, if not, manually enter)

http://127.0.0.1:8081/#/flows

2. Access applets

Open the same trip applet and click on all attractions

A list of attractions appears on the page:

3. View data packets in the browser

The part in red box in the figure above is the API interface of scenic spots list. Click Response to view the returned data.

02. Python parses packets

1. Analyze interfaces

After analysis, it is found that the interface does not have anti-crawl (signature verification), so the interface can directly crawl multi-page data, such as modifying parameters in the interface link

Parameters:

Page number of pages

Article PageSize number

CityId city

Keyword keyword

.

Therefore, all scenic spot data can be obtained by modifying page.

Knowing interface links and getting data from Requests in Python is something we all do.

import requests
Get data from page 1 to page 10
for p in range(1.11) :# pages
    url = "https://wx.17u.cn/scenery/json/scenerylist.html?PosCityId=78&CityId=53&page="+str(p)+"& sorttype = 0 & PageSize = 20 & IsSurrounding = 1 & isSmallPro = 1 & isTcSmallPro = 1 & isEncode = 0 & Lon = 113.87234497070312 & Lat = 22.9054355621 3379&issearchbytimenow=0&IsNeedCount=1&keyword=&IsPoi=0&status=2&CityArea=5&Grades=&IsSearchKeyWordScenery=1"
    response = requests.get(url).json()
    print(response)

Copy the code

Today we get data in a different way that can be used to reverse crawl around interface signature validation, such as signature encryption parameters such as sign or X-sign.

2. Directly parses data packets

Believe that elder brother saw Chen “(this article combat | taught you how to how to use caught artifact MitmProxy) readers know, MitmProxy grab packets, in addition to the browser can view, can also write python code while fetching data packets, and parsing.

Let’s take a look at some of the data that Python can retrieve from packets.

Calling the py code above from the terminal results in the following:

Let’s start writing the actual Python code to save the scenic spot data directly in TXT.

In the chenge.py file, modify the part of the response function (as shown above)

Start procedure:

mitmdump.exe -s chenge.py

Copy the code

The data returned by the API interface is preceded by:

“State “:”100″,”error”:” query successful”

Therefore, the inclusion of this content in the response data indicates that there is a list of attractions

The scenic spot list data is in the SCENeryinfo field of the JSON data. We take out the content of fields (name, address, grade) and save it in TXT file, and name it scenic spot. TXT

Swipe down in the applet to load more data, while MitmProxy continues to capture packages, and the corresponding Python program continues to save data to TXT.

Ps: This is only about the use of technology, not to climb down the data completely, and in order to demonstrate the data can be saved, also temporarily saved in TXT, readers can save to the database or Excel according to need.

03, subtotal

Objective of this paper: Use Mitmproxy to capture the scenic spot data of a travel applet, and realize the page turning (next page) cycle crawling. In addition, it also describes how to use mitmProxy to bypass the reverse crawl of interface signature verification, such as sign or X-sign and other signature encryption parameters (although this article does not have encryption parameters, but the technology can be first master, when encountered can be used).

Not the small partner, feel hands-on practice !!!! Finally say: original is not easy, ask to praise!