A web crawler (also known as a web spider, web bot, or more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm. Python can be used as a programming language for crawler algorithms.

1. Overview of web crawler technology

For non – software professional developers, web crawler technology is relatively complex. For professional B/S structure software developers, web crawler development technology is developed around Http protocol, mainly involving knowledge points such as URL, Http request and Webservice, and data format.

For non-software professional developers, the general legal acquisition of dynamic data can be simply understood as the robot automatically queries the web page through the browser and automatically records the obtained data, namely, the process of sending Request and returning Respone as shown in the figure below, and the process of analyzing the recorded data of the web page.

1.1. Http network processing process

A simplified description of the Http network processing process is as follows: The client (such as Chrome browser and Python program developed by itself) sends HTTP request through URL, and the server responds to the request by returning HTML+CSS+JavaScipt code (including data), while JavaScipt code may contain URL request and dynamic re-request. Until the entire page is loaded.

We often say that crawler is actually to find the URL link that can return the required data to us from this pile of dynamic HTTP requests, no matter it is the web page link or the API link captured by App. After obtaining it, we simulate the client to send a request packet and get a return packet, and then we parse the data for the return packet. The main contents are:

  • URL
  • Request methods (POST, GET)
  • Request packet headers
  • Request packet contents
  • Returns the packet headers

HTTP request consists of three parts: request line, message header and request body.

Request method instructions
GET Requests the resource identified by the request-URI
POST Append new data to the resource identified by the request-URI
HEAD Request a response message header for the resource identified by request-URI
PUT The requesting server stores a resource and identifies it with a request-URI
DELETE Requests the server to remove the resource identified by request-URI
TRACE The request server sends back the received request information, mainly for testing or diagnostics
CONNECT Reserved for future use
OPTIONS Request queries about server performance or resource-related options and requirements

The HTTP response also consists of three parts, including the status line, the message header, and the response body. What we care about is the body of the response, directly intercepting the data we need. Of course, the common approach is to parse the response page, which is often more complete.

1.2. Data analysis

The return format of an HTTP request is an HTTP response, which has a fixed format but may have a variety of data bodies. There are four common ways:

  • Regular expression
  • requests-html
  • BeautifulSoup
  • LXML XPath

2. Practice cases of climbing retail price data of refined oil products

Occasionally we need to get some data from the Internet, for example, we need the latest year’s retail price data of refined oil.

2.1. Find the data source

National Oil Price Data Center (Data.eastmoney.com/cjsj/oil_de…

2.2. Find the web data query interface

Fiddler is a common packet capture analysis software. We can use Fiddler to analyze HTTP requests in detail and simulate corresponding HTTP requests.

When Fiddler is used, all requests and responses between the browser of the local terminal and the server pass through Fiddler and are forwarded by Fiddler. Because all the data goes through Fiddler, Fiddler can intercept the data without capturing network packets. The following picture shows Fiddler fetching network link requests.

After tracking and analyzing network requests, the following links are found to be the interfaces to obtain data: By tracking the monitoring request, find the date data, which is the variable of the interface API, in the WebForms request form.If you are interested, you can find the API for the list of date variables yourself.

2.3. Analyze the returned data format

Web page analysis is too troublesome, directly intercept data analysis of the crawl results.

Data format:

jQuery11230774292942777187_1622422491627(
{"version":"cafaf0743a45b7965a243c937f23dea5"."result": {"pages":1."data":[
{"DIM_ID":"EMI01641328"."DIM_DATE":"2021/2/1"."V_0":5.8."V_92":6.13."V_95":6.59."V_89":5.74."CITYNAME":"Anhui province"."ZDE_0":0.06."ZDE_92":0.06."ZDE_95":0.06."ZDE_89":0.06."QE_0":5.74."QE_92":6.07."QE_95":6.53."QE_89":5.68},
{"DIM_ID":"EMI01521389"."DIM_DATE":"2021/2/1"."V_0":5.81."V_92":6.16."V_95":6.56."V_89":5.76."CITYNAME":"Beijing"."ZDE_0":0.06."ZDE_92":0.06."ZDE_95":0.07."ZDE_89":0.05."QE_0":5.75."QE_92":6.1."QE_95":6.49."QE_89":5.71},
{"DIM_ID":"EMI01641332"."DIM_DATE":"2021/2/1"."V_0":5.77."V_92":6.14."V_95":6.55."V_89":5.71."CITYNAME":"Fujian"."ZDE_0":0.06."ZDE_92":0.06."ZDE_95":0.06."ZDE_89":0.05."QE_0":5.71."QE_92":6.08."QE_95":6.49."QE_89":5.66}]."count":25},"success":true."message":"ok"."code":0});Copy the code

2.4. Python implements fetching data

This example uses the Requests module, which needs to be installed separately, as shown below:

pip install requests
Copy the code
# -*- coding: utf-8 -*-
"Created on May 31, 2021 @author: Xiao Yongwei"
import requests
import re
import json
from datetime import datetime
import pandas as pd
from time import sleep
import random 


class Crawler(object) :
    def __init__(self) :
        # Oriental Wealth interface
        self.url = 'http://datacenter-web.eastmoney.com/api/data/get'
        # Price adjustment date interface API
        self.params_date = 'callback=jQuery112300671446287155848_1622441721838&type=RPTA_WEB_YJ_RQ&sty=ALL&st=dim_date&sr=-1&token=894050c76af8597a 853f5b408b759f5d&p=1&ps=5000&source=WEB&_=1622441721839'
        # Oil price interface API
        self.params_price = 'callback=jQuery11230774292942777187_1622422491627&type=RPTA_WEB_YJ_JH&sty=ALL&filter=(dim_date%3D%27$date$%27)&st=cityn ame&sr=1&token=8&p=1&ps=100&source=WEB&_=1622422491638'

    # Take price adjustment time
    def getDates(self,start_date) :
        dates_json = self._getResponse(self.params_date)
        self.dates = []
        start_date = datetime.strptime(start_date, "%Y-%m-%d").date()
        for dates in dates_json:
            dim_date = dates['dim_date']
            dim_date = dim_date.replace('/'.The '-')
            if datetime.strptime(dim_date, "%Y-%m-%d").date() >= start_date:
                self.dates.append(dim_date)
                
        print (self.dates)
        
    # take price
    def getOilprice(self) :
        self.pricedatas = []
        k = len(self.dates)
        i = 0
        for dates in self.dates:
            params_price = self.params_price.replace('$date$', dates)
            prices_json = self._getResponse(params_price)
            self.pricedatas.extend(prices_json)
            sleep(random.randint(0.3))
            i = i + 1
            print('Done: {:.2%}'.format(i/k))
        
        self.df = pd.DataFrame(self.pricedatas)
        #self.df.to_csv('price20210531.csv',encoding='utf_8_sig')
        print(self.pricedatas)    
        
    Take the API return value
    def _getResponse(self,params) :
        r = requests.get(self.url,params) 
        Get JSON data in parentheses
        p = re.compile(r'[(](.*?) [] ', re.S)
        jsondata = re.findall(p,r.text)
        Return data in Json format
        result = json.loads(jsondata[0])
           
        return result['result'] ['data']  

        
def main() :
    Oil_prices = Crawler()
    
    tmp = Oil_prices.getDates('2008-03-01')
    Oil_prices.getOilprice()
    
if __name__ == '__main__':
    main()
Copy the code

2.5. Data analysis

Because it intercepts the API data directly, it is relatively easy to process, and matches the embedded JSON data directly using the re method.

Where, the code snippet:

        r = requests.get(self.url,params) 
        Get JSON data in parentheses
        p = re.compile(r'[(](.*?) [] ', re.S)
        jsondata = re.findall(p,r.text)
        Return data in Json format
        result = json.loads(jsondata[0])
           
        return result['result'] ['data']
Copy the code

3. Summary

Crawler technology, if only to improve the speed and convenience of data acquisition, is actually relatively simple, from the perspective of IT software, are very mature technology and protocol, if you know JSP, front-end and back-end separation development, SOA technology, legal crawler will be very simple. The focus is on finding the URL of the request, identifying the interface API and parameter variables.

It is recommended to learn more about Webservice interface technology to improve data parsing efficiency.

Reference:

Use of Python crawler requests blog Garden, Lweiser, June 2019

In-depth understanding of HTTP Protocol, Zhihu, Zero One Technology Stack, September 2018

The detail through fiddler grab the HTTP protocol packets | fiddler using process. The west people said test “CSDN blog, the west people dwell on testing, in March 2018

Fiddler: How to capture HTTP and HTTPS (mobile/PC) packets, CSDN blog, November2019