preface

The basic knowledge of the crawler has come to an end, this time to find a website combat wave. But why funds? It starts with my story.

I am a leek after zero, small white one, follow the flow into the base city a load, Buddha is held, even. See white wine red win fire years ago, then a small investment, did not want to open after the green as blue, the leek to earn empty like, a week before the liberation dream.

Still remember that day the wind on the roof is very cool, bow down to watch the car to the car, a little afraid of heights. Want to light a cigarette to foil the atmosphere, just remember I don’t smoke. Sadness, suddenly remembered a famous person once said: “as long as you do not run, you are not a leek.” So I went home, sat down in front of my computer and wrote this article.

To prepare

  1. Clear climb target: climb the various plate fund data

  2. Find the data site: Daily Fund (fund.eastmoney.com)

  1. Determine site entry: Click on the home pageInvestment Vehicles -> Thematic fundsOn the Theme page, selectSubject indexing, as shown below:

  1. Determine what to crawlClick under topic under topic indexliquorEnter the liquor list.

Click to enter the details page.Determine which fields to crawl from the content on the page based on your needs. In addition to the fields in the red box, there are also the fund name, fund code, and the subject field.

  1. Clear page jump relationship: topic page -> list -> details page, a total of three layers

Web analytics

Level 1: Request site entry

F12 or right clickcheck, using developer tools to find the HTML elements for the fund classification.Right-click the HTML element, copy the xpath, and of course you can write your own.Develop code to get a list of categories:As shown in the figure, I could have gotten the classified HTML elements using both my own and copied xPaths, but the result was empty. With that in mind, look at the content of the returned page.

As shown in figure,The crawler request returns a web page with different elements from the web page seen from the browser, and the industry category content is missing!!!!! New contact with the crawler may still be wondering why, the development of the crawler has begun to answer:Well,What is dynamic loading?Here I will use my own understanding to say.

Dynamic loading

When we use the browser to visit a web page, the background returned to the browser HTML page, JS, CSS and other files. The browser kernel (also called the rendering engine) will perform JS rendering in HTML while loading the web page, and then display the rendered web page in the browser, that is, the web content on the browser is: the original HTML + browser JS rendering result.

Js will render data to the page process is dynamic loading. So where does the data come from?

When you enter a URL to request a website, the method defined in JS secretly initiates the request for you. The most common is that there is a data display part of the page, when we click the next page, the page does not jump, only the data display part of the refresh, this is The Ajax implementation of local refresh function, and is one of the most common dynamic loading. Let’s talk about the general principles.

The front-end developer added a click listener event to the next page button in JS. When clicking the button, enter the corresponding JS function, use Ajax in the function to make a request for the background URL, return JSON or other format data, then select the HTML element in the data display area, clear the existing data, insert the newly acquired data, and realize the function of data refresh without the need of web page jump. Also known as asynchronous request, local refresh. Of course, many websites use Ajax to retrieve data for rendering as the page loads.

But crawlers don’t have a rendering engine, they can’t execute JS, so they just get the raw HTML back in the background. We see in the browser web source code, is not through JS rendering of the web page, but also our crawler finally get the web content.

As shown in figure, there is no classification element in the web source code. At this point, we can come to the conclusion that the developer tools see the JS rendered HTML, and the web source is the original HTML.

At this point you should consider: Why are we parsing the web? Get the data! But there is no data in the page, so we don’t need to ask for the URL of the page. We just need to find the URL where JS gets the data, and directly request this URL.

How do you normally handle dynamic loading?

Find the URL of the interface

In my opinion, using dynamic loading web pages to obtain data is much easier than ordinary web pages, except when using encryption parameters. You can get JSON or other text format data directly from the interface without having to parse the web page. Our crawler development went directly from being web oriented to being data oriented. The first thing we need to do is find the URL of the interface.

How do I find the interface URL?

  1. Open developer Tools, refresh the page, and search for keywords

According to the keyword search in the returned data, as shown in the figure, we found the corresponding response content according to the “liquor”. So let’s take a look at what’s returned, and remember the BKCode and Bkname fields.

  1. Look at the URL and construct the parameters

Let’s look at the request for this response. As shown, we find the URL and have two request parameters.

Based on request and response, this is a JSONP request. The rule for this type of request is that the callback in the URL consists of a method name + a timestamp, and the _ parameter is also a timestamp. The response content format is callback(JSON). If you use interest to learn about JSONP, if you simply get data as long as you know its rules.

Level 2: Parse the list page

  1. We click to enter the “electronic information” fund list page, as shown in the figure

  1. By sorting out the page requests, you can see that this is also the data returned by a JSONP interface, again, to find the interface URL.

The main concern here isFCODEField. From the list page, it is found that there are ten funds on one page, which need to be turned, so there are at the end of the response dataTotalCountField, which can be used to calculate how many pages there are.

  1. View request parameters

In this case, the TP field is BKCode, and pageIndex is the number of pages passed in to the current request.

The third layer: parse details page

Go to a fund details page and you will find that the page is a traditional static page, which can be directly parsed using CSS or xpath. By URL you will find that from the list page it is the Fcode field that leads to the details page for each fund.

Application development

From the analysis above, the category page and list page are loaded dynamically, and the returned content is JSONP text similar to JSON. We can remove the redundant parts and parse directly in JSON. The details page is a static page, using xpath.

Code development

import requests
import time
import datetime
import json
import pymysql
from lxml.html import etree

headers = {
    'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15'
    , 'Referer': 'http://fund.eastmoney.com'
}

# Initialize the database connection
connection = pymysql.connect(host='47.102. * * *. * *', user='root', password='root', database='scrapy', port=3306, charset='utf8')
cursor = connection.cursor()

# Program entry, parse fund classification
def start_requests() :
    timestamp = int(time.time() * 1000)
    callback = 'jQuery18306789193760800711_' + str(timestamp)
    start_url = f'http://fundtest.eastmoney.com/dataapi1015/ztjj//GetBKListByBKType?callback={callback}& _ ={timestamp}'
    response = requests.get(start_url, headers=headers)
    # Format the data returned by the category as JSON
    result = response.text.replace(callback, ' ')
    result = result[1: result.rfind(') ')]
    data = json.loads(result)
    # Traverse the industry classification data to get the name and code
    for item in data['Data'] ['hy'] :
        time.sleep(3)
        code = item['BKCode']
        category = item['BKName']
        print(code, category)
        parseFundList(code, category)
    # Traverse the concept category data
    for item in data['Data'] ['gn']:
        time.sleep(3)
        code = item['BKCode']
        category = item['BKName']
        print(code, category)
        parseFundList(code, category)

# Parse the list of funds under each category
def parseFundList(code, category) :
    timestamp = int(time.time() * 1000)
    callback = 'jQuery1830316287740290561061_' + str(timestamp)
    index = 1
    url = f'http://fundtest.eastmoney.com/dataapi1015/ztjj/GetBKRelTopicFund?callback={callback}&sort=SON_1N&sorttype=DESC&pageindex={index}&pagesize=10&tp={code}&isbuy=1&_={timestamp}'
    response = requests.get(url, headers=headers)
    result = response.text.replace(callback, ' ')
    result = result[1: result.rfind(') ')]
    data = json.loads(result)
    totalCount = data['TotalCount']
    # Calculate the total number of pages based on totalCount
    pages = int(int(totalCount) / 10) + 1
    # Parse out each page of fund FCode
    for index in range(1, pages + 1):
        timestamp = int(time.time() * 1000)
        callback = 'jQuery1830316287740290561061_' + str(timestamp)
        url = f'http://fundtest.eastmoney.com/dataapi1015/ztjj/GetBKRelTopicFund?callback={callback}&sort=SON_1N&sorttype=DESC&pageindex={index}&pagesize=10&tp={code}&isbuy=1&_={timestamp}'
        response = requests.get(url, headers=headers)
        result = response.text.replace(callback, ' ')
        result = result[1: result.rfind(') ')]
        data = json.loads(result)
        for item in data['Data']:
            time.sleep(3)
            fundCode = item['FCODE']
            fundName = item['SHORTNAME']
            parse_info(fundCode, fundName, category)


def parse_info(fundCode, fundName, category) :
    url = f'http://fund.eastmoney.com/{fundCode}.html'
    response = requests.get(url, headers=headers)
    content = response.text.encode('ISO-8859-1').decode('UTF-8')
    html = etree.HTML(content)
    worth = html.xpath('//*[@id="body"]/div[11]/div/div/div[3]/div[1]/div[1]/dl[2]/dd[1]/span[1]/text()')
    if worth:
        worth = worth[0]
    else:
        worth = 0
    scope = html.xpath('//div[@class="infoOfFund"]/table/tr[1]/td[2]/text()') [0].replace(':'.' ')
    manager = html.xpath('//div[@class="infoOfFund"]/table/tr[1]/td[3]/a/text()') [0]
    create_time = html.xpath('//div[@class="infoOfFund"]/table/tr[2]/td[1]/text()') [0].replace(':'.' ')
    company = html.xpath('//div[@class="infoOfFund"]/table/tr[2]/td[2]/a/text()') [0]
    level = html.xpath('//div[@class="infoOfFund"]/table/tr[2]/td[3]/div/text()')
    if level:
        level = level[0]
    else:
        level = 'No rating yet'
    month_1 = html.xpath('//*[@id="body"]/div[11]/div/div/div[3]/div[1]/div[1]/dl[1]/dd[2]/span[2]/text()')
    month_3 = html.xpath('//*[@id="body"]/div[11]/div/div/div[3]/div[1]/div[1]/dl[2]/dd[2]/span[2]/text()')
    month_6 = html.xpath('//*[@id="body"]/div[11]/div/div/div[3]/div[1]/div[1]/dl[3]/dd[2]/span[2]/text()')
    if month_1:
        month_1 = month_1[0]
    else:
        month_1 = ' '

    if month_3:
        month_3 = month_3[0]
    else:
        month_3 = ' '

    if month_6:
        month_6 = month_6[0]
    else:
        month_6 = ' '
    print(fundName, fundCode, category, worth, scope, manager, create_time, company, level, month_1, month_3, month_6, sep='|')
    # Store to mysql
    today = datetime.date.today()
    sql = f"insert into fund_info values('{today}', '{fundName}', '{fundCode}', '{category}', '{worth}', '{scope}', '{manager}', '{create_time}', '{company}', '{level}', '{month_1}', '{month_3}', '{month_6}')"
    cursor.execute(sql)
    connection.commit()
# Start crawling
start_requests()
Copy the code

Declaration: The above code is limited to learning to use, shall not use the program to the website malicious requests to cause damage, otherwise the consequences.

Procedure as above, in the analysis of dynamically loaded data than the analysis of the page obviously simple, because the data field specification, do not need to consider the field missing problem, and the analysis of the page will have a variety of situations.

Second, there are many parts of the program that can be optimized. For example,

  1. Redundant code can be recomposed into a method, which is written line by line here for intuition.
  2. You can set up several parsing methods according to the different structure of the detail page.
  3. For each field in the detail page, make an if null judgment and then set the default value. I only judged three or four fields here.

Creating a database table

CREATE TABLE `fund_info` (
  `op_time` varchar(20) DEFAULT NULL,
  `fundName` varchar(20) DEFAULT NULL,
  `fundCode` varchar(20) DEFAULT NULL,
  `category` varchar(20) DEFAULT NULL,
  `worth` varchar(20) DEFAULT NULL,
  `scope` varchar(20) DEFAULT NULL,
  `manager` varchar(20) DEFAULT NULL,
  `create_time` varchar(20) DEFAULT NULL,
  `company` varchar(20) DEFAULT NULL,
  `level` varchar(20) DEFAULT NULL,
  `month_1` varchar(20) DEFAULT NULL,
  `month_3` varchar(20) DEFAULT NULL,
  `month_6` varchar(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Copy the code

The results

Console output:

Database query:

conclusion

On March 6, I decided on the topic and began to write it, and finished it on March 14. I also learned that development is easier to describe than to describe. This article from the analysis of the website, to the development of the crawler, storage data, and interspersed with part of the dynamic loading knowledge, a full aspect of the whole process of the development of a crawler, hope to enlighten you. Looking forward to our next encounter.



Write is the daily work of personal practice, place oneself point of view from 0 to 1, ensure can really let everyone understand.

Article will be in the public number [entry to give up the road] first, looking forward to your attention.