First, write first

In my daily work, I need to conduct statistical analysis on the trust sales and asset management sales data of a trust network, but it is obviously too time-consuming to input one by one, so I came up with the idea of writing a crawler.

A computer language, can treat as is in the imitation of purpose or intention to make a series of behavior or action, so before you write the code, the first thing to know what do you mean, if you, what is you every step of the action, and then pass this action step by step through code to computer, let computer efficiently help you to complete.

This article combines regular expressions with the popular BeautifulSoup (BS4) to parse and extract data from web pages, so it’s worth a brief introduction to regular expressions and BS4 before we start.

Two, basic knowledge

1. Regular expressions

Specific details can be added online knowledge, here only some rules and common usage.

# regular expression rules: single character:. : All characters except newline [] : match any character in the set \D: digit \D: non-digit \W: Digit, letter, Underscore, Chinese \W: Non-digit, letter, Underscore, Chinese \S: space \s: * : Any number of times + : at least 1 time? : non-greedy, optional {m} : fixed m times {m+} : at least m times {m,n} : m to n times start: ^ : begins with what $: ends with what what common combinations and functions:.* : greedy any character any number of times.*? R = re.compile(r' regex ', re.s) : Most commonly used: Pass the rule to an argument to reuse re.match re.search (string) re.findall (string) re.sub(regular expression, replacement content, string)Copy the code

2, bs4

Again, the details of the knowledge supplement, here only introduces the common usage: select with the use of the selector.

BeautifulSoup: from bS4 import BeautifulSoup soup = BeautifulSoup(' page response back ')Copy the code

There are mainly the following extraction rules:

A. attrs fetch all the attributes and values under the soup tag. A ['name'] fetch the name attribute under the a tag. 3, Fetch the content soup. A.soup () soup Find_all (['a','b']) find all a and B in the soup Soup. Find_all (' a ', limit = 5) find before 5 June, select a usage - focus on combining selector, commonly used selector is as follows: tag selector: selector: such as div expressed as div. For example, class = 'you' is represented by. You ID selector: # represents, for example, id = 'me' is represented by #me combinatorial selectors such as div,. You,#me level selectors: For example, div. You #me means to select the contents of the you class under the div tag and div >. You > #me, > means only the following levelCopy the code

Three, start actual combat — climb a trust network in the sale of trust data

1. Preparation before the crawl — comb through the logic of the code

As mentioned earlier, before writing code, it’s important to know what you want to do and, if you do, what actions you want to take to achieve that goal or intent.

First, what’s your purpose or intention, for this case, I need to get a page to a page of the trust products sold in the following data: product name, the issuer, issuing date, the highest yield, product date, investment industry, distribution, income distribution, issuance, low income, high income and interest rate level is divided into the 12 data.

Second, if it is a person, what actions are needed to achieve this goal. Let’s take a look at the web page. The action is clear:

Enter the website/search keywords > Enter the website > click the trust product in the red box and for Sale > enter the relevant information in the green box below > Find the information is not complete, click the product again, continue to enter in the details page (next picture).

2. Start to crawl

Since the action is clear, it can let the computer to simulate the human movement to climb.

Then there is the logic of writing code. Let’s tease out the process by working backwards, as we usually do in math.

In order to get the data < you have to parse the response that the web page gives you < you have to have a response < you have to send a request < you have to have a request request < you have to have a URL.

Get the URL > Build the request > Send the request > Get the response > Parse the response > Get the required data > Save the data.

So by following this procedure, we can first make a big frame, and then add flesh and blood to the frame. So the big picture, you define a pivot function.

It is worth noting that in this example, we have two clicks to obtain the information of each product, that is, the data on the first page is incomplete, and then we click to enter the details page to obtain the remaining data. Therefore, this example has two layers of data acquisition process. The first layer uses regular expressions and the second layer uses BS4.

① Define the main function

Here is the main function, you can forget about the previous write related data, this is the first step to get the URL, after the repair.

Back to the purpose: to extract data from any page to any page, so it is necessary to write a loop, and then below the loop, a two-layer web page data fetching framework emerges. (Because the URL of the second layer web page is spliced out according to a certain data of the first layer web page, and the first layer web page is to extract the information of all the products of the whole page at once, so the extraction of the second layer web page is also set up a cycle, to extract all the products of the first layer web page one by one)

Def main(): write to url_1 = 'http://www. A trust network. Com/Action/ProductAJAX ashx? 'url_2 = http://www. Aspx /Product/ detail.aspx? 'size = input(' Please enter number of pages per page :') start_page = int(input(' Please enter start page number :')) end_page = int(input(' Please enter end page number ')) type = Input (' Please input product type (1 for trust, 2 for asset management): ') items = [] # define an empty list to store data # write loop to crawl each page in range(start_page, end_page + 1): Print (' format(page) ') print(' format(page) ') Joint url_new = joint(url_1,size=size,page=page,type=type) # Que_res response = que_res(url_new) Parse_content_1 contents = parse_content_1(response) # 4, sleep 2 seconds time.sleep(2) # Format (page,content[0])) # print(' format(page,content[0]) # print(' format(page,content[0]) # Response_2 = que_res(urL_2_new) # response_2 = que_res(urL_2_new) # response_2 = que_res(urL_2_new) Parse_content_2, Item = parse_content_2(response_2,content) # store data items. Append (item) print(' {} page {} end download ' ,content[0]) print(' {} page end crawl '. Format (page)) print(' {} page end crawl '. Df. To_csv (' data. CSV, index = False, sep = ', ', encoding = 'utf-8 - sig) print (' *' * 30) print (' crawl all over) the if __name__ = = '__main__': main()Copy the code

② Get the URL — layer 1 and Layer 2 are common

Since we need to access both layers of data, we want to define a function that can concatenate urls from both layers.

The content and source code of the first layer page are shown below. From the content in the second red box (X-requested-with :XMLHttpRequest), it can be seen that this is an AJAX GET request, carrying the data in the third red box, and the data in the third red box, It happens to be part of the URL in the first red box, which is http://www. A trust network. Com/Action/ProductAJAX ashx? Add a third data in a red box. The third box contains several variable data: pageSize (indicating how many products are displayed on a page); PageIndex (number of pages); ConditionStr (define product type, 1 is trust, 2 is asset management), the rest of the figures are fixed (one of them is _:1544925791285, it looks like a random number, so I left it out).

The following figure shows the content and source code of the second page, which is just a simple GET request, and the url is very simple, it is a http://www. Aspx? Com /Product/ detail.aspx? Add an ID, and where the ID comes from, and the answer is in the response data for the first layer of the page (see the red box in the next image).

Through the above analysis, the request URL of the first layer web page consists of a fixed part plus some data, while the url of the second layer web page depends on the data of the first layer. We first write THE URL_1, URL_2 and some variable data in the main function (see the main function above), and then define a function to join the URLS of the two layers. Since the length of the fixed part of the page URL of the first layer is 47 and that of the second layer is 43, a length condition is used to determine whether to splice the first layer or the second layer.

Def joint(url,size=None,page=None,type=None,id=None): if len(url) > 45: def joint(url,size=None,page=None,type=None,id=None) = 'producttype condition:' + type + '| status: sold in "data = {' mode' : 'statistics',' pageSize: size, 'pageIndex' : str(page), 'conditionStr': condition, 'start_released': '', 'end_released': '', 'orderStr': '1', 'ascStr': 'ulup' } joint_str = urllib.parse.urlencode(data) url_new = url + joint_str else: data = { 'id':id } joint_str = urllib.parse.urlencode(data) url_new = url + joint_str return url_newCopy the code

③ Build a request + response train — layer 1 and Layer 2 are common

After getting the URL, it’s time to build the Request to send the request and get the response. Here we define a function to implement the one-stop service. To guard against reverse crawling, user_Agent is randomly selected from multiple, and proxy pools are used (although not many), and I also use LAN IP proxies on my computer.

Response def que_res(url): The first step in building the request is to build the header: Headers USER_AGENTS = ["Mozilla/5.0 (Compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)", Mozilla/4.0 (Compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0) Mozilla/4.0 (Compatible; MSIE 7.0; Windows NT 6.0), Mozilla/4.0 (Compatible; MSIE 6.0; Windows NT 5.1), Mozilla/4.0 (Compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0), Mozilla/4.0 (Compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0), Mozilla/4.0 (Compatible; MSIE 7.0; Windows NT 5.1), Mozilla/4.0 (Compatible; MSIE 7.0; Windows NT 5.1; The World), Mozilla/4.0 (Compatible; MSIE 7.0; Windows NT 5.1; Avant Browser) Mozilla/4.0 (Compatible; MSIE 7.0; Windows NT 5.1) ] user_agent = random.choice(USER_AGENTS) headers = { 'Accept-Language': 'zh-CN,zh; Q =0.8', 'Connection': 'keep-alive', 'Host': 'WWW. Referer: 'http://www '. Aspx ', 'user-agent ': user_agent,' x-requested-with ': Request = urllib.request.Request(url=url, Proxy_list = [{' HTTP ':' 125.40.29.1001:8118 '}} {' HTTP ':'14.118.135.10:808'}] proxy = random. Choice (proxy_list Urllib. Request. ProxyHandler (proxy) opener = urllib. Request. Build_opener (handler) # initiate requests initiated the request, the third step, Response = opener.open(request).read().decode() # Return value return responseCopy the code

④ Parse the content of the first layer of the web page

After obtaining the response, it is to parse and extract the data. The first layer uses the method of regular expression to carry out. The response obtained is as follows:

Therefore, the following re can be written to match the ID, product name, issuing institution, release time, product life, investment industry, and home page revenue from left to right.

Def parse_content_1(response) def parse_content_1(response) def parse_content_1(response) Re_1 = re.compile( r'{"ROWID".*?"ID":"(.*?)","Title":"(.*?)","producttype".*?"issuers":"(.*?)","released":"(.*?) 0:00:00","PeriodTo":(.*?),"StartPrice".*?"moneyinto":"(.*?)","EstimatedRatio1":(.*?),"status":.*?"}') contents = re_1.findall(response) return contentsCopy the code

⑤ Parse the content of the second page and output data

The second layer uses the SELECT + selector method in BS4. In addition to the data extracted from the first layer, it also needs the distribution of place of issue, income distribution method, issue size, lowest income, highest income and interest rate grade. As you can see, the information we need is hidden in tr tag after TR tag in table tag with id= “procon1”.

Since we don’t need all the information, we can extract it one by one and output it as a result. The code looks like this (using the previously mentioned selector knowledge and some string handling) :

BeautifulSoup def parse_content_2(response,content): BeautifulSoup def parse_content_2(response,content): BeautifulSoup(response) # BeautifulSoup(response) # BeautifulSoup(response) # BeautifulSoup(response) # BeautifulSoup(response) # BeautifulSoup(response) # BeautifulSoup Tr_3 = soup. Select ('#procon1 > table > tr')[3] address = soup Tr_3.select ('.pro-textcolor')[0]. Text r_style = tr_3.select('.pro-textcolor')[1]. Tr_4 = soup. Select ('#procon1 > table > tr')[4] guimo = soup Tr_4. Select (' pro - textcolor ') [1]. The text re_2 = re.com running (r '. *? (\ d +). *? ', re. S) scale = re_2. The.findall (guimo) [0] # crawl yields, Tr_7 = soup. Select ('#procon1 > table > tr')[7] rate = soup Tr_7. Select (' pro - textcolor ') [0]. The text [] : (1) r = rate. The split () 'to' r_min = r [0] r_max = r tr_11 = [1] # extraction rate level Soup. Select ('#procon1 > table > tr')[11] r_grade = tr_11. Select ('p')[0]. 'issuers' : the content [2],' release time: content [3], 'product deadline: content [4],' investment industry: content [5], 'returns home page: content [6],' issued to: Address, 'revenue distribution method ': r_style,' distribution size ': scale, 'lowest revenue ': r_min,' highest revenue ': r_max, 'interest rate grade ': r_grade} # return data return itemCopy the code

⑥ Save data to local (dataframe to local CSV format)

# 保存数据为dataframe格式CSV文件
    df = pd.DataFrame(items)
    df.to_csv('data.csv',index=False,sep=',',encoding='utf-8-sig')

好了,现在就大功告成了,最后不要只让自己爽,也要让对方的服务器别太难过,在一些地方休眠几秒,完整代码如下。

import urllib.request
import urllib.parse
import re
import random
from bs4 import BeautifulSoup
import pandas as pd
import time

# 定义第1个分函数joint,用来拼接url
def joint(url,size=None,page=None,type=None,id=None):
    if len(url) > 45:
        condition = 'producttype:' + type + '|status:在售'
        data = {
        'mode': 'statistics',
        'pageSize': size,
        'pageIndex': str(page),
        'conditionStr': condition,
        'start_released': '',
        'end_released': '',
        'orderStr': '1',
        'ascStr': 'ulup'
        }
        joint_str = urllib.parse.urlencode(data)
        url_new = url + joint_str
    else:
        data = {
            'id':id
            }
        joint_str = urllib.parse.urlencode(data)
        url_new = url + joint_str
    return url_new

# 定义第2个函数que_res,用来构建request发送请求,并返回响应response
def que_res(url):

    # 构建request的第一步——构建头部:headers
    USER_AGENTS = [ 
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
        "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
        ]
    user_agent = random.choice(USER_AGENTS)
    headers = {
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Connection': 'keep-alive', 
        'Host': 'www.某信托网.com',
        'Referer': 'http://www.某信托网.com/Product/Index.aspx',
        'User-Agent': user_agent,
        'X-Requested-With': 'XMLHttpRequest'
        }

    # 构建request的第二步——构建request
    request = urllib.request.Request(url=url, headers=headers)


    # 发起请求的第一步——构建代理池
    proxy_list = [      
        {'http':'125.40.29.100:8118'},
        {'http':'14.118.135.10:808'}
        ]
    proxy = random.choice(proxy_list)

    # 发起请求的第二步——创建handler和opener
    handler = urllib.request.ProxyHandler(proxy)
    opener = urllib.request.build_opener(handler)

    # 发起请求的第三步——发起请求,获取响应内容并解码
    response = opener.open(request).read().decode()

    # 返回值
    return response

# 定义第3个函数parse_content_1,用来解析并匹配第一层网页内容,此处使用正则表达式方法
def parse_content_1(response):

    # 写正则进行所需数据的匹配
    re_1 = re.compile(
    r'{"ROWID".*?"ID":"(.*?)","Title":"(.*?)","producttype".*?"issuers":"(.*?)","released":"(.*?) 0:00:00","PeriodTo":(.*?),"StartPrice".*?"moneyinto":"(.*?)","EstimatedRatio1":(.*?),"status":.*?"}')
    contents = re_1.findall(response)
    return contents

# 定义第4个函数parse_content_2,用来解析并匹配第二层网页内容,并输出数据,此处使用BeautifulSoup方法
def parse_content_2(response,content):

    # 使用bs4进行爬取第二层信息
    soup = BeautifulSoup(response)

    # 爬取发行地和收益分配方式,该信息位于id为procon1下的table下的第4个tr里
    tr_3 = soup.select('#procon1 > table > tr')[3]         #select到第四个目标tr
    address = tr_3.select('.pro-textcolor')[0].text        #select到该tr下的class为pro-textcolor的第一个内容(发行地)
    r_style = tr_3.select('.pro-textcolor')[1].text        #select到该tr下的class为pro-textcolor的第二个内容(收益分配方式)

    # 爬取发行规模,该信息位于id为procon1下的table下的第5个tr里
    tr_4 = soup.select('#procon1 > table > tr')[4]         #select到第五个目标tr    
    guimo = tr_4.select('.pro-textcolor')[1].text          #select到该tr下的class为pro-textcolor的第二个内容(发行规模:至***万)
    re_2 = re.compile(r'.*?(\d+).*?', re.S)                #设立一个正则表达式,将纯数字提取出来
    scale = re_2.findall(guimo)[0]                         #提取出纯数字的发行规模

    # 爬取收益率,该信息位于id为procon1下的table下的第8个tr里
    tr_7 = soup.select('#procon1 > table > tr')[7]         #select到第八个目标tr
    rate = tr_7.select('.pro-textcolor')[0].text[:(-1)]    #select到该tr下的class为pro-textcolor的第一个内容(且通过下标[-1]将末尾的 % 去除)
    r = rate.split('至')                                   #此处用来提取最低收益和最高收益
    r_min = r[0]
    r_max = r[1]

    # 提取利率等级
    tr_11 = soup.select('#procon1 > table > tr')[11]       #select到第十二个目标tr
    r_grade = tr_11.select('p')[0].text                    #select到该tr下的p下的第一个内容(即利率等级)

    # 保存数据到一个字典中
    item = {
    '产品名称':content[1],
    '发行机构':content[2],
    '发行时间':content[3],
    '产品期限':content[4],
    '投资行业':content[5],
    '首页收益':content[6],
    '发行地': address,
    '收益分配方式': r_style,
    '发行规模': scale,
    '最低收益': r_min,
    '最高收益': r_max,
    '利率等级': r_grade
    }

    # 返回数据
    return item

# 定义一个主函数
def main():

    # 写入相关数据
    url_1 = 'http://www.某信托网.com/Action/ProductAJAX.ashx?'
    url_2 = 'http://www.某信托网.com/Product/Detail.aspx?'
    size = input('请输入每页显示数量:')
    start_page = int(input('请输入起始页码:'))
    end_page = int(input('请输入结束页码'))
    type = input('请输入产品类型(1代表信托,2代表资管):') 
    items = []                       # 定义一个空列表用来存储数据

    # 写循环爬取每一页
    for page in range(start_page, end_page + 1):

        # 第一层网页的爬取流程
        print('第{}页开始爬取'.format(page))
        # 1、拼接url——可定义一个分函数1:joint
        url_new = joint(url_1,size=size,page=page,type=type)

        # 2、发起请求,获取响应——可定义一个分函数2:que_res
        response = que_res(url_new)

        # 3、解析内容,获取所需数据——可定义一个分函数3:parse_content_1
        contents = parse_content_1(response)

        # 4、休眠2秒
        time.sleep(2)

        # 第二层网页的爬取流程

        for content in contents:
            print('    第{}页{}开始下载'.format(page,content[0]))
            # 1、拼接url
            id = content[0]
            url_2_new = joint(url_2,id=id)      # joint为前面定义的第1个函数

            # 2、发起请求,获取响应
            response_2 = que_res(url_2_new)     # que_res为前面定义的第2个函数

            # 3、解析内容,获取所需数据——可定义一个分函数4:parse_content_2,直接返回字典格式的数据
            item = parse_content_2(response_2,content)

            # 存储数据
            items.append(item)
            print('    第{}页{}结束下载'.format(page,content[0]))
            # 休眠5秒
            time.sleep(5)

        print('第{}页结束爬取'.format(page))


    # 保存数据为dataframe格式CSV文件
    df = pd.DataFrame(items)
    df.to_csv('data.csv',index=False,sep=',',encoding='utf-8-sig')

    print('*'*30)
    print('全部爬取结束')

if __name__ == '__main__':
    main()
Copy the code

3. Crawl results

Run the code, here to display 4 products per page, climb the first 3 pages of trust for sale as an example, the running results are as follows:

Then open the CSV file saved locally as follows: the result is beautiful.

This kind of two-layer web data capture, can be used in very, very, very many places.

To obtain the full source of this article, please scan the public account “PyCity”.

Reply “trust” at the bottom of “PyCity” official account

§ § \

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

The recent hot

\

Build CNN model to crack website captcha \

\

Image recognition with Python (OCR) \

\

Analysis of employment status of Python development in Beijing \

\

Memories of youth in QQ space with Python \

\

Email: [email protected]

\

Please click below to read the original article. ** See all articles in the Python Chinese community at **** **