This article is participating in Python Theme Month. See the link for details.

I. Project Overview

This project is mainly to the leadership of all messages in the message board liuyan.people.com.cn/home?p=0 fetching, the specific content of reply message details, details and evaluation details preservation, and used for data analysis and further processing, can be for the government’s decision making and provides the basis for the implementation of the e-government. For project description and environment configuration, refer to the first Python Crawl Message Board article (I) : Single-process + Selenium Emulation. This paper makes a major improvement on the basis of the second part: Change for multiple processes from multiple threads, set the number of running processes for 3 at the same time, the reasonable number, so that in the guarantee at the same time have multiple processes at the same time, crawl in the implementation of the process can also avoid the excessive high requirements for memory, CPU and network bandwidth, thus greatly reduces the overall operation time, this is a major improvement of the project.

Ii. Project implementation

As there are two common methods to achieve multithreading in the implementation process, there are two different concrete implementations.

1. Import required libraries

import csv
import os
import random
import re
import time


import dateutil.parser as dparser
from random import choice
from multiprocessing import Pool
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
Copy the code

It mainly imports the processing libraries needed in the crawl process and the classes used in Selenium.

2. Configure global variables and parameters

# Time node
start_date = dparser.parse('2019-06-01')
# Browser Settings options
chrome_options = Options()
chrome_options.add_argument('blink-settings=imagesEnabled=false')
Copy the code

We assume that only the messages after 2019.6.1 can be climbed, because the messages before this are automatically praised and have no reference value, so set the time node and forbid the webpage to load pictures, so as to reduce the bandwidth requirements on the network and improve the loading rate.

3. Generate random time and user agent

def get_time() :
    "Get random time"
    return round(random.uniform(3.6), 1)


def get_user_agent() :
    Get random user agent.
    user_agents = [
        "Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; The.net CLR 1.1.4322; The.net CLR 2.0.50727)"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; The.net CLR 2.0.50727; Media Center PC 5.0; The.net CLR 3.0.04506)"."Mozilla / 4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; The.net CLR 1.1.4322; The.net CLR 2.0.50727)"."Mozilla / 5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"."Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident / 5.0; The.net CLR 3.5.30729; The.net CLR 3.0.30729; The.net CLR 2.0.50727; Media Center PC (6.0)"."Mozilla / 5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident / 4.0; WOW64; Trident / 4.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; The.net CLR 1.0.3705; The.net CLR 1.1.4322)"."Mozilla / 4.0 (compatible; MSIE 7.0 b; Windows NT 5.2; The.net CLR 1.1.4322; The.net CLR 2.0.50727; InfoPath.2; The.net CLR 3.0.04506.30)"."Mozilla / 5.0 (Windows; U; Windows NT 5.1; Zh-cn) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)"."Mozilla / 5.0 (X11; U; Linux; En-us) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6"."Mozilla / 5.0 (Windows; U; Windows NT 5.1; en-US; The rv: 1.8.1.2 pre) - Ninja Gecko / 20070215 K / 2.1.1"."Mozilla / 5.0 (Windows; U; Windows NT 5.1; zh-CN; The rv: 1.9) Gecko / 20080705 Firefox/Kapiko / 3.0 3.0"."Mozilla / 5.0 (X11; Linux i686; U;) Gecko / 20070322 Kazehakase / 0.4.5"."Mozilla / 5.0 (X11; U; Linux i686; en-US; The rv: 1.9.0.8) Gecko Fedora / 1.9.0.8-1. Fc10 Kazehakase / 0.5.6"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"."Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20"."Opera / 9.80 (Macintosh; Intel Mac OS X 10.6.8; U; Fr) Presto / 2.9.168 Version / 11.52"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER"."Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; LBBROWSER)"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; LBBROWSER)"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; NET4.0 E)"."Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; QQBrowser / 7.0.3698.400)"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E)"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident / 4.0; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; 360SE)"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E)"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; NET4.0 E)".Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"."Mozilla / 5.0 (the device; U; CPU OS 4_2_1 like Mac OS X; Zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5"."Mozilla / 5.0 (Windows NT 6.1; Win64; x64; The rv: b13pre) Gecko / 20110307 Firefox 2.0/4.0 b13pre"."Mozilla / 5.0 (X11; Ubuntu; Linux x86_64; The rv: 16.0) Gecko / 20100101 Firefox / 16.0"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"."Mozilla / 5.0 (X11; U; Linux x86_64; zh-CN; Rv :1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"."MQQBrowser / 26 Mozilla / 5.0 (Linux; U; Android 2.3.7. zh-cn; MB200 Build/GRJ22; Cyanogenmod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"."Mozilla / 5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1"."Mozilla / 5.0 (Linux; Android 5.1.1. Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"."Mozilla / 5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; Ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20"."Mozilla / 5.0 (Linux; u; Android 4.2.2; zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider / 2.0; +http://www.baidu.com/search/spider.html)"."Mozilla / 5.0 (compatible; Baiduspider / 2.0; + http://www.baidu.com/search/spider.html)"
    ]
    Create a random proxy in the user_Agent list that acts as a mock browser
    user_agent = choice(user_agents)
    return user_agent

Copy the code

Generate random time and random simulation browser used to visit web pages, reduce the possibility of being recognized by the server as crawler and banned.

4. Obtain the FID of the leader

def get_fid() :
    Get all leader ids.
    with open('url_fid.txt'.'r') as f:
        content = f.read()
        fids = content.split()
    return fids
Copy the code

Each leader has an FID to distinguish between them. Here, the FID is obtained manually and saved to TXT, and then read line by line at the start of the crawl.

5. Obtain the links for leaving messages

def get_detail_urls(position, list_url) :
    Get all message links for each leader.
    user_agent = get_user_agent()
    chrome_options.add_argument('user-agent=%s' % user_agent)
    drivertemp = webdriver.Chrome(options=chrome_options)
    drivertemp.maximize_window()
    drivertemp.get(list_url)
    time.sleep(2)
    # loop load the page
    try:
        while WebDriverWait(drivertemp, 50.2).until(EC.element_to_be_clickable((By.ID, "show_more"))):
            datestr = WebDriverWait(drivertemp, 10).until(
                lambda driver: driver.find_element_by_xpath(
                    '//*[@id="list_content"]/li[position()=last()]/h3/span')).text.strip()
            datestr = re.search(r'\d{4}-\d{2}-\d{2}', datestr).group()
            date = dparser.parse(datestr, fuzzy=True)
            print('Climbing link --', position, The '-', date)
            if date < start_date:
                break
            # Simulate click-to-load
            drivertemp.find_element_by_xpath('//*[@id="show_more"]').click()
            time.sleep(get_time())
        detail_elements = drivertemp.find_elements_by_xpath('//*[@id="list_content"]/li/h2/b/a')
        Get all links
        for element in detail_elements:
            detail_url = element.get_attribute('href')
            yield detail_url
        drivertemp.quit()
    except TimeoutException:
        drivertemp.quit()
        get_detail_urls(position, list_url)
Copy the code

Follow step 4 to find a link to all messages for the leader provided by FID. Since the leader’s message list is not displayed all at once, there is a link belowTo load moreButton, as followsEvery time you need to click on the down load, so to simulate click operation, downward slide, click again after loading, until the bottom, is likely to slip the most no longer display button at the bottom of the page or not because of his bad by climbing or network load, the positioning element will timeout, add exception handling, recursive calls. When a function returns a value, instead of returning a list at a time, the yield keyword generates a generator that generates urls at the pace of program execution, reducing the stress on memory.

6. Obtain message details

def get_message_detail(driver, detail_url, writer, position) :
    Get message details
    print('Climbing message -', position, The '-', detail_url)
    driver.get(detail_url)
    # judge, skip if there are no comments
    try:
        satis_degree = WebDriverWait(driver, 2.5).until(
            lambda driver: driver.find_element_by_class_name("sec-score_firstspan")).text.strip()
    except:
        return
    # Get all parts of the message
    message_date_temp = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[6]/h3/span")).text
    message_date = re.search(r'\d{4}-\d{2}-\d{2}', message_date_temp).group()
    message_datetime = dparser.parse(message_date, fuzzy=True)
    if message_datetime < start_date:
        return
    message_title = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_class_name("context-title-text")).text.strip()
    label_elements = WebDriverWait(driver, 2.5).until(lambda driver: driver.find_elements_by_class_name("domainType"))
    try:
        label1 = label_elements[0].text.strip()
        label2 = label_elements[1].text.strip()
    except:
        label1 = ' '
        label2 = label_elements[0].text.strip()
    message_content = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[6]/p")).text.strip()
    replier = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[1]/h3[1]/i")).text.strip()
    reply_content = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[1]/p")).text.strip()
    reply_date_temp = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[1]/h3[2]/em")).text
    reply_date = re.search(r'\d{4}-\d{2}-\d{2}', reply_date_temp).group()
    review_scores = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_elements_by_xpath("/html/body/div[8]/ul/li[2]/h4[1]/span/span/span"))
    resolve_degree = review_scores[0].text.strip()[:-1]
    handle_atti = review_scores[1].text.strip()[:-1]
    handle_speed = review_scores[2].text.strip()[:-1]
    review_content = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[2]/p")).text.strip()
    is_auto_review = 'is' if (('Automatic praise' in review_content) or ('Default rating' in review_content)) else 'no'
    review_date_temp = WebDriverWait(driver, 2.5).until(
        lambda driver: driver.find_element_by_xpath("/html/body/div[8]/ul/li[2]/h4[2]/em")).text
    review_date = re.search(r'\d{4}-\d{2}-\d{2}', review_date_temp).group()
    Save to a CSV file
    writer.writerow(
        [position, message_title, label1, label2, message_date, message_content, replier, reply_content, reply_date,
         satis_degree, resolve_degree, handle_atti, handle_speed, is_auto_review, review_content, review_date])
Copy the code

We only need comments with comments, so we filter out comments without comments at the beginning. Then, use xpath and class_name to locate the corresponding element to obtain the content of each part of the message. Each message saves 14 attributes and saves them in CSV.

7. Obtain and save all messages left by leaders

def get_officer_messages(index, fid) :
    Get and save all messages from the leader.
    user_agent = get_user_agent()
    chrome_options.add_argument('user-agent=%s' % user_agent)
    driver = webdriver.Chrome(options=chrome_options)
    list_url = "http://liuyan.people.com.cn/threads/list?fid={}#state=4".format(fid)
    driver.get(list_url)
    try:
        position = WebDriverWait(driver, 10).until(
            lambda driver: driver.find_element_by_xpath("/html/body/div[4]/i")).text
        print(index, '-- climbing --', position)
        start_time = time.time()
        csv_name = position + '.csv'
        If the file exists, delete it and create it again
        if os.path.exists(csv_name):
            os.remove(csv_name)
        with open(csv_name, 'a+', newline=' ', encoding='gb18030') as f:
            writer = csv.writer(f, dialect="excel")
            writer.writerow(
                ['Position name'.'Message title'.'Message Tag 1'.'Message Tag 2'.'Message date'.'Message Content'.'Responder'.'Reply content'.'Reply Date'.'Satisfaction'.'Degree of solution'.'Attitude score'.'Processing speed points'.'Is it automatically favorable?'.'Evaluation content'.'Evaluation Date'])
            for detail_url in get_detail_urls(position, list_url):
                get_message_detail(driver, detail_url, writer, position)
                time.sleep(get_time())
        end_time = time.time()
        crawl_time = int(end_time - start_time)
        crawl_minute = crawl_time // 60
        crawl_second = crawl_time % 60
        print(position, 'Climb over!! ')
        print('The lead time: {} minutes {} seconds. '.format(crawl_minute, crawl_second))
        driver.quit()
        time.sleep(5)
    except:
        driver.quit()
        get_officer_messages(index, fid)
Copy the code

Obtain the position information of the leader, create an independent CSV for the leader to save the message information, add recursive calls for exception handling, call get_message_detail() to obtain the specific information of each message and save it, and calculate the execution time of each leader.

8. Merge files

def merge_csv() :
    Merge all files.
    file_list = os.listdir('. ')
    csv_list = []
    for file in file_list:
        if file.endswith('.csv'):
            csv_list.append(file)
    If the file exists, delete it and create it again
    if os.path.exists('DATA.csv'):
        os.remove('DATA.csv')
    with open('DATA.csv'.'a+', newline=' ', encoding='gb18030') as f:
        writer = csv.writer(f, dialect="excel")
        writer.writerow(
            ['Position name'.'Message title'.'Message Tag 1'.'Message Tag 2'.'Message date'.'Message Content'.'Responder'.'Reply content'.'Reply Date'.'Satisfaction'.'Degree of solution'.'Attitude score'.'Processing speed points'.'Is it automatically favorable?'.'Evaluation content'.'Evaluation Date'])
        for csv_file in csv_list:
            with open(csv_file, 'r', encoding='gb18030') as csv_f:
                reader = csv.reader(csv_f)
                line_count = 0
                for line in reader:
                    line_count += 1
                    ifline_count ! =1:
                        writer.writerow(
                            (line[0], line[1], line[2], line[3], line[4], line[5], line[6], line[7], line[8],
                             line[9], line[10], line[11], line[12], line[13], line[14], line[15]))
Copy the code

Merges all leaders’ data from the crawl.

Main function call

The implementation of multithreading is mainly in this part, there are two ways to achieve:

  • The for loop iterates through all the tasks and parameters and callsapply_async()The function adds the task to the process pool
def main() :
    Principal function
    fids = get_fid()
    print('Crawler starts execution:')
    s_time = time.time()
    Create a process pool
    pool = Pool(3)
    Add the task to the process pool and pass in parameters
    for index, fid in zip(range(1.len(fids) + 1), fids):
        pool.apply_async(get_officer_messages, (index, fid))
    pool.close()
    pool.join()
    print('Crawler execution completed!! ')
    print('Start composing file:')
    merge_csv()
    print('File synthesis finished!! ')
    e_time = time.time()
    c_time = int(e_time - s_time)
    c_minute = c_time // 60
    c_second = c_time % 60
    print('{} bit leaders total time: {} minutes {} seconds. '.format(len(fids), c_minute, c_second))


if __name__ == '__main__':
    Execute main function
    main()
Copy the code
  • By calling themap()The function adds tasks to the process pool and maps functions to parameters
def main() :
    Principal function
    fids = get_fid()
    print('Crawler starts execution:')
    s_time = time.time()
    # Process incoming parameters so that they correspond to index merging and are iterable
    itera_merge = list(zip(range(1.len(fids) + 1), fids))
    Create a process pool
    pool = Pool(3)
    Pass the task to the process pool and pass the parameters through the map
    pool.map(get_officer_messages_enc, itera_merge)
    print('Crawler execution completed!! ')
    print('Start composing file:')
    merge_csv()
    print('File synthesis finished!! ')
    e_time = time.time()
    c_time = int(e_time - s_time)
    c_minute = c_time // 60
    c_second = c_time % 60
    print('{} bit leaders total time: {} minutes {} seconds. '.format(len(fids), c_minute, c_second))


if __name__ == '__main__':
    Execute main function
    main()
Copy the code

In the main function, all messages of leaders are obtained through multiple processes first, and then all data files are merged to complete the whole crawling process, and the running time of the whole program is counted, which is convenient to analyze the running efficiency.

Results, analysis and description

1. Description of results

2 complete code and test execution results are availableDownload.csdn.net/download/CU…Download, welcome to test and exchange learning,Please do not abuse. Compared with the single thread, the whole execution process is greatly shortened. I selected 10 leaders for testing, and the number of their messages is different in order to find the advantages of multi-threading. The operation results in the cloud server are as follows The running time is reduced to less than 100 minutes, which is significantly shorter and more efficient than a single process because there are three sub-processes executing at the same time. With the addition of multiple processes, long and short running times can complement each other, with multiple processes running at any one time. But it can also be seen that compared with multi-threading, multi-process running time is relatively longer, although the difference is not big, but this may be a performance bottleneck. The possible reason is that the process needs more resources, which requires more memory, CPU, and network, and puts higher requirements on the device. Sometimes, the device performance cannot keep up with the requirements of the program, thus reducing the efficiency. The result is the merged data.csv:A simple comparison between multithreading and multiprocessing is as follows: A thread of at least one process, a process of at least one thread, the thread dimension is less than the process of dividing (fewer resources than process), make multi-threaded programs of high concurrency, process with independent storage unit in the process of execution, and multiple threads to Shared memory, which greatly improved the operation efficiency of the program. Threads cannot execute independently and must depend on processes. Multithreading:

  • Thread execution costs little (occupies very few resources) but is not conducive to resource management and protection;
  • If data needs to be shared, threads are recommended.
  • Suitable for IO intensive tasks (Web and document reading and writing, etc.), encountered I/O blocking, the speed is far less than the CPU running speed, you can open more threads, some threads blocked, other threads still work normally.

Multiple processes:

  • The execution cost is high (occupies a lot of resources), but it is conducive to resource management and protection.
  • Suitable for computing intensive (video decoding and scientific data calculation, etc.).

Obviously, multithreading should be preferred in crawlers.

2. Improve the analysis

(1) The code of this version has not yet achieved automatic crawling of all FID, which needs to be saved manually. This is one of the deficiencies, which can be improved in the later stage. (2) Climb the message details page still using Selenium simulation, will reduce the request efficiency, can consider using the Requests library request. (3) The version of anti-crawling measures are weak, so many times there will be exceptions, such as the page is not normal to find the corresponding element, request time extension, can be added in the later version of further anti-crawling measures, to further increase the robustness of the code.

3. Description of legality

  • This project is for the purpose of learning and scientific research, all readers can refer to the implementation of ideas and program code, but can not be used for malicious and illegal purposes (malicious attack on the website server, illegal profit, etc.), if there is a violation, please take responsibility for it.
  • The data obtained in this project is after the further analysis for the implementation of the improvement to the electronic government affairs, can play a certain reference role in government decisions, not to the malicious data to grab the advantage of unfair competition, also not used for commercial purposes to seek illegal interests, running the code only with a few fid testing, is not a big scope to crawl, At the same time, strictly control the crawl rate and strive not to cause pressure on the server. If the interests of the party concerned are infringed (namely, the network subject to be captured), please contact to change or delete.
  • This project is the second part of the message board crawl series, which will continue to be updated later. Readers are welcome to communicate with each other for continuous improvement.