Welcome to my official account: Early Rise Python

= = = = = = =

One, foreword

A lot of times we write a crawler, and after we implement the requirements, we find a lot of areas to improve, one of the most important is the crawl speed. This article explains how to use multi-process, multi-thread, coroutine to improve the crawl speed through the code. Note: we don’t go into theories and principles, it’s all in the code.

Second, the synchronization

First of all, we write a simplified crawler to subdivide each function and consciously carry out functional programming. The purpose of the following code is to visit baidu page 300 times and return the status code, where parse_1 function can set the number of loops, each loop passes the current number of loops (starting from 0) and the URL to parse_2 function.

import requests

def parse_1():
    url = 'https://www.baidu.com'
    for i in range(300):
        parse_2(url)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()Copy the code

The performance cost is mainly in IO requests. When a URL is requested in single-process, single-thread mode, it must be waited. The example code is typical serial logic

Three, multi-threading

Because the CPU executes programs with only one thread per time scale, multithreading actually increases process utilization and thus CPU utilization. There are many libraries that implement multithreading, as demonstrated here by ThreadPoolExecutor in Concurrent. futures. The ThreadPoolExecutor library is introduced because its code is more concise than other libraries

For the convenience of explanation, if the following code is a newly added part, the > symbol will be added before the code line for observation and explanation, but it needs to be removed in actual operation

import requests
> from concurrent.futures import ThreadPoolExecutor

def parse_1():
    url = 'https://www.baidu.com'
    Create a thread pool
    > pool = ThreadPoolExecutor(6)
    for i in range(300):
        > pool.submit(parse_2, url)
    > pool.shutdown(wait=True)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()Copy the code

The opposite of synchronization is asynchrony. Asynchrony means being independent of each other and continuing to do things while waiting for something to happen, rather than waiting for the event to complete. Threading is one way to do asynchrony, which is to say that multithreading is doing asynchrony which means we don’t know what the result is, and sometimes we need to know what the result is, so we can do callbacks

import requests
from concurrent.futures import ThreadPoolExecutor

Add a callback function
> def callback(future):
    > print(future.result())

def parse_1():
    url = 'https://www.baidu.com'
    pool = ThreadPoolExecutor(6)
    for i in range(300):
        > results = pool.submit(parse_2, url)
        Key steps for the callback
        > results.add_done_callback(callback)
    pool.shutdown(wait=True)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()Copy the code

Python’s implementation of multithreading has a much-criticized GIL(global interpreter lock), but multithreading is still good for tasks that are mostly IO intensive, such as crawling web pages.

4. Multiple processes

Multiple processes are implemented using two methods: ProcessPoolExecutor and multiprocessing

1. ProcessPoolExecutor

Similar to ThreadPoolExecutor, which implements multithreading

import requests
> from concurrent.futures import ProcessPoolExecutor

def parse_1():
    url = 'https://www.baidu.com'
    Create a thread pool
    > pool = ProcessPoolExecutor(6)
    for i in range(300):
        > pool.submit(parse_2, url)
    > pool.shutdown(wait=True)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()Copy the code

As you can see, the code is still very concise after changing the class name twice. You can also add a callback function

import requests
from concurrent.futures import ProcessPoolExecutor

> def callback(future):
    > print(future.result())

def parse_1():
    url = 'https://www.baidu.com'
    pool = ProcessPoolExecutor(6)
    for i in range(300):
        > results = pool.submit(parse_2, url)
        > results.add_done_callback(callback)
    pool.shutdown(wait=True)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()Copy the code

2. multiprocessing

Look directly at the code, everything is in the comments.

import requests
> from multiprocessing import Pool

def parse_1():
    url = 'https://www.baidu.com'
    # built pool
    > pool = Pool(processes=5)
    # save result
    > res_lst = []
    for i in range(300):
        Add tasks to the pool
        > res = pool.apply_async(func=parse_2, args=(url,))
        # fetch finished result (need to fetch)
        > res_lst.append(res)
    Print print print print print print print
    > good_res_lst = []
    > for res in res_lst:
        Use get to get the result of the processing
        > good_res = res.get()
        # Judge the results
        > if good_res:
            > good_res_lst.append(good_res)
    Close and wait to complete
    > pool.close()
    > pool.join()

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()Copy the code


As you can see, the multiprocessing library code is a little more verbose, but supports more extensibility. Multi-process and multi-thread can be used to speed things up, but there is a better way…

5. Asynchronous non-blocking

Coroutines + callbacks combined with dynamic collaboration can achieve asynchronous non-blocking, essentially using only one thread, so it is a great use of resources

To achieve asynchronous non-blocking, the classical asyncio library +yield is used. In order to facilitate the use of aiOHTTP, a higher level of encapsulation has gradually emerged. In order to better understand asynchronous non-blocking, it is best to have a deeper understanding of asyncio library. Gevent is a very convenient library for implementing coroutines

import requests
> from gevent import monkey
# Monkey Patch is the soul of collaborative running
> monkey.patch_all()
> import gevent

def parse_1():
    url = 'https://www.baidu.com'
    Create a task list
    > tasks_list = []
    for i in range(300):
        > task = gevent.spawn(parse_2, url)
        > tasks_list.append(task)
    > gevent.joinall(tasks_list)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()Copy the code

Gevent speeds things up a lot, but it also introduces a new problem: what if we don’t want to overload the server with speed? If it is a multi-process multi-thread pool method, you can control the number of pools. There’s also a good way to control speed with GEvent: create queues. The Quene class is also provided in GEvent, and the following code changes considerably

import requests
from gevent import monkey
monkey.patch_all()
import gevent
> from gevent.queue import Queue

def parse_1():
    url = 'https://www.baidu.com'
    tasks_list = []
    Instantiate the queue
    > quene = Queue()
    for i in range(300):
        All urls are queued
        > quene.put_nowait(url)
    # Two-channel queue
    > for _ in range(2):
        > task = gevent.spawn(parse_2)
        > tasks_list.append(task)
    gevent.joinall(tasks_list)

No arguments need to be passed in, all in the queue
> def parse_2():
    # loop to determine if the queue is empty
    > while not quene.empty():
        # eject queue
        > url = quene.get_nowait()
        response = requests.get(url)
        Check the queue status
        > print(quene.qsize(), response.status_code)

if __name__ == '__main__':
    parse_1()Copy the code

conclusion

These are just a few common ways to speed things up. If you are interested in code testing, you can use the time module to determine runtime. Crawler acceleration is an important skill, but it is also a good habit of crawler workers to properly control the speed, do not put too much pressure on the server, bye ~