This is the 40th day of my first Text Challenge 2022. Using coroutines in Python crawlers can greatly improve the collection efficiency of target sites, so we will learn this concept over and over again and apply it to the crawler case.

Definition of coroutines

With the introduction of these two articles, defining a coroutine should now be very simple. Add the async keyword in front of a function and the function becomes a coroutine. You can verify the type of the isinstance function directly.

from collections.abc import Coroutine


async def func() :
    print("I'm a coroutine function.")


if __name__ == '__main__':
    # create coroutine object, note that coroutine object does not run function code, that is, does not output any information
    coroutine = func()

    # Type judgment
    print(isinstance(coroutine, Coroutine))
Copy the code

Enter the following code:

True
sys:1: RuntimeWarning: coroutine 'func' was never awaited
Copy the code

According to the type judgment, the function that adds async keyword is the coroutine type. The following warning is ignored for the moment. The original is that the coroutine is not registered in the event loop and is called.

Using coroutines

The target site is banan.huiben.61read.com/. This site is a picture book website affiliated to China Children’s Press and Publication Group. There are a large number of children’s picture book animations on the website without advertisements, and the animations are in MP4 format, which is easy to download.

import asyncio

import requests


# coroutine function
async def get_html() :
    res = requests.get("http://banan.huiben.61read.com/Video/List/1d4a3be3-0a72-4260-979b-743d9db8ad85")
    if res is not None:
        return res.status_code
    else:
        return None


Declare coroutine objects
coroutine = get_html()

# event loop object
loop = asyncio.get_event_loop()

Convert coroutines to tasks
task = loop.create_task(coroutine)
# Task = asyncio.ensure_future(coroutine) # Using this method, it is also possible to convert coroutines to tasks

Put the task in an event loop and call it
loop.run_until_complete(task)

# output result
print("Result output",task.result())
Copy the code

The above code can also be modified to run top-level entry functions using the asyncio.run() method after python3.7.

import asyncio
import requests


# coroutine function
async def get_html() :
    res = requests.get("http://banan.huiben.61read.com/Video/List/1d4a3be3-0a72-4260-979b-743d9db8ad85")
    if res is not None:
        print(res.status_code)
    else:
        return None


async def main() :
    await get_html()


Declare coroutine objects
coroutine = get_html()

asyncio.run(main())
Copy the code

Next reference to the above code, the realization of two MP4 video download.

# http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
# http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4
import asyncio
import time
import requests


async def requests_get(url) :
    headers = {
        "Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
    }
    try:
        res = requests.get(url, headers=headers)
        return res
    except Exception as e:
        print(e)
        return None


async def get_video(url) :
    res = await requests_get(url)
    if res is not None:
        with open(f'./mp4/{time.time()}.mp4'."wb") as f:
            f.write(res.content)


async def main() :
    start_time = time.perf_counter()
    # http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
    # http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4

    await get_video("http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4")
    await get_video("http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4")

    print("Code runtime:", time.perf_counter() - start_time)


if __name__ == '__main__':
    asyncio.run(main())
Copy the code

In this test, it took 44 seconds to download the two videos on different computers and Internet speeds.

The asyncio.create_task() function is used to run multiple coroutines concurrently to continue modifying the code and optimize execution time.

import asyncio
import time
import requests


async def requests_get(url) :
    headers = {
        "Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
    }
    try:
        res = requests.get(url, headers=headers)
        return res
    except Exception as e:
        print(e)
        return None


async def get_video(url) :
    res = await requests_get(url)
    if res is not None:
        with open(f'./mp4/{time.time()}.mp4'."wb") as f:
            f.write(res.content)


async def main() :
    start_time = time.perf_counter()
    # http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
    # http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4

    task1 = asyncio.create_task(
        get_video("http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4"))

    task2 = asyncio.create_task(
        get_video("http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4"))
    await task1
    await task2
    print("Code runtime:", time.perf_counter() - start_time)


if __name__ == '__main__':
    asyncio.run(main())
Copy the code

The code runs in 27S, and you can see an improvement in efficiency. Before you formally examine the code above, learn a concept of awaitables

An object that can be used in an await statement is an awaitable object. There are three main types of awaitable objects: coroutines, tasks, and futures

A distinction must be made in Python between coroutine functions and coroutine objects, which are the objects returned by the former.

Asyncio.create_task (coro, *, name=None) creates a task object and schedules its execution. Parameter 1 is a coroutine object and parameter 2 is the task name. Use the asyncio.ensure_future() function.

The prototype of the concurrent running task function is shown below:

asyncio.gather(*aws, loop=None, return_exceptions=False) -> awaitable 
Copy the code

A waitable object in a sequence is run concurrently. If a waitable object in AWS is a coroutine, it is automatically scheduled as a task.

Return_exceptions Parameter description:

  1. return_exceptionsFalse (the default), the first exception raised is immediately propagated to the waitgather()The task. Other waitable objects in the AWS sequence will not be cancelled and will continue to run;
  2. return_exceptionsTrue, the exception is treated as a successful result and aggregated into the result list.

If Gather () is cancelled, all submitted (unfinished) waitable objects are also cancelled.

The simple wait function prototype is as follows:

asyncio.wait(aws, *, loop=None, timeout=None, return_when=ALL_COMPLETED) -> coroutine 
Copy the code

Run awaitable objects specified by AWS concurrently and block the thread until the condition specified by return_WHEN is met.

If a waitable object in AWS (above parameters) is a coroutine, it is automatically scheduled as a task. Passing coroutine objects directly to wait() is deprecated.

This function returns two Task/Future collections, typically written (done, pending).

Return_when specifies when this function should return. It must be one of the following constants:

  • FIRST_COMPLETEDThe function returns when any waitable object terminates or is cancelled;
  • FIRST_EXCEPTIONThe: function returns when any waitable object ends by throwing an exception. It is equivalent to ALL_COMPLETED when no exceptions are thrown;
  • ALL_COMPLETEDThe: function returns when all waitable objects are finished or cancelled.

A similar method to wait() is wait_for, which looks like this:

asyncio.wait_for(aw, timeout, *, loop=None) -> coroutine 
Copy the code

Wait aw can wait for the object to complete and timeout after the specified timeout seconds.

This function can pass coroutines, and if a timeout occurs, the task is canceled and asyncio.timeouterror is raised.

Wait () differs from wait_for() in that wait() does not cancel a waitable object when a timeout occurs.

Bind the callback function

The implementation principle of asynchronous I/O is to suspend the program at the I/O site and continue the program after the I/O is complete. When writing a crawler, a lot of times you rely on the return value of IO, and that’s where the callback comes in. Synchronous programming implements callbacks

Declare the variable directly before await and get the callback value

import asyncio
import time
import requests


async def requests_get(url) :
    headers = {
        "Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
    }
    try:
        res = requests.get(url, headers=headers)
        return res
    except Exception as e:
        print(e)
        return None


async def get_video(url) :
    res = await requests_get(url)
    if res is not None:
        with open(f'./mp4/{time.time()}.mp4'."wb") as f:
            f.write(res.content)
        return (url,"success")
    else:
        return None

async def main() :
    start_time = time.perf_counter()
    # http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
    # http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4

    task1 = asyncio.create_task(
        get_video("http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4"))

    task2 = asyncio.create_task(
        get_video("http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4"))
    Synchronous callback method
    ret1 = await task1
    ret2 = await task2
    print(ret1,ret2)
    print("Code runtime:", time.perf_counter() - start_time)


if __name__ == '__main__':
    asyncio.run(main())
Copy the code

throughasyncioAdd callback function functionality to implement

The method used is add_done_callback, which adds a callback that will be run when the Task object completes. The corresponding remove callback function, remove_done_callback.

import asyncio
import time
import requests


async def requests_get(url) :
    headers = {
        "Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
    }
    try:
        res = requests.get(url, headers=headers)
        return res
    except Exception as e:
        print(e)
        return None


async def get_video(url) :
    res = await requests_get(url)
    if res is not None:
        with open(f'./mp4/{time.time()}.mp4'."wb") as f:
            f.write(res.content)
        return (url, "success")
    else:
        return None


async def main() :
    start_time = time.perf_counter()
    # http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4
    # http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4

    task1 = asyncio.create_task(
        get_video("http://static.61read.com/flipbooks/huiben/huazhuangwuhui/web/1.mp4"))
    task1.add_done_callback(callback)

    task2 = asyncio.create_task(
        get_video("http://static.61read.com/flipbooks/huiben/jingubanghexiaofengche/web/1.mp4"))
    task2.add_done_callback(callback)
    Synchronous callback method
    await task1
    await task2
    print("Code runtime:", time.perf_counter() - start_time)


def callback(future) :
    print(The callback function returns:, future.result())


if __name__ == '__main__':
    asyncio.run(main())
Copy the code

The crawler case for this lesson

In this tutorial, the full code can be downloaded from Codechina. The main ideas are as follows.

Step 1: Get the addresses of all the list pagesThe specific location of the data is shown below. Because the data is all in one page, the acquisition method is relatively simple, and the web page can be directly parsed. Step 2: Get the video download addressThe following process is used to obtain the video address. During the search process, it is found that the address of the video thumbnail has certain rules with the address of the video player, as shown below:

# # video thumbnail address http://static.61read.com/flipbooks/huiben/chudiandetouyuzei/cover.jpg video address http://static.61read.com/flipbooks/huiben/chudiandetouyuzei/web/1.mp4Copy the code

The removal ofcover.jpgAnd replaced with theweb/1.mp4That dramatically lowers the level at which we can get video. Step 3: Write code to download the video

import asyncio
import time
import requests
from bs4 import BeautifulSoup
import lxml

BASE_URL = "http://banan.huiben.61read.com"


async def requests_get(url) :
    headers = {
        "Referer": "http://banan.huiben.61read.com/"."User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"
    }
    try:
        res = requests.get(url, headers=headers)
        return res
    except Exception as e:
        print(e)
        return None


async def get_video(name, url) :
    res = await requests_get(url)
    if res is not None:
        with open(f'./mp4/{name}.mp4'."wb") as f:
            f.write(res.content)
        return (name, url, "success")
    else:
        return None


async def get_list_url() :
    """ Get list page address """
    res = await requests_get("http://banan.huiben.61read.com/")
    soup = BeautifulSoup(res.text, "lxml")
    all_a = []
    for ul in soup.find_all(attrs={'class'.'inline'}):
        all_a.extend(BASE_URL + _['href'] for _ in ul.find_all('a'))
    return all_a


async def get_mp4_url(url) :
    "" get the MP4 address ""
    res = await requests_get(url)
    soup = BeautifulSoup(res.text, "lxml")
    mp4s = []
    for div_tag in soup.find_all(attrs={'class'.'item_list'}) :# Get thumbnails of images
        src = div_tag.a.img['src']
        # Replace thumbnail address with MP4 video address
        src = src.replace('cover.jpg'.'web/1.mp4').replace('cover.png'.'web/1.mp4')
        name = div_tag.div.a.text.strip()
        mp4s.append((src, name))

    return mp4s


async def main() :
    Get the list page address task
    task_list_url = asyncio.create_task(get_list_url())
    all_a = await task_list_url
    Create a task list
    tasks = [asyncio.ensure_future(get_mp4_url(url)) for url in all_a]
    Add a callback function
    # ret = map(lambda x: x.add_done_callback(callback), tasks)
    # async execution
    dones, pendings = await asyncio.wait(tasks)
    all_mp4 = []
    for task in dones:
        all_mp4.extend(task.result())
    Get all MP4 addresses

    totle = len(all_mp4)
    print("Accumulate [", totle, "】 video")
    print("_" * 100)
    print("Ready to download video")

    # Download 10 at a time
    totle_page = totle // 10 if totle % 10= =0 else totle // 10 + 1
    # print(totle_page)
    for page in range(0, totle_page):
        print("Downloading video on page {}".format(page + 1))
        start_page = 0 if page == 0 else page * 10
        end_page = (page + 1) * 10
        print("Download address")
        print(all_mp4[start_page:end_page])
        mp4_download_tasks = [asyncio.ensure_future(get_video(name, url)) for url, name in all_mp4[start_page:end_page]]
        mp4_dones, mp4_pendings = await asyncio.wait(mp4_download_tasks)
        for task in mp4_dones:
            print(task.result())


if __name__ == '__main__':
    asyncio.run(main())
Copy the code

Write in the back

For the complete code, check out the comments section at the top.

Today is the 243/365 day of continuous writing. Expect attention, likes, comments and favorites.