Hello, everyone, I am Chi Ye!

Suppose you have a file with 100,000 urls, and you need to send HTTP requests to each URL and print out the status code for the result of the request. How do you write code to do this as quickly as possible?

There are many ways to create concurrent programming in Python, including the threading, concurrency, coroutines asyncio and, of course, grequests. Each of these libraries can fulfill the requirements described above. For your future concurrent programming reference:

Queue + multithreading

Define a queue size of 400 and start 200 threads, each of which is constantly fetching urls from the queue and accessing them.

The main thread reads the URL from the file into the queue and waits for all the elements in the queue to be received and processed. The code is as follows:

from threading import Thread

import sys

from queue import Queue

import requests

concurrent = 200

def doWork():

while True:

url = q.get()

status, url = getStatus(url)

doSomethingWithResult(status, url)

q.task_done()

def getStatus(ourl):

try:

res = requests.get(ourl)

return res.status_code, ourl

except:

return "error", ourl

def doSomethingWithResult(status, url):

print(status, url)

q = Queue(concurrent * 2)

for i in range(concurrent):

t = Thread(target=doWork)

t.daemon = True

t.start()

try:

for url in open("urllist.txt"):

q.put(url.strip())

q.join()

except KeyboardInterrupt:

sys.exit(1)
Copy the code

The running results are as follows:

Did you learn new skills?

The thread pool

If you use thread pools, the more advanced concurrent.futures library is recommended:

import concurrent.futures import requests out = [] CONNECTIONS = 100 TIMEOUT = 5 urls = [] with open("urllist.txt") as reader: for url in reader: urls.append(url.strip()) def load_url(url, timeout): ans = requests.get(url, timeout=timeout) return ans.status_code with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:  future_to_url = (executor.submit(load_url, url, TIMEOUT) for url in urls) for future in concurrent.futures.as_completed(future_to_url): try: data = future.result() except Exception as exc: data = str(type(exc)) finally: out.append(data) print(data)Copy the code

Coroutines + aiohttp

Coroutines are also a very common tool for concurrency,

import asyncio

from aiohttp import ClientSession, ClientConnectorError

async def fetch_html(url: str, session: ClientSession, **kwargs) -> tuple:

try:

resp = await session.request(method="GET", url=url, **kwargs)

except ClientConnectorError:

return (url, 404)

return (url, resp.status)

async def make_requests(urls: set, **kwargs) -> None:

async with ClientSession() as session:

tasks = []

for url in urls:

tasks.append(

fetch_html(url=url, session=session, **kwargs)

)

results = await asyncio.gather(*tasks)

for result in results:

print(f'{result[1]} - {str(result[0])}')

if __name__ == "__main__":

import sys

assert sys.version_info >= (3, 7), "Script requires Python 3.7+."

with open("urllist.txt") as infile:

urls = set(map(str.strip, infile))

asyncio.run(make_requests(urls=urls))
Copy the code

grequests[1]

Requests + Gevent[2] is a third-party library that currently has 3.8K stars for asynchronous HTTP Requests. The essence of Gevent is still coroutine.

Before use:

pip install grequests

It’s fairly simple to use:

import grequests

urls = []

with open("urllist.txt") as reader:

for url in reader:

urls.append(url.strip())

rs = (grequests.get(u) for u in urls)

for result in grequests.map(rs):

print(result.status_code, result.url)
Copy the code

Note that grequests. Map (RS) is executed concurrently. The running results are as follows:

Exception handling can also be added:

>>> def exception_handler(request, exception): ... Print (" Request failed ") > > > reqs = [... grequests. Get (' http://httpbin.org/delay/1 'timeout = 0.001), ... grequests.get('http://fakedomain/'), ... grequests.get('http://httpbin.org/status/500')] >>> grequests.map(reqs, exception_handler=exception_handler) Request failed Request failed [None, None, <Response [500]>]Copy the code

The last word

Some people say that the performance of asynchronous (coroutine) is better than that of multithreading. In fact, there is no one method suitable for all scenarios. I have done an experiment, also request URL, when the number of concurrent HTTP requests exceeds 500, the coroutine becomes significantly slower. So you can’t say one is better than the other, you have to compartmentalize the situation.