Mission requirements

  • 1, request 250,000 URLS, and determine whether the target URL has a normal page
  • 2. Store the normal page URL in the database

Ideas for Completing tasks

1. Thread pools

Define the requested function, define the number of thread pools based on the CPU capacity of the host and then submit the task to the thread pool, and the thread pool will automatically perform the task based on the size of the pool and then the URL

2. Process pool

Define the requested function, define the number of process pools according to the CPU capacity of the host, and then submit the task to the process pool, and the thread pool will automatically perform the task based on the size of the pool

coroutines

Problems arising

Note: All coroutines are executed, but some of them fail and are not actually reprocessed, so you need to limit the number of coroutines and use gather to batch process coroutines in order

The code could then look like this:

ValueError: Too many file descriptors in select(). The maximum number of files that can be opened is 1024 on Linux and 509 on Windows. The maximum number of files that can be opened is 1024 on Linux and 509 on Windows.

The solution

1. Limit the maximum number of concurrent coroutines (1024 for Linux and 509 for Windows)

2. Use callbacks

3, change the maximum number of open files in the operating system, there is a configuration file in the system can change the default value

The following figure is solved by controlling the concurrency of coroutines:

Another solution is to create a pool like threads and processes, and gEvent provides a pool of coroutines

from gevent.pool import Pool
import gevent.monkey
gevent.monkey.patch_all()
Copy the code