The workman is like waterblog.csdn.net/yanboberNo reprint without permission, please respect the author’s work.Contact me in private】

1 background

We said before the beginning of this series that a simple crawler is easy, but it is not easy to build an efficient and robust crawler. In this series, we have understood the following core knowledge points related to crawlers.

Python3.X crawler (Static downloaders and parsers)

Based on the above several actually we put the crawler as their convenient development tools to use basically be enough (for example, the boss let you watch yourself on a regular basis after the function of the application of online user behavior data, easy to grasp development function of potential risks, this is we can write a small program to Python creeper background check on a regular basis, But we haven’t talked about dynamic page crawls, we haven’t talked about crawls data processing, we haven’t talked about crawls performance… There’s a lot more to discover, MLGB, and we’ll start this article with a look at concurrency in Python crawlers. Whoosh!

The reason why we discuss this topic is to solve the problem of the efficiency of LXml parsing to climb the beauty picture website in python3. X crawler (static download and parser). You will find that the execution efficiency of the program in our last article is very low. It takes a very long time to complete the crawling of those girls, because they are sequential. Plus we haven’t done full station crawl on the sister map website, if you want to full station crawl that is a pretty horrible thing, don’t believe we can through the “Python3.X crawler actual battle (first climb up hi)” one article introduction of the site way to view this site how many pages can be crawled, as follows:

That’s not too much, but we’re getting tired of this slow crawl, so we’re going to have to figure out a way to solve this problem, which is the problem of this article, but first you have to have some basic Python concurrency, If that’s not enough, check out the Beauty of Python concurrent Programming series on Zhihu, which covers well, or check out the Python Core Programming book.

The workman is like waterblog.csdn.net/yanboberNo reprint without permission, please respect the author’s work.Contact me in private】

2 Python 3.X concurrency matting

In fact, this section is not necessary, but it is listed for completion (note: if you have the concurrency foundation, you can directly go to Part3 of the concurrent crawler practice). The concepts of process, thread relationships, and differences in programs are not language specific. If you have studied the concepts of process and thread in the past, such as computer foundation, Unix advanced C programming, Java programming, and Android programming, then Python concurrency is easy to understand. The only difference is the syntax and API names and usage.

Python3 uses POSIX-compliant (PThreads) threads to provide multiple multithreaded programming modules such as _thread, threading, Queue, concurrent.futures packages, etc. The thread_threading module allows you to create management threads. The threading module allows you to create management threads from the thread_threading module. Renamed _thread for Python3 compatibility) provides only basic thread and lock support; Threading provides an even better thread management mechanism; Queue gives us a Queue for multiple threads to share data; The concurrent.futures package has been included in the standard library since Python3.2, ThreadPoolExecutor and ProcessPoolExecutor are high-level abstractions from threading and Multiprocessing, exposing a unified interface to facilitate asynchronous calls.

2-1 Python 3.X _thread module

This is a much rejected Python concurrency module. It is called thread in earlier versions of Python and _thread in later versions of Python for compatibility purposes, but it is not recommended.

The _thread module has only one synchronization primitive, which is weak. Threading has many.
The _thread module has been added to the threading.threading.threading.threading.threading.
Daemons are not supported. The _thread module has little control over when a process terminates (all threads are forced to terminate without warning or cleanup), while the threading module basically guarantees that important child threads end before exiting the main thread.

The bottom line was that I was a scum and couldn’t handle the _thread module, so I shamelessly chose the threading module; This code in various languages in multithreading are classic, nothing special, as follows: [this example complete source code point I get demo_thread.py]

import _thread
import time
If self.lock.acquire() and self.lock.release() are not available, the last count is 467195. When running the code after reserving self.lock.acquire() and self.lock.release(), you will find that the last count is 1000000, and the locking mechanism ensures concurrency. Time.sleep (5) is designed to address the problem with the _thread module.
class ThreadTest(object):
    def __init__(self):
        self.count = 0
        self.lock = None

    def runnable(self):
        self.lock.acquire()
        print('thread ident is '+str(_thread.get_ident())+', lock acquired! ')
        for i in range(0.100000):
            self.count += 1
        print('thread ident is ' + str(_thread.get_ident()) + ', pre lock release! ')
        self.lock.release()

    def test(self):
        self.lock = _thread.allocate_lock()
        for i in range(0.10):
            _thread.start_new_thread(self.runnable, ())

if __name__ == '__main__':
    test = ThreadTest()
    test.test()
    print('thread is running... ')
    time.sleep(5)
    print('test finish, count is:' + str(test.count))Copy the code

So that’s pretty straightforward. I can’t believe it. Let’s look at threading.

2.2 Python 3.X Threading module

The __all__ definition of the threading module is as follows:

__all__ = ['get_ident'.'active_count'.'Condition'.'current_thread'.'enumerate'.'main_thread'.'TIMEOUT_MAX'.'Event'.'Lock'.'RLock'.'Semaphore'.'BoundedSemaphore'.'Thread'.'Barrier'.'BrokenBarrierError'.'Timer'.'ThreadError'.'setprofile'.'settrace'.'local'.'stack_size']Copy the code

The threading module uses the threading.py class as a threading.py class. The threading module uses the demo_threading.py class as a threading.py class.

import threading
from threading import Thread
import time
The Python 3.X Threading module demonstrates how to use the Thread class in The Python 3.X Threading module.
class NormalThread(Thread):
    Overrides the run method in Java Runnable.
    def __init__(self, name=None):
        Thread.__init__(self, name=name)
        self.counter = 0

    def run(self):
        print(self.getName() + ' thread is start! ')
        self.do_customer_things()
        print(self.getName() + ' thread is end! ')

    def do_customer_things(self):
        while self.counter < 10:
            time.sleep(1)
            print('do customer things counter is:'+str(self.counter))
            self.counter += 1


def loop_runner(max_counter=5):
    Called directly by Thread.
    print(threading.current_thread().getName() + " thread is start!")
    cur_counter = 0
    while cur_counter < max_counter:
        time.sleep(1)
        print('loop runner current counter is:' + str(cur_counter))
        cur_counter += 1
    print(threading.current_thread().getName() + " thread is end!")


if __name__ == '__main__':
    print(threading.current_thread().getName() + " thread is start!")

    normal_thread = NormalThread("Normal Thread")
    normal_thread.start()

    loop_thread = Thread(target=loop_runner, args=(10,), name='LOOP THREAD')
    loop_thread.start()

    loop_thread.join()
    normal_thread.join()

    print(threading.current_thread().getName() + " thread is end!")Copy the code

After using the Thread class, you can use join mode to wait for the end of the child Thread. Of course, there are other ways for you to consider. We can see that the two methods are very similar to Java threads, which is great. Let’s give a simple example of synchronous lock handling, as follows:

When you run the code after commenting out self.lock.acquire() and self.lock.release(), you will find that the last count is 467195, etc., concurrent problems. When running the code after reserving self.lock.acquire() and self.lock.release(), you will find that the last count is 1000000, and the locking mechanism ensures concurrency. ' ' '
import threading
from threading import Thread

class LockThread(Thread):
    count = 0

    def __init__(self, name=None, lock=None):
        Thread.__init__(self, name=name)
        self.lock = lock

    def run(self):
        self.lock.acquire()
        print('thread is '+threading.current_thread().getName()+', lock acquired! ')
        for i in range(0.100000):
            LockThread.count += 1
        print('thread is '+threading.current_thread().getName()+', pre lock release! ')
        self.lock.release()


if __name__ == '__main__':
    threads = list()
    lock = threading.Lock()
    for i in range(0.10):
        thread = LockThread(name=str(i), lock=lock)
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()
    print('Main Thread finish, LockThread.count is:'+str(LockThread.count))Copy the code

Using Lock is sufficient for general concurrent synchronization, but for other locking mechanisms (above)__all__Here we have a look at the thread priority queue commonly used in crawlers, as follows:

Python3’s Queue module provides synchronous, thread-safe Queue classes, including first-in, first-out Queue, LifoQueue, and PriorityQueue. These queues implement locking mechanisms and can be used directly in multiple threads. These queues can also be used for synchronization between threads. Here is a simple but classic example (production consumer problem) :

[demo_threading_queue.py]

from queue import Queue
from random import randint
from threading import Thread
from time import sleep
Python 3.X Threading and Queue (threading.x)

class TestQueue(object):
    def __init__(self):
        self.queue = Queue(2)

    def writer(self):
        print('Producter start write to queue.')
        self.queue.put('key', block=1)
        print('Producter write to queue end. size is:'+str(self.queue.qsize()))

    def reader(self):
        value = self.queue.get(block=1)
        print('Consumer read from queue end. size is:'+str(self.queue.qsize()))

    def producter(self):
        for i in range(5):
            self.writer()
            sleep(randint(0.3))

    def consumer(self):
        for i in range(5):
            self.reader()
            sleep(randint(2.4))

    def go(self):
        print('TestQueue Start! ')
        threads = []
        functions = [self.consumer, self.producter]
        for func in functions:
            thread = Thread(target=func, name=func.__name__)
            thread.start()
            threads.append(thread)
        for thread in threads:
            thread.join()
        print('TestQueue Done! ')

if __name__ == '__main__':
    TestQueue().go()Copy the code

As you can see, the common and commonly used Python3 threads are mainly related to crawlers. Of course, there are some high-end uses and high-end thread classes that we have not mentioned, which need to be accumulated and selected according to the needs of our own thread helper classes; We don’t have the space to go into it here, because good threading concurrency for any language is a very deep direction and involves a lot of problems, but it’s enough for general business.

2-3 Python 3.X process module

Python3’s thread concurrency mechanism is not the same as that of other languages. However, if we want to make full use of multi-core CPU resources in Python, we need to use multi-process. Python provides a very useful multiprocess module package called Multiprocessing, which supports tools such as child processes, communication, and sharing data.

The Process of multiprocessing can be described from the threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading threading.

import multiprocessing
import time
from multiprocessing import Process
The Python 3.X multiprocess module is a complete example of threading. The Process class of multiprocess is used to override the run method and pass it directly.
class NormalProcess(Process):
    def __init__(self, name=None):
        Process.__init__(self, name=name)
        self.counter = 0

    def run(self):
        print(self.name + ' process is start! ')
        self.do_customer_things()
        print(self.name + ' process is end! ')

    def do_customer_things(self):
        while self.counter < 10:
            time.sleep(1)
            print('do customer things counter is:'+str(self.counter))
            self.counter += 1


def loop_runner(max_counter=5):
    print(multiprocessing.current_process().name + " process is start!")
    cur_counter = 0
    while cur_counter < max_counter:
        time.sleep(1)
        print('loop runner current counter is:' + str(cur_counter))
        cur_counter += 1
    print(multiprocessing.current_process().name + " process is end!")


if __name__ == '__main__':
    print(multiprocessing.current_process().name + " process is start!")
    print("cpu count:"+str(multiprocessing.cpu_count())+", active chiled count:"+str(len(multiprocessing.active_children())))
    normal_process = NormalProcess("NORMAL PROCESS")
    normal_process.start()

    loop_process = Process(target=loop_runner, args=(10,), name='LOOP PROCESS')
    loop_process.start()

    print("cpu count:" + str(multiprocessing.cpu_count()) + ", active chiled count:" + str(len(multiprocessing.active_children())))
    normal_process.join()
    loop_process.join()
    print(multiprocessing.current_process().name + " process is end!")Copy the code

How about, the two methods of using Process are similar to Thread, but the meanings, principles and memory concepts are different. With this in mind, we can also use the concurrency locking mechanism of the Process and the data sharing mechanism of the Thread (unlike the memory of the Thread, which is common in any language) as follows:

If self.lock.acquire() and self.lock.release() are deprecated, the multiprocess can be used to synchronize data. After running the code will find the last count is 467195, etc., concurrency problems. When running the code after reserving self.lock.acquire() and self.lock.release(), you will find that the last count is 1000000, and the locking mechanism ensures concurrency. ' ' '
import multiprocessing
from multiprocessing import Process

class LockProcess(Process):
    def __init__(self, name=None, lock=None, m_count=None):
        Process.__init__(self, name=name)
        self.lock = lock
        self.m_count = m_count

    def run(self):
        self.lock.acquire()
        print('process is '+multiprocessing.current_process().name+', lock acquired! ')
        # performance problem, 100000 cycles, so it is optimized to take it out of the multi-process share first and then put it back into the multi-process share
        count = self.m_count.value;
        for i in range(0.100000):
            count += 1
        self.m_count.value = count
        print('process is '+multiprocessing.current_process().name+', pre lock release! ')
        self.lock.release()


if __name__ == '__main__':
    processes = list()
    lock = multiprocessing.Lock()
    m_count = multiprocessing.Manager().Value('count'.0)

    for i in range(0.10):
        process = LockProcess(name=str(i), lock=lock, m_count=m_count)
        process.start()
        processes.append(process)

    for process in processes:
        process.join()
    print('Main Process finish, LockProcess.count is:' + str(m_count.value))Copy the code

Ah, melodramatic, can’t stand himself and threading is similar to a routine, the only difference is that as a result of threads and processes essential difference between, and the use way is no difference, so the multiprocessing Queue similar threading, for example, no longer Concrete own actual combat.

2-4 Python 3.X concurrency pool

As you move from Python concurrent threads to concurrent processes, you’ll see that the _thread, threading, and multiprocessing modules provided by the Python standard library are great, but have you ever wondered (and encountered in other languages, too) For example, C\Java, etc.) creating and destroying threads or processes frequently on a large scale in real projects is very resource-intensive, so the concept of pooling was born (space for time). Fortunately, Python3.2 starts with the built-in standard library providing us with the concurrent.futures module, which contains the ThreadPoolExecutor and ProcessPoolExecutor classes (the base class is the Executor abstract class, Not available directly), implements high level abstractions from threading and multiprocessing, and provides direct support for writing thread pools and process pools so that tasks can be automatically scheduled in thread pools and process pools instead of maintaining queues and worrying about deadlocks.

Example thread pool: [demo_thread_pool_executor.py]

Python 3.X ThreadPoolExecutor module Demo
import concurrent
from concurrent.futures import ThreadPoolExecutor
from urllib import request

class TestThreadPoolExecutor(object):
    def __init__(self):
        self.urls = [
            'https://www.baidu.com/'.'http://blog.jobbole.com/'.'http://www.csdn.net/'.'https://juejin.cn'.'https://www.zhihu.com/'
        ]

    def get_web_content(self, url=None):
        print('start get web content from: '+url)
        try:
            headers = {"User-Agent": "Mozilla / 5.0 (X11; Linux x86_64)"}
            req = request.Request(url, headers=headers)
            return request.urlopen(req).read().decode("utf-8")
        except BaseException as e:
            print(str(e))
            return None
        print('get web content end from: ' + str(url))

    def runner(self):
        thread_pool = ThreadPoolExecutor(max_workers=2, thread_name_prefix='DEMO')
        futures = dict()
        for url in self.urls:
            future = thread_pool.submit(self.get_web_content, url)
            futures[future] = url

        for future in concurrent.futures.as_completed(futures):
            url = futures[future]
            try:
                data = future.result()
            except Exception as e:
                print('Run thread url ('+url+') error. '+str(e))
            else:
                print(url+'Request data ok. size='+str(len(data)))
        print('Finished! ')

if __name__ == '__main__':
    TestThreadPoolExecutor().runner()Copy the code

Py = demo_process_pool_executor.py = demo_process_pool_executor.py

Python 3.X ProcessPoolExecutor module Demo
import concurrent
from concurrent.futures import ProcessPoolExecutor
from urllib import request

class TestProcessPoolExecutor(object):
    def __init__(self):
        self.urls = [
            'https://www.baidu.com/'.'http://blog.jobbole.com/'.'http://www.csdn.net/'.'https://juejin.cn'.'https://www.zhihu.com/'
        ]

    def get_web_content(self, url=None):
        print('start get web content from: '+url)
        try:
            headers = {"User-Agent": "Mozilla / 5.0 (X11; Linux x86_64)"}
            req = request.Request(url, headers=headers)
            return request.urlopen(req).read().decode("utf-8")
        except BaseException as e:
            print(str(e))
            return None
        print('get web content end from: ' + str(url))

    def runner(self):
        process_pool = ProcessPoolExecutor(max_workers=4)
        futures = dict()
        for url in self.urls:
            future = process_pool.submit(self.get_web_content, url)
            futures[future] = url

        for future in concurrent.futures.as_completed(futures):
            url = futures[future]
            try:
                data = future.result()
            except Exception as e:
                print('Run process url ('+url+') error. '+str(e))
            else:
                print(url+'Request data ok. size='+str(len(data)))
        print('Finished! ')

if __name__ == '__main__':
    TestProcessPoolExecutor().runner()Copy the code

Alas, any programming language is interoperable, really, as long as you understand one language deeply, everything else is easy, just the syntax to get used to; There is still a lot to learn about concurrency in Python 3, such as asynchronous IO and locks. It is only appropriate to use the appropriate concurrency based on your needs.

The workman is like waterblog.csdn.net/yanboberNo reprint without permission, please respect the author’s work.Contact me in private】

3 Concurrent crawler combat

Python 3 concurrency (Python 3 concurrency, Python 3 concurrency, Python 3 concurrency) Before we write the crawler is a single main thread, they had a very bad problem is once a link can’t climb get stuck, the other is really can only look on, another problem is my computer so cow force for MAO I or a serial of creeper crawled so slow, so the following two examples is used to put an end to the two worst fragments.

3-1 Multithreaded crawler combat

What all don’t say with old husband, come up is dry, come up throw code, don’t tell me to use multithreading demonstration, directly on the thread pool, crawler not much explanation, specific look at the following code notes or run yourself to understand. Spider_multithread.py = spider_multithread.py = spider_multithread.py

import os
from concurrent.futures import ThreadPoolExecutor
from urllib import request
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
Example of using a separate concurrent thread pool to crawl parsing and a separate concurrent thread pool to store parsing results

class CrawlThreadPool(object):
    Enable a thread pool with a maximum of 5 concurrent threads for URL crawling and result parsing; The final result of the crawl is a callback to the complete_callback argument of the crawl method.
    def __init__(self):
        self.thread_pool = ThreadPoolExecutor(max_workers=5)

    def _request_parse_runnable(self, url):
        print('start get web content from: ' + url)
        try:
            headers = {"User-Agent": "Mozilla / 5.0 (X11; Linux x86_64)"}
            req = request.Request(url, headers=headers)
            content = request.urlopen(req).read().decode("utf-8")
            soup = BeautifulSoup(content, "html.parser", from_encoding='utf-8')
            new_urls = set()
            links = soup.find_all("a", href=re.compile(r"/item/\w+"))
            for link in links:
                new_urls.add(urljoin(url, link["href"]))
            data = {"url": url, "new_urls": new_urls}
            data["title"] = soup.find("dd", class_="lemmaWgt-lemmaTitle-title").find("h1").get_text()
            data["summary"] = soup.find("div", class_="lemma-summary").get_text()
        except BaseException as e:
            print(str(e))
            data = None
        return data

    def crawl(self, url, complete_callback):
        future = self.thread_pool.submit(self._request_parse_runnable, url)
        future.add_done_callback(complete_callback)


class OutPutThreadPool(object):
    Enable a thread pool with a maximum of 5 concurrent threads to store the results of the parse thread pool for concurrent processing above; ' ' '
    def __init__(self):
        self.thread_pool = ThreadPoolExecutor(max_workers=5)

    def _output_runnable(self, crawl_result):
        try:
            url = crawl_result['url']
            title = crawl_result['title']
            summary = crawl_result['summary']
            save_dir = 'output'
            print('start save %s as %s.txt.' % (url, title))
            if os.path.exists(save_dir) is False:
                os.makedirs(save_dir)
            save_file = save_dir + os.path.sep + title + '.txt'
            if os.path.exists(save_file):
                print('file %s is already exist! ' % title)
                return
            with open(save_file, "w") as file_input:
                file_input.write(summary)
        except Exception as e:
            print('save file error.'+str(e))

    def save(self, crawl_result):
        self.thread_pool.submit(self._output_runnable, crawl_result)


class CrawlManager(object):
    Crawler management class that manages the crawl parse thread pool and storage thread pool.
    def __init__(self):
        self.crawl_pool = CrawlThreadPool()
        self.output_pool = OutPutThreadPool()

    def _crawl_future_callback(self, crawl_url_future):
        try:
            data = crawl_url_future.result()
            for new_url in data['new_urls']:
                self.start_runner(new_url)
            self.output_pool.save(data)
        except Exception as e:
            print('Run crawl url future thread error. '+str(e))

    def start_runner(self, url):
        self.crawl_pool.crawl(url, self._crawl_future_callback)


if __name__ == '__main__':
    root_url = 'http://baike.baidu.com/item/Android'
    CrawlManager().start_runner(root_url)Copy the code

This is much more efficient than the encyclopaedia crawler I talked about in the first part of this series.

3-2 multi-process crawler combat

What also do not say, after looking at the multi-threaded crawler’s awesome efficiency naturally should look at the multi-process crawler’s awesome place, also the same, do not give me what concept, the above said enough, the following roll up your sleeves is the code, also do not ask what crawler, read the comments, as follows: Spider_multiprocess.py [spider_multiprocess.py]

import os
from concurrent.futures import ProcessPoolExecutor
from urllib import request
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
Example of using process pool to crawl parse and store parse results: Climb baidu Encyclopedia Android entry profile and the profile information of the entry link, output the result to the current directory output directory.


class CrawlProcess(object):
    URL link crawling and result parsing with process pool; The final result of the crawl is a callback to the complete_callback argument of the crawl method.
    def _request_parse_runnable(self, url):
        print('start get web content from: ' + url)
        try:
            headers = {"User-Agent": "Mozilla / 5.0 (X11; Linux x86_64)"}
            req = request.Request(url, headers=headers)
            content = request.urlopen(req).read().decode("utf-8")
            soup = BeautifulSoup(content, "html.parser", from_encoding='utf-8')
            new_urls = set()
            links = soup.find_all("a", href=re.compile(r"/item/\w+"))
            for link in links:
                new_urls.add(urljoin(url, link["href"]))
            data = {"url": url, "new_urls": new_urls}
            data["title"] = soup.find("dd", class_="lemmaWgt-lemmaTitle-title").find("h1").get_text()
            data["summary"] = soup.find("div", class_="lemma-summary").get_text()
        except BaseException as e:
            print(str(e))
            data = None
        return data

    def crawl(self, url, complete_callback, process_pool):
        future = process_pool.submit(self._request_parse_runnable, url)
        future.add_done_callback(complete_callback)


class OutPutProcess(object):
    "" with the process pool for the above climb parsing process results for process pool storage; ' ' '
    def _output_runnable(self, crawl_result):
        try:
            url = crawl_result['url']
            title = crawl_result['title']
            summary = crawl_result['summary']
            save_dir = 'output'
            print('start save %s as %s.txt.' % (url, title))
            if os.path.exists(save_dir) is False:
                os.makedirs(save_dir)
            save_file = save_dir + os.path.sep + title + '.txt'
            if os.path.exists(save_file):
                print('file %s is already exist! ' % title)
                return None
            with open(save_file, "w") as file_input:
                file_input.write(summary)
        except Exception as e:
            print('save file error.'+str(e))
        return crawl_result

    def save(self, crawl_result, process_pool):
        process_pool.submit(self._output_runnable, crawl_result)


class CrawlManager(object):
    Crawler management class, process pool is responsible for unified management of scheduled crawler parsing and storage processes.
    def __init__(self):
        self.crawl = CrawlProcess()
        self.output = OutPutProcess()
        self.crawl_pool = ProcessPoolExecutor(max_workers=8)
        self.crawl_deep = 100   # Climb depth
        self.crawl_cur_count = 0

    def _crawl_future_callback(self, crawl_url_future):
        try:
            data = crawl_url_future.result()
            self.output.save(data, self.crawl_pool)
            for new_url in data['new_urls']:
                self.start_runner(new_url)
        except Exception as e:
            print('Run crawl url future process error. '+str(e))

    def start_runner(self, url):
        if self.crawl_cur_count > self.crawl_deep:
            return
        self.crawl_cur_count += 1
        self.crawl.crawl(url, self._crawl_future_callback, self.crawl_pool)


if __name__ == '__main__':
    root_url = 'http://baike.baidu.com/item/Android'
    CrawlManager().start_runner(root_url)Copy the code

The thread pool crawl is similar to the thread pool crawl, but it is a process pool crawl.

The workman is like waterblog.csdn.net/yanboberNo reprint without permission, please respect the author’s work.Contact me in private】

Summary of concurrent crawlers

In the end, we gave two examples of concurrent crawlers based on the Python3 thread pool and process pool. (although this concurrent creeper Python3 concurrent) without further introduction, but to achieve the basic purpose, on concurrent in-depth learning kung fu is not a day or two, and in a large project is a very learned things, there is still a long way to go for, but we can have the bedding to grope for the basic principle of distributed crawler, It’s a multi-process crawler, and we can explore Python’s asynchronous IO mechanism on our own. That’s the core, and it’s not something that can be explained in one or two articles.

^ – ^Of course, if you see this if it is helpful to you, you might as well scan the TWO-DIMENSIONAL code to appreciate the small amount of money to buy badminton (now the ball is also very expensive), which is both a kind of encouragement and a kind of sharing, thank you!

The workman is like waterblog.csdn.net/yanboberNo reprint without permission, please respect the author’s work.Contact me in private】

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python3.X crawler (concurrent crawler)

1 background

2 Python 3.X concurrency matting

2-1 Python 3.X _thread module

2.2 Python 3.X Threading module

2-3 Python 3.X process module

2-4 Python 3.X concurrency pool

3 Concurrent crawler combat

3-1 Multithreaded crawler combat

3-2 multi-process crawler combat

Summary of concurrent crawlers

Python3.X crawler (concurrent crawler)

1 background

2 Python 3.X concurrency matting

2-1 Python 3.X _thread module

2.2 Python 3.X Threading module

2-3 Python 3.X process module

2-4 Python 3.X concurrency pool

3 Concurrent crawler combat

3-1 Multithreaded crawler combat

3-2 multi-process crawler combat

Summary of concurrent crawlers

Related Posts

Android’s battle with data flow

Mobile network monitoring practices

Multithreaded Programming in Java