Distributed crawler principle distributed crawler principle

We have implemented Scrapy micro-blog crawler in the front, although crawler is asynchronous and multi-threaded, but we can only run on a host, so the crawling efficiency is limited, distributed crawler is a combination of multiple hosts, together to complete a crawling task, which will greatly improve the crawling efficiency.

1. Distributed crawler architecture

Before we look at distributed crawler architecture, let’s review the architecture of Scrapy, as illustrated below.

Scrapy standalone crawlers have a local crawl Queue, which is implemented using the deque module. If a new Request is generated, it is placed in a queue and the Request is then scheduled by the Scheduler. After that, the Request is handed to the Downloader to perform the crawl. The simple scheduling architecture is shown in the following figure.

If two schedulers fetch requests from the queue at the same time, and each Scheduler has its corresponding Downloader, what will happen to the crawl efficiency when the bandwidth is sufficient and the crawl is normal without considering the queue access pressure? Yeah, it doubles the efficiency of the crawl.

In this way, Scheduler can scale more than one, and Downloader can scale more than one. The crawl Queue must always be one, which is called a shared crawl Queue. In this way, Scheduer can ensure that after scheduling a Request from the queue, other schedulers will not schedule the Request again, so that multiple Schdulers can be climbed synchronously. This is the basic prototype of distributed crawler, and the simple scheduling architecture is shown in the figure below.

What we need to do is to run crawler task on multiple hosts at the same time for cooperative crawler, and the premise of cooperative crawler is to share crawler queue. Instead of maintaining their own crawl queues, each host accesses the Request from a shared crawl queue. However, each host still has its own Scheduler and Downloader, so the scheduling and download functions are completed separately. Regardless of the queue access performance cost, the crawl efficiency increases exponentially.

Maintain the crawl queue

So how is this queue maintained? The first thing to consider is performance. We naturally think of Redis based on memory storage. It supports a variety of data structures, such as List, Set, Sorted Set, and so on. The operation of accessing is also very simple.

Each of these data structure stores supported by Redis has its advantages.

The list has lpush(), lPOP (), rpush(), and rPOP () methods, which can be used to implement either a fifO (first-in, first-out) or a fifO (first-in, last-out) stack crawl queue.
The elements of the collection are unordered and non-repeating, which makes it very easy to implement randomly sorted and non-repeating crawl queues.
Ordered collections are represented by fractions, and Scrapy requests have priority control, which we can use to implement prioritized scheduling of queues.

We need to choose different queues flexibly according to the needs of specific crawlers.

Three, how to weight

Scrapy has automatic de-duplication, which uses collections in Python. This collection records the fingerprint of each Request in Scrapy, which is essentially the hash value of the Request. Take a look at Scrapy’s source code, as shown below:

import hashlib
def request_fingerprint(request, include_headers=None):
    if include_headers:
        include_headers = tuple(to_bytes(h.lower())
                                 for h in sorted(include_headers))
    cache = _fingerprint_cache.setdefault(request, {})
    if include_headers not in cache:
        fp = hashlib.sha1()
        fp.update(to_bytes(request.method))
        fp.update(to_bytes(canonicalize_url(request.url)))
        fp.update(request.body or b' ')
        if include_headers:
            for hdr in include_headers:
                if hdr in request.headers:
                    fp.update(hdr)
                    for v in request.headers.getlist(hdr):
                        fp.update(v)
        cache[include_headers] = fp.hexdigest()
    return cache[include_headers]Copy the code

Request_fingerprint () is the method that calculates the Request fingerprint using the sha1() method of Hashlib internally. The calculated fields include Request Method, URL, Body, and Headers. If there is a slight difference, the calculated results will be different. The result is an encrypted string, or fingerprint. Each Request has its own fingerprint, which is a string. It is much easier to determine whether a string is repeated than whether a Request object is repeated. Therefore, fingerprints can be used to determine whether a Request is repeated.

So how do we determine repetition? Scrapy is implemented as follows:

def __init__(self):
    self.fingerprints = set()

def request_seen(self, request):
    fp = self.request_fingerprint(request)
    if fp in self.fingerprints:
        return True
    self.fingerprints.add(fp)Copy the code

In the de-duplicates class RFPDupeFilter, there is a request_seen() method that takes a request and checks if the request object is duplicated. This method calls request_fingerprint() to retrieve the fingerprint of the Request and then check that the fingerprints are present in the FINGERPRINTS variable which is a collection of fingerprints whose elements are non-duplicated. Return True if the fingerprint exists, indicating that the Request is repeated, otherwise the fingerprint is added to the collection. If the next time the same Request is sent and the fingerprint is the same, then the fingerprint already exists in the collection and the Request object is judged to be a duplicate. In this way, the purpose of weight reduction is realized.

Scrapy is a process that takes advantage of the non-repeating nature of collection elements to implement Request scrapping.

For distributed crawlers, we can no longer use individual collections of each crawler. Because this is still each host to maintain their own collection, can not be shared. If multiple hosts generate the same Request, they can only duplicate the Request separately.

So in order to achieve deduplication, the fingerprint set also needs to be shared, Redis just has the storage data structure of the set, we can use the set of Redis as the fingerprint set, so the deduplication set is also shared by Redis. After each host generates a Request, the fingerprint of the Request is compared with the collection. If the fingerprint already exists, the Request is duplicated. Otherwise, add the fingerprint of the Request to the collection. Using the same principle and different storage structure, we also realize distributed Reqeust de-duplication.

Four, to prevent interruption

In Scrapy, the crawler runtime Request queue is placed in memory. When the crawler is interrupted, the space in the queue is freed and the queue is destroyed. So once the crawler operation is interrupted, the crawler running again is equivalent to the whole new crawling process.

To continue the climb after the interruption, we can save the Request in the queue, and the next climb can directly read the saved data to get the queue of the last climb. Scrapy specify a path to the crawl queue, identified by the JOB_DIR variable, using the following command:

scrapy crawl spider -s JOB_DIR=crawls/spiderCopy the code

The use of more detailed method can refer to official document, the link is: https://doc.scrapy.org/en/latest/topics/jobs.html.

In Scrapy, we actually save the crawl queue locally, and the second crawl reads and restores the queue. Should we worry about this in a distributed architecture? Don’t need. Because the crawl queue itself is stored in the database, if the crawler is interrupted, the Request in the database still exists, and the next startup will continue to crawl at the place where the last interruption occurred.

Therefore, when the queue of Redis is empty, the crawler will crawl again. When the Redis queue is not empty, the crawler picks up where it left off.

5. Implementation of architecture

We then need to implement this architecture in our applications. First, implement a shared crawl queue, and also implement the deduplication function. Also, rewrite an implementation of Scheduer to access requests from a shared crawl queue.

Fortunately, someone has implemented this logic and architecture and published it as a Python package called scrapy-redis. Next, take a look at the source code implementation of scrapy-Redis and how it works in detail.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)

Distributed crawler principle distributed crawler principle

1. Distributed crawler architecture

Maintain the crawl queue

Three, how to weight

Four, to prevent interruption

5. Implementation of architecture

Related Posts

Build continuous integration/deployment environment with GITLab-CI and GitLab-Runner provided by GitLab

Varchar type error in Mysql

How to connect iOS simulator to Chrome for debugging