Pay attention to the “water drop and silver bullet” public account, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.

It takes about 15 minutes to read this article.

How does Scrapy work? We’re going to break down the core logic of how Scrapy works, what it does before it actually does the scraping.

What are the core components of Scrapy? And what are they primarily responsible for? How are these components internally implemented to accomplish these functions?

reptiles

Let’s pick up where we left off last time. Last time Scrapy to run after the execution at the end of the Crawler to crawl method, let’s take a look at this method:

@defer.inlineCallbacks
def crawl(self, *args, **kwargs) :
    assert not self.crawling, "Crawling already taking place"
    self.crawling = True
    try:
        Find the crawler from the SpiderLoader and instantiate the crawler instance
        self.spider = self._create_spider(*args, **kwargs)
        # create engine
        self.engine = self._create_engine()
        Call the crawler's start_requests method to get the list of seed urls
        start_requests = iter(self.spider.start_requests())
        Execute the engine's Open_spider and pass in the crawler instance and the initial request
        yield self.engine.open_spider(self.spider, start_requests)
        yield defer.maybeDeferred(self.engine.start)
    except Exception:
        if six.PY2:
            exc_info = sys.exc_info()
        self.crawling = False
        if self.engine is not None:
            yield self.engine.close()
        if six.PY2:
            six.reraise(*exc_info)
        raise
Copy the code

At this point, we see that the crawler instance is created, then the engine is created, and finally the crawler is handed over to the engine for processing.

In the last article, we also mentioned that when Crawler is instantiated, a SpiderLoader is created, which finds the location of the Crawler based on the configuration file settings.py we defined.

The SpiderLoader then scans these code files and finds that the parent class of the crawler is scrapy.Spider. The SpiderLoader then generates a {spider_name: Crawl

scrapy crawl



def _create_spider(self, *args, **kwargs) :
    Call the from_crawler class method to instantiate
    return self.spidercls.from_crawler(self, *args, **kwargs)
Copy the code

The crawler is not initialized using the normal constructor. Instead, the crawler is initialized using the from_crawler class and finds the scrapy.Spider class:

@classmethod
def from_crawler(cls, crawler, *args, **kwargs) :
    spider = cls(*args, **kwargs)
    spider._set_crawler(crawler)
    return spider
    
def _set_crawler(self, crawler) :
    self.crawler = crawler
    Assign the Settings object to the spider instance
    self.settings = crawler.settings
    crawler.signals.connect(self.close, signals.spider_closed)
Copy the code

So here we can see that this class method is actually calling the constructor, instantiating it, and also getting the Settings configuration, so what does the constructor do?

class Spider(object_ref) :
    name = None
    custom_settings = None

    def __init__(self, name=None, **kwargs) :
        # name required
        if name is not None:
            self.name = name
        elif not getattr(self, 'name'.None) :raise ValueError("%s must have a name" % type(self).__name__)
        self.__dict__.update(kwargs)
        # if start_urls is not set, the default is []
        if not hasattr(self, 'start_urls'):
            self.start_urls = []
Copy the code

Does this look familiar? Here are some of the most common attributes we use when writing crawlers: name, start_urls, custom_settings:

  • name: use it to find the crawler class we wrote when running the crawler;
  • start_urls: crawl entry, also called seed URL;
  • custom_settings: Crawler custom configuration overwrites the configuration items in the configuration file.

engine

After analyzing the initialization of the Crawler, go back to the crawling method of the Crawler and then create the engine object, _create_engine. What happens when you initialize the Crawler?

class ExecutionEngine(object) :
    "" "engine "" "
    def __init__(self, crawler, spider_closed_callback) :
        self.crawler = crawler
        Save the Settings configuration to the engine
        self.settings = crawler.settings
        # signal
        self.signals = crawler.signals
        # log format
        self.logformatter = crawler.logformatter
        self.slot = None
        self.spider = None
        self.running = False
        self.paused = False
        Find the Scheduler class from Settings
        self.scheduler_cls = load_object(self.settings['SCHEDULER'])
        Again, find the Downloader class
        downloader_cls = load_object(self.settings['DOWNLOADER'])
        Instantiate Downloader
        self.downloader = downloader_cls(crawler)
        It's a bridge between the engine and the crawler
        self.scraper = Scraper(crawler)
        self._spider_closed_callback = spider_closed_callback
Copy the code

Scheduler, Downloader, and Scrapyer are the core components that Scheduler defines and initializes without instantiation.

That is, the engine is at the heart of Scrapy, managing and scheduling the components to make them work together.

How are these core components initialized?

The scheduler

The scheduler initialization occurs in the engine’s Open_spider method, so let’s look at the scheduler initialization in advance.

class Scheduler(object) :
	""" "Scheduler """
    def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None,
                 logunser=False, stats=None, pqclass=None) :
        # Fingerprint filter
        self.df = dupefilter
        # task queue folder
        self.dqdir = self._dqdir(jobdir)
        Priority task queue class
        self.pqclass = pqclass
        Disk task queue class
        self.dqclass = dqclass
        # memory task queue class
        self.mqclass = mqclass
        Whether the log is serialized
        self.logunser = logunser
        self.stats = stats
        
    @classmethod
    def from_crawler(cls, crawler) :
        settings = crawler.settings
        Get the fingerprint filter class from the configuration file
        dupefilter_cls = load_object(settings['DUPEFILTER_CLASS'])
        Instantiate the fingerprint filter
        dupefilter = dupefilter_cls.from_settings(settings)
        Get priority task queue class, disk queue class, memory queue class from the configuration file
        pqclass = load_object(settings['SCHEDULER_PRIORITY_QUEUE'])
        dqclass = load_object(settings['SCHEDULER_DISK_QUEUE'])
        mqclass = load_object(settings['SCHEDULER_MEMORY_QUEUE'])
        Request log serialization switch
        logunser = settings.getbool('LOG_UNSERIALIZABLE_REQUESTS', settings.getbool('SCHEDULER_DEBUG'))
        return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser,
                   stats=crawler.stats, pqclass=pqclass, dqclass=dqclass, mqclass=mqclass)
Copy the code

As you can see, the scheduler initialization does two things:

  • Instantiated request fingerprint filter: mainly used to filter repeated requests;
  • Define different types of task queues: priority task queues, disk-based task queues, memory-based task queues;

What about requesting a fingerprint filter?

In the configuration file, we can see that the default fingerprint filter defined is RFPDupeFilter:

class RFPDupeFilter(BaseDupeFilter) :
    """ Request fingerprint filter """
    def __init__(self, path=None, debug=False) :
        self.file = None
        # Fingerprint sets use Set based memory
        self.fingerprints = set()
        self.logdupes = True
        self.debug = debug
        self.logger = logging.getLogger(__name__)
        Request fingerprint can be saved to disk
        if path:
            self.file = open(os.path.join(path, 'requests.seen'), 'a+')
            self.file.seek(0)
            self.fingerprints.update(x.rstrip() for x in self.file)

    @classmethod
    def from_settings(cls, settings) :
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(job_dir(settings), debug)
Copy the code

When a fingerprint filter is requested for initialization, a collection of fingerprints is defined that uses an in-memory implementation of the Set and can control whether the fingerprints are saved to disk for next reuse.

In other words, the fingerprint filter is responsible for filtering repeated requests and you can customize the filtering rules.

In the next article, we’ll look at the rules by which each request generates a fingerprint, and then how to implement the repeat request filtering logic, but we’ll just know what it does.

What are the tasks defined by the scheduler?

The scheduler defines two queue types by default:

  • Disk-based task queue: You can configure a storage path in the configuration file and save the queue tasks to disks after each execution.
  • Memory-based task queue: Each time in memory execution, the next startup will disappear;

The default configuration file definition is as follows:

# Disk-based task queue (last in first out)
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
# memory-based task queue (lifO)
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
# priority queue
SCHEDULER_PRIORITY_QUEUE = 'queuelib.PriorityQueue'
Copy the code

If we define the JOBDIR configuration item in the configuration file, the task queue will be saved on disk each time the crawler is executed, so that the next time the crawler is started, the task can be reloaded to continue executing our task.

If this configuration item is not defined, memory queues are used by default.

If you’re careful, you might notice that the default queue structure is lifO. What does that mean?

In other words, when running our crawler code, if a crawler task is generated and put into the task queue, the next crawler will obtain this task from the task queue first and execute it first.

What does this implementation mean? Scrapy’s default collection rule is depth first.

How do you change this mechanism to breadth-first collection? This is where we look at the scrapy.squeues module, which defines various types of queues:

# FifO disk queue (pickle serialization)
PickleFifoDiskQueue = _serializable_queue(queue.FifoDiskQueue, \
    _pickle_serialize, pickle.loads)
Last in first out disk queue (pickle serialization)
PickleLifoDiskQueue = _serializable_queue(queue.LifoDiskQueue, \
    _pickle_serialize, pickle.loads)
First in, first out (FIFO)
MarshalFifoDiskQueue = _serializable_queue(queue.FifoDiskQueue, \
    marshal.dumps, marshal.loads)
Last in, first out (LAST in, first out)
MarshalLifoDiskQueue = _serializable_queue(queue.LifoDiskQueue, \
    marshal.dumps, marshal.loads)
# fifO memory queue
FifoMemoryQueue = queue.FifoMemoryQueue
Last in, first out memory queue
LifoMemoryQueue = queue.LifoMemoryQueue
Copy the code

If we wanted to change the fetching task to breadth-first, all we had to do was change the queue class to fifO in the configuration file! As you can see, the coupling between Scrapy components is very low, and each module is customizable.

If you want to explore how these queues are implemented, you can check out the author’s Scrapy/Queuelib project, which is available on Github.

downloader

Going back to the initialization of the engine, let’s look at how the downloader is initialized.

In the default configuration file default_settings.py, the downloader is configured as follows:

DOWNLOADER = 'scrapy.core.downloader.Downloader'
Copy the code

Let’s look at the initialization of the Downloader class:

class Downloader(object) :
    """ "Downloader """
    def __init__(self, crawler) :
        Get the Settings object as well
        self.settings = crawler.settings
        self.signals = crawler.signals
        self.slots = {}
        self.active = set(a)Initialize the DownloadHandlers
        self.handlers = DownloadHandlers(crawler)
        Get the set concurrency from the configuration
        self.total_concurrency = self.settings.getint('CONCURRENT_REQUESTS')
        # Number of concurrent requests for the same domain name
        self.domain_concurrency = self.settings.getint('CONCURRENT_REQUESTS_PER_DOMAIN')
        # Number of concurrent requests for the same IP address
        self.ip_concurrency = self.settings.getint('CONCURRENT_REQUESTS_PER_IP')
        # Random delay of download time
        self.randomize_delay = self.settings.getbool('RANDOMIZE_DOWNLOAD_DELAY')
        Initialize the downloader middlewareself.middleware = DownloaderMiddlewareManager.from_crawler(crawler) self._slot_gc_loop = task.LoopingCall(self._slot_gc)  self._slot_gc_loop.start(60)
Copy the code

In this process, the download processor, the download middleware manager, and the parameters related to grab request control are initialized from the configuration file.

So what does a download processor do? What does the downloader middleware do?

First, the DownloadHandlers:

class DownloadHandlers(object) :
    """ "Downloader processor """
    def __init__(self, crawler) :
        self._crawler = crawler
        self._schemes = {}	The classpath corresponding to the storage scheme is later used for instantiation
        self._handlers = {}	# Store the downloader corresponding to scheme
        self._notconfigured = {}
        Find the DOWNLOAD_HANDLERS_BASE construct from the configuration to download the handler
        # note: the getwithBase method is called to take the XXXX_BASE configuration
        handlers = without_none_values(
            crawler.settings.getwithbase('DOWNLOAD_HANDLERS'))
        The classpath corresponding to the storage scheme is later used for instantiation
        for scheme, clspath in six.iteritems(handlers):
            self._schemes[scheme] = clspath

        crawler.signals.connect(self._close, signals.engine_stopped)
Copy the code

Download handlers are configured like this in the default configuration file:

# User-definable download processor
DOWNLOAD_HANDLERS = {}
# Default download handler
DOWNLOAD_HANDLERS_BASE = {
    'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler'.'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler'.'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler'.'s3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler'.'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',}Copy the code

As you can see from this, the download processor will select the appropriate downloader to download the resource according to the type of resource to download. The most common of these are the HTTP and HTTPS processors.

Note, however, that these downloaders are not instantiated here, and are initialized only when the network request is actually made, and only once, as described in a later article.

Below we see downloader middleware DownloaderMiddlewareManager initialization process, in the same way, here again called class methods from_crawler initialized, And DownloaderMiddlewareManager inherited MiddlewareManager class, to see what kind of work does it in the initial changes:

class MiddlewareManager(object) :
    A parent class of all middleware that provides common middleware methods.
    component_name = 'foo middleware'
    @classmethod
    def from_crawler(cls, crawler) :
        # call from_settings
        return cls.from_settings(crawler.settings, crawler)
    
    @classmethod
    def from_settings(cls, settings, crawler=None) :
        Call subclass _get_mwLIST_from_settings to get the modules of all middleware classes
        mwlist = cls._get_mwlist_from_settings(settings)
        middlewares = []
        enabled = []
        # instantiate in sequence
        for clspath in mwlist:
            try:
                Load these middleware modules
                mwcls = load_object(clspath)
                Call this method to instantiate from_crawler if the middleware class defines it
                if crawler and hasattr(mwcls, 'from_crawler'):
                    mw = mwcls.from_crawler(crawler)
                Call this method to instantiate from_Settings if the middleware class defines it
                elif hasattr(mwcls, 'from_settings'):
                    mw = mwcls.from_settings(settings)
                If neither of the above methods exists, call the constructor instantiation directly
                else:
                    mw = mwcls()
                middlewares.append(mw)
                enabled.append(clspath)
            except NotConfigured as e:
                if e.args:
                    clsname = clspath.split('. ')[-1]
                    logger.warning("Disabled %(clsname)s: %(eargs)s",
                                   {'clsname': clsname, 'eargs': e.args[0]},
                                   extra={'crawler': crawler})

        logger.info("Enabled %(componentname)ss:\n%(enabledlist)s",
                    {'componentname': cls.component_name,
                     'enabledlist': pprint.pformat(enabled)},
                    extra={'crawler': crawler})
        Call the constructor
        return cls(*middlewares)

    @classmethod
    def _get_mwlist_from_settings(cls, settings) :
        What middleware classes are there, subclass definition
        raise NotImplementedError
    
    def __init__(self, *middlewares) :
        self.middlewares = middlewares
        Define middleware methods
        self.methods = defaultdict(list)
        for mw in middlewares:
            self._add_middleware(mw)
        
	def _add_middleware(self, mw) :
        Subclasses defined by default can be overridden
        Add methods if the middleware class defines an open_spider
        if hasattr(mw, 'open_spider'):
            self.methods['open_spider'].append(mw.open_spider)
        Add methods if close_spider is defined for the middleware class
        Methods are a chain of middleware methods that are called in turn
        if hasattr(mw, 'close_spider'):
            self.methods['close_spider'].insert(0, mw.close_spider)
Copy the code

DownloaderMiddlewareManager instantiation:

class DownloaderMiddlewareManager(MiddlewareManager) :
	""" Download middleware manager """
    component_name = 'downloader middleware'

    @classmethod
    def _get_mwlist_from_settings(cls, settings) :
        Get all the downloader middleware from the configuration files DOWNLOADER_MIDDLEWARES_BASE and DOWNLOADER_MIDDLEWARES
        return build_component_list(
            settings.getwithbase('DOWNLOADER_MIDDLEWARES'))

    def _add_middleware(self, mw) :
        Define a list of methods for the downloader middleware request, response, and exception
        if hasattr(mw, 'process_request'):
            self.methods['process_request'].append(mw.process_request)
        if hasattr(mw, 'process_response'):
            self.methods['process_response'].insert(0, mw.process_response)
        if hasattr(mw, 'process_exception'):
            self.methods['process_exception'].insert(0, mw.process_exception)
Copy the code

The downloader MiddlewareManager inherits the MiddlewareManager class and then overrides the _add_middleware method to define default pre-download, post-download, and exception-time actions for download behavior.

Here we can think about, what are the benefits of middleware doing this?

Can probably see from here, from one component to another component, passes through a series of middleware, each middleware defines its own processing, equivalent to a pipeline, can input for data processing, and then sent to another component, after another component processing logic, and through this a series of middleware, These middleware can then process the response result for the final output.

Scraper

After the launcher is instantiated, I return to the Engine initialization method, and then to the Scraper. As I mentioned in Scrapy, this class does not appear on the Spiders of Engine, Pipeline, and Engine. It’s a bridge between these three components.

Let’s take a look at its initialization:

class Scraper(object) :

    def __init__(self, crawler) :
        self.slot = None
        Instantiate crawler middleware manager
        self.spidermw = SpiderMiddlewareManager.from_crawler(crawler)
        Load the Pipeline handler class from the configuration file
        itemproc_cls = load_object(crawler.settings['ITEM_PROCESSOR'])
        Instantiate the Pipeline handler
        self.itemproc = itemproc_cls.from_crawler(crawler)
        Get the number of tasks to process output simultaneously from the configuration file
        self.concurrent_items = crawler.settings.getint('CONCURRENT_ITEMS')
        self.crawler = crawler
        self.signals = crawler.signals
        self.logformatter = crawler.logformatter
Copy the code

The Scraper creates the SpiderMiddlewareManager, which is initialized:

class SpiderMiddlewareManager(MiddlewareManager) :
	Crawler Middleware Manager ""
    component_name = 'spider middleware'

    @classmethod
    def _get_mwlist_from_settings(cls, settings) :
        SPIDER_MIDDLEWARES_BASE and SPIDER_MIDDLEWARES get default crawler middleware classes from configuration files
        return build_component_list(settings.getwithbase('SPIDER_MIDDLEWARES'))

    def _add_middleware(self, mw) :
        super(SpiderMiddlewareManager, self)._add_middleware(mw)
        Define crawler middleware processing methods
        if hasattr(mw, 'process_spider_input'):
            self.methods['process_spider_input'].append(mw.process_spider_input)
        if hasattr(mw, 'process_spider_output'):
            self.methods['process_spider_output'].insert(0, mw.process_spider_output)
        if hasattr(mw, 'process_spider_exception'):
            self.methods['process_spider_exception'].insert(0, mw.process_spider_exception)
        if hasattr(mw, 'process_start_requests'):
            self.methods['process_start_requests'].insert(0, mw.process_start_requests)
Copy the code

The crawler middleware manager initialization is similar to the previous downloader middleware manager in that it first loads the default crawler middleware class from the configuration file and then registers a series of process methods for the crawler middleware in turn. The default crawler middleware classes defined in the configuration file are as follows:

SPIDER_MIDDLEWARES_BASE = {
	The default crawler middleware class
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50.'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500.'scrapy.spidermiddlewares.referer.RefererMiddleware': 700.'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800.'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,}Copy the code

Here’s what the default crawler middleware does:

  • HttpErrorMiddleware: logical handling of non-200 response errors;
  • OffsiteMiddleware: if defined in Spiderallowed_domains, will automatically filter other domain name requests;
  • RefererMiddleware: additionalRefererHeader information;
  • UrlLengthMiddleware: Filters requests for urls whose length exceeds the limit;
  • DepthMiddleware: Filters fetch requests that exceed a specified depth;

Of course, you can also define your own crawler middleware to handle your own logic.

After the crawler middleware manager is initialized, the Pipeline component is initialized. The default Pipeline component is ItemPipelineManager:

class ItemPipelineManager(MiddlewareManager) :

    component_name = 'item pipeline'

    @classmethod
    def _get_mwlist_from_settings(cls, settings) :
        Load the ITEM_PIPELINES_BASE and ITEM_PIPELINES classes from the config file
        return build_component_list(settings.getwithbase('ITEM_PIPELINES'))

    def _add_middleware(self, pipe) :
        super(ItemPipelineManager, self)._add_middleware(pipe)
        Define the default pipeline processing logic
        if hasattr(pipe, 'process_item'):
            self.methods['process_item'].append(pipe.process_item)

    def process_item(self, item, spider) :
        Call the process_item method of all subclasses in turn
        return self._process_chain('process_item', item, spider)
Copy the code

ItemPipelineManager is also a subclass of middleware manager, which behaves very much like middleware but is one of the core components due to its independent functionality.

From the initialization of the Scraper, we can see that it manages the data interaction associated with Spiders and pipelines.

conclusion

The engine, the downloader, the scheduler, the crawler, and the output processor are all initialized and their submodules are designed to perform their functions.

These components play their respective roles and coordinate with each other to jointly complete the crawler grasping task. Moreover, we can also find from the code that each component class is defined in the configuration file, that is to say, we can implement our own logic and then replace these components. Such a design pattern is also worth learning.

In the next article, I’ll take a look at the core of Scrapy and how the components work together to accomplish our scraping tasks.

Crawler series:

  • Scrapy source code analysis (a) architecture overview
  • Scrapy source code analysis (two) how to run Scrapy?
  • Scrapy source code analysis (three) what are the core components of Scrapy?
  • Scrapy source code analysis (four) how to complete the scraping task?
  • How to build a crawler proxy service?
  • How to build a universal vertical crawler platform?

My advanced Python series:

  • Python Advanced – How to implement a decorator?
  • Python Advanced – How to use magic methods correctly? (on)
  • Python Advanced – How to use magic methods correctly? (below)
  • Python Advanced — What is a metaclass?
  • Python Advanced – What is a Context manager?
  • Python Advancements — What is an iterator?
  • Python Advancements — How to use yield correctly?
  • Python Advanced – What is a descriptor?
  • Python Advancements – Why does GIL make multithreading so useless?

Want to read more hardcore technology articles? Focus on”Water drops and silver bullets”Public number, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.