Pay attention to the “water drop and silver bullet” public account, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.

It takes about 15 minutes to read this article.

In the previous article: Scrapy source code analysis (a) architecture overview, we mainly from the overall understanding of Scrapy architecture and data flow, not in-depth analysis of each module. Starting with this article, I’ll take you through how Scrapy works.

In this article, we’ll start with a basic introduction to how Scrapy works.

Where does scrapy come from?

After we have written a crawler based on Scrapy, we want to get our crawler running. How do we do that? It’s as simple as executing the following command.

 scrapy crawl <spider_name>
Copy the code

With this command, our crawler really gets to work. So what happens when you go from the command line to executing the crawler logic?

Before we begin, are you wondering, as I am, where our scrapy comes from?

In fact, when you’ve successfully installed your Scrapy, use the following command to find the command file, which is the entry to your Scrapy:

$ which scrapy
/usr/local/bin/scrapy
Copy the code

Open this file with edit, and you’ll see that it’s really just a Python script with very little code.

import re
import sys

from scrapy.cmdline import execute

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)? $'.' ', sys.argv[0])
    sys.exit(execute())
Copy the code

Why is this the entry point after you’ve installed your Scrapy?

The answer lies in the Scrapy installation file setup.py, in which the entry to the program is declared:

from os.path import dirname, join
from setuptools import setup, find_packages

setup(
    name='Scrapy',
    version=version,
    url='http://scrapy.org'. entry_points={Scrapy. Cmdline :execute
        'console_scripts': ['scrapy = scrapy.cmdline:execute'] }, classifiers=[ ... ] , install_requires=[ ... ] .)Copy the code

The entry_points configuration we need to focus on is where Scrapy starts, the execute method of cmdline.py.

In other words, setupTools, a package management tool, generates the code and puts it in an executable path while we’re installing our Scrapy, so that when we invoke our Scrapy, Cmdline. py’s execute method is invoked in the Scrapy module.

And here’s a tip — how do you write an executable in Python? In fact, very simple, imitate the above ideas, just need the following steps to complete:

  1. Write a code withmainMethod (the first line must specify the Python execution path)
  2. To get rid of.pysuffix
  3. Change permission to executable (chmod +xThe file name)
  4. You can execute the Python file directly with the filename

For example, we create a file mycmd, write a main method in this file, which writes the logic we want to execute, then execute chmod +x mycmd to make the file executable, and finally execute the code with./mycmd. Instead of using python

to execute, isn’t that easy?

Run entry (execut.py)

Now that we know that Scrapy starts with the execute method of Scrapy /cmdline.py, let’s look at that method.

def execute(argv=None, settings=None) :
    if argv is None:
        argv = sys.argv

    Scrapy.conf. Settings are compatible with lower versions of scrapy.conf. Settings
    if settings is None and 'scrapy.conf' in sys.modules:
        from scrapy import conf
        if hasattr(conf, 'settings'):
            settings = conf.settings
    # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -

	Initialize the environment, get the project configuration parameters, return Settings object
    if settings is None:
        settings = get_project_settings()
    Verify deprecated configuration items
    check_deprecated_settings(settings)

    Scrapy.conf. Settings are compatible with lower versions of scrapy.conf. Settings
    import warnings
    from scrapy.exceptions import ScrapyDeprecationWarning
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", ScrapyDeprecationWarning)
        from scrapy import conf
        conf.settings = settings
    # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -

    Check whether the scrapy. CFG configuration file exists in the project
    inproject = inside_project()

    Convert all command classes to {cmd_name: cmd_instance} dictionary
    cmds = _get_commands_dict(settings, inproject)
    Parse which command is executed from the command line
    cmdname = _pop_command_name(argv)
    parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \
        conflict_handler='resolve')
    if not cmdname:
        _print_commands(settings, inproject)
        sys.exit(0)
    elif cmdname not in cmds:
        _print_unknown_command(settings, cmdname, inproject)
        sys.exit(2)

    Find the command instance based on the command name
    cmd = cmds[cmdname]
    parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
    parser.description = cmd.long_desc()
    Set project configuration and level to Command
    settings.setdict(cmd.default_settings, priority='command')
    cmd.settings = settings
    # add parsing rule
    cmd.add_options(parser)
    Parses command parameters to Scrapy instances
    opts, args = parser.parse_args(args=argv[1:])
    _run_print_help(parser, cmd.process_options, args, opts)

    Initialize the CrawlerProcess instance and add the crawler_process attribute to the command instance
    cmd.crawler_process = CrawlerProcess(settings)
    Execute the run method of the command instance
    _run_print_help(parser, _run_command, cmd, args, opts)
    sys.exit(cmd.exitcode)
Copy the code

This code is the entry point for Scrapy execution, which includes configuration initialization, command parsing, crawler loading, and running.

Now that I understand the entire entry process, I will analyze each step in detail.

Initialize the project configuration

The first step is to initialize the configuration according to the environment. Here is some code that is compatible with the lower version Scrapy configuration, which we will ignore. Let’s focus on how the configuration is initialized. This has to do with environment variables and scrapy. CFG, and eventually generates a Settings instance by calling the get_project_Settings method.

def get_project_settings() :
    SCRAPY_SETTINGS_MODULE is configured in the environment variable
    if ENVVAR not in os.environ:
        project = os.environ.get('SCRAPY_PROJECT'.'default')
        Find the user configuration file settings.py and set it to the environment variable SCRAPY_SETTINGS_MODULE
        init_env(project)
    Default_settings. py generates a Settings instance
    settings = Settings()
    Get the user profile
    settings_module_path = os.environ.get(ENVVAR)
    Override the default if there is a user configuration
    if settings_module_path:
        settings.setmodule(settings_module_path, priority='project')
    If there are other scrapy-related configurations in the environment variable, it is also overridden
    pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")
    if pickled_settings:
        settings.setdict(pickle.loads(pickled_settings), priority='project')
    env_overrides = {k[7:]: v for k, v in os.environ.items() if
                     k.startswith('SCRAPY_')}
    if env_overrides:
        settings.setdict(env_overrides, priority='project')
    return settings
Copy the code

During initial configuration, the default configuration file default_settings.py is loaded, with the main logic in the Settings class.

class Settings(BaseSettings) :
    def __init__(self, values=None, priority='project') :
        Call superclass constructor initialization
        super(Settings, self).__init__()
        Set all Settings of default_settings.py to Settings instance
        self.setmodule(default_settings, 'default')
        Add the Attributes attribute to Settings instance
        for name, val in six.iteritems(self):
            if isinstance(val, dict):
                self.set(name, BaseSettings(val, 'default'), 'default')
        self.update(values, priority)
Copy the code

As you can see, all configuration items from the default configuration file default_settings.py are first set to Settings, and this configuration has priority.

The default configuration file default_settings.py is very important, and it is necessary to pay attention to it when reading the source code. It contains the default configuration for all components, as well as the class modules for each component, such as scheduler class, crawler middleware class, downloader middleware class, downloader processor class, and so on.

# Downloader class
DOWNLOADER = 'scrapy.core.downloader.Downloader'
The scheduler class
CHEDULER = 'scrapy.core.scheduler.Scheduler'
# scheduling queue class
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'
Copy the code

Does it seem strange that there are so many class modules in the default configuration?

This is a feature of Scrapy, and the advantage of it is that any module is replaceable.

What does that mean? For example, if you feel that the default scheduler is not enough, you can implement your own scheduler according to its interface standards, register your own scheduler class in your configuration file, and the Scrapy runtime will load your scheduler. This greatly improves your flexibility.

So, any module class that is configured in the default configuration file is replaceable.

Check whether the run environment is in the project

After the initial configuration, the next step is to check whether the running environment is in a crawler project. We know that some scrapy commands are project-dependent and some are global. This is done by looking at the nearest scrapy. CFG file to determine whether it is in the project environment. The main logic is in the inside_project method.

def inside_project() :
    # Check if this environment variable exists (set above)
    scrapy_module = os.environ.get('SCRAPY_SETTINGS_MODULE')
    if scrapy_module is not None:
        try:
            import_module(scrapy_module)
        except ImportError as exc:
            warnings.warn("Cannot import scrapy settings module %s: %s" % (scrapy_module, exc))
        else:
            return True
	Scrapy. CFG is considered to be in the project environment if it is not found nearby
    return bool(closest_scrapy_cfg())
Copy the code

Whether the operating environment is in the crawler project depends on whether the scrapy. CFG file can be found. If it can be found, it indicates that it is in the crawler project; otherwise, it is considered to be the global command executed.

Assemble a collection of command instances

Further down, you get to the logic of the load command. Scrapy crawl scrapy fetch scrapy crawl scrapy fetch The answer is in the _get_commands_dict method.

def _get_commands_dict(settings, inproject) :
    Generate {cmd_name: CMD} dictionary set
    cmds = _get_commands_from_module('scrapy.commands', inproject)
    cmds.update(_get_commands_from_entry_points(inproject))
    # Load a custom command class if there is a COMMANDS_MODULE configuration in a user-defined configuration file
    cmds_module = settings['COMMANDS_MODULE']
    if cmds_module:
        cmds.update(_get_commands_from_module(cmds_module, inproject))
    return cmds

def _get_commands_from_module(module, inproject) :
    d = {}
    ScrapyCommand find all command classes in this module.
    for cmd in _iter_command_classes(module):
        if inproject or not cmd.requires_project:
            # generate {cmd_name: CMD} dictionary
            cmdname = cmd.__module__.split('. ')[-1]
            d[cmdname] = cmd()
    return d

def _iter_command_classes(module_name) :
    Iterate through all modules in this package to find a subclass of ScrapyCommand
    for module in walk_modules(module_name):
        for obj in vars(module).values():
            if inspect.isclass(obj) and \
                    issubclass(obj, ScrapyCommand) and \
                    obj.__module__ == module.__name__:
                yield obj
Copy the code

This is done by importing all modules in the Commands folder, resulting in a {cmd_name: CMD} dictionary that will be appended if the user has already configured a custom command class in the configuration file. That is, we can write our own command classes, append them to configuration files, and then use our own commands.

Parse the command

Once the command class is loaded, it is time to parse which command we execute. The parse logic is relatively simple:

def _pop_command_name(argv) :
    i = 0
    for arg in argv[1] :if not arg.startswith(The '-') :del argv[i]
            return arg
        i += 1
Copy the code

The procedure is to parse the command line. For example, execute scrapy crawl

. This method resolves the crawl. The result is its Command class.

Parses command line arguments

After finding the corresponding command instance, call cmd.process_options to parse our arguments:

def process_options(self, args, opts) :
    Process_options is first called to resolve the uniform fixed arguments
    ScrapyCommand.process_options(self, args, opts)
    try:
        Convert command line arguments to dictionaries
        opts.spargs = arglist_to_dict(opts.spargs)
    except ValueError:
        raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
    if opts.output:
        if opts.output == The '-':
            self.settings.set('FEED_URI'.'stdout:', priority='cmdline')
        else:
            self.settings.set('FEED_URI', opts.output, priority='cmdline')
        feed_exporters = without_none_values(
            self.settings.getwithbase('FEED_EXPORTERS'))
        valid_output_formats = feed_exporters.keys()
        if not opts.output_format:
            opts.output_format = os.path.splitext(opts.output)[1].replace("."."")
        if opts.output_format not in valid_output_formats:
            raise UsageError("Unrecognized output format '%s', set one"
                             " using the '-t' switch or as a file extension"
                             " from the supported list %s" % (opts.output_format,
                                                              		tuple(valid_output_formats)))
        self.settings.set('FEED_FORMAT', opts.output_format, priority='cmdline')
Copy the code

The process is to parse the rest of the arguments on the command line, passing fixed arguments to the parent class for processing, such as output positions. The remaining different parameters are resolved by different command classes.

Initialize CrawlerProcess

With everything in place, you finally initialize the CrawlerProcess instance and run the run method of the corresponding command instance.

cmd.crawler_process = CrawlerProcess(settings)
_run_print_help(parser, _run_command, cmd, args, opts)
Copy the code

We start a crawler by scrapy crawl

, which means the commands/crawl.py run method is called:

def run(self, args, opts) :
    if len(args) < 1:
        raise UsageError()
    elif len(args) > 1:
        raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
    spname = args[0]

    self.crawler_process.crawl(spname, **opts.spargs)
    self.crawler_process.start()
Copy the code

The crawl and start methods of the CrawlerProcess instance are called in the run method, and the whole crawler is up and running.

Let’s start with the CrawlerProcess initialization:

class CrawlerProcess(CrawlerRunner) :
    def __init__(self, settings=None) :
        Call parent class initialization
        super(CrawlerProcess, self).__init__(settings)
        Signal and log initialization
        install_shutdown_handlers(self._signal_shutdown)
        configure_logging(self.settings)
        log_scrapy_info(self.settings)
Copy the code

Where the constructor calls the CrawlerRunner constructor of its parent:

class CrawlerRunner(object) :
    def __init__(self, settings=None) :
        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)
        self.settings = settings
        Get the crawler loader
        self.spider_loader = _get_spider_loader(settings)
        self._crawlers = set()
        self._active = set(a)Copy the code

At initialization, the _get_spider_loader method is called:

def _get_spider_loader(settings) :
    # Read the SPIDER_MANAGER_CLASS configuration item in the configuration file
    if settings.get('SPIDER_MANAGER_CLASS'):
        warnings.warn(
            'SPIDER_MANAGER_CLASS option is deprecated. '
            'Please use SPIDER_LOADER_CLASS.',
            category=ScrapyDeprecationWarning, stacklevel=2
        )
    cls_path = settings.get('SPIDER_MANAGER_CLASS',
                            settings.get('SPIDER_LOADER_CLASS'))
    loader_cls = load_object(cls_path)
    try:
        verifyClass(ISpiderLoader, loader_cls)
    except DoesNotImplement:
        warnings.warn(
            'SPIDER_LOADER_CLASS (previously named SPIDER_MANAGER_CLASS) does '
            'not fully implement scrapy.interfaces.ISpiderLoader interface. '
            'Please add all missing methods to avoid unexpected runtime errors.',
            category=ScrapyDeprecationWarning, stacklevel=2
        )
    return loader_cls.from_settings(settings.frozencopy())
Copy the code

Here will read the default spider_loader item in the configuration file, the default configuration is spiderloader spiderloader class, we can see from the name, this class is used to load we write good reptilian, here have a look at the concrete implementation of a class.

@implementer(ISpiderLoader)
class SpiderLoader(object) :
    def __init__(self, settings) :
        The configuration file gets the path to the crawler script
        self.spider_modules = settings.getlist('SPIDER_MODULES')
        self._spiders = {}
        Load all crawlers
        self._load_all_spiders()

    def _load_spiders(self, module) :
        Spider_name: spider_cls
        for spcls in iter_spider_classes(module):
            self._spiders[spcls.name] = spcls

    def _load_all_spiders(self) :
        for name in self.spider_modules:
            for module in walk_modules(name):
                self._load_spiders(module)
Copy the code

Here you can see, the crawler loader loads all the crawler script, and finally generate a {spider_name: spider_cls} the dictionary, so we in the execution scarpy crawl < spider_name >, Scrapy can find our reptiles.

Run the crawler

After the CrawlerProcess is initialized, call its crawl method:

def crawl(self, crawler_or_spidercls, *args, **kwargs) :
    # to create crawler
    crawler = self.create_crawler(crawler_or_spidercls)
    return self._crawl(crawler, *args, **kwargs)

def _crawl(self, crawler, *args, **kwargs) :
    self.crawlers.add(crawler)
    Call the Crawler crawl method
    d = crawler.crawl(*args, **kwargs)
    self._active.add(d)

    def _done(result) :
        self.crawlers.discard(crawler)
        self._active.discard(d)
        return result
    return d.addBoth(_done)

def create_crawler(self, crawler_or_spidercls) :
    if isinstance(crawler_or_spidercls, Crawler):
        return crawler_or_spidercls
    return self._create_crawler(crawler_or_spidercls)

def _create_crawler(self, spidercls) :
    If it is a string, load the crawler from the spider_loader
    if isinstance(spidercls, six.string_types):
        spidercls = self.spider_loader.load(spidercls)
    Otherwise create Crawler
    return Crawler(spidercls, self.settings)
Copy the code

This procedure creates the Cralwer instance and calls its crawl method:

@defer.inlineCallbacks
def crawl(self, *args, **kwargs) :
    assert not self.crawling, "Crawling already taking place"
    self.crawling = True

    try:
        Only now is a crawler instance instantiated
        self.spider = self._create_spider(*args, **kwargs)
        # create engine
        self.engine = self._create_engine()
        Call the crawler's start_requests method
        start_requests = iter(self.spider.start_requests())
        Execute the engine's Open_spider and pass in the crawler instance and the initial request
        yield self.engine.open_spider(self.spider, start_requests)
        yield defer.maybeDeferred(self.engine.start)
    except Exception:
        if six.PY2:
            exc_info = sys.exc_info()

        self.crawling = False
        if self.engine is not None:
            yield self.engine.close()

        if six.PY2:
            six.reraise(*exc_info)
        raise

def _create_spider(self, *args, **kwargs) :
    return self.spidercls.from_crawler(self, *args, **kwargs)
Copy the code

At this point, we’ll create an instance object for our crawler class, then create the engine, and call the crawler’s start_requests method to get the seed URL, which the engine will execute.

Finally, let’s see how Cralwer gets started, with its start method:

def start(self, stop_after_crawl=True) :
    if stop_after_crawl:
        d = self.join()
        if d.called:
            return
        d.addBoth(self._stop_reactor)
    reactor.installResolver(self._get_dns_resolver())
    # configure the reactor pool size (REACTOR_THREADPOOL_MAXSIZE)
    tp = reactor.getThreadPool()
    tp.adjustPoolsize(maxthreads=self.settings.getint('REACTOR_THREADPOOL_MAXSIZE'))
    reactor.addSystemEventTrigger('before'.'shutdown', self.stop)
    # start execution
    reactor.run(installSignalHandlers=False)
Copy the code

There is a module called REACTOR. What is a reactor? It is the event manager for the Twisted module. We just register the events we need to execute in the REACTOR and call its Run method. It will execute the registered events for us and automatically switch to the executable events if we encounter network IO waiting.

Instead of going into the details of how reactor works, you can think of it as a thread pool that performs events using registered callbacks.

At this point, Scrapy entry points are analyzed, and the crawler scheduling logic is left to the ExecuteEngine, which coordinates the components to perform the entire task.

conclusion

To summarize, before Scrapy can actually run, it needs to initialize the configuration environment, load the command class, load the crawler module, parse the command class and parameters, and then run our crawler class. Finally, the engine handles the scheduling of the crawler.

Here I have also summarized the whole process into a mind map for your understanding:

Our next article will take a closer look at each of the core components, what they do, and how they coordinate to complete the scraping task.

Crawler series:

  • Scrapy source code analysis (a) architecture overview
  • Scrapy source code analysis (two) how to run Scrapy?
  • Scrapy source code analysis (three) what are the core components of Scrapy?
  • Scrapy source code analysis (four) how to complete the scraping task?
  • How to build a crawler proxy service?
  • How to build a universal vertical crawler platform?

My advanced Python series:

  • Python Advanced – How to implement a decorator?
  • Python Advanced – How to use magic methods correctly? (on)
  • Python Advanced – How to use magic methods correctly? (below)
  • Python Advanced — What is a metaclass?
  • Python Advanced – What is a Context manager?
  • Python Advancements — What is an iterator?
  • Python Advancements — How to use yield correctly?
  • Python Advanced – What is a descriptor?
  • Python Advancements – Why does GIL make multithreading so useless?

Want to read more hardcore technology articles? Focus on”Water drops and silver bullets”Public number, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.