Scrapy is a Twisted based asynchronous processing framework, is a pure Python crawler framework, its architecture is clear, the coupling degree between modules is low, scalability is strong, can flexibly complete a variety of requirements. We only need to customize the development of a few modules can easily implement a crawler.

1. Architecture introduction

Let’s start with our Scrapy framework, as shown below.

It can be divided into the following parts.

  • Engine. The engine, which processes the data flow of the whole system and triggers transactions, is the core of the whole framework.

  • The Item. Item, which defines the data structure for the result of the crawl. The crawl data is assigned to the Item object.

  • The Scheduler. The scheduler, which takes a request from the engine and queues it, provides the request to the engine when the engine requests it again.

  • Downloader. Download, download web content, and web content back to the spider.

  • Spiders. The spider, which defines the crawl logic and page parsing rules, is responsible for parsing the response and generating extraction results and new requests.

  • The Item Pipeline. The project pipeline, which processes items extracted from web pages by spiders, is responsible for cleaning, validating and storing data.

  • Downloader Middlewares. The downloader middleware, a hook framework between the engine and downloader, handles requests and responses between the engine and downloader.

  • Spiders Middlewares. Spider middleware, a hook framework that sits between the engine and the spider, mainly processes the spider’s input response and output results and new requests.

2. The data flow

The flow of data in Scrapy is controlled by the engine as follows.

  • Engine starts by opening a Web site, finding the Spider that handles the site, and asking that Spider for the first URL to crawl.

  • The Engine retrieves the first URL to crawl from the Spider and schedules it as a Request through Scheduler.

  • The Engine asks the Scheduler for the next URL to climb.

  • The Scheduler returns the next URL to be climbed to the Engine, which forwards the URL to the Downloader for download via the Downloader Middlewares.

  • Once the page is downloaded, Downloader generates a Response for the page and sends it to Engine via Downloader Middlewares.

  • The Engine receives the Response from the downloader and sends it to the Spider for processing through the Spider Middlewares.

  • The Spider handles the Response and returns the retrieved Item and the new Request to the Engine.

  • The Engine feeds the Item returned by the Spider to the Item Pipeline and the new Request to the Scheduler.

  • Repeat the second to the last step until there are no more requests in the Scheduler, and the Engine closes the site and the crawl completes.

Through the collaboration of multiple components, different components to do different work, components support asynchronous processing, the maximum use of network bandwidth, Scrapy greatly improve the efficiency of data crawling and processing.

3. Project structure

The Scrapy framework is different from the PySpider in that it creates projects from the command line and requires an IDE to write code. After the project is created, the project file structure looks like this:

scrapy.cfg
project/
    __init__.py
    items.py
    pipelines.py
    settings.py
    middlewares.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...Copy the code

The functions of each file are described as follows.

  • Scrapy. CFG: This is the configuration file of the scrapy project, which defines the configuration file path, deployment information, and so on.

  • Items. py: This defines the Item data structure, where all Item definitions can be placed.

  • > < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important;”

  • Settings. py: This defines the global configuration of the project.

  • Middlewares.py: This defines the implementation of Spider middlewares and Downloader middlewares.

  • Spiders: Spiders contain implementations of spiders, each with a file.

4. Conclusion

This section introduces the basic structure, data flow process and project structure of Scrapy framework. We’ll take a closer look at Scrapy and see how powerful it is.


This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)