Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

The above quote is taken from an official document. Scrapy is an application framework designed to crawl site data and extract structured data. It can be used in a range of applications including data mining, information processing or storing historical data. Scrapy is one of Python’s most mature and stable crawler frameworks, so it’s a good tool to use when developing crawlers.

Other articles in the series:

  • Python Crawler – Using a Scrapy frame
  • Python Crawler (Scrapy)
  • Python Crawler – Using a Scrapy frame
  • Python Crawler – Using a Scrapy frame

This article introduces Scrapy architecture, workflow, advantages, and development environment.

Scrapy

Scrapy structure diagram:





Scrapy architecture diagram

We can see how Scrapy works with a clear architecture diagram.

1.Scrapy:
  • Scrapy Engine: Scrapy uses Twisted, an event-driven web Engine framework, as the core of the framework. So the engine is mainly used to process the data flow of the whole system, triggering individual events.
  • Scheduler: The Scheduler receives data from the engine, maintains a queue of urls to be climbed, and creates urls as download requests through a specified scheduling mechanism.
  • Item Pipeline: Converts the retrieved content into entity objects, and performs custom operations on entity objects such as validation and persistence.
  • Downloader: mainly for HTTP requests and responses to web pages, responsible for generating and returning data.
  • Spiders: In Spiders, rules that define the Spiders of urls and the extraction of information from a web page.
2.Scrapy’s three middleware pieces connect the modules:
  • Downloader Middlewares: The middleware between the Scrapy engine and the Downloader that transfers requests and data to and from the Downloader.
  • Scheduler Middewares: Middleware that sits between a Scrapy engine and a Scheduler, which transmits scheduling requests and responses.
  • Spider Middlewares: A framework between a crawy engine and a crawler that handles the crawler’s response input and request output.
3.Scrapy Workflow:
  • The engine opens a web site, finds the Spider that handles the site and requests the first URL(s) to crawl from that Spider.
  • The engine fetches the first URL to climb from the Spider and creates a request in the Scheduler to schedule it.
  • The engine asks the scheduler for the next URL to climb.
  • The scheduler returns the next URL to crawl to the engine, which forwards the URL to the Downloader via the download middleware (in the request direction).
  • Once the page is downloaded, the downloader generates a Response to the page and sends it to the engine through the download middleware (in the Response direction).
  • The engine receives the Response from the downloader and sends it through the Spider middleware (input direction) to the Spider for processing.
  • The Spider handles the Response and returns the retrieved Item and (following up) new Request to the engine.
  • The engine feeds the retrieved Item to the Item Pipeline and the Request to the scheduler.
  • Repeat until there are no more requests in the scheduler, and the engine closes the site.
4.Scrapy

Why do we use Scrapy instead of other crawler frameworks? There are many other advantages besides maturity and stability.

  • Use the more readable xpath instead of the regular for HTML parsing
  • Shell support for easy debugging
  • High expansion, low coupling, easy customization
  • Automatic coding detection and robust coding support
  • Have powerful statistic function and log system
  • Supports multiple URL asynchronous requests

Ii. Construction of development environment

1. Install the Python environment

Scrapy currently supports both Python 2.7 and Python 3.3 and above, so you can choose a different Python version based on your needs. The development environment for this article uses Python 3.5. If you are a beginner, you are advised to start with Python 3, regardless of python’s many historical baggage.

Download each version of Python from www.python.org/downloads

2. Install Scrapy

Install Scrapy. Please refer to the official documentation for details. This is a bit of a problem on Windows. Since Scrapy relies on a number of third party frames, it is possible to install them along with the relevant third party frames. Some third-party frameworks may fail to be installed, for example:

  • Twisted fails to be installed in Windows. You need to download the Windows installation package of Twisted manuallywww.lfd.uci.edu/~gohlke/pyt…. Download the installation package based on your Windows and Python versions.






    If the Twisted-17.1.0-cp35-cp35m-win_amd64. WHL installation package is downloaded, run the Twisted-17.1.0-cp35-cp35m-win_amd64. WHL command

    pip install Twisted-171..0-cp35-cp35m-win_amd64.whlCopy the code
  • Lxml failed to install in a Windows environment, similar to the Twisted situation. You need to manually download the installation package.www.lfd.uci.edu/~gohlke/pyt….






    If the downloaded installation package is “LXML 3.7.3 CP35 CP35M WIN_AMd64. WHL”, then call the command

    pip install lxml‑ 37.3.cp35cp35mwin_amd64.whlCopy the code
3. Install the mongo

Here, MongoDB is used to save the information on the web page, such as the title of the article, category, image save path, and so on. MongoDB installation is relatively simple. Download the installation package www.mongodb.com/download-ce… If you have any questions during the installation process, please refer to the official documentation. You can also refer to this article to download and install MongoDB on Windows. The author is very detailed. After MongoDB is installed, you need to install pyMongo, a third-party library, to operate MongoDB in Python code. Installation command:

pip install pymongoCopy the code
4. Install Redis

Redis is a storage system whose data structure is key-value pairs. The main purpose of Redis in crawler is to access the cache, record the time of automatic update of Http proxy, filter the URL that has been crawled, and so on. The installation and download of Redis can be found at Redis. IO/Download. There is no Windows version of Redis yet, so if you want to install a Windows version, you must use the Redis version provided by Microsoft Open Tech Group. Github has the download address and installation tutorial github.com/ServiceStac… . Then install the redis installation command:

pip install redisCopy the code
5. Install the third-party library

1) Pillow: Processing the cutting and saving of pictures

pip install pillowCopy the code

2) Requests: HTTP library based on URllib with Apache2 Licensed open source protocol

pip install requestsCopy the code

3) Schedule: Schedule is used to manage scheduled tasks

pip install scheduleCopy the code
6. Install the IDE

PyCharm is the recommended IDE for python development, and the PyCharm community edition is quite powerful enough.

The last

Now that we have a basic understanding of Scrapy and the development environment set up, it’s time to start coding. Python Crawler – Using a Scrapy frame

The attached:

Detailed project works are in Github, if you feel good, remember Star oh.