This is the 12th day of my participation in the August More Text Challenge. For details, see:August is more challenging

(1) Scrapy module installation

Scrapy supports Python2.7 and above.

Python packages can be installed globally (also known as system-wide) or in user space.

A Windows. Download the Twisted version file from https://www.lfd.uci.edu/~gohlke/pythonlibs/. 2. On the command line, go to the Twisted directory and run PIP install and Twisted name 3. Run PIP install scrapy

2. Installed under annaconda (official recommendation) 1. Conda conda docs.anaconda.com/anaconda/pa the old version… https://blog.csdn.net/ychgyyn/article/details/82119201 2 installation method. Install scrapy Conda install scrapy

(2) Scrapy framework introduction

Scrapy is an efficient, structured web scraping framework developed in pure Python.

What’s a Scrapy?

Scrapy is an application framework designed to extract structural data from web sites. Originally designed for page scraping (more specifically, Web scraping), it can also be applied to retrieving data returned by apis (such as Amazon Associates Web Services) or general Web crawlers. Scrapy is widely used for data mining, monitoring, and automated testing. It uses the Twisted asynchronous network library to handle network communications.

Why are we using this thing?

1. To make it easier for us to focus on requests and parsing. 2. Enterprise requirements.

(3) Operation process

(As long as the framework is mentioned, it is important to focus on its flow/logical order)

Introduction:

Take corn:

Xiao Ming, a freshman, started the semester. He went to the reception desk and waited. The senior students (administrators) saw him and asked, “Do you need help?” Xiao Ming was worried and didn’t know what to do, so he said I was a freshman to report. The management would queue up your information and give the number to the management after xiao Ming was in the queue.

Then, the management staff will give this number to the newspaper, the newspaper everywhere to arrange Xiaoming’s school information, such as: class, dormitory… And return the information to the management. The management staff will give the information to Xiao Ming, a freshman, and ask him to check whether these are what he needs. After xiao Ming checks carefully, he finds that they are all what he wants. Xiao Ming says to the management staff: “I have confirmed it, that’s it!”

Finally, the management submits the information to the Information Management office for storage.

Note: the meaning of the arc in the picture: if Xiaoming finds that these are not what he needs after getting the information and checking, then xiaoming will tell the management staff that these are not what I want and I will request something else again, and then the management staff will queue Xiaoming again!!

1. Get to the point:

(

Spiders

The items project

Engine engine

The scheduler scheduler

Downloader downloader

Item Pipelines

A: That’s a nice piece of middleware

Data flow: The diagram above shows the architecture of the Scrapy framework and its components, as well as the data flow occurring within the system (shown by the red arrow). The data flow in Scrapy is controlled by the execution engine, and the process is as follows:

First of all get the initial request from web crawler Put the request in a scheduling module, and then retrieve the next need to crawl request scheduling module to return the next need to crawl to the engine, the engine will request sent to download, download all through the middleware in turn Downloader, once a page to load returns a response contains the page data, And then it goes through all the downloaded middleware in turn. Downloader to receive the response from the engine, and then sent to the crawler parsing, in turn, through all of the crawler middleware crawler processing the response received, and then parse out the item and generate a new request, and sent to the engine Engine will have a good item to send to piping components, will generate good new request sent to the scheduling module, The process repeats until there are no more requests from the scheduler.

Introduction to middleware:

Download middleware is a specific hook between the engine and the downloader that handles requests from the engine to the downloader and responses from the downloader to the engine. If you want to do one of the following, use the Downloader middleware: Processing the request before it is sent to the download program (that is, before scrapy sends the request to the website) and sending the response directly to the crawler, rather than passing the received response to the spider and passing the response to the spider without getting the Web page; Silently give up on some requests

(2) Crawler middleware The crawler middleware is a specific hook between the engine and the crawler, which can process the incoming response and the passed item and request. Use the crawler middleware if you need to: Handle requests or items after a crawler callback handle start_requests handle crawler exceptions call errback based on the response content instead of the callback request

2. Introduction to each component:

The Scrapy Engine controls the flow of data between all the components of the system and fires events when certain operations occur.

The scheduler receives requests from the engine and queues them so that the engine can request them later.

The Downloader takes the Web pages and feeds them to the engine, which feeds them to the spider.

A spider is a custom class written by a user that parses a response, extracts data from it, or other requests to fetch.

The Item Pipeline is responsible for the subsequent processing of the data after it is extracted by the crawler. Typical tasks include cleanup, validation, and persistence (such as storing data in a database)

(3) Simple use

1. Basic operation (simple project command)!

(1) Create project:

(

<> is required; [] is optional!

Tip 1: Pycharm terminal input scrapy to see some help, help us write those hard to remember commands!

Scrapy + command keyword, you can see the detailed use of this command!

1. First: CD + the folder where you want to put your scrapy project

Step 2: Use the scrapy command to create a new scrapy project.

Scrapy startProject <project_name> [project_dir]

This command creates a new Scrapy project named project_Name under the project_dir file. If project_DIR is not specified, project_dir is the same as project_Name.

Execute command:

scrapy startproject baidu

The following files will be created in the specified folder:


(2) Create crawler file

{create a bdSpider class that inherits scrapy.Spider, define the following three properties: name: the name of the Spider, which must be and only start_urls: The initial URL list parse(self, response) method: is called after each initial URL completes. This function does two things: parse the response, encapsulate it as an Item object and return the object to extract a new URL to download, and create a new request and return it

We can also create crawlers by command

The spiders will start to emit bd.py as soon as they emit more spiders. The spiders will start to emit bd.py as soon as they emit more spiders. Select * from ‘scrapy genspider’ where ‘scrapy genspider’ is created; select * from ‘scrapy genspider’ where ‘scrapy genspider’ is created; 1, scrapy genspider bd www.baidu.com 2, scrapy genspider bd www.baidu.com 3, scrapy genspider

# -*- coding: utf-8 -*-
import scrapy


class BdSpider(scrapy.Spider) :          # Inherit from scrapy.Spider
    name = 'bd'                         The # name is unique (not repeated) because when we start the project, we look for crawler files based on this name
    allowed_domains = ['www.baidu.com'] # Allow domain names without this restriction!
    start_urls = ['http://www.baidu.com/']  # First request (must have) otherwise start can not start, how to let the whole framework run!

    def parse(self, response) :           The parse function must not be renamed to receive data downloaded by the downloader
        print("* * * * * * *")		# For more intuitive observation of whether the framework is working properly!
        print("* * * * * * *")
        print("* * * * * * *")
        print("* * * * * * *")
        print("* * * * * * *")
        print("* * * * * * *")
        print(response)         # the response object
        # Two methods to obtain data:
        print(response.body.decode())   The byte code format is obtained
        # print(response.text)
Copy the code

Note: The engine finally sends the data to the spider module to the parse parameter response:

(3) Run the crawler file

Scrapy crawl [options

Execute command:

scrapy crawl bd

But! After we run the crawler file, we find that the print function used for testing is not displayed. After checking the terminal output data, we know that Scrapy framework follows the robots protocol by default, so we must not get the data!!

How to solve this problem?

Open the Settings file settings.py and change the following code to False!

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
Copy the code

Extension: Second way to run scrapy!

cdRun the following command on spiders: Scrapy Runspider SpidersCopy the code

Advanced Extensions: (Note: neither of these methods of running scrapy frameworks can debug, which is very inconvenient! If there is a problem, it will be difficult to find!! So, there’s a third way to start a scrapy framework —– Django automatically generates apy file (manage.py or main.py) when you create a project, which a scrapy framework doesn’t have, but we can define it ourselves !!!!!)

1Create a py file named main.py or manage.py in the project folder:Copy the code

2From scrapy.cmdline import execute import sys import OSSelect * from 'scrapy' where 'scrapy', 'crawl' and 'bd' are executed. (You don't have to!)
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy"."crawl"."bd"])
Copy the code
3Now, we can run the py file directly and see that it will run the same scrapy framework as the previous two methods. Also, you can debug this scrapy framework by debugging the py file.Copy the code

🔆 In The End!

Start from now, stick to it, a little progress a day, in the near future, you will thank you for your efforts!

This blogger will continue to update the basic crawler column and actual crawler column, carefully read this article friends, you can like the collection and comment out of your reading feeling. And can follow this blogger, in the days to read more reptilian!

If there are any mistakes or inappropriate words, please point them out in the comment section. Thank you! If reprint this article, please contact me to obtain my consent, and annotate the source and the name of this blogger, thank you!