Here’s a simple project to complete the Scrapy crawl process. Through this process, we can get a general idea of the basic use and principles of Scrapy.


First, preparation

The tasks to be accomplished in this section are as follows.


  • Create a Scrapy project.

  • Create a spider to crawl sites and process data.

  • Export content captured through the command line.

  • Save the captured content to the MongoDB database.

Second, preparation

We need to install the Scrapy framework, MongoDB and PyMongo libraries.


Third, create a project

Create a Scrapy project, and file projects can be used directly

scrapy

The command is generated as follows:


Scrapy startproject tutorialCopy the code

This command can be run in any folder. If permission issues are prompted, you can run the command kassuto. This command will create a folder called Tutorials, with the folder structure shown below. :

# middlewares * * * * * * * * * * * * * * * * * * * * * * < span style = "box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important;Copy the code

Four, create a spider

Spiders are self-defined classes that Scrapy uses to pull content from web pages and parse the results. However, this class must inherit from Scrapy’s spider class

scrapy.Spider

, and define the name of the spider and the starting request, as well as how to handle the results of the crawl.


You can also use the command line to create a spider. For example, to generate a ticker spider, run the following command:

Disc tutorial Scrapy Genspider quotesCopy the code

Go to the tutorial folder you just created and execute

genspider

Command. The first parameter is the name of the spider and the second parameter is the domain name of the website. After executing, there is a new folder in the spider folder called quotes.py, which is the spider you just created, as follows:

Import scrapy class QuotesSpider (scrapy.Spider) Name = "quotes" allowed_domains = [" quotes.toscrape.com "] start_urls = ['http://quotes.toscrape.com/'] def parse (self, response) : passCopy the code

There are three properties here

name

.

allowed_domains

and

start_urls

There is another way

parse

.

  • name

    “, which is the unique name of each item to distinguish one spider from another.

  • allowed_domains

    It is the domain name that is allowed to be crawled. If the initial or subsequent request links are not under this domain name, the request links will be filtered.

  • start_urls

    , which contains the list of urls that the spider climbs on startup and defines the initial request.

  • parse

    It’s one of the spider’s methods. By default, the call is called

    start_urls

    After the request is downloaded and executed, the returned response is passed to the function as its only argument
    . This method is responsible for parsing the returned response, extracting the data, or further generating the request to be processed.

Five, create a project

A project is a container that holds crawl data and is used in a similar way to a dictionary. However, items have an additional protection mechanism compared to dictionaries to avoid spelling errors or definition field errors.


Create project inheritance requirements

scrapy.Item

Class, defined and of type

scrapy.Field

In the field. Observing the target website, we can get the content there

text

.

author

.

tags

.

To define items, change kitems.py to the following:

Import scrapy class QuoteItem (scrapy.item) : text = scrapy.field () author = scrapy.field () tags = scrapy.field ()Copy the code

There are three fields defined here, and we’ll use this item for the next crawl.

Six, parse the response

As we saw earlier,

parse()

Method parameters

resposne

Knowledge of the English

start_urls

Inside the link after the query query results. In so

parse

In the method, we can directly pair

response

Variables All game content is parsed, for example by viewing the source code of the web page requesting results, or further analyzing the source code content, or finding links in the results to lead to the next request
.


We can see that the page has both the result we want and a link to the next page, and we need to do both.

First take a look at the structure of the page, as shown below. There are multiple pages per page

class

for

quote

Each block contains

text

.

author

.

tags

So let’s find all of them

quote

, extract and then each one

quote

The content of.

. The way you extract it can be a CSS selector or an XPath selector and in this case we’re using a CSS selector,

parse()

The rewrite of the method looks like this:

Hd resolution (self-employed, response) : Quote = response.css ('.quote') in quote: text = quote.css ('::.text Segment text ').extract_first () author = quote.css (". author'text ').extract_first () tags = quote.css ('.tags .tag :: text'). The extract ()Copy the code

Here all offers are first selected using the selector and assigned to

quotes

Student: The variable, and then use

for

Loop to each

quote

Traverse, each parse

quote

The content of the.

right

text

To observe it

class

for

text

So it is ok to use

.text

The result is actually the entire labeled node
To get its body content, add can

::text

Come to e-zine. The result in this case is a list of length 1

extract_first()

Method comes to an element of the e – magazine. For while

tags

Since we want to get all the tags, we use

extract()

Method the entire list of electronic magazines can be.

The first in a

quote

Is used as an example to describe the search results. The following describes the selection methods and results
.

The source code is as follows:

< div class = "quote" itemscope = "" itemtype =" http://schema.org/CreativeWork "> < span class =" text "itemprop = "Text" > "The world we create it is a process of our thoughts. </ span > < span > by < small class = "author" "itemprop =" author "> Albert Einstein < a HREF ="/author/Albert Einstein" Span > < DIV class = "tag" > tag: < metaclass = "keyword" Itemprop = "keyword" content = "change, deep thought, thought, World "> < a class =" label "HREF ="/label/change/page / 1 / "> change < / a > < a class =" tag "HREF ="/tag/deep thoughts/page / 1 / "> deep thought < / a > < a class =" label" HREF = "/ label/thinking/page / 1 /" >, < / a > < a class = "label" HREF = "/ label/world/page / 1 /" > world < / a > < / DIV > < / DIV >Copy the code

The different selectors return the following.

1。 quote.css(‘.text’)

[<Selector xpath = "retrieve -or-self :: * [@class and contains (concat ())' ', normalize-space (@class),' '),'text'"Data =)]' "The'>]Copy the code

2。 quote.css(‘.text::text’)

[<Selector xpath = "retrieve -or-self :: * [@class and contains (concat ())' ', normalize-space (@class),' '),'text')] / text () "data ='The world we created it in is a PR >]Copy the code

3。 quote.css(‘.text’).extract()

['Span class = "text" Itemprop = "text" > "The world we create it is the process of our thinking. We can't change our minds without changing them. "< / span > ']Copy the code

4。 quote.css(‘.text::text’).extract()

[' 'The world we create is the process of our thinking. We can't change our minds without changing them. "']Copy the code

5。 quote.css(‘.text::text’).extract_first()

"The world we create it is the process of our thinking. It can't change without changing our mind. "Copy the code

So, for

text

Get the first element of the result, using so

extract_first()

Method, for

tags

To get a list of all the results, use so

extract()

Methods.

Seven, use items

The project was defined above, so it’s time to use it. Item can be thought of as a dictionary, but it needs to be instantiated when it is declared. Then assign each field of the project in turn with the results just parsed, and finally return the product.


QuotesSpider

Rewrite as follows:

Import scrapy from tutorial.items to the QuoteItem class QuotesSpider (scrapy. Name = "quotes" allowed_domains = [" quotes.toscrape.com "] start_urls = ['http://quotes.toscrape.com/'] DEF parse (individual, response) : quote = response.css ('.quote') used to quote in quotes: item = QuoteItem () item ['text'] = quote. CSS ('::.text Segment text '). Extract_first () project ['author'] = quote. CSS ('.author :: text'.extract_first () item ['tags'] = quote. CSS ('.tags .tag :: text'Extract () yield itemCopy the code

In this way, all the contents of the home page are parsed out and assigned individually

QuoteItem

.

Eight, follow-up requests

The above implementation pulls the content from the initial page. So, how do I grab the next page? This requires us to find information from the current page to generate the next request, and then find information from the next request page to construct the next request. This cycle iteration, so as to achieve the whole station crawl.


Drag the previous page to the bottom, as shown below.

There is a button to view its source code, then you can find it is/page / 2 / link, the link is: HTTP://quotes.toscrape.com/page/2, through this link, we can construct the next request.

Request scrapy.Request we pass two arguments –

url

and

callback

, the two parameters are described as follows.

  • url

    : It is the request link.

  • callback

    :. It is a callback function after the request specifying this callback function has been completed
    The engine passes the response as a parameter to the callback function for parsing or generating the next request
    The function callback is as above

    parse()

    Shown below.

Due to the

parse()

Is the resolution

text

.

author

.

tags

The structure of the next page is the same as the structure of the page just parsed
So we can use it again

parse()

Method to do page parsing.

The next thing we need to do is use the selector to get the next page link and generate the request
In the

parse()

Append the following code to the method:

Next = response. CSS ('.pager. next a :: attr (href) ').extract_first () url = response.urljoin (next) yield scrapy.Request (url = url, callback = self.parse)Copy the code

The first line of code first fetches a link to the next page through the CSS selector, that is, to get a link over the

href

Properties. I’m going to use it here

::attr(href)

Operation. And then call

extract_first()

Methods Contents of electronic magazines.

Sentence 2 code is called

urljoin()

Method,

urljoin()

The relative URL () method can be used to construct an absolute URL. For example, get the next page address/yes/page 2.

urljoin()

Method after processing the result is: http://quotes.tocorone.com/page / 2 /.

Sentence three code passes

url

Adverbial clauses:

callback

The variable constructs a new request, the function callback

callback

Still use

parse()

Methods. When the request is complete, the response is redirected

parse

Method to get the parsing result of the second page, and then generate the request for the next page of the second page, the third page. So the crawler goes into a loop until the last page.

With a few lines of code, we can easily implement a fetching loop that pulls down the results of each page.

Now, let me rewrite the whole thing

Spider

The class is as follows:

Import scrapy from tutorial.items to the QuoteItem class QuotesSpider (scrapy. Name = "quotes" allowed_domains = [" quotes.toscrape.com "] start_urls = ['http://quotes.toscrape.com/'] DEF parse (individual, response) : quote = response.css ('.quote') used to quote in quotes: item = QuoteItem () item ['text'] = quote. CSS ('::.text Segment text '). Extract_first () project ['author'] = quote. CSS ('.author :: text'.extract_first () item ['tags'] = quote. CSS ('.tags .tag :: text'Extract () yield item next = response.css ()'.pager. next a :: attr (" href ") ').extract_first () url = response.urljoin (next) yield scrapy.Request (url = url, callback = self.parse)Copy the code

Nine, run

Next, go to the directory and run the following command:


Scrapy crawls quotesCopy the code

You can see how Scrapy works.

[scrapy.utils.log] [scrapy.utils.log]'NEWSPIDER_MODULE':'tutorial.spiders'.'SPIDER_MODULES': ['tutorial.spiders'].'ROBOTSTXT_OBEY': True,'BOT_NAME':'tutorial'} [scrapy. Middleware]'scrapy.extensions.logstats.LogStats'.'scrapy.extensions.telnet.TelnetConsole'.'scrapy.extensions.corestats.CoreStats'] Scrapy. Middleware [scrapy. Middleware]'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware'.'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware'.'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware'.'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware'.'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'.'scrapy.downloadermiddlewares.retry.RetryMiddleware'.'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware'.'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware'.'scrapy.downloadermiddlewares.redirect.RedirectMiddleware'.'scrapy.downloadermiddlewares.cookies.CookiesMiddleware'.'scrapy.downloadermiddlewares.stats.DownloaderStats'[scrapy. Middleware] [scrapy. Middleware]'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware'.'scrapy.spidermiddlewares.offsite.OffsiteMiddleware'.'scrapy.spidermiddlewares.referer.RefererMiddleware'.'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware'.'scrapy.spidermiddlewares.depth.DepthMiddleware'[] Scrapy. Core. Middleware [] Scrapy. Spider to open the 2017-02-19 13:37:20 [scrapy. Extensions. Logstats] information: [scrapy. Extensions. Telnet] [scrapy. Extensions. [scrapy.core.engine] scrapy.core.engine Crawled (404) < > GET http://quotes.toscrape.com/robots.txt (referer: no) the 2017-02-19 13:37:21 [scrapy. Core. Engine] the DEBUG: Crawled (200) < > GET http://quotes.toscrape.com/ (referer: no) the 2017-02-19 13:37:21 [scrapy. Core. Scraper] debugging: Scrape off from 200 http://quotes.toscrape.com/ < > {'the writer': Einstein Einstein', 'The label': [u'change'u'deep-thoughts'u'thinking'u'world', 'The text'u'.\ u201c The world we create is the process of our thinking. We can't change our minds without changing them. \ u201d'} 2017-02-19 13:37:21 [scrapy. Core. Scraper] debugging: from scrape off 200 < http://quotes.toscrape.com/ > {'The author'u'.JK rowling', 'tags': [u'abilities'u'choices', 'The text': You are our choice, Harry, and it shows that we truly exist, far beyond our ability. \ u201d'}... Engine scrapy. Statscollectors scrapy. Statscollectors scrapy. Statscollectors'downloader / request_bytes': 2859,'downloader / request_count': 11.'downloader / request_method_count / GET': 11.'downloader / response_bytes': 24871,'downloader / response_count': 11.'downloader / response_status_count / 200': 10,'downloader / response_status_count / 404': 1,'dupefilter / filtered': 1,'finish_reason':'complete'.'finish_time'Datetime (2017,2,19,5,37,27,227438),'item_scraped_count': 100,'log_count / DEBUG': 113,'log_count / INFO': 7,'request_depth_max': 10,'response_received_count': 11.'Scheduler/Queue out': 10,'scheduler / dequeued / memory': 10,'Scheduler/Queue': 10,'scheduler / enqueued / memory': 10,'start_time'} 2017-02-19 13:37:27 [scrap.core.engine]Copy the code

Here is only part of the results of the run, some of the middle grab results have been omitted.

First, Scrapy prints the current version number and the name of the project being launched. It then outputs some of the overwritten configurations in the current settings.py. Middleware and Middlewares in charge is enabled by default and can be fixed in settings.py. > < span style = “box-sizing: border-box; display: block; word-break: inherit! Important; We’ll talk about them later.

Next is the output of the crawling results of each page, you can see the crawler while parsing, while turning the page, until all the content is captured, and then terminate.

Finally, Scrapy output the statistics of the entire scraping process, such as the number of bytes requested, request times, response times, completion reasons, etc.

The entire Scrapy program ran successfully. We completed a site content crawl through very simple code, so compared to a little bit of writing procedures a lot simpler.

Ten, save to file

After running Scrapy, we only see the output on the console. What if you want to save the results?


You don’t need any extra code to do this, and Scrapy’s Feed Export makes it easy to Export the results. For example, if we wanted to save the above result as a JSON file, we could run the following command:

Scrapy grabs quotes -o quotes.jsonCopy the code

Json file, which contains all the content just captured, in JSON format.

Alternatively, we can output a line of JSON for each project with the output suffix JL, which is short for jsonline, as follows:

Scrapy grab quotes -o quotes.jlCopy the code

or

Scrapy grab quotes -o references. JsonlinesCopy the code

Output formats also support a variety of, such as CSV, XML, pickles, marshal, etc., but also support FTP, S3 and other remote output, in addition, you can customize ItemExporter to achieve other output.

For example, the following commands correspond to CSV, XML, pickles, Marshal format, and FTP remote output respectively:

Quotes. XML scrapy quotes. Marshal scrapy quotes. / / user:[email protected]/path/to/quotes.csvCopy the code

The user name, password, address, and output path must be correctly configured for FTP output. Otherwise, an error message will be displayed.

Feed Exports provided by Scrapy make it easy to export crawl results to files. For some small projects, this should be enough. But if we want more complex output, such as output to a database, we can use Item Pileline to do this.

11. Use Item Pipeline

If you want to perform more complex operations, such as saving results to MongoDB or filtering useful items, you can define Item Pileline to do this.


Item Pipeline is an Item Pipeline. When an Item is generated, it will be automatically sent to the Item Pipeline for processing. We often use the Item Pipeline to do the following operations.

  • Clean up HTML data.

  • Verify the crawl data and check the crawl fields.

  • Check duplicates and discard duplicates.

  • Save the crawl results to the database.

Implementing the Item Pipeline is as simple as defining a class and implementing it

process_item()

Method can. When Item Pipeline is enabled, the Item Pipeline automatically calls this method.

process_item()

Method must return a dictionary or Item object containing data, or throw a DropItem exception.

process_item()

The method takes two parameters. One parameter is

item

Each time the spider generates an item, it is passed as a parameter. The other parameter is theta

spider

“Is an example of spiders.

Next, we implement an Item Pipeline and filter it out

text

Item of length greater than 50 and save the result to MongoDB.

> < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important
, an increase

TextPipeline

Class, as follows:

From scrapy.exceptions import DropItem class TextPipeline (object) : def __init__ (self) : Self. Limit = 50 def process_item (self, item, spider) :if item [ 'text'] :ifLen (item ['text']]) > self. Limit: item ['text' ] = item [ 'text'[0: self.limit].rstrip () +'... ' 
            return item 
        else:returnDropItem ('Missing Text'))Copy the code

This code defines the limit length in the constructor as
50 is implemented

process_item()

Method, parameter

item

Adverbial clauses:

spide

– [R First the method determines.

item

the

text

Property exists, and if not, throws

DropItem

The exception; If it does, check if it’s greater than 50, if it’s greater, cut it off and then splice the ellipsis, and then put it

item

The report can be viewed.

Now, let’s do the post-processing

item

> < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;”

MongoPipeline

, as follows:

Mongoopipeline (object) : def __init__ (self, mongo_uri, mongo_db) : Self. mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler (CLS, crawler) :returnMongo_uri = crawler.settings.get (mongo_uri = crawler.settings.get (' MONGO_URI'Mongo_db = crawler.settings.get (mongo_db = crawler.settings.get ('MONGO_DB') def open_spider (self, spider) : Self. client = self. Mongo_client (self.mongo_uri) self.client = self. Spider) : name = item.__class __. __name__ self.db [name]. Insert (dict (item))returnItem def close_spider (self, spider) : self.client.close ()Copy the code

MongoPipeline

Class implements several other methods defined by the API.

  • from_crawler

    . It’s a class method that uses

    @classmethod

    Identity is a form of dependency injection. Its arguments are 1, 2, 3

    crawler

    Through the

    crawler

    We can get every configuration information for the global configuration. In the global configuration settings.py, we can define it

    MONGO_URI

    Adverbial clauses:

    MONGO_DB

    To specify the address and database name required by the MongoDB connection, and return the class object after getting the configuration information. So this method is defined primarily to get the configuration in settings.py
    .

  • open_spider

    . This method is called when the spider is turned on. The above program mainly carries out some initialization operations.

  • close_spider

    . This method is called when the spider is closed. The previous procedure closed the database connection.

One of the main

process_item()

Method performs a data insert operation.

Well defined

TextPipeline

Adverbial clauses:

MongoPipeline

After these two classes, we need to use them in settings.py. MongoDB connection information needs to be defined.

We add the following to settings.py:

ITEM_PIPELINES = { 
   'tutorial.pipelines.TextPipeline': 300,'tutorial.pipelines.MongoPipeline'MONGO_URI = : 400,} MONGO_URI ='localhost'MONGO_DB
 = 'tutorial'Copy the code

The assignment

ITEM_PIPELINES

Dictionary search, the key name is the class name of the pipe, the key value is the call priority, is a number, the smaller the number, the corresponding pipe is called first.

Perform the crawl again with the following command:

Scrapy crawls quotesCopy the code

After the crawl, MongoDB creates a tutorial database, the QuoteItem table, as shown in the figure below.

The length of

text

Has been processed and appended with ellipsis, short of

text

Stay the same,

author

Adverbial clauses:

tags

Also. All saved accordingly.

Xii. Source Code

This section of code address is: HTTPS://github.com/Python3WebSpider/ScrapyTutorial.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)