Introduction to Scrapy

This article will introduce a simple project, walk through the process of Scrapy scraping, through the process, you can get a general understanding of the basic use and principles of Scrapy.

Before this article begins, assume that Scrapy has been successfully installed. If not, refer to the previous installation lesson.

The tasks to complete in this section are:

  • Create a Scrapy project

  • Create a Spider to crawl the site and process the data

  • Export content captured through the command line

Create a project

To do this, you must create a Scrapy project, which can be created using the following command:

scrapy startproject tutorial
Copy the code

You can run it in any folder, and sudo can run it if there is a permission problem. This command will create a folder named Tutorial with the following structure:

| ____scrapy. CFG # Scrapy deployment configuration file | ____tutorial # module of the project, It needs to be introduced from this introduction | | ______init__. Py | | ______pycache__ | | ____items. The definition of py # Items, Define the data structure of the crawl | | ____middlewares. The definition of py # Middlewares, define the middleware crawl | | ____pipelines. The definition of py # Pipelines, # define data pipeline | | ____settings. Py configuration file | | ____spiders # placed Spiders folder | | | ______init__. Py | | | ______pycache__Copy the code

Create the spiders

A Spider is a Class that you define, and Scrapy uses it to pull content from a web page and parse the results. However, this Class must inherit from Scrapy’s Spider Class scrapy.Spider, and you must define the name of the Spider and its starting request, as well as how to handle the results of the crawl.

Creating a Spider can also be generated using commands, such as the Quotes Spider, by executing commands.

cd tutorial
scrapy genspider quotes
Copy the code

First go to the tutorial folder you just created, and then execute the genspider command. The first argument is the name of the Spider and the second argument is the domain name of the web site. When this is done, you will see an extra quotes.py added to the spiders folder. This is the Spider you just created, which looks like this:

# -*- coding: utf-8 -*-
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass
Copy the code

You can see that there are three properties, name, allowed_domains, start_urls, and a method called parse

  • The name, which is unique within each project, is used to distinguish different spiders.

  • Allowed_domains Domains that allow crawling and will be filtered if the initial or subsequent requested links are not in that domain.

  • Start_urls, which contains the list of urls that the Spider crawls at startup and defines the initial request.

  • Parse is a method of Spider that, by default, when a request is downloaded from a link in start_urls is called, the returned response is passed as the only argument to the function that parses the returned response, extracts the data, or further generates the request for processing.

To create the Item

Item is a container for retrieving data. It is used in a similar way to a dictionary, although you can use dictionaries to represent it, but Item has an extra protection mechanism compared to dictionaries to avoid spelling errors or defining field errors.

To create an Item, inherit the scrapy.Item class and define a class attribute of type scrapy.Field to define an Item. Observing the target website, we can obtain the contents including text, author and tags

So you can define items like this and change items.py like this:

import scrapy

class QuoteItem(scrapy.Item):

    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()
Copy the code

Three fields are defined, which we will use for the next crawl.

Parse the Response

As explained above, the resposne parameter to the parse method is the result of a link crawl in start_urls. Therefore, in the parse method, we can directly parse the content contained in the response by looking at the source code of the request result, or further analyzing what the source code contains, or finding links in the result to get the next request.

Looking at the website, we can see that the page has both the result we want and a link to the next page, so we need to do both.

First of all, let’s take a look at the structure of the web page. Each page has multiple blocks with class as quote, and each block contains text, author and tags. Therefore, the first step is to find out all the quotes, and then further extract the contents of each quote.

The extract method can be a CSS selector or an XPath selector. Here we use the CSS selector for selection. The parse method is adapted as follows:

def parse(self, response):
    quotes = response.css('.quote')
    for quote in quotes:
        text = quote.css('.text::text').extract_first()
        author = quote.css('.author::text').extract_first()
        tags = quote.css('.tags .tag::text').extract()
Copy the code

Here we use the SYNTAX of CSS selectors. First we use the selectors to select all quotes and assign them to the quotes variable. The for loop then iterates through each quote and parses the contents of each quote.

For text, it is observed that its class is text, so it can be selected by.text. This result is actually the entire element with a label. To get its contents, add ::text. The result is an array of size 1, so the extract_first method is used to get the first element, while for tags, since we need to get all the tags, extract is used.

Taking the result of the first quote as an example, all selection methods and results are classified as follows:

  • The source code
< div class = "quote" itemscope = "" itemtype =" http://schema.org/CreativeWork "> < span class =" text "itemprop =" text ">" The world As we have created it is a process of our thinking. It cannot be changed without changing our thinking." <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>Copy the code
  • quote.css('.text')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), 'text')] "data = '< span class =" text "itemprop =" text "> >" The']Copy the code
  • quote.css('.text::text')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), 'text ')]/text()" data=' The world as we have created it is a pr'>]Copy the code
  • quote.css('.text').extract()
['<span class="text" itemprop="text"> "The world as we have created it is a process of our thinking Without changing our thinking. "< / span > ']Copy the code
  • quote.css('.text::text').extract()
The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.Copy the code
  • quote.css('.text::text').extract_first()
As we have created it is a process of our thinking. It cannot be changed without changing our thinking.Copy the code

So the extract_first() method is used for text to get the first element, and extract() is used for tags to get all elements.

Using the Item

Now that you’ve defined Item, it’s your turn to use it. You can think of it as a dictionary, but you need to instantiate it when you declare it. We then assign values to the results we just parsed and return them.

QuotesSpider = QuotesSpider

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract()
            yield item
Copy the code

In this way, all the contents of the home page are parsed out and assigned as quoteitems.

The subsequent Request

The above operation achieved the content from the initial page, but how to continue to fetch the content of the next page? This requires us to find information from that page to generate the next request, and then find information from the next request page to construct the next request, and so on and so on, so as to achieve the whole station crawl.

Observed just now to the bottom of the page, there is a Next button, check the source code, can be found that it is/page / 2 / links, all link is http://quotes.toscrape.com/page/2, actually we can from this link structure the Next request.

Request is constructed using scrapy.Request, where we pass two parameters, URL and callback.

  • Url, request link

  • Callback: When the request is completed and a response is received, the response is passed as a parameter to the callback function, which parses or generates the next request, as in the parse method above.

Since parse is used to parse text, author, and tags, and the structure of the next page is the same as the structure of the page we just parsed, we can use the parse method again.

Ok, all we need to do is use the selector to get the next page link and generate the request, append the following code to the parse method.

next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)
Copy the code

The first line of code uses the CSS selector to retrieve the link to the next page. The href attribute in the hyperlink is used ::attr(href). The extract_first method is then called to get the content.

The second sentence is to call the urljoin method, it can be a relative url into an absolute url structure, such as getting to the next page address is/page / 2, through urljoin method after processing the result is http://quotes.toscrape.com/page/2/

A new request is constructed from the URL and callback. The callback function still uses the parse method. Upon completion of the request, the response is reprocessed by the parse method, which returns the parsed results of the second page and generates the request for the next page of the second page, the third page. So you go through a loop until you get to the last page.

With a few lines of code, we can easily implement a fetching loop that pulls down the results of each page.

The entire Spider class now looks like this:

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract()
            yield item

        next = response.css('.pager .next a::attr("href")').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)
Copy the code

Let’s try it and see what happens. Go to the directory and run the following command:

scrapy crawl quotes
Copy the code

You can see how Scrapy works.

[scrapy.utils.log] INFO: Scrapy 1.3.0 started () tutorial) 2017-02-19 13:37:20 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'} 2017-02-19 13:37:20 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2017-02-19 13:37:20 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-02-19 13:37:20 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-02-19 13:37:20 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-02-19 13:37:20 [scrapy.core.engine] INFO: Spider opened 2017-02-19 13:37:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-02-19 13:37:20 [scrapy.extensions.telnet] DEBUG: Telnet Console reading on scrapy.core.engine [scrapy.core.engine] Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2017-02-19 13:37:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None) 2017-02-19 13:37:21 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': u'Albert Einstein', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'} 2017-02-19 13:37:21 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': u'J.K. Rowling', 'tags': [u'abilities', u'choices'], 'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d'} ... 2017-02-19 13:37:27 [scrapy.core.engine] INFO: Closing spider (finished) 2017-02-19 13:37:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2859, 'downloader/request_count': 11, 'downloader/request_method_count/GET': 11, 'downloader/response_bytes': 24871, 'downloader/response_count': 11, 'downloader/response_status_count/200': 10, 'downloader/response_status_count/404': 1, 'dupefilter/filtered': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 2, 19, 5, 37, 27, 227438), 'item_scraped_count': 100, 'log_count/DEBUG': 113, 'log_count/INFO': 7, 'request_depth_max': 10, 'response_received_count': 11, 'scheduler/dequeued': 10, 'scheduler/dequeued/memory': 10, 'scheduler/enqueued': 10, 'scheduler/enqueued/memory': 10, 'start_time': datetime.datetime(2017, 2, 19, 5, 37, 20, 321557)} 2017-02-19 13:37:27 [scrapy.core.engine] INFO: Spider closed (finished)Copy the code

Some of the run results are posted here, and some of the middle crawl results have been omitted.

First Scrapy prints the current version number, starting the project. Second, it outputs some of the rewritten configurations currently in settings.py. < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important; word-break: inherit! Important;”

Next is the output of each page grab results, you can see it while parsing, while turning the page, until all content grab finished, and then terminate.

In the end Scrapy output the entire scraping process of statistical information, such as the number of bytes requested, request times, response times, complete the reason and so on.

This completes the entire Scrapy program.

We can find that through some very simple code to complete the content of a website to crawl, compared to a little bit of writing their own program is not too simple?

Save to file

After running Scrapy, we only see the output on the console. What if we want to save it?

For example, in its simplest form, the result is saved as a Json file.

This does not require you to write any extra code. Scrapy provides Feed Exports that can be easily scraped. For example, if you want to save the result as a Json file, you can execute the following command:

scrapy crawl quotes -o quotes.json
Copy the code

Json file, which contains all the content just captured, is a VALID JSON format. Multiple projects are surrounded by parentheses.

You can also use Json for each Item. The result is not enclosed in parentheses, and the line corresponds to one Item.

scrapy crawl quotes -o quotes.jl
Copy the code

or

scrapy crawl quotes -o quotes.jsonlines
Copy the code

Support for CSV, XML, pickle, Marshal, etc., FTP, S3 and other remote outputs can be maintained through custom ItemExporter.

For example, the following commands correspond to CSV, XML, pickle, marshal, format, and FTP remote output respectively:

scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml
scrapy crawl quotes -o quotes.pickle
scrapy crawl quotes -o quotes.marshal
scrapy crawl quotes -o ftp://user:[email protected]/path/to/quotes.csv
Copy the code

FTP output requires you to correctly configure your user name, password, address, output path, otherwise an error will be reported.

Feed Exports, which Scrapy provides, can easily export fetch results to files, which is fine for small projects, but if you want more complex output to a database, you can use Item Pileline to do it more easily.

Using the Item Pipeline

At this point, you can successfully fetch and save the results. If you want to do something more complex, such as save the results to a database like MongoDB, or filter some useful items, you can define Item Pileline to do this.

Item Pipeline refers to the project Pipeline. When an Item is generated, it will be automatically sent to the Item Pipeline for processing. We often use it to do the following operations:

  • Cleaning up HTML data

  • Verify the crawl data and check the crawl fields

  • Check duplicates and discard duplicates

  • Store the crawl results in the database

Implementing an Item Pipeline is as simple as defining a class and implementing the process_item method. When enabled, the Item Pipeline automatically calls the method, which must either return a dictionary or Item object containing the data, or throw a DropItem exception.

This method takes two arguments, one item, which is passed each time a Spider generates an item, and the other Spider, which is an instance of a Spider.

Ok, next we implement an Item Pipeline that filters items whose text length is greater than 50 and saves the result to MongoDB.

> < span style = “box-sizing: border-box; line-height: 22px; word-break: inherit! Important; word-break: inherit! Important;

from scrapy.exceptions import DropItem class TextPipeline(object): def __init__(self): self.limit = 50 def process_item(self, item, spider): if item['text']: if len(item['text']) > self.limit: item['text'] = item['text'][0:self.limit].rstrip() + '... ' return item else: return DropItem('Missing Text')Copy the code

< process_item > < process_item > < process_item > < process_item > < process_item > < process_item > < process_item > < process_item > If greater, truncate and concatenate the ellipsis, then return item.

Next, we will store the processed items into MongoDB. If you haven’t already installed MongoDB, please install it first.

Install pyMongo, the MongoDB development package, using PIP:

pip3 install pymongo
Copy the code

< span style = “box-sizing: border-box; line-height: 22px; display: block; word-break: inherit! Important; word-break: inherit! Important;”

import pymongo

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()
Copy the code

In this class, several other methods defined by the API are implemented.

  • From_crawler is a classmethod identified by @classmethod. It is a dependency injection method. The parameter of the method is crawler. MONGO_URI: MONGO_DB: MONGO_URI: MONGO_DB: MONGO_URI: MONGO_DB: MONGO_URI: MONGO_DB: MONGO_URI: MONGO_DB: MONGO_URI: MONGO_DB So this method is defined primarily to get the configuration in settings.py.

  • Open_spider, which is called when a spider is enabled. This is where you do some initialization.

  • Close_spider, which is called when the spider is closed, closes the database connection here.

The primary process_item method does the data insert.

Ok, now that we’ve defined these two classes, we need to use them in settings.py, and we also need to define connection information for MongoDB.

Add the following to settings.py:

ITEM_PIPELINES = {
   'tutorial.pipelines.TextPipeline': 300,
   'tutorial.pipelines.MongoPipeline': 400,
}
MONGO_URI='localhost'
MONGO_DB='tutorial'
Copy the code

< span style = “box-sizing: border-box; word-wrap: break-word! Important; word-wrap: break-word! Important;

Once defined, perform the crawl again with the following command:

scrapy crawl quotes
Copy the code

After the crawl, you can observe that a tutorial database, the QuoteItem table, has been created in MongoDB.

So far, we’ve done a very simple introduction to Scrapy by grabbing Quotes, but that’s just the tip of the iceberg. There’s a lot more to explore.

The source code

This section code: github.com/Germey/Scra…