This is the first day of my participation in the Gwen Challenge in November. Check out the details: the last Gwen Challenge in 2021

preface

“Back to the beginning, standing in front of the mirror.”

Originally, this article is intended to write Spider middleware, but because it involves Item, this article will finish Item first, and then talk about Pipeline, and then Spider middleware.

The Item and Pipeline

Is still the first frame composition.

As you can see from the architecture diagram, when the downloader gets the web response from the site, it goes back to the Spider through the engine. In our program, we parse the response through CSS or xpath rules and construct it as an Item object.

The Item and response contents are processed by the Spider middleware as they are passed to the engine. Finally, the Pipeline persists the Item passed by the engine.

Summary: Item is a data object, Pipeline is a data Pipeline.

Item

Item is simply a class that contains data fields. The purpose is to allow you to structure the target data parsed from the web page. Note that we usually define the structure of the Item first, then construct it in the program and process it in the pipeline.

Here is still the douro continent as an example.

The Item class definition

Item is defined in items.py. Let’s start by looking at the Item definition template in the py file.

As shown in the figure, it is a template. There are two main points.

  1. Item class inheritancescrapy.Item
  2. Field =scrapy.Field()

Here, Item is defined according to the data fields we need to collect on the Douruo page.

class DouLuoDaLuItem(scrapy.Item) :name = scrapy.Field() alias = scrapy.Field() area = scrapy.Field() parts = scrapy.Field() year = scrapy.Field() update =  scrapy.Field() describe = scrapy.Field()Copy the code

Item data construct

Once we have defined the Item class, we need to construct it in the spider, that is, populate it with data.

# import Item class ScrapyDemo
from ScrapyDemo.items import DouLuoDaLuItem
Construct Item object
item = DouLuoDaLuItem
item['name'] = name
item['alias'] = alias
item['area'] = area
item['parts'] = parts
item['year'] = year
item['update'] = update
item['describe'] = describe
Copy the code

As shown above, an Item data object is constructed.

Launch Item to Pipeline

After the Item object is constructed, one more line of code is needed to pass the Item to the Pipeline.

yield item
Copy the code

So, Pipeline, here I am.

Pipeline

Pipeline is a Pipeline that processes Item data for persistence. In plain English, data is put into various forms of files, databases.

function

The official Pipeline functions are:

  1. Cleaning up HTML data
  2. Validate data (check that item contains certain fields)
  3. Duplicate (and discard)
  4. Save the crawl results to the database

In actual development, there are more scenarios of 4.

Define Pipeline

A Pipeline is defined in pipeline.py. Again, look at the template given by a Pipeline.

As shown, only the process_item() method is implemented to handle the passed Item. But in real development, we typically implement three methods:

  1. __init__: used to construct object properties, such as database connections
  2. from_crawler: class method used to initialize a variable
  3. process_item: core logic code that handles Item

Here, we will customize a Pipeline to put Item data into the database.

Configuration Pipeline

As with middleware, it is configured in settings.py, which corresponds to the ITEM_PIPELINE parameter.

ITEM_PIPELINES = {
    'ScrapyDemo.pipelines.CustomDoLuoDaLuPipeline': 300
}
Copy the code

Key still corresponds to the full path. Value indicates the priority. The smaller the number, the higher the priority. Item is passed through each Pipeline based on priority so that Item can be processed in each Pipeline.

For the sake of intuition, I will later configure the Pipeline locally in the code.

Pipeline connect database

1. Configure database properties

We first configured the database’S IP, account, password, and database name in setttings. Py so that it could be read directly from the pipeline and connections could be created.

MYSQL_HOST = '175.27 xx. Xx'
MYSQL_DBNAME = 'scrapy'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
Copy the code

2. Define pipeline

Pymysql driver is mainly used to connect to the database and Twisted’S ADBAPI is used to operate the database asynchronously. Asynchronism is the key here, and basically asynchronism is synonymous with efficiency and speed.

import pymysql
from twisted.enterprise import adbapi
from ScrapyDemo.items import DouLuoDaLuItem


class CustomDoLuoDaLuPipeline(object) :

    def __init__(self, dbpool) :
        self.dbpool = dbpool

    @classmethod
    def from_crawler(cls, crawler) :
        Read the configuration in Settings
        params = dict(
            host=crawler.settings['MYSQL_HOST'],
            db=crawler.settings['MYSQL_DBNAME'],
            user=crawler.settings['MYSQL_USER'],
            passwd=crawler.settings['MYSQL_PASSWORD'],
            charset='utf8',
            cursorclass=pymysql.cursors.DictCursor,
            use_unicode=False
        )
        Create a connection pool with Pymysql as the connection module to use
        dbpool = adbapi.ConnectionPool('pymysql', **params)
        return cls(dbpool)

    def process_item(self, item, spider) :
        if isinstance(item, DouLuoDaLuItem):
            query = self.dbpool.runInteraction(self.do_insert, item)
            query.addErrback(self.handle_error, item, spider)
        return item
        
    A callback function that performs database operations
    def do_insert(self, cursor, item) :
        sql = 'insert into DLDLItem(name, alias, area, parts, year, `update`, `describe`) values (%s, %s, %s, %s, %s, %s, %s)'
        params = (item['name'], item['alias'], item['area'], item['parts'], item['year'], item['update'], item['describe'])
        cursor.execute(sql, params)

    A callback function when a database operation fails
    def handle_error(self, failue, item, spider) :
        print(failue)
Copy the code

Here are a few points to highlight in the code above.

  1. Why is process_item() usedisinstanceWhat’s the type of item?

This is to solve the scenario where multiple items pass through the same Pipiline and need to call different methods to perform database operations. As shown below:

Different items have different structures, which means that different SQL is needed to insert into the database. Therefore, Item type is determined first, and then the corresponding method is called for processing.

  1. Why do update and describe fields in SQL use backquotes?

Update, describe, and SELECT are all MySQL keywords, so if you want to use these words in a field, you need to put back quotes in the SQL execution and table summary, otherwise you will get an error.

3. Generate Item into pipeline

What’s coming up is the familiar code again, with the Item structure defined in items.py above. Pipelines will also be configured locally within the code, which is unclear in the second article.

import scrapy
from ScrapyDemo.items import DouLuoDaLuItem

class DouLuoDaLuSpider(scrapy.Spider) :
    name = 'DouLuoDaLu'
    allowed_domains = ['v.qq.com']
    start_urls = ['https://v.qq.com/detail/m/m441e3rjq9kwpsc.html']

    custom_settings = {
        'ITEM_PIPELINES': {
            'ScrapyDemo.pipelines.CustomDoLuoDaLuPipeline': 300}}def parse(self, response) :
        name = response.css('h1.video_title_cn a::text').extract()[0]
        common = response.css('span.type_txt::text').extract()
        alias, area, parts, year, update = common[0], common[1], common[2], common[3], common[4]
        describe = response.css('span._desc_txt_lineHight::text').extract()[0]
        item = DouLuoDaLuItem()
        item['name'] = name
        item['alias'] = alias
        item['area'] = area
        item['parts'] = parts
        item['year'] = year
        item['update'] = update
        item['describe'] = describe
        print(item)
        yield item
Copy the code

4. Program testing

Start the program and see that the console prints a list of enabled pipelines, as well as the contents of the item. After the program is executed, we go to the database to check whether the data has been put into the database.

As shown in the figure, the data can already be found in the DLDLItem table of the database.

conclusion

Items and pipelines streamline the storage of data structures. Multiple pipelines can be defined and configured. After yield items, data will be stored in files and databases

There’s also an ItemLoaders that I haven’t used much, but I’ll use it as an extension. Looking forward to our next meeting.