The use of Item Pipeline for Scrapy frames

Item Pipeline is an Item Pipeline, and we’ll look at its usage in detail in this section.

First let’s look at the structure of the Item Pipeline in Scrapy, as shown below.

On the far left is the Item Pipeline, which calls after the Spider has generated the Item. When the Spider finishes parsing the Response, the Item will be passed to the Item Pipeline, which will be called in sequence by the defined Item Pipeline component to complete a series of processing processes, such as data cleaning and storage.

The main functions of Item Pipeline include the following four points.

Clean up HTML data.
Verify the crawl data and check the crawl fields.
Check duplicates and discard duplicates.
Save the crawl results to the database.

First, the core method

We can customize the Item Pipeline by simply implementing the specified methods, one of which must be implemented: process_item(Item, spider).

In addition, there are several more practical methods as follows.

Open_spider (spider).
Close_spider (spider).
Crawler from_crawler (CLS).

We will explain the usage of these methods in detail below.

1. process_item(item, spider)

Process_item () is the mandatory method that the defined Item Pipeline calls by default to process the Item. For example, we can process data or write data to a database. It must either return a value of type Item or throw a DropItem exception.

The process_item() method takes two arguments.

Item is the item object, the item being processed.
Spider is a spider object, that is, the spider that generates this Item.

The return type of the process_item() method is summarized below.

If it returns an Item object, the Item is processed by the low-priority Item Pipeline’s process_item() method until all methods have been called.
If it throws a DropItem exception, the Item is discarded and no longer processed.

2. open_spider(self, spider)

The open_spider() method is called automatically when the Spider is started. Here we can do some initialization operations, such as opening the database connection. The parameter spider is the spider object that is enabled.

3. close_spider(spider)

The close_spider() method is called automatically when the Spider is closed. This is where we can do some finishing touches, such as closing the database connection. The argument spider is the closed spider object.

4. from_crawler(cls, crawler)

The from_crawler() method is a classmethod, identified by @classmethod, and is a dependency injection method. With crawler objects, we can get all of the core components of Scrapy, such as each of the global configurations, and then create a Pipeline instance. The CLS argument is Class, and an instance of Class is returned.

Let’s use an example to further understand the use of Item Pipeline.

2. Objectives of this section

We take the 360 photography meitu as an example to achieve three pipelines of MongoDB storage, MySQL storage and Image Image storage.

Three, preparation

Ensure that the MongoDB and MySQL databases are installed, and Python’s PyMongo, PyMySQL, and Scrapy frameworks are installed.

Four, grasping analysis

Our target website for this climb is: https://image.so.com. Open this page, switch to the photography page, the page presented many beautiful photography. We open the browser developer tools, switch the filter to the XHR option, and then pull down the page to see a number of Ajax requests displayed below, as shown in the figure below.

We look at the details of a request and observe the data structure returned, as shown in the figure below.

The return format is JSON. The list field is the detailed information of each picture, including the ID, name, link, thumbnail and other information of 30 pictures. In addition, if you look at the parameters of the Ajax request, there is a parameter sn that keeps changing, which is obviously the offset. If sn is 30, the first 30 images are returned; if sn is 60, the 31st to 60th images are returned. In addition, the CH parameter is the photography category, the listType is the sorting method, and the temp parameter can be ignored.

So all we have to do is change the value of sn.

Below we use Scrapy to achieve the capture of the picture, save the information of the picture to MongoDB, MySQL, and store the picture locally.

5. New projects

Start by creating a new project with the following command:

scrapy startproject images360Copy the code

Next, create a new Spider with the following command:

scrapy genspider images images.so.comCopy the code

We have successfully created a Spider.

Six, structure request

Next, define the number of pages to climb. For example, if you want to crawl 50 pages with 30 images per page (1500 images), you can define a variable MAX_PAGE in settings.py and add the following definition:

MAX_PAGE = 50Copy the code

Define the start_requests() method to generate 50 requests, as shown below:

def start_requests(self):
    data = {'ch': 'photography'.'listtype': 'new'}
    base_url = 'https://image.so.com/zj?'
    for page in range(1, self.settings.get('MAX_PAGE') + 1):
        data['sn'] = page * 30
        params = urlencode(data)
        url = base_url + params
        yield Request(url, self.parse)Copy the code

Here we first define the initial two parameters, sn, which are generated by the traversal loop. Then use urlencode() to convert the dictionary into the GET parameter of the URL, construct the complete URL, construct and generate the Request.

You also need to introduce the scrapy.Request and urllib.parse modules, as shown below:

from scrapy import Spider, Request
from urllib.parse import urlencodeCopy the code

Change the ROBOTSTXT_OBEY variable in settings.py to False, otherwise it will not fetch, as shown below:

ROBOTSTXT_OBEY = FalseCopy the code

Run the crawler, that is, you can see that all the links are requested successfully, execute the command as follows:

scrapy crawl imagesCopy the code

The following figure shows the result.

All requests have a status code of 200, which indicates that the image information was successfully crawled.

7. Extract information

First we define an Item, called ImageItem, as follows:

from scrapy import Item, Field
class ImageItem(Item):
    collection = table = 'images'
    id = Field()
    url = Field()
    title = Field()
    thumb = Field()Copy the code

Here we define four fields, including the image ID, link, title, and thumbnail. There are also two properties collection and table, both defined as images strings, representing the collection name stored in MongoDB and the table name stored in MySQL, respectively.

Next we extract the information from Spider and rewrite the parse() method as follows:

def parse(self, response):
    result = json.loads(response.text)
    for image in result.get('list'):
        item = ImageItem()
        item['id'] = image.get('imageid')
        item['url'] = image.get('qhimg_url')
        item['title'] = image.get('group_title')
        item['thumb'] = image.get('qhimg_thumb_url')
        yield itemCopy the code

JSON is parsed first, iterating through its List field to retrieve image information one by one, and then ImageItem is assigned to generate the Item object.

So we’re done extracting the information.

8. Store information

Next, we need to save the image information to MongoDB and MySQL, and save the image locally.

MongoDB

First, make sure MongoDB is installed and running properly.

> MongoPipeline save information to MongoDB; add the following class in charge:

import pymongo

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        self.db[item.collection].insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()Copy the code

We’ll use two variables, MONGO_URI and MONGO_DB, where we’ll store the link to MongoDB and the database name. We add these two variables to settings.py as follows:

MONGO_URI = 'localhost'
MONGO_DB = 'images360'Copy the code

A save to MongoDB Pipeline is created. The primary method here is the process_item() method, which calls the Insert () method of the Collection object directly to insert the data and return the Item object.

MySQL

First, make sure MySQL is properly installed and running.

Create a new database, name is images360, SQL statement is as follows:

CREATE DATABASE images360 DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ciCopy the code

Select * from ‘id’, ‘url’, ‘title’, and ‘thumb’;

CREATE TABLE images (id VARCHAR(255) NULL PRIMARY KEY, url VARCHAR(255) NULL , title VARCHAR(255) NULL , thumb VARCHAR(255) NULL)Copy the code

After executing the SQL statement, we have successfully created the table. Now you can store data in the table.

Next we implement a MySQLPipeline as follows:

import pymysql

class MysqlPipeline():
    def __init__(self, host, database, user, password, port):
        self.host = host
        self.database = database
        self.user = user
        self.password = password
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get('MYSQL_HOST'),
            database=crawler.settings.get('MYSQL_DATABASE'),
            user=crawler.settings.get('MYSQL_USER'),
            password=crawler.settings.get('MYSQL_PASSWORD'),
            port=crawler.settings.get('MYSQL_PORT'),
        )

    def open_spider(self, spider):
        self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8', port=self.port)
        self.cursor = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        data = dict(item)
        keys = ', '.join(data.keys())
        values = ', '.join(['%s'] * len(data))
        sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
        self.cursor.execute(sql, tuple(data.values()))
        self.db.commit()
        return itemCopy the code

As mentioned earlier, the data insertion method used here is a way to construct SQL statements on the fly.

Here we need a few more MySQL configurations. Let’s add a few variables to settings.py, as shown below:

MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'images360'
MYSQL_PORT = 3306
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'Copy the code

This section defines the MySQL address, database name, port, user name, and password.

This completes the MySQL Pipeline.

Image Pipeline

Scrapy provides pipelines for handling downloads, including file downloads and image downloads. The principle of downloading files and images is the same as the principle of grabbing pages, so the downloading process supports asynchronous and multi-threading, downloading is very efficient. Let’s take a look at the implementation process.

Is the official document address: https://doc.scrapy.org/en/latest/topics/media-pipeline.html.

To define the path to store the file, you need to define an IMAGES_STORE variable and add the following code to settings.py:

IMAGES_STORE = './images'Copy the code

Here we define the path as the images subfolder under the current path, that is, downloaded images will be saved to the images folder of this project.

The built-in ImagesPipeline reads the Item’s image_urls field by default, thinking it is a list, iterating over the Item’s Image_urls field and fetching each URL for image download.

However, the image link field of the generated Item is not represented by the image_URLS field, nor is it a list, but a single URL. So in order to achieve the download, we need to redefine part of the download logic, that is, to customize ImagePipeline, inherit the built-in ImagePipeline, rewrite several methods.

We define ImagePipeline as follows:

from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline

class ImagePipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split('/')[-1]
        return file_name

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image Downloaded Failed')
        return item

    def get_media_requests(self, item, info):
        yield Request(item['url'])Copy the code

Here we implement imagepipline, inherit the ImagePipeline built in Scrapy, overriding the following methods.

Get_media_requests (). Its first argument, item, is the generated item object to crawl. We take its URL field and generate the Request object directly. The Request is added to the scheduling queue, waiting to be scheduled, and then downloaded.
File_path (). Its first argument, Request, is the request object that is being downloaded. This method returns the saved file name, using the last part of the image link as the file name. It uses the split() function to split the link and extract the last part, returning the result. The saved name of the downloaded image is the file name returned by the function.
Item_completed (), which is what happens when a single Item completes the download. Since not every image will download successfully, we need to analyze the download results and eliminate the images that fail to download. If an image fails to download, then we do not need to save the Item to the database. The first parameter of the method results is the download result corresponding to the Item, which is in the form of a list. Each element of the list is a tuple, which contains information about the success or failure of the download. Here we walk through the download results to find a list of all successful downloads. If the list is empty, the image for that Item fails to download, and an exception DropItem is thrown, which is ignored. Otherwise, the Item is returned, indicating that the Item is valid.

At this point, the three Item Pipeline definitions are complete. > < span style = “box-sizing: border-box! Important; word-break: inherit! Important;

ITEM_PIPELINES = {
    'images360.pipelines.ImagePipeline': 300,
    'images360.pipelines.MongoPipeline': 301,
    'images360.pipelines.MysqlPipeline': 302}Copy the code

Notice the order of the calls here. We need to call ImagePipeline first to filter items after downloading. Items that fail to download will be ignored directly, so they will not be saved to MongoDB and MySQL. The other two stored pipelines are then called to ensure that the images stored in the database are downloaded successfully.

Next run the program and perform the crawl as follows:

scrapy crawl imagesCopy the code

The crawler crawls and downloads at the same time. The download speed is very fast. The corresponding output log is shown in the figure below.

Check the local images folder and see that the images have been downloaded successfully, as shown below.

Check MySQL, the downloaded image information has been saved successfully, as shown in the following figure.

Check MongoDB. The downloaded images are also saved, as shown in the following figure.

In this way, we can successfully download the image and store the image information in the database.

Ix. Code of this section

This section of code address is: https://github.com/Python3WebSpider/Images360.

Ten, endnotes

Item Pipeline is an important component of Scrapy, and data storage is almost entirely implemented through this component. Please grasp this content carefully.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)