This is the 23rd day of my participation in the August More Text Challenge.More challenges in August

1. Media Pipeline

Abstract pipeline that implements the image thumbnail generation logic Abstract pipeline that implements the image thumbnail generation logic

In short: a set of methods that deal specifically with media file image generation.

Media pipes all implement the following features: 1. Avoid re-downloading recently downloaded media 2. Specify the storage location (file system directory, Amazon S3 bucket, Google Cloud Storage bucket)

3. The image pipeline has some additional image processing functions: (1) convert all downloaded images to common format (JPG) and mode (RGB) (2) generate thumbnail images (3) check the width/height of the image for minimum size filtering

(1) source code simple learning:

< span style = “box-sizing: border-box! Important; word-break: inherit! Important;”

Think about it yourself! Too much!

(2) WE own a media pipeline for picture storage (rewrite the framework of the media pipeline class part of the method to achieve a picture to save the name of the custom! :

Select * from ‘spider’; select * from ‘spider’; Image_urls = scrapy.field () 3. Settings to set the IMAGES_STORE path, if the path does not exist, the system will help us create 4. Use the default pipeline is in Settings. Py file open: scrapy. Pipelines. Images. ImagesPipeline: 60, we need to inherit ImagesPipeline self-built pipes and in the Settings. Open the corresponding pipe 5 py. This can be overridden according to the official documentation: get_media_requests item_completed

1. Crawler file:

# -*- coding: utf-8 -*-
import scrapy

import re
from ..items import BaiduimgPipeItem
import os
class BdimgSpider(scrapy.Spider) :
    name = 'bdimgpipe'
    allowed_domains = ['image.baidu.com']
    start_urls = ['https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1& showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E7%8C%AB%E5%92%AA']

    def parse(self, response) :
        text=response.text
        image_urls=re.findall('"thumbURL":"(.*?) "',text)
        item=BaiduimgPipeItem()
        item["image_urls"]=image_urls
        yield item

Copy the code

2. Set special field names in the kitems. py file:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BaiduimgPipeItem(scrapy.Item) :
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls=scrapy.Field()

Copy the code
3. In settings.py, enable the self-built pipe and set the file storage path:
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'baiduimg.pipelines.BaiduimgPipeline': 300,
   'baiduimg.pipelines.BdImagePipeline': 40.# 'scrapy.pipelines.images.ImagesPipeline': 60,
}

# IMAGES_STORE =r 'c :\my\pycharm_work\ reptilieight_class \baiduimg\baiduimg\dir0'
IMAGES_STORE ='C: / my/pycharm_work/crawler/eight_class_ImagesPipeline/baiduimg/baiduimg/dir3'
Copy the code

4. Write pipelines. Py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.http import Request
import os

from scrapy.pipelines.images import ImagesPipeline        Import the media pipe class to use
from .settings import IMAGES_STORE
class BdImagePipeline(ImagesPipeline) :
    image_num = 0
    print(Spider's Media Pipeline class)
    def get_media_requests(self, item, info) :  # can be used to rewrite
        # Turn the URL of the image into a request and send it to the engine
        ''' req_list=[] for x in item.get(self.images_urls_field, []): Req_list.append (Request(x)) return req_list ""
        return [Request(x) for x in item.get(self.images_urls_field, [])]

    def item_completed(self, results, item, info) :
        images_path = [x["path"] for ok,x in results if ok]

        for image_path in images_path:  Rename rename rename rename rename rename rename rename rename rename rename The first parameter is the original path of the image; The second parameter is the image custom path
            os.rename(IMAGES_STORE+"/"+image_path,IMAGES_STORE+"/"+str(self.image_num)+".jpg")      # IMAGES_STORE+"/"+image_path is the original absolute path to save images; The second parameter is the new absolute path to save the custom image (also in IMAGES_STORE).
            self.image_num+=1


' ' 'Copy the code

An overridden method in the source code:

def item_completed(self, results, item, info): If isinstance(item, dict) or self.images_result_field in item.fields: item[self.images_result_field] = [x for ok, x in results if ok] return itemCopy the code

Url – The url from which to download the file. This is the URL of the request returned from the get_media_requests() method.

Path-files_store File storage path (relative to)

Checksum – MD5 hash of image contents

This is a typical value for the results parameter:

[(True,
  {'checksum': '2b00042f7481c7b056c4b410d28f33cf',
   'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
   'url': 'http://www.example.com/files/product1.pdf'}),
]
Copy the code

Item_completed () : [x for OK, x in results if OK] [item_completed() : [x for OK, x in results if OK]

Images_path = [x[“path”] for OK,x in results if OK] images_path = [x[“path”] for OK,x in results if OK]

🔆 In The End!

Start now, stick to it, a little progress a day, in the near future, you will thank you for your efforts!

This blogger will continue to update the basic column of crawler and crawler combat column, carefully read this article friends, you can like the collection and comment on your feelings after reading. And can follow this blogger, read more crawler in the days ahead!

If there are mistakes or inappropriate words can be pointed out in the comment area, thank you! If reprint this article please contact me for my consent, and mark the source and the name of the blogger, thank you!