Initial Experience of Crawler Sharp Weapon (1)

I heard your reptile was shut down again? (2)

If you do not save data by crawling, you are playing rogue (3)

Climb more than 20,000 rental data, tell you the current situation of Guangzhou rent (4)

Scrapy also picks up girls? (5)

Scrapy, Scrapy, Scrapy, Scrapy, Scrapy

directory

  • preface
  • Media Pipeline
  • To enable the Media Pipeline
  • Using ImgPipeline
  • Grab girl graph
  • A blind bibi

preface

In the process of capturing data, in addition to capturing text data, of course, there will also be the need to capture pictures. Does our scrapy pick up pictures? The answer is, of course. To my shame, I didn’t know that until last month, when a group of Zone7 fans asked scrapy how to crawl image data. Then I did a search. Now summarize and share.

Media Pipeline

Our ItemPipeline processing can not only process text information, but also save file and picture data, respectively FilesPipeline and image pipeline

Files Pipeline

  • Avoid re-downloading recently downloaded data
  • Specifying a storage path
FilesPipeline’s typical workflow is as follows:
  • In a crawler, you grab an item and put the URL of the image in the “File_urls” group.
  • The item returns from the crawler into the project pipeline.
  • When the project enters the FilesPipeline, URLs within the File_urls group are scheduled to be downloaded by Scrapy’s scheduler and downloader (meaning that the scheduler and downloader middleware can be reused) and, when higher priority, processed before other pages are fetched. The project will remain “locker” during this particular pipeline phase until the file download is complete (or not complete for some reason).
  • When the file is downloaded, another field (files) will be updated to the structure. This group will contain a dictionary list containing information about the downloaded files, such as the download path, the source capture address (obtained from the File_urls group), and the image’s checksum. The order of the files in the files list will be the same as the source file_urls group. If an image fails to download, an error message will be recorded and the image will not appear in the Files group.

Images Pipeline

  • Avoid re-downloading recently downloaded data
  • Specifying a storage path
  • Convert all downloaded images to common formats (JPG) and modes (RGB)
  • Thumbnail generation
  • Detect the width/height of images to make sure they meet the minimum limits

To enable the Media Pipeline

# Enable both image and file pipelines
ITEM_PIPELINES = { # Change to your own ImgPipeline when using
    'girlScrapy.pipelines.ImgPipeline': 1,
}
FILES_STORE = os.getcwd() + '/girlScrapy/file'  # File storage path
IMAGES_STORE = os.getcwd() + '/girlScrapy/img'  # Image storage path

Avoid downloading files that have been downloaded in the last 90 days
FILES_EXPIRES = 90
# Avoid downloading graphic content that has been downloaded in the last 90 days
IMAGES_EXPIRES = 30

# Set image thumbnails
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (250, 250),}# Picture filter, minimum height and width, below this size do not download
IMAGES_MIN_HEIGHT = 128
IMAGES_MIN_WIDTH = 128
Copy the code

It should be noted that the image names you download will eventually be named after the hash value of the image URL, for example:

0bddea29939becd7ad1e4160bbb4ec2238accbd9.jpg
Copy the code

The final saving address is:

your/img/path/full/0bddea29939becd7ad1e4160bbb4ec2238accbd9.jpg
Copy the code

Using ImgPipeline

Here is an ImgPipeline in my demo that overwrites two methods.

from scrapy.pipelines.images import ImagesPipeline
class ImgPipeline(ImagesPipeline):# inherit ImagesPipeline class

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            image_url = image_url
            yield scrapy.Request(image_url)
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item
Copy the code

Respectively is:

get_media_requests(self, item, info):
item_completed(self, results, item, info):
Copy the code

get_media_requests(self, item, info):

Here, we can get the item value parsed in Parse, so we can get the corresponding image address. Return a scrapy.Request(image_URL) here to download the image.

item_completed(self, results, item, info):

Both item and info print out as lists of URLS. Results is printed with the following values.

# success
[(True, {'path': 'full/0bddea29939becd7ad1e4160bbb4ec2238accbd9.jpg'.'checksum': '98eb559631127d7611b499dfed0b6406'.'url': 'http://mm.chinasareview.com/wp-content/uploads/2017a/06/13/01.jpg'})]
# error
[(False,
  Failure(...))]
Copy the code
name value
path Picture save address
checksum MD5 hash of the image content
url Image URL

Grab girl graph

Ok, that’s the theory part, so let’s do it

spider

The Spider section is simple as follows:

class GirlSpider(scrapy.spiders.Spider):
    name = 'girl'
    start_urls = ["http://www.meizitu.com/a/3741.html"]
    def parse(self, response):
        soup = BeautifulSoup(response.body, 'html5lib')
        pic_list = soup.find('div', id="picture").find_all('img')  # Find all images on the screen
        link_list = []
        item = ImgItem()
        for i in pic_list:
            pic_link = i.get('src')  Get the specific URL of the image
            link_list.append(pic_link)  # Extract image link
        item['image_urls'] = link_list
        print(item)
        yield item
Copy the code

item

class ImgItem(scrapy.Item):
    image_urls = scrapy.Field()# Link to image
    images = scrapy.Field()
Copy the code

ImgPipeline

class ImgPipeline(ImagesPipeline):# inherit ImagesPipeline class
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            image_url = image_url
            yield scrapy.Request(image_url)
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item
Copy the code

Start the

scrapy crawl girl
Copy the code

The final crawl result is as follows:

A blind bibi

This is the end of today’s update, did you get new skills? How to batch download girl picture? Well, I believe the girl’s appearance level will drive you to improve the code, manual funny!! Finally, reply to get the source code.

This article was first published on the public account “Zone7”. Stay tuned for the latest tweets!