Scrapy

It takes about 10 minutes to read the text.

A lot of people say crawlers are fun, but they don’t know how to get started. In fact, crawler entry is very simple, the difficult part lies in the anti – crawling mechanism of major websites. Of course, for some simple sites, or very easy to crawl.

To learn crawler, you should first clarify your driving force, whether you want to climb some zhihu data or some movie resources. Motivation is very important. It determines whether you are interested enough to continue learning.

The first driving force for many people to learn crawlers is to crawl the girl pictures of major websites, such as the more famous MZITU. When you climb these sites, you can enjoy the beautiful girl pictures and learn the technology, which is very nice.

Today I will use a very good scrapy framework to grab some girl images and save the data to mongodb database. The site to be climbed this time is 360’s image search website,

Address: http://images.so.com/

360 pictures of the sister quality or very can, I put a few pieces of you feel.

Pure and lovely

Literary and artistic temperament

Ethereal

Very pleasing to the eye.

Program way of thinking

Make sure you have your Scrapy, PyMongo, and mongodb databases installed before running the program on Windows 10 + Python 3.6.

After a simple analysis of 360 image website, there is no strong anti-crawling measures, and the data of the website is presented by Ajax request.

Let’s take a closer look at the details of the request and observe the data structure returned.

It returns a JSON data format in which the list field stores some information about the image. For example, we need the image address information cover_imgURL. In addition, if you look at the parameter information of the Ajax request, there is another sn that keeps changing, and this parameter is obviously the offset. When sn is 30, the first 30 pictures are returned, and so on. We only need to change the value of SN to get the information of the picture all the time.

The next step is to use scrapy’s high-performance framework to save images from your site to local storage.

New project

Start by creating a local scrapy project and naming it Images360. It can be created using the name already specified.

scrapy startproject images360
Copy the code

The following project structure follows

The next step is to add a Spider to the spiders directory, with the following command:

scrapy genspider images images.so.com
Copy the code

In this way, our project has been created, and the final structure of the project is as follows.

The program code

settings.py

In settings.py we define a variable MAX_PAGE, which is the maximum page that we want to climb. For example, in this application we set 50, which means that we want to climb 50 pages, 30 pages per page, 1500 images in total.

MAX_PAGE = 50
Copy the code

In settings.py we also set some database-specific configuration information.

MONGO_URI = 'localhost'
MONGO_DB = 'test'
IMAGES_STORE = './images'
Copy the code

Also note that we need to change the ROBOTSTXT_OBEY variable in settings.py to False, otherwise it won’t be able to crawl.

ROBOTSTXT_OBEY = False
Copy the code

start_requests()

This function is used to construct the initial request, which is used to generate 50 requests.

    def start_requests(self):
        data = {'ch': 'photogtaphy'.'listtype': 'new'}
        base_url = 'https://image.so.com/zj?0'
        for page in range(1, self.settings.get('MAX_PAGE') + 1):
            data['sn'] = page * 30
            params = urlencode(data)
            url = base_url + params
            yield Request(url, self.parse
Copy the code

Extracting information

We’ll define an Images360Item class in our items.py file that defines our data structure.

class Images360Item(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    collection = table = 'images'
    id = Field()
    url = Field()
    title = Field()
    thumb = Field()
Copy the code

This includes the ID, link, title, thumbnail of the image. There are also two other properties, Collection and table, both defined as images strings that represent the collection name of the MongoDB store.

Next we extract the information from Spider and parse() as follows:

    def parse(self, response):
        result = json.loads(response.text)
        for image in result.get('list'):
            item = Images360Item()
            item['id'] = image.get('imageid')
            item['url'] = image.get('qhimg_url')
            item['title'] = image.get('group_title')
            item['thumb'] = image.get('qhimg_thumb_url')
            yield item
Copy the code

Now that we have extracted the information, we need to save the captured information to MongoDB.

MongoDB

Make sure you have MongoDB installed locally and started properly. > MongoPipeline save information to MongoDB; add the following class in charge:

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        self.db[item.collection].insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()
Copy the code

ImagePipeline

Scrapy provides pipelines for handling downloads, including file downloads and image downloads. The principle of downloading files and images is the same as the principle of grabbing pages, so the downloading process supports asynchronous and multi-threading, downloading is very efficient.

We first define an IMAGES_STORE variable in settings.py that represents the path where the image is stored.

IMAGES_STORE = './images'
Copy the code

The built-in ImagesPipeline reads the Item’s image_urls field by default, thinking it is a list, iterating over the Item’s Image_urls field and fetching each URL for image download.

However, the generated image link field of the Item is not represented by the image_urls character, nor is it in the form of a list, but a single URL. So in order to achieve the download, we need to redefine part of the download logic, that is, to customize ImagePipeline, inherit the built-in ImagePipeline, so as to achieve our own image download logic.

class ImagePipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split('/')[-1]
        return file_name

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image Downloaded Failed')
        return item

    def get_media_requests(self, item, info):
        yield Request(item['url'])
Copy the code

> < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important;”

ITEM_PIPELINES = {
   'images360.pipelines.ImagePipeline': 300,
   'images360.pipelines.MongoPipeline': 301}Copy the code

Finally, we just need to run the program to perform the crawl, named as follows:

scrapy crawl images
Copy the code

I have uploaded the complete code to the background of wechat public account. You can get it by replying to “360” at the background of “Crazy Sea” public account.

This article was first published on the public account “Chi Hai”, the background reply “1024” to get the latest programming resources.

Related Posts

How to Become a DevOps Engineer in Six months or Less, Part 4: Packing

Python3 sends WSS requests that require two-way authentication

Create-react-app is precompiled using ANTD, SASS CSS