This article appeared on my blog at blog.d77.xyz/archives/35…

preface

For Scrapy exercises, I have found a crawler platform, the website is scrape. Center /, so far I have climbed the top ten relatively simple sites, thanks to the platform author for providing the exercise platform.

Environment set up

Pycharm = PyMysql.pyCharm = PyMysql.pyCharm Startproject learnscrapy create a new scrapy project and build it on Github.

Preparation before starting

Since new items are set by default, set them up before you start analyzing the corresponding site to facilitate later crawling.

Setting Robots Rules

Change the ROBOTSTXT_OBEY value to False in the settings.py file. This doesn’t need to be set at the moment, as the target site doesn’t have robot files, but to avoid missing data for no apparent reason, make it a habit.

Setting a Log Level

The default log level is DEBUG, which prints all requests and status. For large scale crawls, logs are very large and messy. Add the following Settings to the settings.py file to filter out some of the logs.

import logging
LOG_LEVEL = logging.INFO
Copy the code

Setting download Delay

In order not to cause too much impact on the target site, reduce the download delay slightly. The default download delay is 0. Uncomment DOWNLOAD_DELAY and set it to 0.1.

Turn off the Telnet Console

I can barely use it right now, so let’s turn it off. Set TELNETCONSOLE_ENABLED to False.

Change the default request header

In order to make our request more like a normal request, we need to set the default request header to prevent ban.

DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml,application/json; Q = 0.9 * / *; Q = 0.8 '.'accept-encoding': 'gzip, deflate, br'.'accept-language': 'zh-CN,zh; Q = 0.9 '.'cache-control': 'no-cache'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, '
                                         'like Gecko) Chrome / 86.0.4240.75 Safari / 537.36'
                           }
Copy the code

Start pipes and middleware only when they are needed, and that’s where the preparation ends.

The database

MySQL > create a MySQL docker container, set the account and password of the MySQL database, and create a new database named test. Run the test. SQL file in the root directory of the project, and generate three tables.

Began to crawl

ssr1

Ssr1 is described as follows:

Ssr1: Movie data site, no crawl back, data rendered by the server, suitable for basic crawler practice.

Since it is server-side rendering, the data must be in the HTML source code, so you can grab the data directly from the source code.

Create a new file ssr1.py in the spiders folder and write the following code:

import scrapy


class SSR1(scrapy.Spider) :
    name = 'ssr1'

    def start_requests(self) :
        urls = [
            f'https://ssr1.scrape.center/page/{a}' for a in range(1.11)]for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response, **kwargs) :
        pass
Copy the code

These are the two methods we need to implement. The start_requests method stores the starting URL, yields the request to the downloader, and processes the returned data through the default parse method.

Before parsing the data, there are a few more things to do: set the fields to collect the data, turn on the item pipeline, specify who will handle the data, and write the modified Settings to the setting.py file.

  1. You need to create fields that belong to the data you want to collect and add the following to the kitems.py file:
class SSR1ScrapyItem(scrapy.Item) :
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    fraction = scrapy.Field()
    country = scrapy.Field()
    time = scrapy.Field()
    date = scrapy.Field()
    director = scrapy.Field()
Copy the code
  1. Open the Item pipe in thesettings.pyfileITEM_PIPELINESUncomment and change the content to:
ITEM_PIPELINES = {
   'learnscrapy.pipelines.SSR1Pipeline': 300,}Copy the code
  1. inpipelines.pyAdd the following to the file:
class SSR1Pipeline:
    def process_item(self, item, spider) :
        print(item)
        return item
Copy the code

We then write code in Ssr1.py to parse the HTML source code, use xpath to retrieve the corresponding data, use yield Request to Request the next page, use meta arguments to pass item data, and use callback arguments to specify the callback function.

The complete code is as follows:

import scrapy
from scrapy import Request

from learnscrapy.items import SSR1ScrapyItem


class SSR1(scrapy.Spider) :
    name = 'ssr1'

    def start_requests(self) :
        urls = [
            f'https://ssr1.scrape.center/page/{a}' for a in range(1.11)]for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response, **kwargs) :
        result = response.xpath('//div[@class="el-card item m-t is-hover-shadow"]')
        for a in result:
            item = SSR1ScrapyItem()
            item['title'] = a.xpath('.//h2[@class="m-b-sm"]/text()').get()
            item['fraction'] = a.xpath('.//p[@class="score m-t-md m-b-n-sm"]/text()').get().strip()
            item['country'] = a.xpath('.//div[@class="m-v-sm info"]/span[1]/text()').get()
            item['time'] = a.xpath('.//div[@class="m-v-sm info"]/span[3]/text()').get()
            item['date'] = a.xpath('.//div[@class="m-v-sm info"][2]/span/text()').get()
            url = a.xpath('.//a[@class="name"]/@href').get()
            yield Request(url=response.urljoin(url), callback=self.parse_person, meta={'item': item})

    def parse_person(self, response) :
        item = response.meta['item']
        item['director'] = response.xpath(
            '//div[@class="directors el-row"]//p[@class="name text-center m-b-none m-t-xs"]/text()').get()
        yield item
Copy the code

Then create a startup file main.py in the project path. Normally, the crawler should be started using the command line. We used Pycharm to start it for debugging. In the main.py file, write the following:

from scrapy.cmdline import execute

if __name__ == '__main__':
    execute(['scrapy'.'crawl'.'ssr1'])
Copy the code

Execute it, and the data should be printed out as a dictionary, because we printed it in the item pipe and haven’t written it to the database yet.

> < span style = “box-sizing: border-box; line-height: 22px; word-break: inherit! Important; word-break: inherit! Important;

class SSR1Pipeline:
    def __init__(self) :
        self.conn: pymysql.connect
        self.cur: pymysql.cursors.Cursor
        self.queue = []
        self.count = 0

    def open_spider(self, spider) :
        self.conn = pymysql.connect(host='192.168.233.128', user='root', password='123456', db='test',
                                    port=3306, charset='utf8')
        self.cur = self.conn.cursor()

    def close_spider(self, spider) :
        if len(self.queue) > 0:
            self.insert_database()
        self.cur.close()
        self.conn.close()

    def insert_database(self) :
        sql = "insert into ssr1 (country,date,director,fraction,time,title) values (%s,%s,%s,%s,%s,%s)"
        self.cur.executemany(sql, self.queue)
        self.queue.clear()
        self.conn.commit()

    def process_item(self, item, spider) :
        self.queue.append(
            (item['country'], item['date'], item['director'], item['fraction'], item['time'], item['title']))
        if len(self.queue) > 30:
            self.insert_database()
        return item
Copy the code

Run the crawler again, and the full 100 pieces of data should be written to the database.

The complete code submission is at github.com/libra146/le…

ssr2

Ssr2 is described as follows:

Movie data site, no back crawl, no HTTPS certificate, suitable for HTTPS certificate verification.

Scrapy does not validate HTTPS certificates by default, so the scrapy rules should be consistent with ssR1, but the SSR2 backend service may have a problem, I have been reporting 504 error, browser can not open, can not test whether rules are valid.

The complete code submission is at github.com/libra146/le…

ssr3

Ssr3 is described as follows:

Movie data website, no back crawl, with HTTP Basic Authentication, suitable for HTTP Authentication cases, user name and password are admin.

Because this site comes with Basic Authentication, you need to add Authentication information to the request header, using download middleware in scrapy.

The other code is the same as the above two, except that SSR3 specifies the custom download middleware through custom_settings.

class SSR3Spider(scrapy.Spider) :
    name = "ssr3"
    Override Settings in global Settings, using custom download middleware
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'learnscrapy.middlewares.SSR3DownloaderMiddleware': 543,}}Copy the code

You also need to add the following code to middlewares.py:

class SSR3DownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler) :
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider) :
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        # installed downloader middleware will be called
        Add authentication information
        request.headers.update({'authorization': 'Basic ' + base64.b64encode('admin:admin'.encode()).decode()})
        return None

    def process_response(self, request, response, spider) :
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider) :
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider) :
        spider.logger.info('Spider opened: %s' % spider.name)
Copy the code

Add validation information to the request header before the request is sent. However, the server still returns 504 after the Authentication information is added. It is assumed that HTTP Basic Authentication is configured in nginx, and the backend server is still offline.

The complete code submission is at github.com/libra146/le…

ssr4

Ssr4 is described as follows:

Movie data site, no back crawl, each response increased by 5 seconds delay, suitable for testing slow site crawl or do crawl speed test, reduce network speed interference.

For scrapy, it’s possible to use SSR4 to test the crawl speed and eliminate network speed interference. There’s no difference in the crawl rules, just a little bit longer. The code is basically unchanged.

The full code submission is at github.com/libra146/le…

conclusion

This is all the SSR crawl analysis and code, for this simple no anti-crawl measures and also does not limit the crawl speed of the site is relatively simple, helpful to understand the writing and framework of scrapy.

Refer to the link

Docs.scrapy.org/en/latest/i…