Scrapy introduction

preface

Reading novels is a habit I have kept for so many years. “Coiling Dragon”, “Fight against the Sky”, “Fairy inverse”, “Mortals Xiuxian Biography” and so on, accompanied me throughout the school days. Recently, I found that the experience of fiction apps on iOS is not good, with frequent pop-up ads, delayed updates and forced sharing. So one rainy night, I decided to stop putting up with the apps and masturbate a book crawler myself.

Scrapy is introduced

Scrapy is python’s main crawler framework, which makes it easy to grab web information by URL, and provides more tools and higher concurrency than the traditional Requests library. Learning from the official learning website is recommended.

However, it doesn’t matter if you don’t know any scrapy material, you can still pull it off after reading this article

Scrapy of actual combat

Before we start, we need to have the following things ready:

The url of the novel you want to crawl
Environment set up
Beginner Python syntax

Select site

Here I chose m.book9.net/. I chose this site because it has three advantages

Fast update speed (stable service)
Simple page structure (easy to parse)
No crawler protection (easy to operate)

Next, find the home page for chasing more novels.

For example, Chen Dong’s The Holy Ruins.

If we wanted to catch up now, we would have to go to the site each time and click on the first TAB of the latest chapter to link to the specific novel chapter.

Imitating the above steps, I drew a flow like this:

So next, we just need to follow these procedures to translate into our code

Building engineering

We’re going to build a Scrapy shell project, but before doing so, make sure you have python installed. The framework itself is compatible with version 2 and 3, so you don’t need to worry about version differences.

My local environment is Python3, so it may operate slightly differently from 2.

1. Install Scrapy

> pip3 install scrapy
Copy the code

2. Create a crawler project and name it NovelCrawler

> scrapy startproject NovelCrawler
Copy the code

3. Create a CRAWler service based on URL

> scrapy genspider novel m.book9.net
Copy the code

This is the basic project creation process and is ready to use once executed

> scrapy crawl novel
Copy the code

Command to start crawler service. However, currently our crawler does not implement any rules, so it does not do anything even if the command is executed, so we need to add some crawler rules to the project.

The crawler written

Next we use Pycharm to open the project we just created.

All of scrapy’s crawler services are grouped under the spiders directory, where we also add the crawler file Novel.py

Request Novel Home page

# encoding:utf-8

import scrapy

class NovelSpider(scrapy.Spider):
    Create the service with the same name as above
    name = 'novel' 
    The name of the domain to be accessed is the same as the name of the service created above
    allowed_domains = ['m.book9.net'] 
    # url to initiate the request, the holy Ruins novel home page
    start_urls = ['https://m.book9.net/wapbook/10.html'] 

	The default callback function to request success
    def parse(self, response): 
        pass

Copy the code

In the above code, the input argument to the parse function, the response object, is unknown to us, and this is one of the most troubling aspects of learning Python. One way to do this is to use Pycharm’s Debug function to view the parameters

As you can see from the figure above, Response contains the REQUESTED HTML information. So we just need to take it a little bit and cut out what we need.

Gets the latest chapter URL

So how to parse the nodes we need? Response provides us with xpath methods, and we only need to input xpath rules to locate the corresponding HTML tag nodes.

It doesn’t matter if you don’t know xpath syntax, Chrome gives you a one-click way to get xpath addresses (right-click -> check ->copy->copy xpath), as shown below:

Using xpath, we can get the address of the latest chapter from this page

# encoding:utf-8

import scrapy

class NovelSpider(scrapy.Spider):
    name = 'novel'
    allowed_domains = ['m.book9.net']
    start_urls = ['https://m.book9.net/wapbook/10.html']

    def parse(self, response):
    	# jump link to the specified  tag
        context = response.xpath('/html/body/div[3]/div[2]/p[1]/a/@href')   
        The first result of the array is the URL of the latest section
        url = context.extract_first()  
        print(url) 
        pass

Copy the code

Request section information

Once we have the link, we can jump to the next page. And Response also provides the follow method, which is convenient for us to jump to the short chain in the station.

# encoding:utf-8

import scrapy

class NovelSpider(scrapy.Spider):
    name = 'novel'
    allowed_domains = ['m.book9.net']
    start_urls = ['https://m.book9.net/wapbook/10.html']

    def parse(self, response):
        context = response.xpath('/html/body/div[3]/div[2]/p[1]/a/@href') 
        url = context.extract_first()
        Get the short chain and return the result to the specified callback
        yield response.follow(url=url, callback=self.parse_article)
        
        # Custom callback methods
    def parse_article(self,response): 
    	# Response here is our specific post page
        print(response)
        pass

Copy the code

(Click on the portal if you are confused by the yield keyword in the code.)

With the page of the article, we just need to parse its HTML. This section is too detail-oriented. It only applies to this site, so I won’t go over it. Attach a comment code:

# encoding:utf-8
import re

import os
import scrapy

class NovelSpider(scrapy.Spider):
    name = 'novel'
    allowed_domains = ['m.book9.net']
    start_urls = ['https://m.book9.net/wapbook/10.html']

    def parse(self, response):
    	# jump link to the specified  tag
        context = response.xpath('/html/body/div[3]/div[2]/p[1]/a/@href')  
        Get the short chain and return the result to the specified callback
        url = context.extract_first() 
        yield response.follow(url=url, callback=self.parse_article)   

    def parse_article(self, response):
        Get the title of the article
        title = self.generate_title(response) 
        Build the HTML for the article
        html = self.build_article_html(title, response)  
        Save section HTML locally
        self.save_file(title + ".html", html)
        # Open native HTML with your own browser
        os.system("open " + title.replace(""."\") + ".html")   
        pass

    @staticmethod
    def build_article_html(title, response):
        # Get the content
        context = response.xpath('//*[@id="chaptercontent"]').extract_first()
        # Skip the  tag in the article to jump content
        re_c = re.compile('<\s*a[^>]*>[^<]*<\s*/\s*a\s*>')
        article = re_c.sub("", context)  
        # Splice the article HTML
        html = '
      
      
       '
       \
               + title + '</font></b></div>' + article + "</html>"   
        return html

    @staticmethod
    def generate_title(response):
        title = response.xpath('//*[@id="read"]/div[1]/text()').extract()
        return "".join(title).strip()

    @staticmethod
    def save_file(file_name, context):
        fh = open(file_name, 'wb')
        fh.write(context.encode(encoding="utf-8"))
        fh.close()
        pass
Copy the code

Now we can run the following command in the current directory:

> scrapy crawl novel
Copy the code

Show video

thinking

After writing the whole thing, I found it difficult to make one piece of code fit multiple sites. Therefore, it is more suitable to build a crawler file for each site when the need of multi-site crawler is met.

Source code address (I heard that the end of the year will double star oh)