Basic steps of crawler based on scrapy framework

1. Install scrapy framework

Detailed tutorials can be viewed on this site by clicking the jump

2. Create a scrapy project

Generate a crawler file. Open cmd.exe in the specified directory and enter the code

scrapy startproject mxp7 
cd mxp7 
scrapy genspider sp mxp7.com
Copy the code

After opening the project through Pycharm, we found that all files had been created. We only needed to modify the code in the file, and then we could enter the command in the command line to crawl the data.

3. Extract data

First click open a web page to view the basic framework of the blog. Then we create a new Item dictionary in sp.py, our crawler file, and extract the contents of the blog by xpath.

** Note: One xpath-related pit is highlighted here. ** I like to copy xpath in the browser debug directory when I first get page tag content

For example, copy title’s xpath as shown in the image above and get the following xpath. He will extract the xPaths in the order of the tags

/html/body/div[2]/div[1]/section[1]/main/article/header/h1
Copy the code

However, this situation only applies to static web pages, because some information in dynamic web pages is dynamically loaded by codes. The xpath we copied is the directory after dynamic loading, but when crawling data, crawler will not load dynamic interface, it only crawls the original code under the response content. If the crawler wants to obtain the web content through the order of tags, it must compare with the code under response, which is generally quite troublesome to calculate.

How to open the Response page

Open the target webpage, click F12 to enter the browser debugging, select Network, press F5 to refresh, and the file called by the webpage will appear at the bottom of the interface. Select the file with the same name of the current URL, and click the top menu “Response” on the right, then we can view the content of Response

On the other hand, in the browser debug interface you can retrieve aN xpath by retrieving an ID

Obtain the following content:

//\*\[@id="post-331"\]/header/h1
Copy the code

However, we will find that the ID is the same as the number of pages on the site, which means that the id of the tag in this directory is different for each page, so it is not possible to abstract the xpath of all pages by id. Therefore, I recommend using CSS xpath to extract the path. Once we’ve written the xpath content, the source code for sp.py looks like this:

import scrapy


class SpSpider(scrapy.Spider) :
    name = 'sp'
    allowed_domains = ['mxp7.com']
    start_urls = ['http://mxp7.com']

    def parse(self, response) :
        href = response.xpath("//h2[@class = 'entry-title'][1]//a/@href").extract_first()
        print(href)
        yield scrapy.Request(href, callback=self.parse1)

    def parse1(self, response) :
        item = {}
        item["date"] = response.xpath("//span[@class ='meta-date']//time/text()").extract_first()
        item["category"] = response.xpath("//span[@class = 'meta-category']//a/text()").extract_first()
        item["title"] = response.xpath("//h1[@class = 'entry-title']/text()").extract_first()
        item["author"] = response.xpath("//span[@class = 'meta-author']//a/text()").extract_first()
        contain_path = response.xpath("//div[@class = 'entry-content clearfix']")
        item["contain"] = contain_path.xpath("string(.) ").extract_first()
        yield item
Copy the code

< span style = “box-sizing: border-box! Important; word-wrap: break-word! Important;

# pipelines. Py the source code
class Mxp7Pipeline:
    def process_item(self, item, spider) :
        return item
Copy the code

> < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important

< span style = “box-sizing: border-box! Important; word-wrap: break-word! Important;”

With this done, we are ready to do our fetching in the D:\Project\Spider\mxp7> path, enter the command line

Then we can find that the captured data is hidden in a large number of log files. For visual experience, we can add a code in setting.py to make the output log level ERROR

Then re-enter the fetch command in CMD

We can successfully crawl the contents of the page

If we want to get all the content published on our blog, we will need to modify sp.py by adding the code to get the next page within the parse function. You can open the page in your browser and look at the format of the next page

We’ll just use the page in our scrapy.Request

pre\_url = response.xpath("//div\[@class = 'nav-previous'\]//a/@href").extract\_first() 
if pre\_url: 
    yieldScrapy. Request(pre\_url, callback=self.parse1)// Call selfCopy the code

However, the url obtained on some websites is not a complete URL (the anti-crawling mechanism of my website is quite weak)

You need to add the head address of the page before extracting the URL, here I believe you can operate, no further details. After modification, we also need to modify the value of USER_AGENT in setting.py. Before modification, we also need to obtain the value of USER_AGENT of the web page.

** Note: **USER_AGENT is where crawlers and site anti-crawlers clash, both sides will make a variety of measures against each other for this part. This step only applies to the simplest anti-crawler mechanism. In fact, the author’s website does not have any anti-crawler mechanism at present, even without this step, we can normally crawl all the content of the article.

After opening the Network screen, click on the top menu “Headers” in the lower right corner and slide to the bottom to obtain the value of user-Agent

We then find the annotated USER_AGENT code in Setting and change its value to the value copied here

USER\_AGENT = 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'
Copy the code

After processing, we can grab data in CMD interface again

Successfully grab the content of all articles on the site

4. Save the file

To save the file, just type the following code on the command line:

scrapy crawl sp -o mxp7.csv
Copy the code

A new mxp7.csv file is created in the current directory

Open found is garbled, we can open through notepad, save this format

Open it again and you can get the normal file

Basic steps of crawler based on scrapy framework

Related Posts

Use Git Hook for quality control of team code

Gitee first used build warehouse and local commit

If you’re still worried about project documentation, DOClever lets your project documentation feel smooth