This is the 8th day of my participation in the August More Text Challenge. For details, see:August is more challenging

I have received feedback from many fans in private chat :”scrapy framework is difficult to learn, I have learned nearly the basic crawler library and have made a lot in practical operation, but many outsourcing or boss require skilled use of scrapy framework, I don’t know how to do it!”

ˏ₍•ɞ•₎ char-indent, I have worked hard to compile ➕ and summarize for three days and three nights, summed up this article of 20,000 words, and attached a complete set of scrapy framework learning route at the end of this article. If you can read this article earnestly, have an impression on Scrapy in your heart, and then study the whole learning route at the end of this article with great effort, Scrapy frameworks are easy for you!!




Design purpose: This is a framework for crawling network data and extracting structural data. This tool uses the Twisted asynchronous network framework to speed up download times.

Official document address!!

PIP install scrapy

Scrapy project development process

Extract data: according to the website structure in the spider to achieve data collection related content to save data: Use PIPELINE for data follow-up processing and saving \



Description :(scrapy each module in the specific role)

Request: Url, method, post_data, headers, and so on The item data object consists of url, body, status, headers, and so on: essentially a dictionary

Process principle description:

  1. Url object of the initial URL construct in the crawler -> crawler middleware -> engine -> scheduler
  2. The scheduler takes the request–> engine –> middleware –> loader
  3. The downloader sends the request and gets the response response –> download middleware –> engine –> crawler middleware –> crawler
  4. The crawler extracts the URL address and assembles it into a Request object –> crawler middleware –> engine –> scheduler. Repeat Step 2
  5. Crawler extract data -> engine -> pipeline process and save data

Crawler middleware and download middleware only run the logic in different locations, the function is repeated: such as replacing UA, etc

1. Create a project

Example: scrapy startproject myspiderCopy the code

The resulting directories and files are as follows:

2. Create crawler files

CD myspider scrapy genspider itcast itcast.cnCopy the code

Crawler name: domain name allowed to crawl as a parameter when the crawler runs: it is the crawling range set for the crawler. After setting, it is used to filter the URL to be crawled. If the url to be crawled is different from the allowed domain name, it will be filtered out

3. Run the scrapy crawler

Example: scrapy crawl itcast

Write the itcast.py crawler file:

# -*- coding: utf-8 -*-
import scrapy


class ItcastSpider(scrapy.Spider) :
    # Crawler runtime parameters
    name = 'itcast'
    Check the domain names allowed to crawl
    allowed_domains = ['itcast.cn']
    # 1. Change the url where the Settings start
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml#ajacaee']

    # Data extraction method: receive the response from the download middleware and define the operation related to the site
    def parse(self, response) :
        # Get all teacher nodes
        t_list = response.xpath('//div[@class="li_txt"]')
        print(t_list)
        # Traverse the list of teacher nodes
        tea_dist = {}
        for teacher in t_list:
            The # xpath method returns a list of selector objects and the extract() method can extract the data corresponding to the data in the selector object.
            tea_dist['name'] = teacher.xpath('./h3/text()').extract_first()
            tea_dist['title'] = teacher.xpath('./h4/text()').extract_first()
            tea_dist['desc'] = teacher.xpath('./p/text()').extract_first()
            yield teacher
Copy the code

You will find that it is OK to run!

4. After data modeling (after defining the data crawled by crawlers, use the pipeline for data persistence operation)

# -*- coding: utf-8 -*-
import scrapy
from ..items import UbuntuItem


class ItcastSpider(scrapy.Spider) :
    # Crawler runtime parameters
    name = 'itcast'
    Check the domain names allowed to crawl
    allowed_domains = ['itcast.cn']
    # 1. Change the url where the Settings start
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml#ajacaee']

    # Data extraction method: receive the response from the download middleware and define the operation related to the site
    def parse(self, response) :
        # Get all teacher nodes
        t_list = response.xpath('//div[@class="li_txt"]')
        print(t_list)
        # Traverse the list of teacher nodes
        item = UbuntuItem()
        for teacher in t_list:
            The # xpath method returns a list of selector objects and the extract() method can extract the data corresponding to the data in the selector object.
            item['name'] = teacher.xpath('./h3/text()').extract_first()
            item['title'] = teacher.xpath('./h4/text()').extract_first()
            item['desc'] = teacher.xpath('./p/text()').extract_first()
            yield item
Copy the code

Parse (); / / Parse (); / / parse (); If the site structure is more complex, you can also customize other resolution functions. 3. The url extracted in the resolution function must belong to allowed_domains if you want to send a request, but the URL in start_urls is not subject to this restriction. 5. Parse () uses yield to return data. Note: The only objects that yield can be passed in the parse function are BaseItem, Request, dict, None\

Select * from scrapy crawler where you want to locate elements and extract data and attribute values. 1. The response. Xpath method returns a type similar to list, containing the Selector object. The operation is the same as the list, but with some additional methods 2. Extra method extract() : returns a list of strings 3. Extra method extract_first() : returns the first string in the list, if the list is empty None\ is returned

Response. url: Url of the current response Response. request.url: URL of the request corresponding to the current response Response. headers: Response headers response. Requests. Headers: the current response to the request of the head of the response. The body: the response body, which is the HTML code, byte type response. The status: the response status code \

5. Pipes save data

Define operations on data in Pipelines. Py file 1. Define a pipeline class 2. Override the pipeline class process_item method 3. The process-item method must return the item to the engine after processing it

Update:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class UbuntuPipeline(object) :

    def __init__(self) :
        self.file = open('itcast.json'.'w', encoding='utf-8')


    def process_item(self, item, spider) :
        This operation can only be used in scrapy applications
        item = dict(item)
        # Every yield of the method used to extract data from the crawler file is run
        # This method is a fixed-name function
        By default, the pipe is used up and data needs to be returned to the engine
        # 1. Serialize the dictionary data
        Ensure_ascii =False Converts unicode type to STR type. Defaults to True.
        json_data = json.dumps(item, ensure_ascii=False, indent=2) + ',\n'
        
        # 2. Write data to a file
        self.file.write(json_data)
        
        return item
    
    def __del__(self) :
        self.file.close()
Copy the code

6.settings.py configures to enable pipes

In the Settings file, unpack the code as follows:

7. Scrapy data modeling and requests

(Data modeling is usually done in items.py during the project)

Why model? 1. Define item, that is, plan in advance which fields need to be caught to prevent manual errors, because after the definition is done, the system will automatically check during the running process, and error 2 will be reported if the values are different. Together with comments, you can clearly know which fields to crawl. Undefined fields cannot be crawled. When there are few target fields, you can use dictionary instead of 3. Some scrapy-specific components require Item support, such as scrapy's ImagesPipeline classCopy the code

Operate in the kitems.py file:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class UbuntuItem(scrapy.Item) :
    # Name of lecturer
    name = scrapy.Field()
    # Lecturer title
    title = scrapy.Field()
    # Lecturer motto
    desc = scrapy.Field()
Copy the code

1. From… Note the correct import path of the Item in this line of code, and ignore the Error of the PyCharm tag

8. Set the user-agent

Find the following code in the # settings.py file and unseal it and add UA:
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'en'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari

}
Copy the code

9. So far, an entry-level scrapy crawler is OK.

Now go to your project directory and type in scrapy crawl itcast to run scrapy!

Summary of development process:

  1. Create project: scrapy startProject Project name

  2. Explicit goals: Modeling is done in the items.py file

  3. Create a crawler:

    Create the crawler scrapy genspider crawler name allows the domain to complete the crawler modification start_urls check the modification allowed_domains write the resolution method

  4. Store data: Define pipelines for data processing in Pipelines. Py file register the enabled pipeline in Settings. py file

In The End!

Start from now, stick to it, a little progress a day, in the near future, you will thank you for your efforts!

This blogger will continue to update the basic crawler column and actual crawler column, carefully read this article friends, you can like the collection and comment out of your reading feeling. And can follow this blogger, in the days to read more reptilian!

If there are any mistakes or inappropriate words, please point them out in the comment section. Thank you! If reprint this article, please contact me to obtain my consent, and annotate the source and the name of this blogger, thank you!