The introduction

Through previous learning, we have a certain understanding of crawler and crawling some static websites and simple dynamic websites. Now, it’s time to start learning about a more powerful crawler framework.

First Scrapy library

Scrapy introduction:

Scrapy is an application framework designed to crawl site data and extract structured data. It can be used in a range of applications including data mining, information processing or storing historical data. Originally designed for page crawling (more specifically, Web crawling), it can also be used to fetch data returned by apis (such as Amazon Associates Web Services) or general Web crawlers.

Scrapy components

  1. Scrapy Engine: Responsible for communication, signal and data transfer among Spider, ItemPipeline, Downloader and Scheduler.
  2. The Scheduler receives requests from the engine and arranges them in a certain way. When the engine needs them, the Scheduler returns them to the engine.
  3. The Downloader downloads all Requests sent by the Scrapy Engine and returns the Responses to the Engine, which the Spider handles
  4. The Spiders analyze and extract data from all Responses, obtain the data required by the Item field, and submit the URL to be followed up to the engine to enter the scheduler again
  5. Item Pipeline Is where items retrieved from spiders are processed for post-processing (detailed analysis, filtering, storage, etc.)
  6. Downloader Middlewares (Downloader middlewares) a component that can be customized to extend the download functionality.
  7. Spider middlewares (Spider Middlewares) a functional component that customizes extensions and manipulates communication between the engine and Spider
  8. Data Flow

Scrapy: The green line is the data flow

A profound

To create a Scrapy project before the crawl, go to the desired directory in CMD and run the following command.

scrapy startproject scrapyspider
Copy the code

The following information is displayed after the creation is successful:

The following message is displayed after the system is successfully created. New Scrapy project 'scrapyspider', using template directory 'k:\apy\lib\site-packages\scrapy\templates\project', Created in: K: learning \Scrapy \ spider You can start your first spider with: cd scrapyspider scrapy genspider example example.comCopy the code

The created project directory is as follows.

< span style = "box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;" > < span style = "box-sizing: border-box! Important; word-break: inherit! Important;Copy the code

Write the items.py file and then write the scrapyspider/items.py file. Item is used to hold data that is crawled, and Item defines a dictionary-like structured data field.

Use simple class definition syntax and Field object declarations. This project needs to crawl two kinds of information, name and link, so you need to create two containers. Create the following code

import scrapy
class jdItem(scrapy.Item):
    name = scrapy.Field()
    src = scrapy.Field()
Copy the code

To write the spider. Py file, you first need to create and write the spider file jd_spiders. Since I want to crawl jd, I create a jd_spider.py file.

You can also create the file in CMD using the scrapy startProject mySpider command. Files created using commands have default code.

Once you’ve created the file, you need to import the Spider class and the jdItem class you just created.

So with that foundation, it’s actually pretty easy to understand.

The first step is to create a crawler, give it a name, and tell the crawler which URL to crawl (start_urls).

class jd_spider1(Spider): name = 'jd_spider1' start_urls = ['https://search.jd.com/Search?keyword=%E5%8F%A3%E7%BA%A2&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E5%8F%A3%E7%BA%A2&stock= 1&psort=3&click=0']Copy the code

Then it’s about programming how to parse a web page. Instead of using the BS4 library, xpath syntax is used to parse the web page, which is essentially the same, crawling nodes, but expressed differently. Let’s determine how to crawl names and links below.

A simple look at the source of the web page, to determine the location of the need for information

The first is the location of the item information, which is stored in the Li Class = ‘GL-item’ tag.

Then locate the product name and link in the title and href of a target = ‘_blank’.

Mys.xpath (‘.//a[@target = “_blank”]//@title’);

The next step is to follow the rules of xpath syntax.

  def parse(self,response):
        jdtem = jdItem()
        mes_links = response.xpath('//li[@class="gl-item"]')
        fro mes in mes_links:
            jdItem1['name'] = mes.xpath('.//a[@target = "_blank"]//@title').extract()[0]
            jdItem['src'] = mes.xpath('.//a[@target = "_blank"]//@herf')
            yield item
Copy the code

At this point, the crawler is ready to do what it needs.

Running a crawler Runs a command in the crawler directory

scrapy crawl jd_spider1 -o jingdong.csv
Copy the code

-o is a shortcut provided by Scrapy to export an item to CSV format

If the CSV file is garbled. Add FEED_EXPORT_ENCODING = “gb18030” to settings.py file