Scrapy (1)

preface

Why scrapy? Look at the picture below. A lot of job requirements have scrapy, mainly because scrapy is really strong. So what’s strong about it? Please find the answer in the passage.

Scrapy data flow

Let’s take a look at how scrapy works. Scrapy document address

1. The crawler engine gets the initial request and starts fetching. 2. The crawler starts requesting the scheduler and is ready to grab the next request. 3. The crawler dispatcher returns the next request to the crawler. 4. The engine request is sent to the downloader and the network data is downloaded through the downloader middleware. 5. Once the loader completes the page download, it returns the download result to the crawler. 6. The engine returns the response of the downloader to the crawler through the middleware for processing. 7. The crawler processes the response and returns the processed items and new requests to the engine through the middleware. 8. The engine sends the processed items to the project pipeline and returns the results to the scheduler, which plans to process the next fetch. 9. Repeat the process (continue with Step 1) until all URL requests have been climbed.

Scrapy components

The crawler engine

The crawler engine is responsible for controlling the data flow between various components. When some operations trigger events, it is processed by engine.

The scheduler

The scheduler receives requests from Engine, queues them, and returns them to Engine via events.

downloader

The engine requests that the network data be downloaded and the results returned to the Engine.

Spider

The Spider makes the request and processes the downloader response data that Engine returns to it, returning it to Engine as items and data requests (urls) within the rules.

item pipeline

Handles the data parsed by the Engine back from spider and persists the data, for example to a database or file.

download middleware

Download middleware is an interaction component between the engine and the downloader, in the form of hooks (plug-ins) that can instead receive requests, process the download of data, and respond to the engine with results.

spider middleware

Spider middleware is an interactive component between engine and spider, in the form of hooks (plug-ins) that handle responses and return them to Engine items and new request sets.

Top250 douban movies

The installation

pip install scrapy
Copy the code

Initialize crawler

Scrapy startProject doubanTop250Copy the code

The directory architecture is as follows, where douban_spider.py is created manually.

Start the crawler

Scrapy crawl Douban scrapy crawl DoubanCopy the code

spider

The following code is douban_spider.py, with corresponding comments for easy comprehension

class RecruitSpider(scrapy.spiders.Spider):
    # Here is the hole left above, namely set the crawler name
    name = "douban"
    Set the domain name allowed to be crawled
    allowed_domains = ["douban.com"]
    # set the starting URL
    start_urls = ["https://movie.douban.com/top250"]
    
    Whenever web page data is downloaded, it is sent here for parsing
    Then return a new link to join the request queue
    def parse(self, response):
        item = Doubantop250Item()
        selector = Selector(response)
        Movies = selector.xpath('//div[@class="info"]')
        for eachMovie in Movies:
            title = eachMovie.xpath('div[@class="hd"]/a/span/text()').extract()  # multiple span tags
            fullTitle = "".join(title)
            movieInfo = eachMovie.xpath('div[@class="bd"]/p/text()').extract()
            star = eachMovie.xpath('div[@class="bd"]/div[@class="star"]/span/text()').extract()[0]
            quote = eachMovie.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()
            # quote may be null
            if quote:
                quote = quote[0]
            else:
                quote = ' '
            item['title'] = fullTitle
            item['movieInfo'] = '; '.join(movieInfo)
            item['star'] = star
            item['quote'] = quote
            yield item
        nextLink = selector.xpath('//span[@class="next"]/link/@href').extract()
        # Page 10 is the last page, there is no link to the next page
        if nextLink:
            nextLink = nextLink[0]
            yield Request(urljoin(response.url, nextLink), callback=self.parse)
Copy the code

pipelines

< span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;” In the code above:

yield item
Copy the code

Is the returned data. General charge:

Check for certain fields
Save the data into the database
Since this is just a preliminary attempt at scrapy crawlers, I didn’t make any changes here

class Doubantop250Pipeline(object):
    def process_item(self, item, spider):
        return item
Copy the code

items

Define the fields we need to get

class Doubantop250Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()  # Movie title
    movieInfo = scrapy.Field()  # Description of the movie, including director, star, movie type, etc
    star = scrapy.Field()  # Movie ratings
    quote = scrapy.Field()  # 2
    pass
Copy the code

setting

Settings. py defines the configuration of our crawler, and since this is a primer on scrapy, it will be covered later.

Start the crawler

scrapy crawl douban
Copy the code

Afterword.

Scrapy, scrapy, scrapy, scrapy, scrapy, scrapy

Send the book aside

If one of the winners fails to claim the prize in time, the winner will be deemed to have forfeited the prize. After checking the background data, I decided to give this quota to [French fries] this reader

directory

preface

Scrapy data flow

Scrapy components

The crawler engine

The scheduler

downloader

Spider

item pipeline

download middleware

spider middleware

Top250 douban movies

The installation

Initialize crawler

Start the crawler

spider

pipelines

items

setting

Start the crawler

Afterword.

Send the book aside

Related Posts

JavaScript Advanced Design Patterns – singleton patterns

SpringBoot integrates the use of EasyExcel

Take a look at the JVM garbage collection algorithm