The foreword 0.

New contact with the crawler, after a period of practice, wrote a few simple crawlers, crawling douban movie crawler example there are many online, but are very simple, most only introduced the request page and analysis part, for novice, I hope to be able to have a more comprehensive example. Therefore, I found a lot of examples and articles, integrated them together, and added some content on the basis of the existing Douban crawler, which is a relatively complete content. It mainly includes project establishment, request page, xpath parsing, automatic page turning, data output, coding processing and so on.

1. Build projects

Execute the following command to create a scrapy crawler project

scrapy startproject spider_doubanCopy the code

After the command is executed, the spider_Douban folder is created with the following directory structure:

. ├ ─ ─ scrapy. CFG └ ─ ─ spider_douban ├ ─ ─ just set py ├ ─ ─ the items. The py ├ ─ ─ middlewares. Py ├ ─ ─ pipelines. Py ├ ─ ─ Settings. Py └ ─ ─ The spiders ├── double ExercisesCopy the code

2. Build crawler data model

Open the./spider_douban/items. Py file and edit it as follows:

import scrapy class DoubanMovieItem(scrapy.Item): # rank = scrapy.field () # movie_name = scrapy.field () # score = scrapy.field () # score_num = scrapy.Field()Copy the code

3. Create a crawler file

Spiders /douban_spider. Py

from scrapy import Request from scrapy.spiders import Spider from spider_douban.items import DoubanMovieItem class DoubanMovieTop250Spider(Spider): name = 'douban_movie_top250' start_urls = { 'https://movie.douban.com/top250' } ''' headers = { 'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',} def start_requests(self): url = 'https://movie.douban.com/top250' yield Request(url, headers=self.headers) ''' def parse(self, response): item = DoubanMovieItem() movies = response.xpath('//ol[@class="grid_view"]/li') print(movies) print('=============================================') for movie in movies: item['ranking'] = movie.xpath( './/div[@class="pic"]/em/text()').extract()[0] item['movie_name'] = movie.xpath( './/div[@class="hd"]/a/span[1]/text()').extract()[0] item['score'] = movie.xpath( './/div[@class="star"]/span[@class="rating_num"]/text()' ).extract()[0] item['score_num'] = movie.xpath( [@ '. / / div class = "star"] / span/text () '). The re (r '(\ d +) evaluation) [0] yield item next_url = response.xpath('//span[@class="next"]/a/@href').extract() if next_url: next_url = 'https://movie.douban.com/top250' + next_url[0] yield Request(next_url)Copy the code

Crawler file function records of each part

The douban_spider.py file consists of several main parts.

The import module

from scrapy import Request
from scrapy.spiders import Spider
from spider_douban.items import DoubanMovieItemCopy the code

The Request class is used to Request the data of the page to be climbed. The Spider class is the base class of the crawler. DoubanMovieItem is the crawler data model established in our first step

Initial setup

In DoubanMovieTop250Spider, the basic information of the crawler is first defined:

Name: The name of the crawler in the project, which can be executed in the project directory
scrapy listGets a list of crawlers that have been defined


Start_urls: is the first page address to crawl


Headers: A user-Agent message that is attached to a web server when it sends a page request to the Web server to tell the web server what type of browser or device is requesting the page. For websites that do not have a simple anti-crawling mechanism, the HEADERS part can be omitted.

To confuse web servers, user-Agent information is generally defined when a crawler sends a Web request, and there are two ways to write it.

  • The first definition of header is:
Headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',} def start_requests(self): url = 'https://movie.douban.com/top250' yield Request(url, headers=self.headers)Copy the code

As you can see, start_urls are no longer defined, and start_requests is defined instead, with the starting URL in the function. In addition, the headers dictionary is defined. The headers dictionary is sent when a Request is sent. This approach is simple and intuitive, but the disadvantage is that during the execution of a crawler project, all requests are a user-Agent attribute.

  • The second definition of header:
    start_urls = {
        'https://movie.douban.com/top250'
        }Copy the code

The start_urls attribute is defined simply and directly, while the header attribute in the Request is defined in another way, but later.

Parse processing function

1. Create an item instance based on our defined DoubanMovieItem class

item = DoubanMovieItem()

2. Parse the page – get the content frame

By analyzing the source code of the page, we can see that the movie information in the page is saved in the < OL > tag, which has a unique style sheet grid_view, and each individual movie information is saved in the

  • tag. The following code gets the contents of all
  • tags under < OL > tags with class attribute grid_view.
  • movies = response.xpath(‘//ol[@class=”grid_view”]/li’)

    3. Parse the page – get the items

    Within each

  • tag, there is also an internal structure, which is parsed out by xpath() and assigned to the fields in the item instance. You can easily find the tag definition by looking at the source on the movie.douban.com/top250 page. If we pass the type () function to check the variable types of movies, you can find he is of type < class ‘scrapy. The selector. Unified. SelectorList’ >. Each
  • tag in the < OL > tag is an item in this list, so you can iterate over movies.
  • First look at the page structure in the

  • tag:
  • You can see the tags where the parts of the data to be extracted are located:

    Ranking: the class attribute is PIC
    <div>Under the tag,,
    <em>In the label…


    Movie name: class property hd
    <div>Under the label,
    <a>The first one in the tag
    <span>The label…


    Rating: class attribute of star
    <div>Tag with the class attribute rating_num
    <span>In the label…


    Number of comments: class attribute star
    <div>Under the label,
    <span>Tag. Because the RE regular expression is used, you don’t specify which one
    <span>The label.

    Back in the code section, iterate over the movies you defined earlier, retrieving the data you want to grab item by item.

    for movie in movies: item['ranking'] = movie.xpath( './/div[@class="pic"]/em/text()').extract()[0] item['movie_name'] = movie.xpath( './/div[@class="hd"]/a/span[1]/text()').extract()[0] item['score'] = movie.xpath( './/div[@class="star"]/span[@class="rating_num"]/text()' ).extract()[0] item['score_num'] = movie.xpath( [@ '. / / div class = "star"] / span/text () '). The re (r '(\ d +) evaluation) [0] yield itemCopy the code

    4.Url jump (page turn)

    If so far, we can be the first page in https://movie.douban.com/top250 pages to crawl, but only 25 records, to crawl all the 250 records, will perform the following code:

            next_url = response.xpath('//span[@class="next"]/a/@href').extract()
    
            if next_url:
                next_url = 'https://movie.douban.com/top250' + next_url[0]
                yield Request(next_url)Copy the code

    First we parse the links to the next page in the page using xpath and assign the value to the next_URL variable. If we are currently on the first page, then the links to the next page are what? Start = 25 & filter =. Join the parsed back page link with the full URL to form the full address, execute Request() again, and crawl all 250 records. Note: The result of xpath parsing is a list, so write next_URL [0] as a reference.

    4. Handle random Head attributes (random user-agent)

    Implement random sending of the head attribute. Two files are mainly modified:

    settings.py

    USER_AGENT_LIST = [' 0.9 dev zspider/http://feedback.redkolibri.com/ ', 'Xaldon_WebSpider / 2.0. B1,' Mozilla / 5.0 (Windows; U; Windows NT 5.1; En - US) Speedy spiders (http://www.entireweb.com/about/search_tech/speedy_spider/) ', 'Mozilla / 5.0 (compatible; Speedy spiders; http://www.entireweb.com/about/search_tech/speedy_spider/) ', 'Speedy spiders (Entireweb; Beta / 1.3; http://www.entireweb.com/about/search_tech/speedyspider/) ', 'Speedy spiders (Entireweb; Beta / 1.2; http://www.entireweb.com/about/search_tech/speedyspider/) ', 'Speedy spiders (Entireweb; Beta / 1.1; http://www.entireweb.com/about/search_tech/speedyspider/) ', 'Speedy spiders (Entireweb; Beta / 1.0; http://www.entireweb.com/about/search_tech/speedyspider/) ', 'Speedy spiders (Beta / 1.0;  www.entireweb.com)', 'Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)', 'Speedy Spider (http://www.entireweb.com/about/search_tech/speedyspider/)', 'Speedy Spider (http://www.entireweb.com)', 'Sosospider+(+http://help.soso.com/webspider.htm)', 'sogou spider', 'Nusearch Spider (www.nusearch.com)', 'nuSearch Spider (Compatible; MSIE 4.01;  Windows NT)', 'lmspider ([email protected])', 'lmspider [email protected]', 'ldspider (http://code.google.com/p/ldspider/wiki/Robots)', 'iaskspider / 2.0 (+) http://iask.com/help/help_index.html, 'iaskspider', 'hl_ftien_spider_v1. 1', 'hl_ftien_spider', 'FyberSpider (+ http://www.fybersearch.com/fyberspider.php)', 'FyberSpider', 'everyfeed - 2.0 - / - spiders (http://www.everyfeed.com)', 'envolk [ITS] spiders / 1.6 (+) http://www.envolk.com/envolkspider.html'. 'envolk [ITS] spiders / 1.6 (http://www.envolk.com/envolkspider.html)', 'Baiduspider+(+http://www.baidu.com/search/spider_jp.html)', 'Baiduspider+(+http://www.baidu.com/search/spider.htm)', 'BaiDuSpider', 'Mozilla/4.0 (Compatible; MSIE 7.0; Windows NT 6.0) AddSugarSpiderBot www.idealobserver.com', ] DOWNLOADER_MIDDLEWARES = { 'spider_douban.middlewares.RandomUserAgentMiddleware': 400, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None, }Copy the code

    USER_AGENT_LIST defines a number of browser user-agent attributes. There are many available online, and you can add them directly to the list. Note that some user-Agent messages are for mobile devices (phones or tablets). DOWNLOADER_MIDDLEWARES defines the downloader middleware, which is called when the page request data is sent.

    middlewares.py

    from spider_douban.settings import USER_AGENT_LIST
    import random
    
    class RandomUserAgentMiddleware():
        def process_request(self, request, spider):
            ua  = random.choice(USER_AGENT_LIST)
            if ua:
                request.headers.setdefault('User-Agent', ua)Copy the code

    Every time in RandomUserAgentMiddleware (), send the request data, will be randomly selected in the USER_AGENT_LIST a the user-agent record.

    5. Save the results

    Edit Pipelines. Py

    from scrapy import signals
    from scrapy.contrib.exporter import CsvItemExporter
    
    class SpiderDoubanPipeline(CsvItemExporter):
        def __init__(self):
            self.files = {}
    
        @classmethod
        def from_crawler(cls, crawler):
            print('==========pipeline==========from_crawler==========')
            pipeline = cls()
            crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
            crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
            return pipeline
    
        def spider_opened(self, spider):
            savefile = open('douban_top250_export.csv', 'wb+')
            self.files[spider] = savefile
            print('==========pipeline==========spider_opened==========')
            self.exporter = CsvItemExporter(savefile)
            self.exporter.start_exporting()
    
        def spider_closed(self, spider):
            print('==========pipeline==========spider_closed==========')
            self.exporter.finish_exporting()
            savefile = self.files.pop(spider)
            savefile.close()
    
        def process_item(self, item, spider):
            print('==========pipeline==========process_item==========')
            print(type(item))
            self.exporter.export_item(item)
            return itemCopy the code

    The SpiderDoubanPipeline class was created when the project was created and modified to save files.

    def from_crawler(cls, crawler):

    • If one exists, such a method is called to create the Pipeline instance from Crawler. It must return a new Pipeline instance. Scraping objects provide access to all of the core Scrapy components, such as Settings and signals; This is a way for pipelines to access them and hook their functionality to Scrapy.

    In this method, an instance of a data collector (CLS) is defined: ‘pipeline’.

    Signals: Scrapy uses signals to tell things to happen. You can extend your Scrapy project by capturing signals (using extension) to do additional work or add additional functionality. Although the signal dispatching mechanism provides some parameters, the handler does not need to receive all parameters. The Singal Dispatching Mechanism only provides the parameters accepted by the handler. You can connect to (or send your own) Signals through the Signals API.

    Connect: Connects a receiver function to a signal. Signal can be any object, although Scrapy provides some pre-defined signals.

    def spider_opened(self, spider):

    • This signal is sent when the spider begins to crawl. This signal is typically used to allocate spider resources, but it can also do anything. This signal supports a return to deferreds.

    In this method, an instance of a file object is created: savefile.

    CsvItemExporter(Savefile) : exports CSV files. If you add the fields_to_export attribute, it defines the CSV column names in sequence.

    def spider_closed(self, spider):

    • This signal is sent when a spider is closed. This signal can be used to release resources that each spider occupies when Spider_Opened. This signal supports a return to deferreds.

    def process_item(self, item, spider):

    • Each Item Pipeline component needs to call this method, which must either return an Item (or any inherited class) object or throw a DropItem exception. The discarded item will not be processed by subsequent Pipeline components.

    Enable the pipeline

    > < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important;”

    ITEM_PIPELINES = {
        'spider_douban.pipelines.SpiderDoubanPipeline': 300,
    }Copy the code

    6. Perform crawlers

    scrapy crawl douban_movie_top250Copy the code

    Executing a crawler can see the data it crawls…

    If the previous part of the pipeline code is not written, you can also use the following command to export data directly during the crawler execution:

    scrapy crawl douban_movie_top250 -o douban.csvCopy the code

    Add the -o parameter to save the data to the douban.csv file.

    7. Problem of file encoding

    I perform crawler in Linux server, generate CSV file, in The Windows 7 system with Excel opened into garbled code. I found some articles on the web, some of which directly changed the default encoding of Linux files, but felt that doing so would have an impact on other projects. Finally choose a relatively easy way. Follow these steps:

    1. Do not open CSV files directly in Excel. Start by opening Excel and creating a blank worksheet.
    2. choosedataTAB, openGet external dataIn theSince the text.
    3. inImporting a text fileDialog box, select the CSV file to import.
    4. inText Import Wizard - Step 1In the settingOriginal file formatfor65001 : Unicode (UTF-8)
    5. Go to the next step and select comma delimited to import normal text.