Pycharm 2017.1 Scrapy 1.3.3

The target

Crawl the thread site and save the content to items.json

Page analysis



According to the figure above, we can find that the content is in classpostthisdivIn the

So let’s put out the code for post

<div class="post"><! - baidu_tc block_begin: {" action ":" DELETE "} -- > < div class = "date" > < span > 4 months < / span > < span class = "f" > 7, < / span > < / div > <! - baidu_tc block_end - > < h2 > < a href = "http://www.abckg.com/193.html" title = "7 April Tao COINS for mileage for jingdong check-in "rel =" bookmark" Target = "_blank" > 7 April Tao COINS for mileage for jingdong sign in < / a > < span > end < / span > < / h2 > < h6 > release date: 2017-04-07 | classification: Virtual currency < a href = "http://www.abckg.com/xunibi" > < / a > | view: 125177 < h6 > < div class = "intro" > < p > tao COINS a key for http://021.tw/t/ Computer terminal 30 gold 】 【 https://www.chaidu.com/App/Web/Taobao-Coin/ https://taojinbi.taobao.com/inde... Auto_take =true http://h5.m.taobao... </p></div></div>


Implementation method

1. Define the items

class DemoItem(scrapy.Item):    id = scrapy.Field()    title = scrapy.Field()    href = scrapy.Field()    content = scrapy.Field()

2. Create a crawler named test

# -*- coding: utf-8 -*-import scrapyfrom demo.items import DemoItemfrom scrapy.http import Requestclass TestSpider(scrapy.Spider): Name = "test" allowed_domains = ["www.abckg.com"] start_urls = ['http://www.abckg.com/'] def parse(self, response): for resp in response.css('.post'): ['href'] = resp.css('h2 a::attr(href)').extract() item['title'] = CSS ('h2 a::text').extract() item['content'] = resp.css('.intro p::text').extract() yield item # response.css('.pageinfo a::attr(href)').extract() for url in urls: yield Request(url, callback=self.parse) categorys = response.css('.menu li a::attr(href)').extract() for ct in categorys: yield Request(ct, callback=self.parse)

3. Modify settings.py by adding the following code

FEED_EXPORT_ENCODING = 'utf-8'

# Run open CMD input

scrapy crawl test -o items.json

A known bug

If you run the crawler multiple times, it will not overwrite the original content, but will append data.

Expandable content

1. Run crawler regularly, obtain new data and notify by email when checking website update 2. Check whether data is duplicated