Pycharm 2017.1 Scrapy 1.3.3

The target

Crawl the thread site and save the content to items.json

Page analysis

According to the figure above, we can find that the content is in classpostthisdivIn the

So let’s put out the code for post

<div class="post"><! - baidu_tc block_begin: {" action ":" DELETE "} -- > < div class = "date" > < span > 4 months < / span > < span class = "f" > 7, < / span > < / div > <! - baidu_tc block_end - > < h2 > < a href = "http://www.abckg.com/193.html" title = "7 April Tao COINS for mileage for jingdong check-in "rel =" bookmark" Target = "_blank" > 7 April Tao COINS for mileage for jingdong sign in < / a > < span > end < / span > < / h2 > < h6 > release date: 2017-04-07 | classification: Virtual currency < a href = "http://www.abckg.com/xunibi" > < / a > | view: 125177 < h6 > < div class = "intro" > < p > tao COINS a key for http://021.tw/t/ Computer terminal 30 gold 】 【 https://www.chaidu.com/App/Web/Taobao-Coin/ https://taojinbi.taobao.com/inde... Auto_take =true http://h5.m.taobao... </p></div></div>

Implementation method

1. Define the items

class DemoItem(scrapy.Item):    id = scrapy.Field()    title = scrapy.Field()    href = scrapy.Field()    content = scrapy.Field()

2. Create a crawler named test

# -*- coding: utf-8 -*-import scrapyfrom demo.items import DemoItemfrom scrapy.http import Requestclass TestSpider(scrapy.Spider): Name = "test" allowed_domains = ["www.abckg.com"] start_urls = ['http://www.abckg.com/'] def parse(self, response): for resp in response.css('.post'): ['href'] = resp.css('h2 a::attr(href)').extract() item['title'] = CSS ('h2 a::text').extract() item['content'] = resp.css('.intro p::text').extract() yield item # response.css('.pageinfo a::attr(href)').extract() for url in urls: yield Request(url, callback=self.parse) categorys = response.css('.menu li a::attr(href)').extract() for ct in categorys: yield Request(ct, callback=self.parse)

3. Modify settings.py by adding the following code

FEED_EXPORT_ENCODING = 'utf-8'

# Run open CMD input

scrapy crawl test -o items.json

A known bug

If you run the crawler multiple times, it will not overwrite the original content, but will append data.

Expandable content

1. Run crawler regularly, obtain new data and notify by email when checking website update 2. Check whether data is duplicated

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Write a crawler using Python scrapy in 21 lines of code

The target

Page analysis

Implementation method

A known bug

Expandable content

Write a crawler using Python scrapy in 21 lines of code

The target

Page analysis

Implementation method

A known bug

Expandable content

Related Posts

Preliminary study of hongmeng OS analysis of hongmeng project composition structure

From Java to BAT and other offers, I share my experience and perception in the past two years

The Official Rust Community Update for June