PyCharm = PyCharm = PyCharm = PyCharm = PyCharm = PyCharm = PyCharm = PyCharm = PyCharm

1. Create a Scrapy project first:

On the command line, type:

scrapy startproject project_nameCopy the code

Project_name is the project name, for example my project name is py_scrapyJobbole and the generated directory is:


2. Create a new Spider

On the command line, type:

Scrapy genspider jobbole blog.jobbole.comCopy the code
# -*- coding: utf-8 -*- import scrapy class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/111322/'] def parse(self,  response): re_select = response.xpath('//*[@id="post-111322"]/div[1]/h1') passCopy the code

3. Configure setting.py file (this step is important)

BOT_NAME = 'py_scrapyjobbole' SPIDER_MODULES = ['py_scrapyjobbole.spiders'] NEWSPIDER_MODULE = 'py_scrapyjobbole.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT  = 'py_scrapyjobbole (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = FalseCopy the code

ROBOTSTXT_OBEY = False Must be set to False for breakpoint debugging to work properly. <>


4, create the main.py file in the project directory and debug it later!

from scrapy.cmdline import execute
import sys
import os

# Interrupt point debug py file
# sys.path.append('D:\PyCharm\py_scrapyjobbole')
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
print(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy'.'crawl'.'jobbole'])Copy the code

5. Debug breakpoints

The appendix

Xpath knowledge

Your knowledge of Xpath may be useful when Scrapy is used for data crawling, so here’s a quick diagram:

The difference between ‘/’ and ‘//’ is noteworthy.

/ : represents child elements. The selected elements must be parent and child

// : represents all descendant elements. The selected elements are not necessarily parent-child, but only descendant elements

However, if you find that difficult, you can also use Chrome’s element lookup feature to copy xpath paths: