Keep track of Python scrapy crawlers

Demand: crawl fangtianxia website all new homes, second-hand housing information

Scrapy framework in windows10 system deployment

Install Visual C++ Build Tools

Because pyWin32 and Twisted’s base is built on C in Scrapy’s library of dependencies, a C compilation environment needs to be installed. For python3.6, you can install this environment by installing Visual C++ Build Tools. Download address is: https://visualstudio.microsoft.com/thank-you-downloading-visual-studio/?sku=BuildTools&rel=152

For Windows Scrapy, there are two third-party libraries that can’t be installed in the usual way: LXML and PyWin32. Neither of these third-party libraries is recommended to be installed using the PIP command. Both libraries can be installed using the.exe installation package.

3. Create Virtualenv

Since subsequent Twisted and Scrapy installations come with a large number of dependency libraries that are only used in Scrapy, they are rarely used in normal development. So installing them into the system can cause confusion in the Python system, and it’s not easy to export the dependent library files involved when publishing crawlers.

So we use Virtualenv to create a virtual Python environment to install the rest of Scrapy.

Virtualenv is a third-party library for Python that can be used to create virtual environments for Python. Virtualenv can be installed using the normal method:

pip install virtualenv

To get Virtualenv to use third-party libraries in Python, use CMD to create a virtual environment:

virtualenv –always-copy –system-site-packages venv

After creating a virtual environment, you can use the following commands to activate the virtual environment:

venv\scripts\avtivate

Do not close this window, all subsequent operations will take place here.

4. Install Twisted and Scrapy

Install using PIP command in the previous window, respectively:

pip install twisted

pip install scrapy

Grab the housing world website housing information instance

1. Get all the city url http://www.fang.com/SoufunFamily.htm for example: http://cq.fang.com/ 2. New url http://newhouse.sh.fang.com/house/s/. 3. Second-hand housing url http://esf.sh.fang.com/ 4. Beijing new and second-hand housing different url rules http://newhouse.fang.com/house/s/ http://esf.fang.com/Copy the code

Create a project

Enter the following command in the CMD you just entered

scrapy startproject fang

scrapy genspider sfw_spider "fang.com"Copy the code

sfw_spider.py

# -*- coding: utf-8 -*-
import scrapy
import re
from fang.items import NewHouseItem,ESFHouseItem

class SfwSpiderSpider(scrapy.Spider):
    name = 'sfw_spider'
    allowed_domains = ['fang.com']
    start_urls = ['http://www.fang.com/SoufunFamily.htm']

    def parse(self, response):
        trs = response.xpath("//div[@class='outCont']//tr")
        provice = None
        for tr in trs:
            # exclude the first TD, the two second and third TD tags
            tds = tr.xpath(".//td[not(@class)]")
            provice_td = tds[0]
            provice_text = provice_td.xpath(".//text()").get()
            If the second TD contains a null value, the province value of the previous TD is used
            provice_text = re.sub(r"\s"."",provice_text)
            if provice_text:
                provice = provice_text
            # Exclude overseas cities
            if provice == 'other':
                continue

            city_td = tds[1]
            city_links = city_td.xpath(".//a")
            for city_link in city_links:
                city = city_link.xpath(".//text()").get()
                city_url = city_link.xpath(".//@href").get()
                # print(" province: ",provice)
                # print(" city: ",city)
                # print(" city_url ",city_url)
                # Below is the url link of the new house and the second-hand house pieced together by city_URL
                # city URL: http://cq.fang.com/
                # new url:http://newhouse.cq.fang.com/house/s/
                # Second-hand houses: http://esf.cq.fang.com/
                url_module = city_url.split("/ /")
                scheme = url_module[0]     #http:
                domain = url_module[1]     #cq.fang.com/
                if 'bj' in domain:
                    newhouse_url = ' http://newhouse.fang.com/house/s/'
                    esf_url = ' http://esf.fang.com/'
                else:
                    # a new url
                    newhouse_url = scheme + '/ /' + "newhouse." + domain + "house/s/"
                    # Second-hand house URL
                    esf_url = scheme + '/ /' + "esf." + domain + "house/s/"
                # print(" new url: ",newhouse_url)
                # print(" 下 载 地 址 : ",esf_url)

                The meta can carry some parameter information in the Request, which can be retrieved in the response function
                yield scrapy.Request(url=newhouse_url,
                                     callback=self.parse_newhouse,
                                     meta = {'info':(provice,city)}
                                     )

                yield scrapy.Request(url=esf_url,
                                     callback=self.parse_esf,
                                     meta={'info': (provice, city)})


    def parse_newhouse(self,response):
        # bridal chamber
        provice,city = response.meta.get('info')
        lis = response.xpath("//div[contains(@class,'nl_con')]/ul/li")
        for li in lis:
            name = li.xpath(".//div[contains(@class,'house_value')]//div[@class='nlcd_name']/a/text()").get()
            if name:
                name = re.sub(r"\s"."",name)
                # bedroom
                house_type_list = li.xpath(".//div[contains(@class,'house_type')]/a/text()").getall()
                house_type_list = list(map(lambda x:re.sub(r"\s"."",x),house_type_list))
                rooms = list(filter(lambda x:x.endswith("居"),house_type_list))
                # area
                area = "".join(li.xpath(".//div[contains(@class,'house_type')]/text()").getall())
                area = re.sub(r"\ | | / s"."",area)
                # address
                address = li.xpath(".//div[@class='address']/a/@title").get()
                address = re.sub(r"[Please select]"."",address)
                sale = li.xpath(".//div[contains(@class,'fangyuan')]/span/text()").get()
                price = "".join(li.xpath(".//div[@class='nhouse_price']//text()").getall())
                price = re.sub(r"\ | s advertising"."",price)
                # Details page URL
                origin_url = li.xpath(".//div[@class='nlcd_name']/a/@href").get()

                item = NewHouseItem(
                    name=name,
                    rooms=rooms,
                    area=area,
                    address=address,
                    sale=sale,
                    price=price,
                    origin_url=origin_url,
                    provice=provice,
                    city=city
                )
                yield item

        # on the next page
        next_url = response.xpath("//div[@class='page']//a[@class='next']/@href").get()
        if next_url:
            yield scrapy.Request(url=response.urljoin(next_url),
                                 callback=self.parse_newhouse,
                                 meta={'info': (provice, city)}
                                 )


    def parse_esf(self,response):
        # secondhand the room
        provice, city = response.meta.get('info')
        dls = response.xpath("//div[@class='shop_list shop_list_4']/dl")
        for dl in dls:
            item = ESFHouseItem(provice=provice,city=city)
            name = dl.xpath(".//span[@class='tit_shop']/text()").get()
            if name:
                infos = dl.xpath(".//p[@class='tel_shop']/text()").getall()
                infos = list(map(lambda x:re.sub(r"\s"."",x),infos))
                for info in infos:
                    if "Hall" in info:
                        item["rooms"] = info
                    elif 'layers' in info:
                        item["floor"] = info
                    elif 'to' in info:
                        item['toward'] = info
                    elif '㎡' in info:
                        item['area'] = info
                    elif 'in a year' in info:
                        item['year'] = re.sub("Year"."",info)
                item['address'] = dl.xpath(".//p[@class='add_shop']/span/text()").get()
                # the total price
                item['price'] = "".join(dl.xpath(".//span[@class='red']//text()").getall())
                # unit price
                item['unit'] = dl.xpath(".//dd[@class='price_right']/span[2]/text()").get()
                item['name'] = name
                detail = dl.xpath(".//h4[@class='clearfix']/a/@href").get()
                item['origin_url'] = response.urljoin(detail)
                yield item
        # on the next page
        next_url = response.xpath("//div[@class='page_al']/p/a/@href").get()
        if next_url:
            yield scrapy.Request(url=response.urljoin(next_url),
                                 callback=self.parse_esf,
                                 meta={'info': (provice, city)}
                                 )Copy the code

items.py

# -*- coding: utf-8 -*-

import scrapy

class NewHouseItem(scrapy.Item):
    # provinces
    provice = scrapy.Field()
    # city
    city = scrapy.Field()
    # village
    name = scrapy.Field()
    Price of #
    price = scrapy.Field()
    # number, it's a list
    rooms = scrapy.Field()
    # area
    area = scrapy.Field()
    # address
    address = scrapy.Field()
    # Whether it is for sale
    sale = scrapy.Field()
    The url of the room World details page
    origin_url = scrapy.Field()

class ESFHouseItem(scrapy.Item):
    # provinces
    provice = scrapy.Field()
    # city
    city = scrapy.Field()
    # Community name
    name = scrapy.Field()
    # How many rooms
    rooms = scrapy.Field()
    # layer
    floor = scrapy.Field()
    # towards
    toward = scrapy.Field()
    # s
    year = scrapy.Field()
    # address
    address = scrapy.Field()
    # Floor area
    area = scrapy.Field()
    # the total price
    price = scrapy.Field()
    # unit price
    unit = scrapy.Field()
    # Details page URL
    origin_url = scrapy.Field()Copy the code

pipelines.py

# -*- coding: utf-8 -*-

from scrapy.exporters import JsonLinesItemExporter

class FangPipeline(object):
    def __init__(self):
        self.newhouse_fp = open('newhouse.json'.'wb')
        self.esfhouse_fp = open('esfhouse.json'.'wb')
        self.newhouse_exporter = JsonLinesItemExporter(self.newhouse_fp,ensure_ascii=False)
        self.esfhouse_exporter = JsonLinesItemExporter(self.esfhouse_fp,ensure_ascii=False)

    def process_item(self, item, spider):
        self.newhouse_exporter.export_item(item)
        self.esfhouse_exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.newhouse_fp.close()
        self.esfhouse_fp.close()Copy the code

Py sets random user-agent: No proxy pool IP is used here

# -*- coding: utf-8 -*-

import random

class UserAgentDownloadMiddleware(object):
    USER_AGENTS = [
        'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64; The rv: 34.0) Gecko / 20100101 Firefox 34.0 / '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    ]

    def process_request(self,request,spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent'] = user_agentCopy the code

settings.py

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

DOWNLOADER_MIDDLEWARES = {
   'fang.middlewares.UserAgentDownloadMiddleware': 543,
}

ITEM_PIPELINES = {
   'fang.pipelines.FangPipeline': 300}Copy the code

start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl sfw_spider".split())Copy the code

Keep track of Python scrapy crawlers

Related Posts

Only the code can cure all summary | 2021 years

45 years old, an ordinary uncle working life

This section describes the data types and length limits of the MSSQL