E-commerce owners and managers may need to climb their own websites in order to monitor web pages, track site traffic, search for optimization opportunities, etc.

For each of these, discrete tools, web scraping tools, and services are available to help monitor the site. You can create your own site crawlers and site monitoring systems with relatively little development effort.

The first step in building a custom crawler site and monitor is simply to get a list of all the pages on the site. This article shows you how to easily generate a list of these pages using the Python programming language and a neat Web crawler framework called Scrapy.

You need a server, Python and Scrapy

This is a development project. Server that requires Python and Scrapy installed. You also need to access the command line of the server through a terminal application or SSH client. Information on installing Python can also be obtained from the documentation section of Python.org. Scrapy also has excellent installation documentation. Make sure your server is ready to install Python and Scrapy.

Create a Scrapy project

Use an SSH client like Putty for Windows or a terminal application on a Mac, Linux computer to navigate to the directory where you want to keep your Scrapy project. Using the built-in Scrapy command startProject, we can quickly generate the basic files we need.

This article will grab a website called Business Idea Daily, so name the project “Bid”.

Generate a new Scrapy Web Spider

For convenience, Scrapy also has another command-line tool that automatically generates new Web spiders.

scrapy genspider -t crawl getbid businessideadaily.com

The first term, scrapy, refers to the scrapy framework. Next, there’s the genspider command that tells Scrapy we want a new Web spider or, if you prefer, a new web crawler.

-t tells Scrapy that we want to select a particular template. The genspider command generates any of four generic Web Spider templates: basic, Crawl, csvFeed, and XMLFeed. Directly after -t, we specify the desired template. In this case, we’ll create a template that Scrapy calls a CrawlSpider. The word “getbid” is the name of the spider.

The last part of the command tells Scrapy which site we want to crawl. The framework will use it to fill in the parameters of some new spider.

Define the Items

In Scrapy, Items are the way/model we use to organize the collection of Items when our spider crawls a particular site. While we could easily accomplish our goal – getting a list of all pages on a particular site – without using Items, not using Items might limit us if we want to expand our crawler later.

To define an Item, simply open the kitems.py file of our Scrapy that we created when we built our project. Within that, there will be a class called BidItem. The class name is based on the name we provide for the project.

`class BidItem(scrapy.Item):

define the fields for your item here like:

name = scrapy.Field()

pass`

Replace pass with the definition of a new field named URL.

url = scrapy.Field()

Save your documentation

Build Web spiders

Next open the Spider directory in your project and look for the generated new Spider Scrapy. In this case, the spider is called getbid, so the file is getbid.py.

When you open this file in the editor, you should see something like the following.

# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from bid.items import BidItem class GetbidSpider(CrawlSpider): name = 'getbid' allowed_domains = ['businessideadaily.com'] start_urls = ['http://www.businessideadaily.com/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) def parse_item(self, response): i = BidItem() #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() #i['name'] = response.xpath('//div[@id="name"]').extract() #i['description'] = response.xpath('//div[@id="description"]').extract() return i

We need to make some small changes to the code Scrapy generated for us. First, we need to modify the LinkExtractor parameters under the rule. Delete everything in parentheses.

Rule(LinkExtractor(), callback='parse_item', follow=True),

With this update, our spider will find each link on the start page (the home page), pass the individual link to the parse_item method, and follow the link to the next page of the site to make sure we get each linked page.

Next, we need to update the parse_item method. Delete all comment lines. These lines are just examples for Scrapy.

def parse_item(self, response): i = BidItem() return i

I like to use variable names that make sense. So I’m going to change I to href, which is the name of the property in the HTML link, which will hold the address of the target link, if any.

def parse_item(self, response): href = BidItem() return href

Now the magic happens and we capture the page URL as Items.

def parse_item(self, response): href = BidItem() href['url'] = response.url return href

That’s right. The new Spider is ready to crawl.

Crawl websites, get data

From the command line, we want to navigate to our project directory. Once in that directory, we’ll run a simple command to send our new spider and get a list of pages.

scrapy crawl getbid -o 012916.csv

This command has several parts. First, we refer to Scrapy frames. We told Scrapy we wanted to crawl. We specified that we want to use the getBid spider.

-o tells Scrapy to output. The 012916.csv part of the command tells Scrapy to place the result in a comma-separated value (.csv) file with that name.

In our example, Scrapy returns three page addresses. One of the reasons I chose this site for this example is that it’s only a few pages long. If you aim at a similar spider on a website with thousands of pages, it will take some time to run, but it will return a similar response.

url

businessideadaily.com/auth/login

businessideadaily.com/

Businessideadaily.com/password/em…

With just a few lines of code, you can lay the foundation for your own site-monitoring application.

This article was originally written by Data Star