Python quick crawler tips

In this article, we’ll look at more specific examples to familiarize you with Scrapy’s two most important classes, Request and Response.

1.1 Crawler to be logged in

More often than not, you’ll find that the site you want to extract data from has a login mechanism. In most cases, the site will ask you to provide a username and password for login. You can find the examples we’ll use from http://web:9312/dynamic (accessed from the dev machine) or http://localhost:9312/ Dynamic (accessed from the host browser). If you use “user” as your username and “pass” as your password, you will be able to access a web page with links to three real estate pages. The question is, how do you do the same with Scrapy?

Let’s use the Developer tools of Google Chrome to try to understand how login works (see Figure 1.1). First, open the Network TAB (1). Then, fill in the username and password, and click Login (2). If the username and password are correct, you will see a page with three links. If the username and password do not match, you will see an error page.

Figure 1.1 Request and response at site login

When the Login button in the Google Chrome developer tools of Network TAB to see a contains the Request Method: the POST Request, the destination address to http://localhost:9312/dynamic/login.

When you click on the request (3), you can see the Data sent to the server, including Form Data (4), which contains the username and password we entered. The data is transmitted to the server as text. Chrome just organizes it and shows us the data better. The response from the server is 302 Found (5), which jumps us to a new page: /dynamic/gated. This page will appear only after a successful login. If you try to visit http://localhost:9312/dynamic/gated directly, and don’t input the correct user name and password, the server will find that you’re cheating, and jump to the error page, the address is http:// localhost: 9312 / dynamic/error. How does the server know about you and your password? If you click on the Developer tool (6), you will see that a Cookie value (8) has been set under the Request Headers area (7).

In summary, even a single operation, such as a login, can involve multiple server round trips including POST requests and HTTP jumps. Scrapy handles most of the operations automatically, and the code we need to write is simple.

We start with the crawler named Easy in Chapter 3, create a new crawler, name it login, keep the original file, and modify the name attribute in the crawler (as shown below) :

The class LoginSpider (CrawlSpider) : name ='login'
Copy the code

We need by performing a POST request to http://localhost:9312/dynamic/login, send the initial login request. This will be implemented through Scrapy’s FormRequest class. To use this class, you first need to introduce the following modules.

from scrapy.http import FormRequest
Copy the code

Then, replace the start_urls statement with the start_requests() method. We do this because in this case, we need to start with a few more customized requests, not just a few urls. More specifically, we create and return a FormRequest from this function.

# Start with a login requestdef start_requests(self):return [FormRequest (" http://web:9312/dynamic/login ", formdata = {" user ":" user ", "pass", "pass"})]
Copy the code

As crazy as it sounds, the default parse() method of the CrawlSpider (LoginSpider’s base class) does handle Response, and can still use the Rule and LinkExtractor from Chapter 3. We wrote very little extra code because Scrapy handled cookies transparently for us and, once we logged in, transmitted them on subsequent requests, just as the browser did. You can then use scrapy Crwal as usual.

<strong>$scrapy crawl login</strong><strong>INFO: scrapy 1.0.3 started (bot: properties)</strong><strong>... </strong><strong>DEBUG: Redirecting (302) to &lt; GET ... /gated> from &lt; POST ... /login ></strong><strong>DEBUG: Crawled (200) &lt; GET ... /data.php></strong><strong>DEBUG: Crawled (200) &lt; GET ... /property_000001.html> (referer: ... /data.</strong><strong>php)</strong><strong>DEBUG: Scraped from &lt; 200... /property_000001.html></strong><strong>{'address': [u'Plaistow, London'],</strong><strong> 'date': [datetime.datetime(2015, 11, 25, 12, 7, 27, 120119)],</strong><strong> 'description': [u'features'],</strong><strong> 'image_urls': [u'http://web:9312/images/i02.jpg'],</strong><strong>... </strong><strong>INFO: Closing spider (finished)</strong><strong>INFO: Scrapy stats:</strong><strong>{... </strong><strong>'downloader/request_method_count/GET': 4,</strong><strong>'downloader/request_method_count/POST': 1,</strong><strong>... </strong><strong>'item_scraped_count': 3,</strong>
Copy the code

We can see a jump from Dynamic /login to Dynamic /gated in the log and then grab the Item as usual. In the statistics, you can see 1 POST request and 4 GET requests (1 to dynamic/ Gated index page and 3 to property page).

If you use the wrong username and password, you will be redirected to a page without any items and the crawl process will be terminated, as shown below.

<strong>$scrapy crawl login</strong><strong>INFO: scrapy 1.0.3 started (bot: properties)</strong><strong>... </strong><strong>DEBUG: Redirecting (302) to &lt; GET ... /dynamic/error > from &lt; POST ... /</strong><strong>dynamic/login></strong><strong>DEBUG: Crawled (200) &lt; GET ... /dynamic/error></strong><strong>... </strong><strong>INFO: Spider closed (closespider_itemcount)</strong>Copy the code

This is a simple login example to demonstrate the basic login mechanism. Most sites have more complex mechanics, but Scrapy is easy to handle. For example, some sites require you to transfer certain form variables from the form page to the login page when you perform a POST request to confirm that cookies are enabled, which also makes it harder for you to brute force thousands of user name/password combinations. Figure 1.2 shows an example of such a situation.

Figure 1.2 request and response for a more advanced login example using one-time random numbers

For example, when visiting http://localhost:9312/dynamic/nonce, you will see a page that look the same, but when using Chrome developer tools to check, will find that in the form of the page there is a hidden field called nonce. When submitting the form (to http://localhost:9312/ dynamic/nonce-login), unless you have both transmitted the correct username/password and submitted the nonce value that the server gave you when you accessed the login page, the login will not be successful. You can’t guess the value because it’s usually random and one-time. This means that you now need to request twice for a successful login. You must first access the form page and then the login page to transfer data. Of course, Scrapy also has built-in functions to help us do this.

We created a NonceLoginSpider crawler similar to the previous one. Now, in start_requests(), you’ll return a simple Request (don’t forget to include that module) to the form page and process the response manually by setting its callback attribute for the processing method parse_welcome(). In parse_welcome(), the FormRequest object’s helper method from_response() is used to create a FormRequest object that pre-fills all fields and values from the original form. Formrequest.from_response () roughly simulates a submit click on the first form of the page, with all fields left blank.

This method is useful to us because it effortlessly contains all the hidden fields in the form as is. All we need to do is populate the User and Pass fields with the FormData parameter and return the FormRequest. Here’s the code.

# Start on the welcome pagedef start_requests(self):return [Request (" http://web:9312/dynamic/nonce ", the callback = self. Parse_welcome)] # Post welcome page 's first form  with the given user/passdef parse_welcome(self, Response):return FormRequest. From_response (response,formdata={"user": "user", "pass": "pass"})
Copy the code

We can run the crawler as usual.

<strong>$scrapy crawl noncelogin</strong><strong>INFO: scrapy 1.0.3 started (bot: properties)</strong><strong>... </strong><strong>DEBUG: Crawled (200) &lt; GET ... /dynamic/nonce></strong><strong>DEBUG: Redirecting (302) to &lt; GET ... /dynamic/gated > from &lt; POST ... /</strong><strong>dynamic/login-nonce></strong><strong>DEBUG: Crawled (200) &lt; GET ... /dynamic/gated></strong><strong>... < / strong > < strong > INFO: Dumping Scrapy stats: < / strong > < strong > {... </strong><strong>'downloader/request_method_count/GET': 5,</strong><strong>'downloader/request_method_count/POST': 1,</strong><strong>... </strong><strong>'item_scraped_count': 3,</strong>
Copy the code

As you can see, the first GET request goes to the /dynamic/nonce page, then a POST request jumps to the /dynamic/nonce-login page, and then jumps to the /dynamic/ Gated page as in the previous example. So much for the log-in discussion. This example uses two steps to complete the login. With enough patience, you can form an arbitrarily long chain to perform almost any login operation.

5.2 Use JSON API and AJAX page crawler

Sometimes, you will find that the data you are looking for on a page cannot be found in an HTML page. For example, when visiting http://localhost:9312/static/ (see figure 5.3), right-click on the page any position inspect element (1, 2), you can see which contains all common HTML elements in the DOM tree. However, when you use a scrapy shell request or right-click View Page Source (3, 4) in Chrome, you’ll find that the HTML code for the Page contains no information about the property. So where do these numbers come from?

Figure 1.3 Page request and response when loading JSON objects dynamically

As usual, the next step when encountering such an example should be to open the Network TAB of Chrome’s developer tools and see what happens. In the list on the left, you can see the request Chrome performed when loading this page. In this simple page, there are only three requests: static/ is the request that was just checked; Jquery.min.js is used to get the code of a popular Javascript framework; And api.json looks interesting. When you click on the request (6) and click on the Preview TAB (7) on the right, you’ll see that it contains the data we’re looking for. In fact, http://localhost:9312/properties/api.json contains the housing (8) ID and name, as shown below.

[{"id": 0."title": "better set unique family well"},... {"id": 29."title": "better portered mile"}]
Copy the code

This is a very simple EXAMPLE of a JSON API. More complex apis might require you to log in, use POST requests, or return more interesting data structures. In either case, JSON is one of the easiest parsing formats because you don’t need to write any XPath expressions to extract data from it.

Python provides a very good JSON parsing library. When we import JSON, we can parse the JSON using json.loads(response.body) to convert it into equivalent Python objects made up of Python primitives, lists, and dictionaries.

We copied manual.py from Chapter 3 to do this. In this case, this is the best starting option because we need to manually create the property URL and Request object from the ID we found in the JSON object. We rename the file api.py, rename the crawler to ApiSpider, and change the name attribute to API. The new start_urls will be JSON API urls, as shown below.

start_urls = ('http://web:9312/properties/api.json'.)Copy the code

If you want to perform a POST request, or something more complex, you can use the start_requests() method described in the previous section. At this point, Scrapy opens the URL and calls the parse() method with Response as an argument. You can parse the JSON object using the following code by importing JSON.

Def parse (self, response) : base_url ="http://web:9312/properties/"js = json.loads(response.body)for item inJs: id = item ["id"Url = base_url +"property_%06d.html"% idyield Request (url, the callback = self. Parse_item)Copy the code

The previous code uses json.loads(Response.body) to parse the RESPONSE JSON object into a Python list and iterate over the list. For each item in the list, we combine the three parts of the URL (base_URL, property_%06d, and.html). Base_url is the URL prefix defined earlier. %06d is a very useful part of Python syntax that allows us to create new strings with Python variables. In this case, %06d will be replaced by the value of the variable ID (the variable after % at the end of the line). The ID will be treated as a number (%d means it is treated as a number), and if it is less than 6 digits, it will be preceded by 0 to expand to 6 characters. For example, if id is 5, %06d will be replaced with 000005, and if ID is 34322, %06d will be replaced with 034322. The end result is a valid URL for our property page. We use this URL to form a new Request object and use yield as we did in Chapter 3. You can then run the example using scrapy Crawl as usual.

<strong>$scrapy crawl API </strong><strong>INFO: scrapy 1.0.3 started (bot: properties)</strong><strong>... </strong><strong>DEBUG: Crawled (200) &lt; GET ... properties/api.json></strong><strong>DEBUG: Crawled (200) &lt; GET ... /property_000029.html></strong><strong>... </strong><strong>INFO: Closing spider (finished)</strong><strong>INFO: Dumping Scrapy stats:</strong><strong>... </strong><strong>'downloader/request_count': 31,... </strong><strong>'item_scraped_count': 30,</strong>
Copy the code

You might notice that the state at the end is 31 requests — one for each Item, and the original api.json request.

1.2.1 Transferring parameters between responses

In many cases, there will be information of interest in the JSON API that you may want to store in items. In our example, to demonstrate this, the JSON API preends the title of the given property information with “better”. For example, if the title of the property is “Covent Garden”, the API will say “Better Covent Garden”. Suppose we want to store these “better” starting headings in Items, how do we pass the information from the parse() method to the parse_item() method?

Not surprisingly, you can do this by setting a few things in the Request generated by parse(). This information can then be retrieved from the Response received by parse_item(). Request has a dictionary named meta that allows direct access to Response. For example, in our example, we could set the title value in the dictionary to store the title from the JSON object.

title = item["title"]yield Request(url, meta={"title": title},callback=self.parse_item)
Copy the code

Inside parse_item(), you can use this value instead of the XPath expression you used previously.

l.add_value('title', response.meta['title'], MapCompose (unicode) strip, unicode. The title))Copy the code

You’ll notice that instead of calling add_xpath(), we’ll call add_value() because we won’t be using any XPath expressions in that field. Now you can run the new crawler using scrapy crawl and see the title from api.json in the PropertyItems.

1.330 times faster property crawler

There is a tendency, when you start using a framework, to do everything in the most complicated way possible. You’ll find yourself doing the same thing with Scrapy. Before going crazy over technologies like XPath, it’s worth pausing to think: Is the approach I’ve chosen the easiest way to extract data from a web site?

If you can extract basically the same information from the index page, you can get an order of magnitude improvement by avoiding grabbing every listing page.

For example, in the real estate example, all the information we need exists in the index page, including title, description, price, and picture. This means that by grabbing a single index page, you can extract 30 entries and links to the next page. By climbing 100 index pages, we only need 100 requests instead of 3000 requests to get 3000 entries. That’s great!

In a real Gumtree site, the description of the index page is slightly shorter than the full description of the list page. But it may be feasible, and even desirable, to grab.

In our case, when see any an index page HTML code, you will find the index page of each property has its own node, and use itemtype = “http://schema.org/Product”. In this node, we have all the information annotated for each property in exactly the same way as in the detail page, as shown in Figure 5.4.

Figure 5.4 Extracting multiple properties from a single index page

We load the first index page in our Scrapy shell and test it using XPath expressions.

<strong>$ scrapy shell http://web:9312/properties/index_00000.html</strong>
Copy the code

In Scrapy shell, try to pick everything with the Product tag:

<strong>>>> p=response.xpath('//*[@itemtype="http://schema.org/Product"]')</strong><strong>>>> len(p)</strong><strong>30</strong><strong>>>> p</strong><strong>[&lt;Selector xpath='//*[@itemtype="http://schema.org/Product"]' data=u'< li class="listing-maxi" itemscopeitemt'...]</strong>
Copy the code

You can see that we’ve got a list of 30 Selector objects, each pointing to a house. In a sense, a Selector object is similar to a Response object in that we can use XPath expressions in it and only get information from where they point. The only caveat is that these expressions should be relative XPath expressions. The relative XPath expression is basically the same as what we saw before, but with a ‘.’ dot added in front. As an example, let’s see how it works to extract the title from the fourth house using the relative XPath expression.//*[@ItemProp =”name”][1]/text().

<strong>>>> selector = p[3]</strong><strong>>>> selector</strong><strong>&lt; Selector xpath='//*[@itemtype="http://schema.org/Product"]'.'></strong><strong>>>> selector.xpath('.//*[@itemprop="name"][1]/text()').extract()</strong><strong>[u'l fun broadband clean people brompton european']</strong>
Copy the code

You can use a for loop in the list of Selector objects to extract information for all 30 entries in the index page.

To do this, we’ll start again with manual.py in Chapter 3, rename the crawler to “fast” and rename the file to fast.py. We’ll reuse most of the code, with only minor changes in the parse() and parse_items() methods. The code for the latest method is as follows.

def parse(self, response):# Get the next index URLs and yield Requestsnext_sel = response.xpath('//*[contains(@class,"next")]//@href')for url In next_sel. Extract () : the yield Request (urlparse. Urljoin (response. The url, Url))# Iterate through products and create PropertiesItemsselectors = Response. Xpath (' / / * [@ itemtype = "http://schema.org/Product"] ") for the selector in selectors: yield self.parse_item(selector, response)
Copy the code

In the first part of the code, the yield operation for the Request going to the next index page is unchanged. The only change is in the second part, where instead of using yield to create a request for each detail page, you iterate over the selector and call parse_item(). The code for parse_item() is also very similar to the original code, as shown below.

def parse_item(self, selector, response):# Create the loader using the selectorl = ItemLoader(item=PropertiesItem(), Selector =selector)# Load fields using XPath expressionsl.add_xpath('title', '. / / * [@ itemprop = "name"] [1] / text () ', MapCompose (unicode) strip, unicode. The title)) L.A. dd_xpath (' price ', '. / / * [@ itemprop = "price"] [1] / text () ', MapCompose (lambda I: i.replace(',', ''), Float), re = '[0-9] to +) L.A. dd_xpath (' description', '. / / * [@ itemprop = "description"] [1] / text () ', MapCo mpose(unicode.strip), The Join ()) L.A. dd_xpath (' address ', '. / / * [@ itemtype = "http://schema.org/Place"] ' '[1] / * / text ()', MapComp Ose (unicode. Strip)) make_url = lambda I: Urlparse. Urljoin (response. The url, I) L.A. dd_xpath (' image_urls', '. / / * [@ itemprop = "image"] [1] / @ SRC ', MapCompose (make_url)) # Housekeeping fieldsl. Add_xpath (" url ", '. / / * [@ itemprop = "url"] [1] / @ href ', MapCompose (make_url)) L.A. dd_value (' project ', The self. The Settings. The get (' BOT_NAME)) L.A. dd_value (' spiders' self. Name) L.A. dd_value (' server ', Sockets gethostname ()) L.A. dd_value (' date and datetime. Datetime. Now ()) return l.l oad_item ()
Copy the code

The minor changes we made are shown below.

ItemLoader now uses selector as the source instead of Response. This is a handy feature of the ItemLoader API that allows us to extract data from the currently selected portion rather than the entire page.
XPath expressions are expressed by prefixing the dot (.) Switch to relative XPath.
We have to edit the Item URL ourselves. Previously, Response. url has given the URL of the listing page. Now, it gives us the URL of the index page, because that’s the page we want to crawl. We need to use the familiar.//*[@ItemProp =” URL “][1]/@href XPath expression to extract the url of the house and then use the MapCompose processor to convert it to an absolute URL.

Small changes can save huge amounts of work. Now we can run the crawler using the following code.

<strong>$ scrapy crawl fast -sCLOSESPIDER_PAGECOUNT=3</strong><strong>... </strong><strong>INFO: Dumping Scrapy stats:</strong><strong>'downloader/request_count': 3,... </strong><strong>'item_scraped_count': 90,... </strong>Copy the code

As expected, 90 items were fetched in just three requests. If we don’t get it in the index page, we need 93 requests. That’s a smart way to go!

If you want to use scrapy Parse for debugging, you must now set the spider parameters, as shown below.

<strong>$ scrapy parse --spider=fast http://web:9312/properties/index_00000.html</strong><strong>... </strong><strong>>>> STATUS DEPTH LEVEL 1 &lt; &lt; &lt; </strong><strong># Scraped Items --------------------------------------------[{'address': [u'Angel, London'],... 30 items...# Requests ---------------------------------------------------[< GET http://web:9312/properties/index_00001.html>]
Copy the code

As expected, parse() returns 30 items and a Request to the next index page. Feel free to experiment with scrapy parse, such as transfer –depth=2.

This article is excerpted from Master Python crawler Framework Scrapy.

Scrapy is a fast, high-level screen scraping and Web scraping framework developed in Python for grabbing Web sites and extracting structured data from pages. This book introduces the basics of Scrapy and how to use Python and tripartite apis to extract and organize data to suit your needs.

Stretch recommended

A blockbuster book for January 2018

Primary school students start learning Python, the closest programming language to AI: A Wave of Python books from Amway

Policy warming: Everyone is learning big data, a wave of good books recommended

Selenium Automation is a Python based test book for Selenium

Eight new books. Send one you like

AI | classic books (introduction to artificial intelligence which books to read?

Click on keywords to read more new books:

Python | | machine learning Kotlin Java | | | | mobile development robots contests | Web front-end | book list

If you reply “follow” in the background of “Asynchronous books”, you can get 2000 online video courses for free. Recommend friends to pay attention to according to the prompts to get a gift book link, free of charge asynchronous books. Come and join us!

Read the original article to find out more about the book

Scan the qr code above and reply “follow” to participate in the event!

Read the article below to see more

5.2 Use JSON API and AJAX page crawler

1.2.1 Transferring parameters between responses

1.330 times faster property crawler

A blockbuster book for January 2018

Primary school students start learning Python, the closest programming language to AI: A Wave of Python books from Amway

Policy warming: Everyone is learning big data, a wave of good books recommended

Eight new books. Send one you like

AI | classic books (introduction to artificial intelligence which books to read?

Python | | machine learning Kotlin Java | | | | mobile development robots contests | Web front-end | book list

Related Posts

Application of fine-grained sentiment analysis in food-to-meal scenarios

[Medical image segmentation] Medical image segmentation based on MATLAB GVF algorithm

Data center is not a panacea for enterprises