In the previous chapter, we successfully tried to analyze Ajax for fetching relevant data, but not all pages can do so through analyzing Ajax. For example, Taobao’s entire page data is indeed obtained through Ajax, but these Ajax interface parameters are complicated and may contain encryption keys, so it is difficult to construct Ajax parameters by yourself. The easiest and fastest way to crawl such pages is through Selenium. In this section, we will use Selenium to simulate browser operations to capture product information on Taobao and save the results to MongoDB.

1. Objectives of this section

In this section, we will use Selenium to capture taobao goods and pyQuery analysis to get the picture, name, price, number of buyers, store name and location information of the goods, and save them to MongoDB.

2. Preparation

In this section, we first use Chrome as an example to explain the use of Selenium. Before you start, make sure the Chrome browser is properly installed and ChromeDriver is configured; In addition, the Selenium library of Python needs to be properly installed; Finally, PhantomJS and Firefox are connected, so make sure PhantomJS and Firefox are installed and GeckoDriver is configured. If the environment is not configured, refer to Chapter 1.

3. Interface analysis

First, let’s take a look at taobao’s interface and see how much more content it has than regular Ajax.

Open the Taobao page and search for goods, such as iPad. At this time, open the developer tool and intercept the Ajax request. We can find the interface to obtain the list of goods, as shown in Figure 7-19.

Figure 7-19 List interfaces

Its link contains several GET parameters, and if you want to construct an Ajax link, it is best to request it directly, which returns the content in JSON format, as shown in Figure 7-20.

Figure 7-20 JSON data

However, this Ajax interface contains several parameters. The _ksTS and Rn parameters cannot be found directly, and if you want to find out how they are generated, it can be relatively cumbersome, so if you use Selenium directly to simulate the browser, you don’t need to worry about these interface parameters. As long as you can see it in the browser, you can crawl it. This is why we chose Selenium to crawl Taobao.

4. Page analysis

The goal of this section is to crawl commodity information. Figure 7-21 is a commodity entry, which contains the basic information of the commodity, including the picture of the commodity, name, price, number of buyers, name of the store and location of the store. What we need to do is to grab all these information.

Figure 7-21 Items

The crawl entry is taobao’s search page, and this link can be accessed by directly constructing parameters. For example, if you search for iPad, you can go directly to s.taobao.com/search?q=iP… The search results on the first page are displayed, as shown in Figure 7-22.

Figure 7-22 Search results

At the bottom of the page, there is a paging navigation, which includes the links of the previous five pages, the links of the next page, and a link to jump to any page number, as shown in Figure 7-23.

Figure 7-23 Paging navigation

The product search results here are usually 100 pages at most, and to get the content of each page, you just need to go through the page number from 1 to 100, and the page number is fixed. Therefore, directly enter the page number to jump in the page Skip text box, and then click “OK” button to jump to the corresponding page number.

The reason for not directly clicking “Next page” is that once there is an abnormal exit in the process of climbing, for example, when you exit on page 50, you cannot quickly switch to the corresponding follow-up page when you click “Next page”. In addition, in the process of crawling, it is also necessary to record the current page number, and once the page fails to load after clicking “Next page”, it is also necessary to do abnormal detection to detect the current page is loaded to the page. The whole process is relatively complex, so here we directly use the jump to climb the page.

When we successfully load a list of items on a page, Selenium can be used to obtain the source code of the page and then parse it with the appropriate parsing library. Here we use PyQuery for parsing. Below we use the code to achieve the whole fetching process.

5. Get a list of goods

First, we need to construct a URL to crawl: s.taobao.com/search?q=iP… . The URL is very succinct, and the parameter Q is the keyword to search for. Simply change this parameter to get a list of different items. Here we define the keyword of the item as a variable, and then construct such a URL.

Then, you need to use Selenium to crawl. We implement the following method of fetching the list page:

12345678910111213141516171819202122232425262728293031323334from selenium import webdriverfrom selenium.common.exceptions  import TimeoutExceptionfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.support.wait import WebDriverWaitfrom urllib.parse import quote browser  = webdriver.Chrome()wait = WebDriverWait(browser, 10)KEYWORD = 'iPad' def index_page(page):    """Fetch index page: param page: page number"""    print('Climbing to number one', page, 'pages')    try:        url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)        browser.get(url)        if page > 1:            input = wait.until(                EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-pager div.form > input')))            submit = wait.until(                EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager div.form > span.btn.J_Submit')))            input.clear()            input.send_keys(page)            submit.click()        wait.until(            EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#mainsrp-pager li.item.active > span'), str(page)))        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.m-itemlist .items .item')))        get_products()    except TimeoutException:        index_page(page)Copy the code

We first construct a WebDriver object using Chrome, then specify a keyword, such as iPad, and then define the index_Page () method to fetch the listing page.

In this method, we first visit the search item link, then determine the current page number, if greater than 1, the page hopping operation, otherwise wait for the page to complete loading.

When waiting to load, we use the WebDriverWait object, which can specify the wait condition and specify a maximum wait time, which in this case is 10 seconds. If the wait condition is successfully matched within this time, that is, the page element is successfully loaded, the result is returned immediately and the execution continues, otherwise the timeout exception is thrown when the maximum wait time is not loaded.

For example, if we’re finally waiting for the item information to load, we specify the condition presence_of_element_located, and we pass in the.m-itemlist.items. Item selector, and the page content of that selector is the piece of information for each item. You can check it out on the web page. If the load succeeds, the subsequent get_products() method is executed to extract the product information.

For the page-turning operation, the page number input box is first obtained, and the value is input; then the “OK” button is obtained, and the value is submit, which are the two elements in Figure 7-24.

Figure 7-24 Jump options

First, we clear the input field and call the clear() method. Then, call the send_keys() method to populate the page number in the input box and click ok.

So, how do you know if there’s a jump to a page number? Note that the page numbers are highlighted after a successful jump, as shown in Figure 7-25.

Figure 7-25 Page highlights

We just need to determine that the currently highlighted page number is the current page number, so we use another wait condition, text_TO_BE_PRESENT_IN_element, which returns success when the specified text appears in one of the nodes. Here we pass the CSS selector corresponding to the highlighted page number node and the current page number to the wait condition with a parameter, so that it will detect whether the currently highlighted page number node is the number we passed, and if so, it will prove that the page has successfully jumped to this page and the page has successfully jumped.

Then the index_Page () method can pass in the corresponding page number and call get_products() for page parsing after loading the list of items with the corresponding page number.

6. Parse the list

Next, we can implement the get_products() method to parse the list of items. Here we get the page source directly and use PyQuery to parse it as follows:

12345678910111213141516171819from pyquery import PyQuery as pqdef get_products():    """Extract commodity data"""    html = browser.page_source    doc = pq(html)    items = doc('#mainsrp-itemlist .items .item').items()    for item in items:        product = {            'image': item.find('.pic .img').attr('data-src'),            'price': item.find('.price').text(),            'deal': item.find('.deal-cnt').text(),            'title': item.find('.title').text(),            'shop': item.find('.shop').text(),            'location': item.find('.location').text()        }        print(product)        save_to_mongo(product)Copy the code

First, we call the page_source property to get the source code for the page number, then construct the PyQuery parsing object, and then extract the list of items using the CSS selector # MainSRp – ItemList.items. Item, which matches every item on the entire page. It matches multiple results, so here we iterate through it again, parsing each result separately with a for loop, assigning it each time to an Item variable, each of which is a PyQuery object, and then calling its find() method, passing in the CSS selector, You can retrieve the specific content of a single item.

For example, view the source code of the product information, as shown in Figure 7-26.

Figure 7-26 Source code of product information

As you can see, it is an IMG node with attributes such as ID, class, data-src, Alt, and SRC. The image is visible here because its SRC property is assigned to the URL of the image. Extract its SRC attribute to get an image of the item. However, we also pay attention to the data-src attribute, whose content is also the URL of the picture. After observation, we find that the URL is the complete large picture of the picture, while SRC is the compressed small picture. Therefore, the data-src attribute is captured here as the picture of the commodity.

Therefore, we need to find the image node using the find() method, and then call attr() to get the data-src attribute of the item, which successfully extracts the link of the item image. Then use the same method to extract the price, volume, name, store and store location of the goods, and then assign all the extraction results to a dictionary product, then call save_to_mongo() to save it to MongoDB.

7. Save the configuration to MongoDB

Next, we save the product information to MongoDB with the following code:

123456789101112131415MONGO_URL = 'localhost'MONGO_DB = 'taobao'MONGO_COLLECTION = 'products'client = pymongo.MongoClient(MONGO_URL)db = client[MONGO_DB]def save_to_mongo(result):    """Save to MongoDB :param result: result"""    try:        if db[MONGO_COLLECTION].insert(result):            print('Storage to MongoDB successful')    except Exception:        print('Storage to MongoDB failed')Copy the code

A connection object to MongoDB is created, the database is specified, the Collection name is specified, and the data is inserted directly into MongoDB by calling insert(). The result variable here is the product passed in the get_products() method, which contains information about a single item.

8. Go through each page

The get_index() method we just defined takes the parameter page, which represents the page number. Here we can achieve page number traversal, the code is as follows:

1234567MAX_PAGE = 100def main():    """Go through every page."""    for i in range(1, MAX_PAGE + 1):        index_page(i)Copy the code

The implementation is as simple as calling a for loop. The maximum page number is defined as 100, and the result of the range() method is a list from 1 to 100, iterated sequentially by calling the index_Page () method.

In this way, our Taobao commodity crawler is completed and can be run by calling main() method at last.

9. Run

When you run the code, a Chrome browser will pop up, then the Taobao page will be visited, and the console will output the corresponding extraction results, as shown in Figure 7-27.

Figure 7-27 Running result

As you can see, the result of these product information is dictionary form, which is stored in MongoDB.

Take a look at the results in MongoDB, as shown in Figure 7-28.

Figure 7-28 Saving the Settings

As you can see, all the information is saved to MongoDB, indicating that the crawl was successful.

10. Chrome Headless

As of Chrome 59, there is support for Headless mode, which means no interface, so the browser doesn’t pop up when you crawl. To use this mode, upgrade Chrome to version 59 or later. You can enable Headless mode as follows:

1
2
3

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘–headless’)
browser = webdriver.Chrome(chrome_options=chrome_options)

To enable Chrome’s headless mode, first create the ChromeOptions object, then add the headless parameter, then pass the ChromeOptions object through chrome_options when initializing the Chrome object.

11. Docking Firefox

To connect to Firefox, it’s as simple as making one change:

1

browser = webdriver.Firefox()

This has changed the way the Browser object is created so that Firefox is used when crawling.

12. Docking PhantomJS

If you don’t want to use Chrome’s Headless mode, you can also use PhantomJS (which is an interface less browser) to crawl. When fetching, there will be no popup, just change the WebDriver declaration:

1

browser = webdriver.PhantomJS()

It also supports command-line configuration. For example, you can set caching and disable image loading to further improve crawl efficiency:

1
2

SERVICE_ARGS = [‘–load-images=false’, ‘–disk-cache=true’]
browser = webdriver.PhantomJS(service_args=SERVICE_ARGS)

Finally, the code address for this section is given: github.com/Python3WebS… .

In this section, we demonstrate the scraping of Taobao pages using Selenium. With it, we don’t have to analyze Ajax requests, and we can actually see them.


Video learning resources:

  • Do it yourself! Python3 Web crawler examples
  • Python3 crawler


The rest of the book is advanced and closed for the time being.

If you want to see more, you can buy electronic or physical books.

This book is published by Turing Education – Posts and Telecommunications Press.

Preview of the book:

The preview link is:

Germey. Gitbooks. IO/python3webs…

Book purchase Address:

Item.jd.com/26114674847…

Item.jd.com/26124473455…


This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)