Data crawler is the starting block of data analysis, and Python is a popular tool for data crawler. Because of the popularity, there will also be a lot of related ancillary tools. But for beginners, finding the right tool can be a hassle. This article will take a look at some of the more mainstream tools and when they should be used.

Requests / urllib

Both suites, Requests or URllib, are tools for handling HTTP protocols. Urllib is a more complete HTTP feature built into Python (url encoding, file downloading, etc.), and Requests focuses on handling Request/Response transfers in a friendly way.

BeautifulSoup / Pyquery

BeautifulSoup and Pyquery are used to translate a Response to an HTML string into a DOM base object. LXML and HTML5lib are two parsers that read HTML strings. Both suites support using CSS selectors to find data.

Xpath

XPath is a positioning technology based on XML format. You can also treat HTML as XML and use XPath to find the information you need.

Selenium

Selenium was originally a browser emulation tool for web testing. However, with the dynamic web /AJAX approach, JavaScript loading can be a problem just through Requests, so you can use a browser emulation tool like Selenium to execute JavaScript.

PhantomJS

The original Selenium impersonation tool needed to call physical browsers, such as Chrome or Firefox, which caused resource and performance issues. PhantomJS is a virtual browser solution in Selenium that can be simulated without having to open a physical browser.

Ghost

Selenium, as mentioned earlier, addresses the problem of dynamic web pages that cannot execute JavaScript by literally emulating the workings of a browser. Ghost is a program that simulates JavaScript through Python to achieve the goal of dynamically generating data.

Scrapy

Scrapy is a framework for crawlers that want to crawl multiple web pages at once.

Pyspider

Pyspider provides a crawler framework for manipulating Web UI pages. It also supports multiple Web page downloads.

summary

These crawler tools can be roughly divided into several types:

  1. Static site access: Requests/urllib
  2. Parse crawl of web page data: BeautifulSoup/Pyquery/Xpath
  3. Dynamic site information obtained: Selenium/PhantomJS/Ghost
  4. Multi-page crawler framework: Scrapy/Pyspider

License

This work is by Chang Wei-Yaun (V123582) and is distributed under an INNOVATIVE CC name – Share in the same way with 3.0 Unported license.