Research on Python distributed dynamic page crawlers

Selenium’s Webdriver works well for crawling dynamic web pages, but it’s not as efficient. Recently, we have been studying how to improve the efficiency of dynamic page crawler. The methods are high concurrency and distributed. In the process of a lot of harvest, but also stepped on a lot of pits, here to make a summary. The following is a rough guide to learning during this period.

A, Scrapy + phantomJS

Scrapy is an efficient asynchronous crawler framework, widely used, the documentation is also very complete, developers can quickly achieve high performance crawler. The basic use of Scrapy will not be covered here, but this is a good reading note for Scrapy. However, Scrapy can only extract static web content by default, so it must be further customized.

Scrapy combined with phantomJS seems like a good choice. PhantomJS is a pageless browser that renders dynamic pages and is relatively lightweight. Therefore, we need to modify our Scrapy page request module to get phantomJS to request the page in order to get the dynamic page. After some research, we found that there are roughly three customization methods:

1. Request each URL twice. Discard the returned response in the callback function, and then Request response.url again with phantomJS. Since there is no Request object constructed this time, of course, there is no callback function, and then block waiting for the result to return. This method makes two requests for the same URL, the first for Scrapy’s default HTTP request and the second for phantomJS, which fetches the dynamic page. This is a good way to quickly implement small scale crawlers, using the default Scrapy project and simply modifying the callback function.

2. Customize downloadMiddleware. DownloadMiddleware preprocesses Request objects sent from scheduler before they are requested. DownloadMiddleware can add headers, user_Agent, and cookies. But you can also directly return the HtmlResponse object through the middleware, skip the requested module, and throw it directly to the response callback function for processing. The code is as follows:

    class CustomMetaMiddleware(object):
        def process_request(self,request,spider):
            dcap = dict(DesiredCapabilities.PHANTOMJS)
            dcap["phantomjs.page.settings.loadImages"] = False
            dcap["phantomjs.page.settings.resourceTimeout"] = 10 
            driver = webdriver.PhantomJS("E:xx\xx\xx",desired_capabilities=dcap)
            driver.get(request.url)
            body = driver.page_source.encode('utf8')
            url = driver.current_url
            driver.quit()
            return HtmlResponse(request.url,body=body)Copy the code

After changing the code, remember to modify the Settings configuration. But there’s a big problem with this approach — it doesn’t allow asynchronous crawls. Because web pages are requested directly in the download middleware, Scrapy is not asynchronous and can only be used for blocking page-by-page downloads. Of course, this is a quick way to deploy dynamic crawlers, if not for high concurrency.

3. Customize downloader. Downloader is a module that Scrapy initiates HTTP requests. This module implements asynchronous requests, so a custom downloader is the perfect implementation. However, writing a custom downloader can be tricky, and you must follow some of Twisted’s specifications. Fortunately, there are several open source downloaders on the web that make it easier to change. This article explains the development of downloader in detail, very good!

Some pits and tips

Running Scrapy through code is a useful way to run a crawler through the CrawlerProcess class, but passing the Settings parameter to a Spider is a huge pitfall, which has been around for a long time. The final solution is to modify the PYTHONPATH and SCRAPY_SETTINGS_MODULE environment variables, plus the crawler project’s directory, so That Python can find the configuration file.
Set the DOWNLOAD_TIMEOUT option. The default value is 180 seconds, which is relatively long, but can be shorter for efficiency.
PhantomJS support for multiple processes is extremely erratic. Specifically, if multiple phantomJS processes are opened on a host at the same time, the running results of a single phantomJS will be up and down, and there are often some puzzling errors. The official Git issue also mentions that phantomJS does not support multiple processes very well. Chromedriver is recommended for multi-process crawlers.
Scrapy the advantage of efficient asynchronous request frame, due to its itself does not support the crawl dynamic pages, if no particular high demand on the efficiency of the crawler, must use the framework, and it isn’t necessary, after all, be familiar with the framework to a certain amount of time cost is limited in the framework of programming, for some simple crawler, sometimes is rolled a hand.

Second, the Scrapy – splash

Scrapy+phantomJS is limited in its efficiency due to phantomJS ‘multi-concurrency shortcomings, so it’s not a particularly good option.

After more research, Splash seemed like a good choice. Splash is a Javascript rendering service. It is implemented in Python, uses Twisted and QT, and implements a lightweight browser for the HTTP API that Twisted (QT) uses to make the service asynchronous to take advantage of WebKit’s concurrency capabilities.

Using splash in Scrapy is also very simple, see www.cnblogs.com/zhonghuason… .

In general, you only need to return a SplashRequest object in Scrapy. Such as:

yield SplashRequest(url='http: / /'+url,callback=self.parse,endpoint='render.html',
                  args={'wait':2},errback=self.errback_fun, meta={ })Copy the code

It is also possible to return a Request object with a POST argument. More simply, it’s ok to construct POST requests using libraries like URllib, because this is essentially a port proxy that can accept any HTTP request.

Splash occupies relatively little memory, but still has some problems with multiple concurrent requests. The request failure rate is greatly increased, and page rendering results occasionally have some problems. Meanwhile, it is limited by the bandwidth and speed of the server host.

Splash also has significant advantages in that it allows other distributed nodes to easily access dynamic pages through the HTTP API and minimizes the coupling between the server and other nodes, making it easy to scale. In addition, distributed nodes can get dynamic pages without having to configure the environment, which is much simpler than phantomJS ‘complex configuration! Splash is a great choice if you want to simply implement dynamic page crawlers, but it is limited by the bandwidth of a single server, has limited speed, and sometimes does not render very well.

Chromedriver concurrency

For both phantomJS and Splash, stability is one thing, and they are not as good at rendering or speed as ChromeDriver, because the V8 engine is not covered! But chromedriver has obvious downsides — it’s a huge drain on memory, and it has an interface!

In order to crawl baidu search results for a while, I used the Requests library to simulate POST requests. Although efficient, I was often blocked by Baidu, so I tried phantomJS instead. I thought it was inefficient, but it was a real browser, so Baidu wouldn’t block it. PhantomJS is not very stable and often generates errors. It can’t run with multiple concurrent processes. I’ll have to give chromedriver a try. I used to be afraid of memory killer Chrome (open chrome browser, there are a lot of Chrome processes in the task manager), but I had no choice but to use this killer. After running for a period of time, I found chrome’s efficiency is quite good, the memory occupation is not imagined in the big, many concurrent support is very good, in my computer at the same time to open 20 is no problem, stability is also good, and Baidu unexpectedly did not seal! (Shock!! Chrome has an anti-crawler halo! . But because the program is mainly run on ali Cloud host, interface Chromedriver was not considered at that time, not long ago to know that the original can be introduced through the virtual interface, let Chrome run….. on the host without interface

Python’s PyVirtualDisplay library introduces virtual interfaces. The code implementation is also very simple:

from pyvirtualdisplay import Display
display = Display(visible=0.size= (800.600))
display.start()
driver = webdriver.Chrome()
driver.get('http://www.baidu.com')Copy the code

After personal testing, found that Chrome multi process support is very good, fast rendering speed, is relatively large memory, can be multi-process + distributed to improve efficiency, the key Chrome is not easy to be blocked.

PS. Common chromedriver to turn off images option code:

chromeOptions = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images":2}
chromeOptions.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(chromedriver_path,chrome_options=chromeOptions)
driver.get(url)Copy the code

Fourth, Selenium Grid

Selenium Grid is a stand-alone extension of Selenium that allows users to distribute test cases across several machines for parallel execution. Of course, can achieve distributed testing, distributed crawler of course no problem. The mechanism of Selenium Grid is shown below. First, start a central node (Hub), then start multiple remote control nodes (RC), and ask RC to register its own information with the Hub, including rc’s own system, supported WebDrivers, maximum number of concurrent requests, etc. In this way, the Hub node knows all RC information for future scheduling.

Selenium Grid mechanism

After the running environment is set up, the test or crawler script requests the service port of the Hub. The Hub host distributes these test cases to the specified RC node according to the current state of the registered RC node and the load balancing principle. The RC node executes the command after receiving the command.

from selenium import webdriver
    url = "http://localhost:4444/wd/hub"
    driver = webdriver.Remote(command_executor = url, desired_capabilities = {'browserName':'chrome'})
    driver.get("http://www.baidu.com")
    print driver.titleCopy the code

As shown in the figure below, I set up a Hub node locally with the default port 4444, and then registered two RC nodes locally with the ports 5555 and 6666 respectively. As you can see from the console of the Hub service port, each node can support 5 Instances of Firefox, 1 instance of Internet Explorer, and 5 instances of Chrome (you can customize). Since the Opera browser is not installed on this machine, there is, of course, no Opera instance.

The Selenium Grid console page

Selenium Grid is a very good framework to implement distributed testing/dynamic crawler. The principle and operation are not complicated. Interested students can learn more about it.

Five, the summary

The features of the above software or frameworks are summarized as follows:

PhantomJS is relatively lightweight, but has very poor support for multiple concurrency
Chromedriver renders fast and supports multiple concurrent processes well, but takes up a lot of memory
Splash implements HTTP API, easy distributed extension, and mediocre page rendering capability
Selenium Grid is a professional testing framework that is easy to scale and supports advanced features such as load balancing

Therefore, distributed Scrapy+chromedriver or Selenium Grid is a better choice to implement distributed dynamic crawler.

Research on Python distributed dynamic page crawlers

A, Scrapy + phantomJS

Some pits and tips

Second, the Scrapy – splash

Chromedriver concurrency

Fourth, Selenium Grid

Five, the summary

Related Posts

(2) | use lambad expression based Java correct posture

SAP Data Intelligence Graph JSON source code structure analysis

Docsify, an amazing document generation tool!