In the last section, we implemented Scrapy and Selenium scraping, which is a way to crawl JavaScript dynamically rendered pages. In addition to Selenium, Splash can do the same. In this section, take a look at how Scrapy works with Splash to grab a page.

First, preparation

Ensure that Splash is properly installed and running, and that the scrapy-Splash library is installed.

2. New projects

Start by creating a new project called scrapysplashtest as follows:

scrapy startproject scrapysplashtestCopy the code

Create a new Spider with the following command:

scrapy genspider taobao www.taobao.comCopy the code

Add the configuration

Can refer to Scrapy – Splash configuration instructions for the configuration of the step by step, link is as follows: https://github.com/scrapy-plugins/scrapy-splash#configuration.

Modify settings.py to configure SPLASH_URL. In this case, our Splash runs locally, so we can directly configure the local address:

SPLASH_URL = 'http://localhost:8050'Copy the code

If Splash is running on a remote server, this should be the remote address. For example, if the IP address of the server is 120.27.34.25, set the following parameters:

SPLASH_URL = 'http://120.27.34.25:8050'Copy the code

A few more Middleware configurations are required, with code like this:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100}Copy the code

Three Downloader Middleware components and a Spider Middleware component are configured, which is the core of scrapy-Splash. Instead of implementing a Downloader Middleware like we did with Selenium, scrapy-Splash libraries are ready to be configured.

You also need to configure a DUPEFILTER_CLASS with the following code:

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'Copy the code

Finally, configure a Cache to store HTTPCACHE_STORAGE as follows:

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'Copy the code

4. Create a request

Once configured, we can use Splash to grab pages. We could simply generate a SplashRequest object and pass the parameters, and Scrapy would forward the request to Splash, who would render and load the page and then pass the results back. At this point, the content of the Response is the result of the rendered page, and finally handed to the Spider to parse.

Let’s take a look at an example like this:

yield SplashRequest(url, self.parse_result,
    args={
        # optional; parameters passed to Splash HTTP API
        'wait': 0.5.# 'url' is prefilled from request url
        # 'http_method' is set to 'POST' for POST requests
        # 'body' is set to request body for POST requests
    },
    endpoint='render.json'.# optional; default is render.html
    splash_url='<url>'.# optional; overrides SPLASH_URL
)Copy the code

A SplashRequest object is constructed, with the first two arguments again being the REQUEST URL and the callback function. We can also pass render parameters through ARgs, such as wait time, and specify render interfaces based on endpoint parameters. More parameter can reference documentation: https://github.com/scrapy-plugins/scrapy-splash#requests.

In addition, we can also generate Request objects. The configuration of Splash can be done through the configuration of meta attribute, and the code is as follows:

yield scrapy.Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,
            # 'url' is prefilled from request url
            # 'http_method' is set to 'POST' for POST requests
            # 'body' is set to request body for POST requests
        },
        # optional parameters
        'endpoint': 'render.json'.# optional; default is render.json
        'splash_url': '<url>'.# optional; overrides SPLASH_URL
        'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
        'splash_headers': {},       # optional; a dict with headers sent to Splash
        'dont_process_response': True, # optional, default is False
        'dont_send_headers': True,  # optional, default is False
        'magic_response': False,    # optional, default is True}})Copy the code

SplashRequest objects configured using ARgs and Request objects configured using meta achieve the same effect.

In this section, we want to capture taobao commodity information, which involves page loading and waiting, simulating clicking to turn pages and other operations. We can first define a Lua script to load the page and simulate clicking the page, as follows:

function main(splash, args)
  args = {
    url="https://s.taobao.com/search?q=iPad".wait=5,
    page=5
  }
  splash.images_enabled = false
  assert(splash:go(args.url))
  assert(splash:wait(args.wait))
  js = string.format("document.querySelector('#mainsrp-pager div.form > input').value=%d; document.querySelector('#mainsrp-pager div.form > span.btn.J_Submit').click()", args.page)
  splash:evaljs(js)
  assert(splash:wait(args.wait))
  return splash:png()
endCopy the code

We defined three parameters: the link URL of the request, wait time, and page number. Then disable image loading, request taobao’s commodity list page, invoke JavaScript code through evaljs() method, realize page number filling and page turning click, and finally return the page screenshot. We put the script into Splash and get the screenshot of the page normally, as shown below.

The page turning operation has also been successfully implemented. The current page number as shown in the figure below is the same as the page number parameter we passed in.

We just need to dock Lua with SplashRequest in Spider, as shown below:

from scrapy import Spider
from urllib.parse import quote
from scrapysplashtest.items import ProductItem
from scrapy_splash import SplashRequest

script = """ function main(splash, args) splash.images_enabled = false assert(splash:go(args.url)) assert(splash:wait(args.wait)) js = string.format("document.querySelector('#mainsrp-pager div.form > input').value=%d; document.querySelector('#mainsrp-pager div.form > span.btn.J_Submit').click()", args.page) splash:evaljs(js) assert(splash:wait(args.wait)) return splash:html() end """

class TaobaoSpider(Spider):
    name = 'taobao'
    allowed_domains = ['www.taobao.com']
    base_url = 'https://s.taobao.com/search?q='

    def start_requests(self):
        for keyword in self.settings.get('KEYWORDS') :for page in range(1, self.settings.get('MAX_PAGE') + 1):
                url = self.base_url + quote(keyword)
                yield SplashRequest(url, callback=self.parse, endpoint='execute', args={'lua_source': script, 'page': page, 'wait'7}) :Copy the code

We defined the Lua script as a long string, passed parameters through the ARgs of SplashRequest, and changed the interface to execute. In addition, there is a lua_source field in the args parameter that specifies the contents of the Lua script. In this way, we have successfully constructed a SplashRequest, and the work of docking with Splash is completed.

Other configurations do not need to be changed. The Item, Item Pipeline, and other Settings connect to Selenium the same way as in the previous section, and the parse() callback function is exactly the same.

Five, run,

Next, we run the crawler with the following command:

scrapy crawl taobaoCopy the code

The running result is shown in the figure below.

Since both Splash and Scrapy support asynchronous processing, we can see that multiple crawls are successful at the same time. In Selenium docking, each page rendering download is done in Downloader Middleware, so the process is blocked. Scrapy waits for this process to complete before continuing to process and schedule other requests, which affects crawl efficiency. Therefore, the climb efficiency using Splash is much higher than Selenium.

Finally, take a look at MongoDB’s results, as shown in the figure below.

The results are also saved to MongoDB normally.

Vi. Code of this section

This section of code address is: https://github.com/Python3WebSpider/ScrapySplashTest.

Seven, conclusion

Therefore, in Scrapy, it is recommended to use Splash to handle dynamically rendered pages in JavaScript. This will not break the asynchronous processing in Scrapy and will greatly improve the crawl efficiency. Moreover, the installation and configuration of Splash are relatively simple, and the module separation is realized through API calls. The deployment of large-scale crawls is also more convenient.