Dynamic site crawls

introduce

In the daily crawl process, the crawl of dynamic website is troublesome, because the data of dynamic website is dynamically loaded, at this time, we need to use selenuIM middleware to simulate operation and obtain dynamic data

start

Create a project

1.scrapy startproject Taobao
Copy the code

2.cd Taobao
Copy the code

3. scrapy genspider taobao www.taobao.com
Copy the code

Open the project

Setting crawler rule set to False

ROBOTSTXT_OBEY = False
Copy the code

We input at the terminal

 scrapy view "http://www.taobao.com"
Copy the code

At this point, you will find the following page:

Using middleware

We focus on process_request functions in a download middleware (TaobaoDownloaderMiddleware) we added a line of code to see what effect:

 def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        # installed downloader middleware will be called
        print("------ I am middleware, request to pass through me ------")
        return None
Copy the code

Then in seeting, remove the comments below

DOWNLOADER_MIDDLEWARES = {
   'Taobao.middlewares.TaobaoDownloaderMiddleware': 543}Copy the code

Let’s run it to see what it looks like:

scrapy view "http://www.taobao.com"
Copy the code

Effect:

-- -- -- -- -- - I'm a middleware, request after I -- -- -- -- -- - the 2018-08-29 15:38:21 [scrapy. Downloadermiddlewares. Redirect] the DEBUG: Redirecting (302) to <GET https://www.taobao.com/> from <GET http://www.taobao.com> ------ [scrapy.core.engine] Link: Critical link (200) <GET https://www.taobao.com/> (referer: None)Copy the code

The middleware is effective. If we simulate some operations in the middleware, can we get dynamic data?

Middlewares middleware operation

First import selenium

from selenium import webdriver
Run without interface
from selenium.webdriver.chrome.options import Options
Copy the code

Write the following code in the process_request function

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.


        print("------ I am middleware, request to pass through me ------")
        # Set to run without interface
        option = Options()
        option.add_argument("--headless")
        Create a driver object
        #driver = webdriver.Chrome()
        driver = webdriver.Chrome(chrome_options=option)
        # Wait 15 seconds
        driver.implicitly_wait(15)
        driver.get(request.url)
        # Let the page scroll bottom to simulate the human operation
        js = 'window.scrollTo(0,document.body.scrollHeight)'
        # execution js
        driver.execute_script(js)
        # Get content
        content = driver.page_source
        from scrapy.http import HtmlResponse
        # create a resP object to return to the parse function
        resp = HtmlResponse(request.url,request=request,body=content,encoding='utf-8')
        return resp
        return None
Copy the code

Well, at this time has obtained the dynamic data, here is not resolved, the above is the idea of dynamic site crawl

introduce

start

Create a project

Open the project

Using middleware

Middlewares middleware operation

Write the following code in the process_request function

Related Posts

MySQL — Optimize the ORDER BY statement

EasyExcel entry and project actual combat

Spring Batch entry