This is the 20th day of my participation in the August Text Challenge.More challenges in August

Prohibit reprint!

Crawler how to deal with stream loading page, this 100 lines of code please bookmark!

Before the study committee made a website photo retention, found that the screen shot of the page is a streaming page, cut part of how to do?

In so-called stream-loaded pages, the height of the page is constantly growing, and the page cannot be loaded in a single load to get the true height of the entire page! Are there any stream-loaded pages around? For example, CSDN’s hot list is dead.

This type of streaming loading window is like turning on and off the tap, the content will be displayed in a rush, not load.

Friends who often rush to the list know that open the hot list and find a few Top 5, need the browser to pull down, will continue to load more content dynamically. Keep scrolling down, slowly pull down the entire hot list.

The realization process of this paper is summarized as follows:

First question: how do I get the height of the streaming window?

How to start the crawler setup, in the previous article, please read for yourself.

To get to the point, how do we get the height of the dynamic streaming content window?

Ray learning committee to deal with streaming page crawler solution secret core code: ""
def resolve_height(driver, pageh_factor=5) :
    js = "return action=document.body.scrollHeight"
    height = 0
    page_height = driver.execute_script(js)
    ref_pageh = int(page_height * pageh_factor)
    step = 150 
    max_count = 15 
    count = 0 
    while count < max_count and height < page_height:
        #scroll down to page bottom
        for i in range(height, ref_pageh, step):
            count+=1
            vh = i
            slowjs='window.scrollTo(0, {})'.format(vh)
            print('exec js: %s' % slowjs)
            driver.execute_script(slowjs)
            sleep(0.3)
        if i >= ref_pageh- step:
            print('not fully read')
            break
        height = page_height
        sleep(2)
        page_height = driver.execute_script(js)
    print("finish scroll")
    return page_height
Copy the code

Not much code.

core idea

  1. Keep scrolling through the learning window
  2. Then until a page stops loading or the page loads to the limit value
  3. Stop updating (because some streaming pages don’t have a lower limit, you just have to scroll down and see something new)

Check out the renderings:

Ray learning committee to deal with streaming page crawler solution tips screenshot core code: ""
def resolve_height(driver, pageh_factor=5) :
    js = "return action=document.body.scrollHeight"
    height = 0
    page_height = driver.execute_script(js)
    ref_pageh = int(page_height * pageh_factor)
    step = 150 
    max_count = 15 
    count = 0 
    while count < max_count and height < page_height:
        #scroll down to page bottom
        for i in range(height, ref_pageh, step):
            count+=1
            vh = i
            slowjs='window.scrollTo(0, {})'.format(vh)
            print('exec js: %s' % slowjs)
            driver.execute_script(slowjs)
            sleep(0.3)
        if i >= ref_pageh- step:
            print([Demo]not fully read')
            break
        height = page_height
        sleep(2)
        page_height = driver.execute_script(js)
    print("finish scroll")
    return page_height

Get the actual window height
page_height = resolve_height(driver)
print("[Demo] Page height: %s"%page_height)
sleep(5)
driver.execute_script('document.documentElement.scrollTop=0')
sleep(1)
driver.save_screenshot(img_path)
page_height = driver.execute_script('return document.documentElement.scrollHeight') # page height
print("get accurate height : %s" % page_height)

if page_height > window_height:
   n = page_height // window_height #floor
   for i in range(n):
       driver.execute_script(f'document.documentElement.scrollTop={window_height*(i+1)}; ')
       sleep(1)
       driver.save_screenshot(f'./leixuewei_rank_{i}.png')
Copy the code

There’s still not much code.

core idea

  1. Continue scrolling the screenshot window contents
  2. Keep as picture (marked up and down)

Here’s a screenshot from the middle:

So how do you make one graph?

In the code project directory, you can see that several images have been generated. You can’t photoshop yourself, can you?

The first thing to understand is what is a picture?

An image is essentially a matrix of 2D pixels.

Each image we see is actually a number of pixels arranged horizontally and vertically, and displayed as an image.

Easy to do, merge the idea has, use Numpy library directly

Let’s take the above code to transform:

Ray learning committee to deal with streaming page crawler solution tips screenshot core code: ""
import numpy as np

if page_height > window_height:
   n = page_height // window_height #floor
   base_matrix = np.atleast_2d(Image.open(img_path))
   for i in range(n):
       driver.execute_script(f'document.documentElement.scrollTop={window_height*(i+1)}; ')
       sleep(1)
       driver.save_screenshot(f'./leixuewei_rank_{i}.png')
       delta_matrix = np.atleast_2d(Image.open(f'./leixuewei_rank_{i}.png'))
       #concentrate the image
       base_matrix = np.append(base_matrix, delta_matrix, axis=0) 
   Image.fromarray(base_matrix).save('./leixuewei_rank_full.png')
Copy the code

Awesome. Just a little bit of code. It’s all about ideas.

Code parsing

This is essentially a loop of converting images to 2D matrices.

Then append the 2d matrices, so that the horizontal length remains the same, but the vertical content is appended, forming a complete picture.

Here’s a full screen shot of the list.

conclusion

The whole idea is still very smooth, less than 100 lines of code, but the idea is not to do, mainly used the following several libraries.

selenium
numpy
Pillow
Copy the code

Finally use crawler must be careful, do not as a child’s play to climb organization website. You study also can’t take the serious network to brush, this behavior will let you eat LAO rice sooner or later!

This article is for demonstration purposes only, please inform us of any objections to the demo website.

By the way, the committee also has this can pay attention to long-term reading => Thunder committee interesting programming story compilation

Or => Ray academy NodeJS long series

Continuous learning and continuous development, I am Lei Xuewei! Programming is fun. The key is to get the technology right. Creation is not easy, please support, like collection support committee!