Puppeteer - Crawl dynamically generated web pages

A, the Puppeteer

There is no more information about Puppeteer than the following links

Open source address
English document
The Chinese community
The Nuggets Puppeteer column

Two, climb dynamic web pages

1. The demand

First, let’s take a look at our requirements: crawl the ZoomCharts document and save the corresponding pages of all access connections under the Net Chart directory locally

2. Study ZoomCharts document page structure

First, we need to understand how the ZoomCharts page is loaded and the DOM tree structure of the navigation on the left before we proceed to the next step

First page loading

When the page loads for the first time, navigate to the first directory on the leftIntroductionHighlight, and you can see from the console that the element has been increasedactiveClass, and at the same timeli[data-section="net-chart"]There is only one element node under the nodea
Click on theNet Chartdirectory

Click the Net Chart directory, Net Chart directory highlight, drop down to display the subdirectory, check the console, its element node add active class, and add ul subelement node, at this time, the first subdirectory node only has a child element node A

conclusion

The left directory is dynamically generated, not statically written to death. The child directory is displayed only when the parent directory is clicked, and the drop class on the parent directory element indicates the existence of the child directory

3. Write the main program

Through the above analysis, the general process is as follows

Go from top to bottomNet ChartDirectory DOM tree when founda.dropElement node that simulates mouse click eventsclickTo generate a subdirectory node
findNet ChartEverything in the directoryaLink to generate an array
Traverse the number group, visit each subdirectory page, save the HTML file of the page to the local

Each specific process is then implemented

Project initialization

Install puppeteer, rimraf(required for folder operation)

npm i -S puppeteer rimraf
Copy the code

Create a test.js file and import it

const puppeteer = require('puppeteer');
const chalk = require('chalk');
const path = require('path');
const https = require('https');
const fs = require('fs');
const rm = require('rimraf');

const settings = {
    headless: false
}

function resolve(dir, dir2 = ' ') {
	return path.posix.join(__dirname, '/', dir, dir2);
}

async function main () {
    const browser = await puppeteer.launch(settings); // Create a Browser object
    try {
        const page = await browser.newPage(); // Create Page with Browser
        page.setDefaultNavigationTimeout(600000);
        / / to monitor console
        page.on('console', msg => {
            for (let i = 0; i < msg.args().length; ++i) {
                console.log(`${i}: ${msg.args()[i]}`); }});<! -- main start -->/ / the main area<! -- end start-->Console. log(' service ended normally ')} Catch (error) {console.log(' service failed: ') console.log(error)} finally {}} main()Copy the code

All the rest of the code is done in the Main area. The full code can be viewed in the Github repository

Create a folder to hold the files that you crawl
- Define the file output path
- Generate folders based on the path
- If the folder already exists, delete it and then create it
Net Chart directory to achieve alla.dropElement

This section deals with DOM manipulation. Only in Page.evaluate () can you access the actual DOM element. In Page.evaluate (), you can’t call a function defined outside of it

await page.evaluate(async() = > {const rootNode = document.querySelector('#menu > ul > li:nth-child(5) > ul > li:nth-child(5)');
    await window.walkDOM(rootNode)
})
Copy the code

At this point, is bound to the window object walkDOM function requires on the page. The evaluateOnNewDocument defined function to take effect

await page.evaluateOnNewDocument((a)= > {
    / / traverse the DOM
    window.walkDOM = (node) = > {
        if (node === null) {
            return
        }
        if (node.tagName === 'A' && node.className.indexOf('drop') > - 1) {
            node.click() // Click the event
        }
        node = node.firstElementChild
        while (node) {
            walkDOM(node)
            node = node.nextElementSibling
        }
    }
})
Copy the code

When all the A.trop elements in the Net Chart directory are clicked, all the descendant subdirectories in the Net Chart directory are loaded and generated, and the next step is easy

Get all a elements in Net Chart directory
- throughdocument.querySelectorAll()Find allaElement to an array
- Iterate through the array, processing each item of the array into{href: '',text: ''}object
- Return an array of objects
Iterate through the array of objects, accessing each link and downloading its HTML file
- Jump to each link and download the required HTML to the specified folder
- When the HTML existsimg“, download all images

4. To summarize

The first time to use Puppeteer was a bumpy one, involving time, references and practice

Code warehouse

Code warehouse

Refer to the article

The front end uses the Puppeteer crawler to generate the React. Js Little Book PDF and merge it

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Puppeteer – Crawl dynamically generated web pages

A, the Puppeteer

Two, climb dynamic web pages

1. The demand

2. Study ZoomCharts document page structure

3. Write the main program

4. To summarize

Puppeteer – Crawl dynamically generated web pages

A, the Puppeteer

Two, climb dynamic web pages

1. The demand

2. Study ZoomCharts document page structure

3. Write the main program

4. To summarize

Related Posts

Tree data structures in JavaScript

Vue-router 4 Version study notes

Js Object-oriented Programming (OOP) PS Functional Programming (FP)