A, the Puppeteer

There is no more information about Puppeteer than the following links

  • Open source address
  • English document
  • The Chinese community
  • The Nuggets Puppeteer column

Two, climb dynamic web pages

1. The demand

First, let’s take a look at our requirements: crawl the ZoomCharts document and save the corresponding pages of all access connections under the Net Chart directory locally

2. Study ZoomCharts document page structure

First, we need to understand how the ZoomCharts page is loaded and the DOM tree structure of the navigation on the left before we proceed to the next step

  • First page loading

    When the page loads for the first time, navigate to the first directory on the leftIntroductionHighlight, and you can see from the console that the element has been increasedactiveClass, and at the same timeli[data-section="net-chart"]There is only one element node under the nodea

  • Click on theNet Chartdirectory

Click the Net Chart directory, Net Chart directory highlight, drop down to display the subdirectory, check the console, its element node add active class, and add ul subelement node, at this time, the first subdirectory node only has a child element node A

  • conclusion

The left directory is dynamically generated, not statically written to death. The child directory is displayed only when the parent directory is clicked, and the drop class on the parent directory element indicates the existence of the child directory

3. Write the main program

Through the above analysis, the general process is as follows

  1. Go from top to bottomNet ChartDirectory DOM tree when founda.dropElement node that simulates mouse click eventsclickTo generate a subdirectory node
  2. findNet ChartEverything in the directoryaLink to generate an array
  3. Traverse the number group, visit each subdirectory page, save the HTML file of the page to the local

Each specific process is then implemented

  • Project initialization

Install puppeteer, rimraf(required for folder operation)

npm i -S puppeteer rimraf
Copy the code

Create a test.js file and import it

const puppeteer = require('puppeteer');
const chalk = require('chalk');
const path = require('path');
const https = require('https');
const fs = require('fs');
const rm = require('rimraf');

const settings = {
    headless: false
}

function resolve(dir, dir2 = ' ') {
	return path.posix.join(__dirname, '/', dir, dir2);
}

async function main () {
    const browser = await puppeteer.launch(settings); // Create a Browser object
    try {
        const page = await browser.newPage(); // Create Page with Browser
        page.setDefaultNavigationTimeout(600000);
        / / to monitor console
        page.on('console', msg => {
            for (let i = 0; i < msg.args().length; ++i) {
                console.log(`${i}: ${msg.args()[i]}`); }});<! -- main start -->/ / the main area<! -- end start-->Console. log(' service ended normally ')} Catch (error) {console.log(' service failed: ') console.log(error)} finally {}} main()Copy the code

All the rest of the code is done in the Main area. The full code can be viewed in the Github repository

  • Create a folder to hold the files that you crawl

    • Define the file output path
    • Generate folders based on the path
    • If the folder already exists, delete it and then create it
  • Net Chart directory to achieve alla.dropElement

This section deals with DOM manipulation. Only in Page.evaluate () can you access the actual DOM element. In Page.evaluate (), you can’t call a function defined outside of it

await page.evaluate(async() = > {const rootNode = document.querySelector('#menu > ul > li:nth-child(5) > ul > li:nth-child(5)');
    await window.walkDOM(rootNode)
})
Copy the code

At this point, is bound to the window object walkDOM function requires on the page. The evaluateOnNewDocument defined function to take effect

await page.evaluateOnNewDocument((a)= > {
    / / traverse the DOM
    window.walkDOM = (node) = > {
        if (node === null) {
            return
        }
        if (node.tagName === 'A' && node.className.indexOf('drop') > - 1) {
            node.click() // Click the event
        }
        node = node.firstElementChild
        while (node) {
            walkDOM(node)
            node = node.nextElementSibling
        }
    }
})
Copy the code

When all the A.trop elements in the Net Chart directory are clicked, all the descendant subdirectories in the Net Chart directory are loaded and generated, and the next step is easy

  • Get all a elements in Net Chart directory

    • throughdocument.querySelectorAll()Find allaElement to an array
    • Iterate through the array, processing each item of the array into{href: '',text: ''}object
    • Return an array of objects
  • Iterate through the array of objects, accessing each link and downloading its HTML file

    • Jump to each link and download the required HTML to the specified folder
    • When the HTML existsimg“, download all images

4. To summarize

The first time to use Puppeteer was a bumpy one, involving time, references and practice

Code warehouse

  • Code warehouse

Refer to the article

  • The front end uses the Puppeteer crawler to generate the React. Js Little Book PDF and merge it