Use NODEJS+Puppeteer to save web pages as images and PDFS

This article is suitable for docker, Node have a certain understanding of children’s shoes

Puppeteer profile

The project, created on Github in May, is part of the newer Chromium headless browser library.

Actual use of Puppeteer in the project

Base installation

There are two main reasons for Puppeteer.

1: Official maintenance of GOOGLE, highly active, personally feel bright future.
2: Our products fit best on Chrome. The latest version is0.13.0, we adopt0.12.0Version,0.13.0There were some changes to the release API that didn’t meet our requirements. There were two scenarios that we had to resolve in the screenshots
- The website can take screenshots only after all queries on the current dashboard are completed
- We don’t know how long all dashboard-initiated queries end

When NPM installs puppeteer, it downloads chromium from a Google site, but the download fails because of the wall. The way we do this is by setting environment variables

    set PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1
Copy the code

Prevent install from automatically downloading, then manually download Chromium and package it into a base image through docker build. We will From the image in the Dockerfile and do the following.

NPM instal [email protected] - saveCopy the code

Docker can now be packaged very quickly. The /usr/src/node/ image contains the Node code and the Chromium directory

Basic operation

Call the puppeteer

We manually specify the Chromium directory to run.

const browser = await puppteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'], // The docker can run with two args executablePath: 'chromium/chrome', // base image copied chromium to /usr/src/node/chromium});Copy the code

Save the picture

Open the site through the URL

   await page.goto(fullUrl, {
       waitUntil: 'networkidle',
       networkIdleTimeout: 15000,
       timeout: 240000
   });
Copy the code

NetworkIdleTimeout: 15000 Indicates that the navigation is complete when the network is in idle state for at least 15 seconds. Otherwise, the exported screenshot data is incomplete. It’s easy to save the entire page as an image or a PDF, and there are apis available to call directly. But this time we only save one area as a picture,

let rect = await page.evaluate(() => { const element = document.querySelector( '.class1' ); / / select have the class attribute of the specified dom node const {x, y, width, height} = element. GetBoundingClientRect (); return { left: x, top: y, width, height, }; }); await page.screenshot({ path: imagePath, clip: { x: rect.x, y: rect.y, width: actualWidth, height: actualHeight } });Copy the code

You can manipulate page elements in Page. Evaluate, so you can get information such as the width and length of a given area. So we just need to intercept that area. Please refer to the github API documentation for the full API address

Save the PDF

As mentioned in the previous section, it would be easy to save the entire page as a PDF because we only save a certain area, but the API for saving PDF does not have the clip parameter like in page. Screenshot. There are many conversion ways, I use PDfKit class library implementation. I won’t go into the code, but I can refer to many demos.

conclusion

Because we use Docker +CICD+ Devops to package and deploy node services, Puppeteer also has a few holes in Docker. Fortunately, there are a number of official solutions. I still have the occasional page load failure in practice, and expect it to be more powerful and stable in future releases.

Use NODEJS+Puppeteer to save web pages as images and PDFS

Puppeteer profile

Actual use of Puppeteer in the project

Base installation

Basic operation

Call the puppeteer

Save the picture

Save the PDF

conclusion

Related Posts

What is rearrangement and redrawing

Html2canvas image is too long and the background rendering is incomplete

Shopping cart Settlement Case