This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together

Project Background:

Recently, the company needed to generate PDF files, which was easy enough. I laughed and transferred the requirements to the front end. When the product handed me a gaudy prototype, I realized it wasn’t that simple.

To put it simply, our requirements are: interface to export a PDF with flashy content that looks good when printed;

Therefore, some technical comparative research was also carried out. There are many front-end and back-end methods for PDF generation on the web. Here are some common ones for reference.

Method comparison:

The front end

html2canvas+jsPDF

How it works: Convert HTML elements to canvas or images to generate PDF files.

Advantages: the front end of this method tutorial, easy to use, faster.

Disadvantages: difficult to solve the problem of truncated elements, poor printing effect, the front end can not directly generate PDF files through the interface call

The back-end

Itext/POI: HTML/XML to PDF, if there is a lot of PDF content, the background needs to maintain bloated structured data or styles, for writing and maintenance are very disgusting, and can not achieve the effect of flashy; Only suitable for compact PDF exports.

Poi-tl: Using DOC template to generate PDF, using template files can save a lot of style and typesetting troubles, but due to the limitations of DOC, can not support as complicated as HTML elements, such as charts, fonts, tables and other highly personalized styles. I think this method is superior to the above two methods in efficiency.

After comparing the advantages and disadvantages of these approaches, we will focus on the use and deployment of Puppeteer

Puppeteer:

Puppeteer is a Node library that provides a high-level API for controlling Chrome or Chromium via the DevTools protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

What can puppeteer do?

Most of the things you can do manually in a browser can be done using Puppeteer! Here are some examples to help you get started:

  • Generate screen captures and PDF of the page.

  • Grab SPA (single-page application) and generate pre-rendered content (that is, “SSR” (server-side rendering)).
  • Automated form submission, UI testing, keyboard entry, and more.
  • Create the latest automated test environment. Run your tests directly in the latest version of Chrome, using the latest JavaScript and browser features.
  • Capture a timeline trace of the site to help diagnose performance problems.
  • Test the Chrome extension.

Puppeteer is rich in features, and we are only concerned with the first point of PDF generation, which is simply: Puppeteer runs Chrome in the environment and uses Chrome’s API to generate PDFS. This may seem complicated, but complexity has its advantages. PDFS generated by Puppeteer can directly avoid the problem of text or tables being relentlessly truncated. The solution to truncation for other elements, such as canvas or image, is also very simple.

[jvppeteer](GitHub – Fanyong920 / jVPpeteer: Headless Chrome For Java (Java Crawler))

Java version Puppeteer, which allows users to use the Java version developed by puppeteer to generate PDF files. However, due to the dependency on the Chrome environment, Node is used to deploy puppeteer.

The official sample

const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://example.com'); // Get the "viewport" of the page, as reported by the page. const dimensions = await page.evaluate(() => { return { width: document.documentElement.clientWidth, height: document.documentElement.clientHeight, deviceScaleFactor: window.devicePixelRatio, }; }); console.log('Dimensions:', dimensions); await browser.close(); }) ();Copy the code

Puppeteer.launch () : the browser launch method, but we don’t need to restart the browser every time we request it. We can use puppeteer.connect() to reopen the page to reduce consumption.

Code sample

For puppeteer.launch() and page.pdf() configuration information please refer to the official website

const puppeteer = require('puppeteer'); const fs = require('fs'); const logger = require("./log4js"); Async function launchBrowser() {try {const Browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox', '--enable-accelerated-2d-canvas', '--enable-aggressive-domstorage-flushing'], ignoreHTTPSErrors: true, headless: true, timeout: 60000, }); const wsAddress = browser.wsEndpoint(); const w_data = Buffer.from(wsAddress); fs.writeFile(__dirname + '/wsa.txt', w_data, {flag: 'w+'}, function (err) { if (err) { logger.error(err); Else {logger.info(" Browser started successfully: ", wsAddress); }}); } Catch (e) {logger.error(e)}} // Get a new TAB ws connection Browser Async function newPage() {const getWSAddress = () => new Promise(resolve => { fs.readFile(__dirname + '/wsa.txt', {flag: 'r+', encoding: 'utf8'}, function (err, data) { if (err) { console.error(err); return; } resolve(data); }); }); const wsa = await getWSAddress(); const browserConfig = { browserWSEndpoint: wsa }; const browser = await puppeteer.connect(browserConfig); Return browser.newpage ()} // export PDF const options = {// paper size format: 'A4', // printBackground, default is false printBackground: DisplayHeaderFooter: true, // Display the page number, etc. HeaderTemplate, footerTemplate, margin: {top: '2px', bottom: '35px' }, } app.post('/PDF', async (req, res) => { const printPdf = async () => { const page = await newPage() try { await page.goto(url, {waitUntil: 'networkidle0'}) return await page.pdf(options) } catch (e) { } finally { await page.close() } } const result = await Res.set ({' content-type ': 'application/ PDF ', 'content-length ': result.length}) res.send(200, result) })Copy the code

Project deployment

If you use Windows, you have avoided a lot of environmental dependencies, missing fonts, etc., but most players are deployed on Linux. Here is how to deploy using Docker.

Deployment and Chinese font missing reference to the elder brother’s method, thank the elder brother again

Puppeteer docker 中文 Failed to launch chrome 中文 clut_dead_line9527 下 载 – puppeteer docker 中文 Failed to launch chrome

Screenshots of temptation: Docker deploys the Puppeteer project – Jane’s Book

# pull base image The mirror no longer maintain, be on the safe side please homemade mirror FROM buildkite/puppeteer: 10.0.0 # set domestic image source RUN sed -i 's/deb.debian.org/mirrors.163.com/g' /etc/apt/sources.list && \ apt update && \ apt-get install -y DPKG wget unzip # ./tmp/fonts-noto-cjk.deb COPY fonts/source-sans-pro.zip ./tmp/source-sans-pro.zip RUN cd /tmp && dpkg -i Fonts -noto-cjk.deb && \ unzip source-sans-pro.zip && CD source-sans-pro-2.040 r-ro-1.090 r-it && mv./OTF /usr/share/fonts/ && \ fc-cache -f -v \ RUN apt-get update RUN apt-get -y install fontconfig xfonts-utils RUN fc-list :lang=zh WORKDIR /app COPY ./package.json /app/ RUN npm config set unsafe-perm true RUN npm config set registry https://registry.npm.taobao.org # # installation pm2 RUN NPM pm2 - g I RUN NPM install COPY. / app/EXPOSE 8888 CMD [" yarn ", "server" ]Copy the code

problems

  1. Chinese garbled characters

    Chinese fonts must be installed in the Linux environment. Otherwise, garbled Characters may occur, resulting in unstable network resources and uploaded fonts.

  2. The picture is truncated

    page-break-after : auto | always | avoid | left | right page-break-before : auto | always | avoid | left | right page-break-inside : auto | avoid

    Element styles that need to be handled add:

    Page break – inside: get

    page-break-before: always

    Paging is generally done by inserting a blank space up to prevent elements from being page-truncated

  3. The cache problem

    After the front-end page is modified, the generated PDF may still be the old version due to the browser cache. In this case, the traceless mode or page.nocache (true) can be used for processing (not verified yet).