Four schemes and methods to convert HTML into PDF

In this article, I’ll show you how to generate PDF documents from complex React pages using Node.js, Puppeteer, Headless Chrome, and Docker. Since this task is much more complex than solving it with simple CSS rules, we first explored possible implementations. We found three main solutions. This blog post will guide you through their possibilities and ultimately implement them. Directories: generated on the client side or server side? Solution 1: Make screenshots from DOM Solution 2: Use only PDF library Final solution 3: Node.js, Puppeteer and Headless Chrome style control Send files to client and save them in Docker Using Puppeteer solution 3 +1: CSS print rule summary generated on client or server side? PDF files can be generated on both the client and server sides. But it might make more sense to let the back end handle it, because you don’t want to use up all the resources a user’s browser can provide. Even so, I’ll show you the solution to both approaches. Solution 1: Making screenshots from the DOM At first glance, this solution seems to be the simplest, and it turns out to be, but it has its limitations. If you don’t have special needs, such as selecting text in a PDF or searching for text, this is an easy to use method. The method is straightforward: create a screen shot from the page and put it in a PDF file. Very straightforward. We can use two packages to do this: Html2canvas, generate screenshots from THE DOM, jsPdf, and a library to generate PDFS to start coding:

[JavaScript]

Plain text view

Copy the code

npm install html2canvas jspdf

import html2canvas from
'html2canvas'

import jsPdf from
'jspdf'

function
printPDF () {

const domElement = document.getElementById(
'your-id'
)

html2canvas(domElement, { onclone: (document) => {

document.getElementById(
'print-button'
).style.visibility =
'hidden'

}})

.then((canvas) => {

const img = canvas.toDataURL(
'image/png'
)

const pdf =
new
jsPdf()

pdf.addImage(imgData,
'JPEG'
, 0, 0, width, height)

pdf.save(
'your-filename.pdf'
)

})

In this way! Notice the onClone method of HTML2Canvas. This comes in handy when you need to manipulate the DOM (such as hiding the print button) before taking a screenshot. I’ve seen many projects using this package. Unfortunately, this is not what we want because we need to finish creating the PDF on the back end. Option 2: Use only PDF libraries There are several libraries on NPM such as jsPDF (as described above) or PDFKit. Their problem was that if I wanted to use these libraries, I would have to restructure the page. This definitely hurts maintainability as I need to apply all subsequent changes to the PDF template and React page. Look at the code below. You will need to manually create PDF documents yourself. You need to traverse the DOM to find every element and convert it to PDF, which is tedious work. An easier way must be found.

[JavaScript]

Plain text view

Copy the code

doc =
new
PDFDocument

doc.pipe fs.createWriteStream(
'output.pdf'
)

doc.font(
'fonts/PalatinoBold.ttf'
)

.fontSize(25)

.text(
'Some text with an embedded font! '
, 100, 100)

doc.image(
'path/to/image.png'
, {

fit: [250, 300],

align:
'center'
.

valign:
'center'

});

doc.addPage()

.fontSize(25)

.text(
'Here is some vector graphics... '
, 100, 100)

doc.end()

This code snippet comes from the PDFKit documentation. But it can be useful if your goal is to generate a PDF file directly, rather than converting an existing (and constantly changing) HTML page. What is Puppeteer? Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools protocol. Puppeteer runs Chrome or Chromium in headless mode by default, but it can also be configured to run in full non-headless mode. It is essentially a browser that can be run from Node.js. If you read its documentation, one of the first things it says is that you can use Puppeteer to generate screenshots and PDFS of pages. Excellent! This is exactly what we want. Start by installing the Puppeteer with nPMI I Puppeteer and implementing our functionality.

[JavaScript]

Plain text view

Copy the code

const puppeteer = require(
'puppeteer'
)

async
function
printPDF() {

const browser = await puppeteer.launch({ headless:
true
});

const page = await browser.newPage();

await page.goto(
'https://blog.risingstack.com'
, {waitUntil:
'networkidle0'
});

const pdf = await page.pdf({ format:
'A4'
});

await browser.close();

return
pdf

})

This is a simple feature that navigates to the URL and generates a PD F file for the site. First, we launch the browser (PDF generation is supported only in Headless mode), then open a new page, set up the viewport, and navigate to the URL provided. Setting the waitUntil:’networkidle0′ option means that Puppeteer considers navigation complete when there is no network connection for at least 500 milliseconds. (You can get more information from API Docs.) After that, we save the PDF as a variable, close the browser and return the PDF. Note: the page.pdf method accepts options objects, and you can use the ‘path’ option to save the file to disk. If no path is provided, the PDF will not be saved to disk, but will be buffered. (I’ll discuss how to handle it later.) If you need to log in first to generate a PDF from a protected page, first you need to navigate to the login page, check the ID or name of the form element, fill them out, and submit the form:

[JavaScript]

Plain text view

Copy the code

await page.type(
'#email'
, process.env.PDF_USER)

await page.type(
'#password'
, process.env.PDF_PASSWORD)

await page.click(
'#submit'
)

Always save login credentials in environment variables, don’t hardcode! Style control Puppeteer also has a solution for this style of operation. You can insert the style tag before the PDF is generated, and Puppeteer will produce the file with the modified style.

[JavaScript]

Plain text view

Copy the code

1	`await page.addStyleTag({ content:` `'.nav { display: none} .navbar { border: 0px} #print-button {display: none}'` `})`

Send the file to the client and save it. Now you have a PDF file generated on the back end. What’s next? As mentioned above, if you do not save the file to disk, you will get a buffer. You just need to send the buffer with the appropriate content type to the front end.

[JavaScript]

Plain text view

Copy the code

printPDF.then(pdf => {

res.set({
'Content-Type'
:
'application/pdf'
.
'Content-Length'
: pdf.length })

res.send(pdf)

Now, you just need to send a request to the server from your browser to get the generated PDF.

[JavaScript]

Plain text view

Copy the code

function
getPDF() {

return
axios.get(`${API_URL}/your-pdf-endpoint`, {

responseType:
'arraybuffer'
.

headers: {

'Accept'
:
'application/pdf'

}

})

Once the request is sent, the contents of the buffer should start downloading. The final step is to convert the buffer data to a PDF file.

[JavaScript]

Plain text view

Copy the code

savePDF = () => {

this
OpenModal (' Loading... ')
// open modal

return
getPDF()
// API call

.then((response) => {

const blob =
new
Blob([response.data], {type:
'application/pdf'
})

const link = document.createElement(
'a'
)

link.href = window.URL.createObjectURL(blob)

link.download = `your-file-name.pdf`

link.click()

this
.closeModal()
// close modal

})

.
catch
(err =>
/** error handling **/
)

}

<button onClick={
this
.savePDF}>Save as PDF</button>

In this way! If you click the Save button, the browser will save the PDF. I think this is the trickiest part of implementing Puppeteer in Docker – so let me save you a few hours of Baidu time. The official documentation notes that * “Using Headless Chrome in Docker and getting it up and running can be tricky” *. The official documentation has a troubleshooter section where you can find all the necessary information about installing puppeteer with Docker. If you install Puppeteer on the Alpine image, be sure to scroll down a bit more when you see this part of the page. Otherwise you might ignore the fact that you can’t run the latest Puppeteer version, and you need to disable SHM with a flag:

[JavaScript]

Plain text view

Copy the code

const browser = await puppeteer.launch({

headless:
true
.

args: [
'--disable-dev-shm-usage'
]

});

Otherwise, the Puppeteer child process may run out of memory before it normally starts. Scenario 3 + 1: CSS print rules One might think that simply using CSS print rules is easy from a developer’s perspective. No NPM modules, just pure CSS. But how does it fare in terms of cross-browser compatibility? When selecting CSS print rules, you have to test the results in each browser to make sure it provides the same layout, and it doesn’t do it 100% of the time. For example, inserting a break-after after a given element isn’t a very sophisticated technique, but you might be surprised to find that using it in Firefox requires workarounds. Unless you’re an experienced CSS guru with a lot of experience in creating printable pages, this can be very time consuming. Printing rules are useful if you can keep printing stylesheets simple. Let’s look at an example.

[CSS]

Plain text view

Copy the code

@media
print
{

.print-button {

display
:
none
;

}

.content div {

break-after:
always
;

}

The CSS above hides the print button and inserts a page break after each div that contains the Content class. There’s a great article summarizing what you can do with print rules and what’s wrong with them, including browser compatibility. All things considered, CSS print rules are very effective if you want to generate PDFS from less complex pages. Summary Let’s quickly review the previous scheme to generate PDF files from HTML pages: Generate screenshots from THE DOM: This can be useful when you need to create snapshots from pages (such as creating thumbnails), but it’s a bit of a stretch when you need to process large amounts of data. Just use PDF libraries: This is the perfect solution if you plan to programmatically create PDF files from scratch. Otherwise, you’ll need to maintain both HTML and PDF templates, which is definitely a no-no. Puppeteer: Although working on Docker is relatively difficult, it provides the best results for our implementation and is the easiest code to write. CSS printing rules: If your users are educated enough to know how to print page content to a file, and your pages are relatively simple, this is probably the easiest solution. As you can see in our case, this is not the case. Link: https://juejin.cn/post/6844903811182493704

Four schemes and methods to convert HTML into PDF

Related Posts

This time, understand EventLoop completely

Part1 – Functional Programming

Subcontracting design of Core-JS