A preliminary study on the headless browser Puppeteer

Felix Ant Financial · Data Experience Technology team

Our daily steps to use the browser are: start the browser, open a web page, and interact. A headless browser is a browser that we use to do this with scripts that mimic real browser usage.

With a headless browser, we can do things including but not limited to:

Take a screenshot of the web page and save it as a picture or PDF
Crawl single page application (SPA) execution and rendering (solve the problem that traditional HTTP crawler crawl single page application is difficult to handle asynchronous requests)
Do automatic form submission, automated UI testing, simulated keyboard input, etc
Use some debugging tools and performance analysis tools that come with the browser to help us analyze problems
Test in the latest headless browser environment and use the latest browser features
Write crawlers do what you want

There are many headless browsers, including but not limited to:

Based on its PhantomJS
SlimerJS, based on the Gecko
Based on Rhnio HtmlUnit
TrifleJS, based on Trident
Splash, based on its

This article introduces headless Chrome provided by Google, which provides a number of highly encapsulated interfaces based on Chrome DevTools Protocol to facilitate our control of the browser.

Simple code example

In order to be able to use new features such as async/await, you need to use Node v7.6.0 or later.

Start/close the browser and open the page

// open the browser const browser = await puppeteer.launch({// close the headless mode, so that we can see the headless browser execution process // headless:false, timeout: 30000, // Default timeout is 30 seconds, 0 indicates no timeout}); Const page = await browser.newPage(); // Open blank page const page = await browser.newpage (); // make an interaction //... // await browser.close();Copy the code

Set the page window size

// Set the browser window page.setviewPort ({width: 1376, height: 768,});Copy the code

Enter url

// Enter the url "await page.goto" ('https://google.com/', {// Configuration item //waitUntil: 'networkidle'});Copy the code

Save web pages as images

Open a web page and save the screenshot locally:

await page.screenshot({
    path: 'path/to/saved.png'});Copy the code

Complete sample code

Save the page as a PDF

Open a web page and save the PDF locally:

await page.pdf({
     path: 'path/to/saved.pdf',
    format: 'A4', // Save the size});Copy the code

Complete sample code

Execute the script

To get the host environment in the open web Page, use the Page. Evaluate method:

Const dimensions = await page.evaluate(() => {return {
        width: document.documentElement.clientWidth,
        height: document.documentElement.clientHeight,
        deviceScaleFactor: window.devicePixelRatio
    };
});
console.log('Window message :', dimensions); Const htmlHandle = await page.$('html'); Const HTML = await page. Evaluate (body => body.outerhtml, htmlHandle); // Dispose (); console.log('html:', html);
Copy the code

Page.$can be interpreted as our usual document.querySelector, while Page.? The corresponding document. QuerySelectorAll.

Complete sample code

Automatic form submission

Go to the Google homepage, type in your keywords, and press Enter to search:

// Enter the url "await page.goto" ('https://google.com/', {
    waitUntil: 'networkidle'}); // focus search box // await page.click('#lst-ib');
await page.focus('#lst-ib'); // Enter the search keyword await page.type('Spicy Chicken', {delay: 1000, // control keypress;}); // Return to await page.press('Enter');
Copy the code

Complete sample code

A more complex code example

Each simple action adds up to a series of complex interactions, and let’s look at two more concrete examples.

Grab single page application: simulate ele. me takeout order

The traditional crawler is based on THE HTTP protocol. It simulates the UserAgent to send HTTP request and then uses the regular to parse out the content to be captured after obtaining THE HTML content. This method is very convenient for the server to render the HTML web page directly.

However, when it comes to single page applications (SPA) or login verification, this crawler is weak.

While using headless browser, crawling web pages completely using human-computer interaction operations, so the initialization of the page can use the host browser environment rendering complete, no longer need to care about the single-page application in the front-end initialization needs to involve HTTP requests.

Headless browser provides a variety of click, input and other instructions, completely simulate the human click, input and other instructions, also no longer need to worry about the re can not write out ah-ha ha ha

Of course, there are some scenarios where it is more efficient to use a traditional HTTP crawler that writes regular matches.

Instead of comparing these differences in detail, the following example is intended to demonstrate the simulation of a complete human-computer interaction: ordering takeout using the mobile version of Ele. me.

Take a look at the effect first:

I won’t post all of the long code, but the key is a few lines:

const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone6 = devices['iPhone 6'];

console.log('Start browser');
const browser = await puppeteer.launch();

console.log('Open page'); const page = await browser.newPage(); // emulate the mobile device await page.emulate(iPhone6); console.log('Enter the web address in the address bar');
await page.goto(url);

console.log('Wait for the page to be ready');
await page.waitForSelector('.search-wrapper .search');

console.log('Click on the search box');
await page.tap('.search-wrapper .search');

await page.type(McDonald's, {delay: 200, // the interval between each letter}); console.log('Enter enter to search');
await page.tap('button');

console.log('Wait for the search results to render.');
await page.waitForSelector('[class^="index-container"]');

console.log('Find the first takeaway! ');
await page.tap('[class^="index-container"]');


console.log('Wait for the menu to render');
await page.waitForSelector('[class^="fooddetails-food-panel"]');


console.log('Let's just pick a dish.');
await page.tap('[class^="fooddetails-cart-button"]');

// console.log(Wait two seconds for a clear view.);
await page.waitFor(2000);
await page.tap('[class^=submit-btn-submitbutton]'); // Close the browser await browser.close();Copy the code

The key steps are:

Loading the page
Wait for the DOM to be rendered and click
Continue to wait for the next DOM to render and click again

Key instructions:

page.tap(orpage.click) to click
page.waitForSelectorThis means to wait for the specified element to appear on the page, and if it does, to continue immediately, followed byselectorSelectors, with the ones we usedocument.querySelectorThe received parameters are consistent
page.waitForI can pass it inselectorSelector,functionA function ortimeoutMilliseconds, such aspage.waitFor(2000)Pause means to wait two seconds before continuing. This is used for demonstration purposes

Each of the above instructions accepts a selector as an argument. Here are a few additional methods:

page.$(selector)With our usualdocument.querySelector(selector)Consistent, one is returnedElementHandleElement handle
page.? (selector)With our usualdocument.querySelectorAll(selector)Consistent, returns an array

In the context of a header browser, we select an element by:

const body = document.querySelector('body');
const bodyInnerHTML = body.innerHTML;
console.log('bodyInnerHTML: ', bodyInnerHTML);
Copy the code

In a headless browser, we first need to get a handle, and when we get information from the handle in the environment, we destroy the handle.

Const bodyHandle = await page.$('body'); Const bodyInnerHTML = await page. Evaluate (dom => dom.innerhtml, bodyHandle); Dispose (); // Dispose (); console.log('bodyInnerHTML:', bodyInnerHTML);
Copy the code

Otherwise, you can use page.$eval:

const bodyInnerHTML = await page.$eval('body', dom => dom.innerHTML);
console.log('bodyInnerHTML: ', bodyInnerHTML);
Copy the code

Page. Evaluate means to execute the script in the browser environment, passing in a second argument as a handle, while Page.$eval performs an operation on a selected DOM element.

Complete sample code

Export bulk web pages: Download Turing Books

I bought a lot of ebooks on the Turing community. I used to support push to MOBI format to Kindle or push PDF format to email for reading, but I often closed these push channels and could only stay on the web to read books.

It’s not very convenient for me, and the online reading of these books is rendered by the server (with a lot of tags, it is not easy to extract good typography), the best way of course is to read them online and save them as PDFS or images.

Using the browser’s headless mode, I wrote a simple download of purchased books as a PDF to local script that supports bulk download of purchased books.

Use method, pass in the account password and save path, for example:

$ node ./demo/download-ituring-books.js 'Username' 'password' './books'
Copy the code

Note: Puppeteer page.pdf () is currently only supported in headless mode, so an error will be reported if the puppeteer page.pdf () is executed:

So start the script in headless mode:

Const browser = await puppeteer.launch({// turn off the headless mode, so we can see the process of the headless browser executing // Note that page.pdf is saved as a PDF if called, // headless:false});Copy the code

Take a look at the implementation:

I have more than 20 books on my bookshelf that look like this after downloading them:

Complete sample code

What else can a headless browser do?

A headless browser basically simulates what a human can do in a browser with a header. That naturally a lot of human work, can use a headless browser to do (for example, the process of downloading PDF above, in fact, is human open each article page, and then press CTRL + P or Command + P to save the local automatic process).

Since automated tools can solve things, we should not waste repeated human labor, in addition to we can do:

Automated tools such as automatic form submission, automatic download
Automated UI tests such as logging the correct DOM structure or screenshots and then automatically checking if the DOM structure or screenshots match after performing specified actions (UI assertions)
Periodic monitoring tools, such as periodic screenshots, send weekly reports, or periodically check whether pages in important service paths are available, coordinate with email alarms
Crawlers that traditional HTTP crawlers can’t crawl can be done with headless browser rendering capabilities
etc

If you are interested, you can follow the column or send your resume to ‘qingsheng. LQS ####alibaba-inc.com’. Replace (‘####’, ‘@’)

Original address: github.com/ProtoTeam/b…

A preliminary study on the headless browser Puppeteer

Simple code example

Start/close the browser and open the page

Set the page window size

Enter url

Save web pages as images

Save the page as a PDF

Execute the script

Automatic form submission

A more complex code example

Grab single page application: simulate ele. me takeout order

Export bulk web pages: Download Turing Books

What else can a headless browser do?

Related Posts

Familiar and unfamiliar APIS: Promise

How do I execute a function after the user stops typing JavaScript

Vue source learning 4.2: Compiling parse