Talk about Puppeteer in conjunction with the project

Puppeteer is a Node.js package released by the Chrome development team in 2017 to simulate the operation of the Chrome browser. Our team has been a loyal user of Puppeteer since its release (mainly due to the large number of PhantomJs pits). This article will introduce Puppeteer and share our daily practice.

Before learning about Puppeteer, let’s take a look at Chrome DevTool Protocol

What is Chrome DevTool Protocol

CDP is based on WebSocket and uses WebSocket to realize fast data channel with browser kernel
CDP is divided into multiple domains (DOM, Debugger, Network, Profiler, Console…). Commands and Events are defined for each domain.
Some tools can be used to debug and analyze Chrome based on CDP. For example, the Chrome Developer Tool is implemented based on CDP
If you start Chrome with remote-debugging-port, you can see that the developer of all Tab pages debugs the front-end page and also provides HTTP service on the same port. It provides the following interfaces:

GET /json/version                     Get some meta information about the browser
GET /json or /json/list               # Some page information that is open on the current browser
GET /json/protocol                    Obtain the protocol information of the current CDPGET /json/new? {url}Open a total of new Tab pages
GET /json/activate/{targetId}         Activate a page to become the currently displayed page
GET /json/close/{targetId}            # Close a page
GET /devtools/inspector.html          Open the developer debug tool for the current page
WebSocket /devtools/page/{targetId}   Get the websocket address of a page
Copy the code

Many useful tools are implemented based on CDP, such as Chrome Developer Tools, Chrome-remote-Interface, Puppeteer, etc

What is Headless Chrome

Run Chrome in an unbounded environment
Operate Chrome from the command line or programming language
Without human intervention, operation is more stable
With no interface, less CSS/JS loading and page rendering in real browsers, headless testing is faster than in real browsers
Start Chrome in Headless mode by adding the parameter “headless” when you start Chrome

alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome"  Mac OS X command alias
chrome --headless --remote-debugging-port=9222 --disable-gpu                   # Enable remote debugging
chrome --headless --disable-gpu --dump-dom https://www.baidu.com               Get the page DOM
chrome --headless --disable-gpu --screenshot https://www.baidu.com             # screenshots
Copy the code

What parameters can be added to Chrome startup? You can click here to see

What is the Puppeteer

Puppeteer is the Node.js tool engine
Puppeteer provides a series of apis that control the behavior of Chromium/Chrome through the Chrome DevTools Protocol
Puppeteer, by default, starts Chrome with headless. You can also start Chrome with an interface using parameters
Puppeteer is bound to the latest Chromium version by default, and can be bound to a different version by itself
Puppeteer allows us to communicate with the browser without knowing too much about the underlying CDP protocol

What can Puppeteer do

The official says: “Most things that you can do manually in the browser can be done using Puppeteer”, so what can be done?

Web page screenshots or PDF generation
Crawl SPA or SSR sites
UI automation test, simulate form submission, keyboard input, click and other behavior
Capture a timeline of your site to help diagnose performance problems
Create an up-to-date test automation environment and run test cases using the latest JS and the latest Chrome browser
Test the Chrome extension
.

The Puppeteer API is layered

The API hierarchy in Puppeteer is basically the same as that in the browser. Here are some of the classes that are commonly used:

Browser: Corresponding to a Browser instance, a Browser can contain multiple BrowserContext
BrowserContext: BrowserContext has a separate Session(cookies and cache are not shared), just like opening a normal Chrome browser and then opening a browser in incognito mode. A BrowserContext can contain multiple pages
Page: NewPage ()/browser.newPage(). Browser.newpage () creates the page using the default browserContext. A Page can contain multiple frames
Frame: a Frame that has one MainFrame(page.mainframe ()) for each page, or multiple subframes, created primarily by the iframe tag
ExecutionContext: is the javascript execution environment. Each Frame has a default javascript execution environment
ElementHandle: an element node corresponding to the DOM. This instance can be used to click on an element and fill in a form. The element can be obtained by using selectors, xPath, etc
JsHandle: Corresponding to the javascript object in DOM, ElementHandle inherits from JsHandle. Since we cannot operate the object in DOM directly, it is encapsulated as JsHandle to realize related functions
CDPSession: communicates with the native CDP directly, sending messages using the session.send function and receiving messages using the session.on function, enabling Puppeteer apis to perform functions not involved in Puppeteer
Coverage: Gets JavaScript and CSS code Coverage
Tracing: Captures performance data for analysis
Response: Indicates the Response received by the page
Request: indicates a Request made by the page

How do I create a Browser instance

Puppeteer provides two ways to create an instance of Browser:

Puppeteer.connect: Connects an existing Chrome instance
Puppeteer.launch: Launch a Chrome instance at a time

const puppeteer = require('puppeteer');
let request = require('request-promise-native');

// Use puppeteer.launch to launch Chrome
(async() = > {const browser = await puppeteer.launch({
        headless: false.// A browser interface is started
        slowMo: 100.// Slow down browser execution to facilitate test observation
        args: [            // The parameters for starting Chrome are described above
            '- no - the sandbox'.'- the window - size = 1280960']});const page = await browser.newPage();
    await page.goto('https://www.baidu.com');
    await page.close();
    awaitbrowser.close(); }) ();// Connect to an existing Chrome instance using puppeteer.connect
(async() = > {// Obtain the corresponding websocketUrl through the HTTP interface of port 9222
    let version = await request({
        uri:  "http://127.0.0.1:9222/json/version".json: true
    });
    // Connect directly to existing Chrome
    let browser = await puppeteer.connect({
        browserWSEndpoint: version.webSocketDebuggerUrl
    });
    const page = await browser.newPage();
    await page.goto('https://www.baidu.com');
    await page.close();
    awaitbrowser.disconnect(); }) ();Copy the code

A comparison of these two approaches:

Puppeteer.launch requires a Chrome process to be restarted each time, which takes an average of 100 to 150 ms and is not performing well
Puppeteer.connect enables sharing of the same Chrome instance, reducing the time required to start and close the browser
Puppeteer.launch Parameters can be dynamically modified during puppeteer.launch
Puppeteer.connect enables a Chrome instance to be remotely connected and deployed on different machines
Puppeteer. Connect Multiple pages share a Chrome instance. If Page Crash occurs occasionally, concurrency control is required and the Chrome instance needs to be restarted periodically

How do I wait to load?

In practice, we often encounter problems such as how to judge when a page is loaded, when to take a screenshot, when to click a button and so on. How do we wait for the page to load?

Let’s break down the apis waiting to load into three categories:

Loading the navigation page

Page. goto: Opens a new page
Page. goBack: Goes back to the previous page
Page. goForward: Advance to the next page
Page. reload: reloads the page
Page. waitForNavigation: Waits for the page to jump

Almost all operations in Pupeeteer are asynchronous, and all of the above apis have to do with opening a page to determine when the function has finished executing. These functions all provide waitUtil and timeout, WaitUtil means that the execution is complete until something happens, and timeout means that an exception is thrown if it hasn’t finished after this time.

await page.goto('https://www.baidu.com', {
   timeout: 30 * 1000.waitUntil: [
       'load'.// Wait for the "Load" event to trigger
       'domcontentloaded'.// Wait for the "domContentLoaded" event to trigger
       'networkidle0'.// there is no network connection within 500ms
       'networkidle2'       // the number of network connections within 500ms is not more than two]});Copy the code

The above waitUtil has four events, and the business can set one or more of them to terminate as required. The 500ms of Networkidle0 and Networkidle2 is still a bit long for time-sensitive users

Wait for elements, requests, responses

Page. waitForXPath: Waits for the xPath element to appear and returns the corresponding ElementHandle instance
Page. waitForSelector: Waits for the selector element to appear and returns the corresponding ElementHandle instance
WaitForResponse: Waits for a Response to end and returns an instance of Response
Page. WaitForRequest: Waits for a Request to appear and returns the Request instance

await page.waitForXPath('//img');
await page.waitForSelector('#uniqueId');
await page.waitForResponse('https://d.youdata.netease.com/api/dash/hello');
await page.waitForRequest('https://d.youdata.netease.com/api/dash/hello');
Copy the code

Custom wait

If none of the waiting methods provided above satisfy our needs, Puppeteer also provides us with two functions:

Page.waitforfunction: Waits for the result of the execution of a custom function on the page and returns an instance of JsHandle
Page.waitfor: Set the wait time

await page.goto(url, { 
    timeout: 120000.waitUntil: 'networkidle2' 
});
// We can define an event in the page that we think is finished loading and set it to true at the appropriate point in time
If renderDone appears and is true, then take a screenshot. If Object appears, it means that the page has been loaded incorrectly. We can catch the exception and prompt it
let renderdoneHandle = await page.waitForFunction('window.renderdone', {
    polling: 120
});
const renderdone = await renderdoneHandle.jsonValue();
if (typeof renderdone === 'object') {
    console.log('Failed to load page: report${renderdone.componentId}Error -${renderdone.message}`);
}else{
    console.log('Page loaded successfully');
}
Copy the code

Two separate environments

It is important to understand the two environments in which data is almost certainly exchanged when using Puppeteer: the Node.js environment where the Puppeteer is running and the Page DOM where the Puppeteer operates

First, Puppeteer provides many useful functions to execute code in the Page DOM Environment, which will be described later
Second, Puppeteer provides ElementHandle and JsHandle to encapsulate elements and objects in the Page DOM Environment into corresponding Node.js objects. In this way, the Page DOM can be manipulated directly by the encapsulation functions of these objects

10 Use Cases that tell you how to use Puppeteer

Here are 10 use cases for Puppeteer, along with some APIS to show how to use Puppeteer:

Case1: screenshots

Puppeteer is used to take screenshots of both a page and an element in the page:

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Set the viewable area size
    await page.setViewport({width: 1920.height: 800});
    await page.goto('https://youdata.163.com');
    // Take a screenshot of the entire page
    await page.screenshot({
        path: './files/capture.png'.// Image save path
        type: 'png'.fullPage: true // Take screenshots while scrolling
        // clip: {x: 0, y: 0, width: 1920, height: 800}
    });
    // Take a screenshot of an element on the page
    let [element] = await page.$x('/html/body/section[4]/div/div[2]');
    await element.screenshot({
        path: './files/element.png'
    });
    await page.close();
    awaitbrowser.close(); }) ();Copy the code

How do we get an element in a page?

$(‘#uniqueId’) : Gets the first element corresponding to a selector
page.? (‘div’) : Gets all elements corresponding to a selector
Page.$x(‘//img’) : Gets all elements corresponding to an xPath
Page.waitforxpath (‘//img’) : Waits for an xPath element to appear
Page.waitforselector (‘#uniqueId’) : Waits for an element corresponding to a selector to appear

Case2: simulates user login

(async() = > {const browser = await puppeteer.launch({
        slowMo: 100.// Slow down
        headless: false.defaultViewport: {width: 1440.height: 780},
        ignoreHTTPSErrors: false.// Ignore HTTPS error
        args: ['--start-fullscreen'] // Open the page in full screen
    });
    const page = await browser.newPage();
    await page.goto('https://demo.youdata.com');
    // Enter the account password
    const uniqueIdElement = await page.$('#uniqueId');
    await uniqueIdElement.type('[email protected]', {delay: 20});
    const passwordElement = await page.$('#password', {delay: 20});
    await passwordElement.type('123456');
    // Click ok to log in
    let okButtonElement = await page.$('#btn-ok');
    // Wait for page navigation to complete. Generally, when clicking a button to jump, wait for page.waitfornavigation () to complete before the jump is successful
    await Promise.all([
        okButtonElement.click(),
        page.waitForNavigation()  
    ]);
    console.log('Admin login successful');
    await page.close();
    awaitbrowser.close(); }) ();Copy the code

What functions does ElementHandle provide to manipulate elements?

Elementhandle.click () : click on an element
Elementhandle.tap () : simulates finger touch click
Elementhandle.focus () : Focuses on an element
Elementhandle.hover () : hover over an element
Elementhandle.type (‘hello’) : Enters the text in the input field

Case3: Request interception

Request in some situations it is necessary to intercept it is not necessary to request to improve performance, we can monitor the Page request events, and request to intercept, the premise is to open a request to intercept Page. SetRequestInterception (true).

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    const blockTypes = new Set(['image'.'media'.'font']);
    await page.setRequestInterception(true); // Enable request blocking
    page.on('request', request => {
        const type = request.resourceType();
        const shouldBlock = blockTypes.has(type);
        if(shouldBlock){
            // Block requests directly
            return request.abort();
        }else{
            // Override the request
            return request.continue({
                // Override url, method, postData, headers
                headers: Object.assign({}, request.headers(), {
                    'puppeteer-test': 'true'})}); }});await page.goto('https://demo.youdata.com');
    await page.close();
    awaitbrowser.close(); }) ();Copy the code

What events are available on the Page?

Page. On (‘close’) The page is closed
Page. On (‘console’) The console API is called
Page. On (‘error’) The page fails
Page. On (‘load’) The page is loaded
Page. On (‘request’) Received the request
Page. On (‘requestfailed’) requestfailed
Page. On (‘ RequestFinished ‘) The request succeeded
Page. On (‘response’) A response is received
Page. On create webWorker (‘ workercreated ‘)
Page. On destroyed webWorker (‘ workerdestroyed ‘)

Case4: Gets the WebSocket response

Puppeteer does not currently provide a native API for handling Websockets, but it is available through the lower-level Chrome DevTool Protocol (CDP)

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Create a CDP session
    let cdpSession = await page.target().createCDPSession();
    // Enable Network debugging and listen for Network events in Chrome DevTools Protocol
    await cdpSession.send('Network.enable');
    // Listen for the webSocketFrameReceived event to get the corresponding data
    cdpSession.on('Network.webSocketFrameReceived', frame => {
        let payloadData = frame.response.payloadData;
        if(payloadData.includes('push:query')) {// Parse payloadData to get the data pushed by the server
            let res = JSON.parse(payloadData.match(/ / \ {. * \}) [0]);
            if(res.code ! = =200) {console.log('Error calling websocket interface :code=${res.code},message=${res.message}`);
            }else{
                console.log('Get webSocket data:', res.result); }}});await page.goto('https://netease.youdata.163.com/dash/142161/reportExport?pid=700209493');
    await page.waitForFunction('window.renderdone', {polling: 20});
    await page.close();
    awaitbrowser.close(); }) ();Copy the code

Case5: embed javascript code

The most powerful feature of Puppeteer is that you can execute any javascript code you want in your browser. Here is the list of inbox users in mailbox 188. The number of iframes was so large that the browser froze and could not run, so I added a script to delete useless iframes to the crawler code:

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://webmail.vip.188.com');
    // Register a node.js function and run it in the browser
    await page.exposeFunction('md5', text =>
        crypto.createHash('md5').update(text).digest('hex'));Evaluate: Delete the useless iframe code in the browser with Page. Evaluate
    await page.evaluate(async() = > {let iframes = document.getElementsByTagName('iframe');
        for(let i = 3; i <  iframes.length - 1; i++){
            let iframe = iframes[i];
            if(iframe.name.includes("frameBody")){
                iframe.src = 'about:blank';
                try{
                    iframe.contentWindow.document.write(' ');
                    iframe.contentWindow.document.clear();
                }catch(e){}
                // Remove iframe from the pageiframe.parentNode.removeChild(iframe); }}// Call functions in the Node.js environment on the page
        const myHash = await window.md5('PUPPETEER');
        console.log(`md5 of ${myString} is ${myHash}`);
    });
    await page.close();
    awaitbrowser.close(); }) ();Copy the code

What functions are available to execute code in the browser environment?

Evaluate (pageFunction[,…args]) : Evaluate (pageFunction[,…args]) : Implement a function in the browser environment
EvaluateHandle (pageFunction[,…args]) : Executes a function in the browser environment that returns a JsHandle object
page.? Eval (selector, pageFunction[,…args]) : Passes all elements corresponding to the selector into the function and executes it in the browser environment
Page.$eval(selector, pageFunction[,…args]) : Passes the first element corresponding to the selector into the function to execute in the browser environment
Page. EvaluateOnNewDocument (pageFunction args [,…]] : create a new Document in the browser environment, can be carried all the scripts on the page before the execution
Page. ExposeFunction (name, puppeteerFunction) : Registers a function on the window object, which is executed in the Node environment and has the opportunity to call node.js libraries in the browser environment

Case6: how do I fetch elements in an iframe

A Frame contains an Execution Context. Functions cannot be executed across frames. A page can have multiple frames, which are generated by embedding iframe tags. Most of the functions on the page are actually short for page.mainframe ().xx. Frame is a tree structure, and we can iterate through all frames with frame.childframes (). If you want to execute a function in another Frame, you have to get the corresponding Frame to process it

When logging in to mailbox 188, the login window is actually an embedded IFrame. In the following code, we are getting the IFrame and logging in

(async() = > {const browser = await puppeteer.launch({headless: false.slowMo: 50});
    const page = await browser.newPage();
    await page.goto('https://www.188.com');
    // Click Login with password
    let passwordLogin = await page.waitForXPath('//*[@id="qcode"]/div/div[2]/a');
    await passwordLogin.click();
    for (const frame of page.mainFrame().childFrames()){
        // Find the iframe corresponding to the login page based on the URL
        if (frame.url().includes('passport.188.com')) {await frame.type('.dlemail'.'[email protected]');
            await frame.type('.dlpwd'.'123456');
            await Promise.all([
                frame.click('#dologin'),
                page.waitForNavigation()
            ]);
            break; }}await page.close();
    awaitbrowser.close(); }) ();Copy the code

Case7: page performance analysis

Puppeteer provides a tool to perform performance analysis on Puppeteer. Currently, it is a weak tool, and only one page performance data can be obtained.

A browser can trace only once at a time
In devTools Performance, you can upload the corresponding JSON file and view the analysis result
We can write scripts to parse the data in trace.json for automated analysis
Tracing shows page loading speed and script execution performance

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.tracing.start({path: './files/trace.json'});
    await page.goto('https://www.google.com');
    await page.tracing.stop();
    /* continue analysis from 'trace.json' */browser.close(); }) ();Copy the code

Case8: File upload and download

The need to upload and download files is often encountered in automated testing. How is this implemented in Puppeteer?

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Set the download path through the CDP session
    await page.target().createCDPSession().send('Page.setDownloadBehavior', {
        behavior: 'allow'.// Allow all download requests
        downloadPath: 'path/to/download'  // Set the download path
    });
    // Click the button to trigger the download
    await (await page.waitForSelector('#someButton')).click();
    // Wait for the file to appear, and take turns to check whether the file appears
    await waitForFile('path/to/download/filename');
    
    // Upload inputElement must be the  element
    let inputElement = await page.waitForXPath('//input[@type="file"]');
    await inputElement.uploadFile('/path/to/file'); browser.close(); }) ();Copy the code

Case9: Switches to a new TAB page

When clicking a button to jump to a new Tab Page, a new Page is opened. How do we get the corresponding Page instance of the changed Page? This can be done by listening for a TargetCreated event on Browser to indicate that a new page has been created:

let page = await browser.newPage();
await page.goto(url);
let btn = await page.waitForSelector('#btn');
// Before clicking the button, define a Promise that returns the Page object of the new TAB
const newPagePromise = new Promise(res= > 
  browser.once('targetcreated', 
    target => res(target.page())
  )
);
await btn.click();
// After clicking the button, wait for the new TAB object
let newPage = await newPagePromise;
Copy the code

Case10: simulate different devices

Puppeteer provides the function of simulating different devices. The Puppeteer. Devices defines the configuration information of many devices, including viewport and userAgent

const puppeteer = require('puppeteer');
const iPhone = puppeteer.devices['iPhone 6'];
puppeteer.launch().then(async browser => {
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto('https://www.google.com');
  await browser.close();
});
Copy the code

Puppeteer vs Phantomjs

Fully true browser operation, supporting all Chrome features
You can provide different versions of Chrome environment
Chrome team maintenance, with better compatibility and prospects
Headless parameters are dynamically configured to facilitate debugging. You can run the – remote-debugging-port=9222 command to access the debugging interface
Supports the latest JS syntax, such as async/await, etc
The Phantomjs environment is complex to install and API call unfriendly
The main difference between the two is that Phantomjs uses an older version of WebKit as its rendering engine
Faster and better performance than Phantomjs. Here’s how others compare Puppeteer and Phantomjs:

Headless Chrome vs PhantomJS Benchmark

Puppeteer application scenario in our team

Performance and optimization

About shared memory:

Chrome uses /dev/shm shared memory by default. Docker uses /dev/shm shared memory by default. Docker uses /dev/shm shared memory by default. - Docker add parameter --shm-size= 1GB to increase /dev/shm shared memory, swarm currently does not support shm-size parameter -disable-dev-shm-usage: disables the /dev/shm shared memoryCopy the code

Try to use the same browser instance so that the cache can be shared
Interception of resources that do not need to be loaded by request
Just like when you open Chrome, many TAB pages will inevitably get stuck, so you have to control the number of tabs
A Chrome instance that takes a long time to start will inevitably have memory leaks, page crashes and other problems, so it is necessary to periodically restart the Chrome instance
To speed up performance, turn off unnecessary configurations such as: -no-sandbox (sandbox), –disable-extensions, etc
Avoid using Page.waiffor (1000) as much as possible and it is better to let the application decide for itself
Note that a sticky Websocket session may occur when you connect to a Chrome instance

reference

Puppeteer guides
Puppeteer performance optimization and execution speed improvement
PhantomJs death, Chrome-Headless birth
Headless Chrome vs PhantomJS Benchmark
Scraping iframes with Puppeteer