Puppeteer is a Node.js package released by the Chrome development team in 2017, along with Headless Chrome. Used to simulate the running of Chrome. It provides a high-level API to control headless Chrome or Chromium via the DevTools protocol, and it can also be configured to use full (non-headless) Chrome or Chromium.

Before learning about Puppeteer, let’s take a look at Chrome DevTool Protocol and Headless Chrome.

What is Chrome DevTool Protocol

  • CDP is based on WebSocket and uses WebSocket to realize fast data channel with browser kernel.
  • CDP is divided into multiple domains (DOM, Debugger, Network, Profiler, Console…). Each domain defines related Commands and Events.
  • Some tools can be used to debug and analyze Chrome based on CDP. For example, the Chrome Developer Tool is implemented based on CDP.
  • Many useful tools are implemented based on CDP, such as Chrome Developer Tools, Chrome-remote-Interface, Puppeteer, etc.

What is Headless Chrome

  • You can run Chrome in an unbounded environment.
  • Operate Chrome from the command line or programming language.
  • Without human intervention, operation is more stable.
  • Start Chrome in Headless mode by adding the parameter “headless” when you start Chrome.
  • Click here to see what parameters chrome can add when it starts.

Headless Chrome is a feature-free version of the Chrome browser that allows you to run applications using all of Chrome’s supported features without having to open the browser.

What is the Puppeteer

  • Puppeteer is the Node.js tool engine.
  • Puppeteer provides a series of apis that control the behavior of Chromium/Chrome through the Chrome DevTools Protocol.
  • Puppeteer, by default, starts Chrome with headless. You can also start Chrome with an interface using parameters.
  • Puppeteer is bound to the latest Chromium version by default, and can be bound to a different version by itself.
  • Puppeteer allows us to communicate with the browser without knowing too much about the underlying CDP protocol.

What can Puppeteer do

Official: Most of the things you can do manually in a browser can be done with Puppeteer! Example:

  • Generate screen captures and PDF of the page.
  • Crawl SPA or SSR sites.
  • Automated form submission, UI testing, keyboard input, etc.
  • Create the latest automated test environment. Run tests directly in the latest version of Chrome, using the latest JavaScript and browser features.
  • Capture a timeline trace of the site to help diagnose performance problems.
  • Test the Chrome extension.
  • .

The Puppeteer API is layered

The API hierarchy in Puppeteer is basically the same as that in the browser. Here are some of the classes that are commonly used:

  • Browser: Corresponding to a Browser instance, a Browser can contain multiple BrowserContext
  • BrowserContext: BrowserContext has a separate Session(cookies and cache are not shared), just like opening a normal Chrome browser and then opening a browser in incognito mode. A BrowserContext can contain multiple pages
  • Page: NewPage ()/browser.newPage(). Browser.newpage () creates the page using the default browserContext. A Page can contain multiple frames
  • Frame: a Frame that has one MainFrame(page.mainframe ()) for each page, or multiple subframes, created primarily by the iframe tag
  • ExecutionContext: is the javascript execution environment. Each Frame has a default javascript execution environment
  • ElementHandle: an element node corresponding to the DOM. This instance can be used to click on an element and fill in a form. The element can be obtained by using selectors, xPath, etc
  • JsHandle: Corresponding to the javascript object in DOM, ElementHandle inherits from JsHandle. Since we cannot operate the object in DOM directly, it is encapsulated as JsHandle to realize related functions
  • CDPSession: communicates with the native CDP directly, sending messages using the session.send function and receiving messages using the session.on function, enabling Puppeteer apis to perform functions not involved in Puppeteer
  • Coverage: Gets JavaScript and CSS code Coverage
  • Tracing: Captures performance data for analysis
  • Response: Indicates the Response received by the page
  • Request: indicates a Request made by the page

Puppeteer Installation and Environment

Note: Puppeteer needs at least Node V6.4.0 before V1.18.1. Versions from V1.18.1 to V2.1.0 depend on Node 8.9.0+. Starting with V3.0.0, Puppeteer is dependent on Node 10.18.1+. To use async/await, only Node V7.6.0 or later supports it.

Puppeteer is a Node.js package, so installing Puppeteer is simple:

NPM install puppeteer // or yarn add puppeteerCopy the code

NPM may have an error installing puppeteer! This is due to the external network caused by the use of scientific Internet access or the use of Taobao mirror CNPM installation can be solved.

When Puppeteer is installed, it will download the latest version of Chromium. Starting with version 1.7.0, the puppeteer-Core software package is officially available. By default, no browser is downloaded and is used to launch existing browsers or connect to remote browsers. Note that the puppeteer-Core version installed is compatible with the browser you intend to connect to.

Puppeteer USES

Case1: screenshots

Puppeteer is used to take screenshots of both a page and an element in the page:

const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); SetViewport ({width: 1920, height: 800}); await page.goto('https://www.baidu.com/'); Screenshot ({path: './files/baidu_home.png', // Screenshot save path type: 'PNG ', fullPage: // Clip: {x: 0, y: 0, width: 1920, height: 800}}); // let element = await page.$('#s_lg_img'); await element.screenshot({ path: './files/baidu_logo.png' }); await page.close(); await browser.close(); }) ();Copy the code

How do we get an element in a page?

  • page.$('#uniqueId'): Gets the first element corresponding to a selector
  • page.$$('div'): Gets all elements corresponding to a selector
  • page.$x('//img'): Gets all elements corresponding to an xPath
  • page.waitForXPath('//img'): Waits for an xPath element to appear
  • page.waitForSelector('#uniqueId'): Waits for the element corresponding to a selector to appear

Case2: simulates user operations

const puppeteer = require('puppeteer'); (async () => {const browser = await puppeteer.launch({slowMo: 100, // slow headless: false, // open visualization defaultViewport: {width: 1440, height: 780}, ignoreHTTPSErrors: false, // ignoreHTTPS error args: ['--start-fullscreen'] // fullscreen open page}); const page = await browser.newPage(); await page.goto('https://www.baidu.com/'); // Input text const inputElement = await page.$('#kw'); await inputElement.type('hello word', {delay: 20}); // click search button let okButtonElement = await page.$('#su'); // Wait for the page to complete, usually click a button to jump, Await promise.all ([okButtonElement.click(), page.waitfornavigation ()]); await page.close(); await browser.close(); }) ();Copy the code

What functions does ElementHandle provide to manipulate elements?

  • elementHandle.click(): Click on an element
  • elementHandle.tap(): Simulates finger touch and click
  • elementHandle.focus(): Focuses on an element
  • elementHandle.hover()Hover: hover over an element
  • elementHandle.type('hello'): Enters text in the input box

Case3: Embed javascript code

The most powerful feature of Puppeteer is that you can execute any javascript code you want in your browser. The following code is an example of baidu home news recommendation to crawl data.

const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://www.baidu.com/'); Const resultData = await Page. Evaluate (async () => {let data = {}; const ListEle = [...document.querySelectorAll('#hotsearch-content-wrapper .hotsearch-item')]; data = ListEle.map((ele) => { const urlEle = ele.querySelector('a.c-link'); const titleEle = ele.querySelector('.title-content-title'); return { href: urlEle.href, title: titleEle.innerText, }; }); return data; }); console.log(resultData) await page.close(); await browser.close(); }) ();Copy the code

What functions are available to execute code in the browser environment?

  • page.evaluate(pageFunction[, ...args]): Executes functions in the browser environment
  • page.evaluateHandle(pageFunction[, ...args]): Executes a function in the browser environment that returns a JsHandle object
  • page.$$eval(selector, pageFunction[, ...args]): Passes all elements corresponding to the selector to the function and executes it in the browser environment
  • page.$eval(selector, pageFunction[, ...args]): passes the first element corresponding to the selector to the function and executes it in the browser environment
  • page.evaluateOnNewDocument(pageFunction[, ...args]): Executes in the browser environment when a new Document is created, before all scripts on the page are executed
  • page.exposeFunction(name, puppeteerFunction): Registers a function on the Window object, which executes in the Node environment and has the opportunity to call node.js libraries in the browser environment

Case4: Request interception

Request in some situations it is necessary to intercept it is not necessary to request to improve performance, we can monitor the Page request events, and request to intercept, the premise is to open a request to intercept Page. SetRequestInterception (true).

const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); const blockTypes = new Set(['image', 'media', 'font']); await page.setRequestInterception(true); // Enable request blocking page.on('request', request => {const type = request.resourceType(); const shouldBlock = blockTypes.has(type); If (shouldBlock){return request.abort(); }else{return request.continue({// can override url, method, headers, headers); Object.assign({}, request.headers(), { 'puppeteer-test': 'true' }) }); }}); await page.goto('https://www.baidu.com/'); await page.close(); await browser.close(); }) ();Copy the code

What events are available on the Page?

  • page.on('close')Page is closed
  • page.on('console')The console API is called
  • page.on('error')Page fault
  • page.on('load')Page loaded
  • page.on('request')Receipt of a request
  • page.on('requestfailed')The request failed
  • page.on('requestfinished')The request is successful
  • page.on('response')The response is received
  • page.on('workercreated')Create webWorker
  • page.on('workerdestroyed')Destruction of webWorker

Case5: Gets the WebSocket response

Puppeteer does not currently provide a native API for handling Websockets, but it is available through the lower-level Chrome DevTool Protocol (CDP)

const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); // createCDPSession let cdpSession = await page.target().createcdpsession (); // Enable Network debugging and listen to await cdpsession. send(' network.enable ') in Chrome DevTools Protocol; // Listen for webSocketFrameReceived events To obtain the corresponding data cdpSession. On (' Network. WebSocketFrameReceived, frame = > {let payloadData = frame. The response. PayloadData; If (payloaddata.includes ('push:query')){// Parse payloadData, Parse (payloadData.match(/\{.*\}/)[0]); if(res.code ! == 200){console.log(' failed to call websocket :code=${res.code},message=${res.message} '); }else{console.log(' get websocket data: ', res.result); }}}); await page.goto('https://netease.youdata.163.com/dash/142161/reportExport?pid=700209493'); await page.waitForFunction('window.renderdone', {polling: 20}); await page.close(); await browser.close(); }) ();Copy the code

Case6: how do I fetch elements in an iframe

A Frame contains an Execution Context. Functions cannot be executed across frames. A page can have multiple frames, which are generated by embedding iframe tags. Most of the functions on the page are actually short for page.mainframe ().xx. Frame is a tree structure, and we can iterate through all frames with frame.childframes (). If you want to execute a function in another Frame, you have to get the corresponding Frame to process it

When logging in to mailbox 188, the login window is actually an embedded IFrame. In the following code, we are getting the IFrame and logging in

const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch({headless: false, slowMo: 50}); const page = await browser.newPage(); await page.goto('https://www.188.com'); For (const frame of page.mainframe ().childframes ()){// Find iframe if corresponding to the login page based on the URL (frame.url().includes('passport.188.com')){ await frame.type('.dlemail', '[email protected]'); await frame.type('.dlpwd', '123456'); await Promise.all([ frame.click('#dologin'), page.waitForNavigation() ]); break; } } await page.close(); await browser.close(); }) ();Copy the code

Case7: page performance analysis

Puppeteer provides a tool to perform performance analysis on Puppeteer. Currently, it is a weak tool, and only one page performance data can be obtained. – A browser can trace only once at a time – devTools Performance can upload the corresponding JSON file and view the analysis results – we can write scripts to parse the data in trace.json for automatic analysis – Yes Tracing shows page loading speed and script execution performance

const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.tracing.start({path: './files/trace.json'}); await page.goto('https://www.google.com'); await page.tracing.stop(); /* continue analysis from 'trace.json' */ browser.close(); }) ();Copy the code

Case8: File upload and download

The need to upload and download files is often encountered in automated testing. How is this implemented in Puppeteer?

const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); // Set the download path through the CDP session const CDP = await page.target().createcdpSession (); Send (' page.setDownloadBehavior ', {behavior: 'allow', // Allow all downloadPath requests: 'path/to/download' // Set download path}); // click button to trigger download await (await page.waitForSelector('#someButton')).click(); / / wait for the file, training in rotation to determine whether a file there await waitForFile (' path/to/download/filename '); Let inputElement = await page.waitForxpath ('//input[@type="file"]'); await inputElement.uploadFile('/path/to/file'); browser.close(); }) ();Copy the code

Case9: Switches to a new TAB page

When clicking a button to jump to a new Tab Page, a new Page is opened. How do we get the corresponding Page instance of the changed Page? This can be done by listening for a TargetCreated event on Browser to indicate that a new page has been created:

let page = await browser.newPage(); await page.goto(url); let btn = await page.waitForSelector('#btn'); // Define a Promise before clicking the button, Const newPagePromise = new Promise(res => browser.once(' targetCreated ', target => res(target.page()) ) ); await btn.click(); // Wait for new TAB object let newPage = await newPagePromise;Copy the code

Case10: simulate different devices

Puppeteer provides the function of simulating different devices. The Puppeteer. Devices defines the configuration information of many devices, including viewport and userAgent

const puppeteer = require('puppeteer');
const iPhone = puppeteer.devices['iPhone 6'];
puppeteer.launch().then(async browser => {
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto('https://www.baidu.com');
  await browser.close();
});
Copy the code

Performance and optimization

  • About shared memory:

    Chrome uses /dev/shm shared memory by default. Docker uses /dev/shm shared memory by default. Docker uses /dev/shm shared memory by default.

    • Docker –shm-size= 1GB –shm-size= 1GB –shm-size= 1GB –shm-size= 1GB
    • Enable Chrome add parameter -disable-dev-shm-usage to disable the /dev/shm shared memory

    const puppeteer = require(‘puppeteer’); Front-end learning training, video tutorials, learning routes, add weixin Kaixin666Haoyun contact me});

  • Try to use the same browser instance so that the cache can be shared

  • Interception of resources that do not need to be loaded by request

  • Just like when you open Chrome, many TAB pages will inevitably get stuck, so you have to control the number of tabs

  • A Chrome instance that takes a long time to start will inevitably have memory leaks, page crashes and other problems, so it is necessary to periodically restart the Chrome instance

  • To speed up performance, turn off unnecessary configurations such as: -no-sandbox (sandbox), –disable-extensions, etc

  • Avoid using Page.waiffor (1000) as much as possible and it is better to let the application decide for itself

  • A Sticky Websocket session problem occurs because of the Websocket used to connect to the Chrome instance.