Puppeteer is an official Chrome library for Headless Chrome nodes. It provides a series of apis that can be used to call Chrome functionality without a UI, and is suitable for various scenarios such as crawlers and automated processing

According to the website, Puppeteer has the following functions:

  • Generate page screenshots and PDF
  • Automated form submission, UI testing, keyboard entry, and more
  • Create an up-to-date automated test environment. With the latest JavaScript and browser features, you can run tests directly in the latest version of Chrome.
  • Capture a timeline trace of the site to help diagnose performance problems.
  • Crawl SPA page and pre-render (i.e. ‘SSR’)

These are the roles puppeteer can play

1. Initialize the project

Note: We use the new es6/7 features here, so we compile our code in typescript

npm install puppeteer typescript @types/puppeteer
Copy the code

Tsconfig. json is configured as follows:

{
  "compileOnSave": true."compilerOptions": {
    "target": "es5"."lib": [
      "es6"."dom"]."types": [
      "node"]."outDir": "./dist/"."sourceMap": true."module": "commonjs"."watch": true."moduleResolution": "node"."isolatedModules": false."experimentalDecorators": true."declaration": true."suppressImplicitAnyIndexErrors": true
  },
  "include": [
    "./examples/**/*"]},Copy the code

The Puppeteer module provides a way to launch a Chromium instance.

import * as puppeteer from 'puppeteer'

(async () => {
  await puppeteer.launch()
})()
Copy the code

The code above generates an instance of Browser using the Launch method of Puppeteer, which receives configuration items. More commonly used are:

  • Headless [Boolean]: Specifies whether to start the browser in headless mode
  • SlowMo [number]: slows down puppeteer operations. It’s very convenient to see what’s going on
  • Args [Array[string]]: Additional parameters to be passed to the browser instance

2. Generate a screenshot

Let’s take example.com/ as an example

(async () => { const browser = await puppeteer.launch(); // Generate browser instance const page = await browser.newPage(); // Parse a new page. The page is created in the default browser context with await page.goto("https://example.com/"); // go to https://example.com/ await page.screenshot({// generate a screenshot path:'example.png'
  })
})()
Copy the code

It is important to note that screenshots capture content that is open in the viewable area of the web page by default. To get a full screen shot of a scrollable page, add fullPage: true

Run the node dist/screenshot.js command to generate example.png in the root directory

Puppeteer has a default page size of 800 x 600, which can be changed using page.setviewport ().

More than that, the Puppeteer simulates mobile phones

import * as puppeteer from "puppeteer";
import * as devices from "puppeteer/DeviceDescriptors";
const iPhone = devices["iPhone 6"];

(async () => {
  const browser = await puppeteer.launch({
    headless: false
  });
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto("https://baidu.com/"); await browser.close(); }) ();Copy the code

3. Generate PDF

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://example.com/");
  await page.pdf({
    displayHeaderFooter: true,
    path: 'example.pdf',
    format: 'A4',
    headerTemplate: '<b style="font-size: 30px">Hello world<b/>',
    footerTemplate: '<b style="font-size: 30px">Some text</b>',
    margin: {
      top: "100px",
      bottom: "200px",
      right: "30px",
      left: "30px",}}); await browser.close(); }) ()Copy the code

Execute node dist/pdf.js.

4. Automate form submission and input

Here we simulate the Github login. To get a better view of the process, we use headless: false to turn off headless mode and take a look at the entire login process

(async () => {
  const browser = await puppeteer.launch({
    headless: false
  });
  const page = await browser.newPage();
  await page.goto("https://github.com/login"); Await page.waitfor (1000) // Delay typing await page.type("#login_field"."Account"); // Type await page.type("#password"."Password", {delay: 100}) // Simulate user input await page.click("input[type=submit]"); // Click login button})()Copy the code

Perform the node dist/login. Js

5. Site timeline tracking

Tracking. start and tracking.stop can easily be used to create a trace file that can be opened in Chrome DevTools

(async () => {
  const broswer = await puppeteer.launch();
  const page = await broswer.newPage();
  await page.tracing.start({
    path: "trace.json"
  });
  await page.goto("https://example.com/"); await page.tracing.stop(); broswer.close(); }) ();Copy the code

Executing Node dist/trace.js generates a trace.json file, then open Chrome devTools -> Performance and drag the file directly into it. This feature allows us to analyze and optimize the performance of the site

6. Reptiles and SSR

Nowadays, most developers use React, Vue and Angular to build SPA websites. SPA has many inherent advantages, such as fast development, modularization, componentalization and excellent performance. But its disadvantages are still very obvious, first is the first screen rendering problem, second is not conducive to SEO, not friendly to crawlers.

In the preview. Pro. Ant. The design / # / dashboard… For example, we can right click on the source code and find that there is only

in the body. If we want to pull down the ranking of store sales, we can save it in the database for data analysis (as shown below).

Such as python

# -*- coding : UTF-8 -*-
from bs4 import BeautifulSoup
import urllib2


def spider():
    html = urllib2.urlopen('https://preview.pro.ant.design/#')
    html = html.read()
    soup = BeautifulSoup(html, 'lxml')
    print(soup.prettify())


if __name__ == '__main__':
    spider()
Copy the code

Run python py/index.py to get the following result:

nodejs

import axios from "axios";

(async () => {
  const res = await axios.get("https://preview.pro.ant.design/#"); console.log(res.data); }) ();Copy the code

Execute node dist/node-spider.js to get the same result as in the above example.

puppeteer

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://preview.pro.ant.design/#"); console.log(await page.content()); }) ();Copy the code

Execute node dist/spider.js to get the following:

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://preview.pro.ant.design/#");
  const RANK = ".rankingList___11Ilg li";
  await page.waitForSelector(RANK);
  const res = await page.evaluate(() => {
    const getText = (v, selector) => {
      return v.querySelector(selector) && v.querySelector(selector).innerText;
    };
    const salesRank = Array.from(
      document.querySelectorAll(".rankingList___11Ilg li")); const data = []; salesRank.map(v => { const obj = { rank: getText(v,"span:nth-child(1)"),
        address: getText(v, "span:nth-child(2)"),
        sales: getText(v, "span:nth-child(3)")}; data.push(obj); });return{ data }; }); console.log(res); await browser.close(); }) ();Copy the code

Execute node dist/spider.js to get the following:

At this point, we have climbed down the puppeteer with the data we need.

At this point, we have implemented the basic puppeteer functionality. The sample code for this article is available on Github.

reference

  • Github.com/GoogleChrom…
  • PPTR. Dev / #? Product = P…