Introduction to Puppeteer Headless browser

What is the Puppeteer

Puppeteer is a Node library that provides a high-level API and controls Chrome(or Chromium) through the DevTools protocol. In plain English, it is a Headless Chrome browser (which can also be configured with a UI, but not by default).

Puppeteerstructure

Puppeteer uses the DevTools protocol to communicate with the browser
Browser instances can have a Browser context
The BrowserContext instance defines a browsing session and can have multiple pages,
A Page has at least one main frame. There may be other frames created by iframe or frame
Frame has at least one execution context (the default JavaScript execution context). Frameworks may have additional execution contexts associated with extensions
Workers have a single execution context to facilitate interaction with WebWorkers

PuppeteerWhat can do

Generate web screenshots or PDFS
Crawl single page application (SPA) execution and rendering
Do automatic form submission, automated UI testing, simulated keyboard input, etc
Use some debugging tools and performance analysis tools that come with the browser to help us analyze problems
Test in the latest headless browser environment and use the latest browser features

The installation

npm i puppeteer -S

The default is to download the latest version of Chromium, or you can skip the download by configuring the boot parameter executablePath to specify the location of Chromium. If the download fails, refer to Github Issue, or use the following methods to install it quickly

npm config set puppeteer_download_host=https://npm.taobao.org/mirrors
npm i puppeteer -S
Copy the code

If the mkdir permission problem occurs during download on a Mac, add installation parameters to resolve the problem

sudo npm i puppeteer -S --unsafe-perm=true --allow-root
Copy the code

Or install puppeteer-CN

npm i puppeteer-cn -S
Copy the code

Use the sample

The simplest example is to open Baidu and close it

const browser = await puppeteer.launch({
    headless: false// Turn off headless mode}); const page = await browser.newPage(); await page.goto('http://www.baidu.com/');
await browser.close();
Copy the code

Open Baidu, screenshot & Produce PDF, then close

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://www.baidu.com/');
await page.screenshot({path: 'baidu.png'}); / / is currently available only in a headless mode to generate PDF * * https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagepdfoptions*

await page.pdf({path: 'baidu.pdf'})await browser.close();
Copy the code

Execute JS in the open page and return the result

const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://www.trip.com/m/');
const dimensions = await page.evaluate(() => {
  return {
    width: document.documentElement.clientWidth,
    height: document.documentElement.clientHeight,
    deviceScaleFactor: window.devicePixelRatio
  }
});
console.log('Dimensions:', dimensions);
await browser.close();
Copy the code

Listen to console, you will see the familiar baidu recruitment of hard wide

const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('console', msg => console.log(msg.type(), msg.text()));
await page.goto('https://www.baidu.com/');
await browser.close();
Copy the code

Setting the form size

const browser = await puppeteer.launch({
  headless: false}); const page = await browser.newPage(); // Open await page.setviewPort ({width: 375, height: 667}); await page.goto('https://www.trip.com/m/');
Copy the code

Performance Trace Performance data capture, then import Trace. Json DevTools -> Performance to view

const browser = await puppeteer.launch({
    headless: false}); const page = await browser.newPage(); // Set the data file and include screenshots await page.tracing. Start ({path:'trace.json',
    screenshots: true
});
await page.goto('https://www.trip.com');
await page.tracing.stop();
await browser.close();
Copy the code

Simulate form submission toTrip.comTake the home page as an example, you need to register your account first and replace it. Good luck not encountering verification code

const browser = await puppeteer.launch({
  headless: false
});
const page = await browser.newPage();
await page.goto('https://www.trip.com/account/signin?');
await page.waitForSelector('#userName');
await page.focus('#userName');
await page.waitFor(500);
await page.type('#userName'.'your account', {delay: 100});
await page.focus('#txtPassword');
await page.waitFor(500);
await page.type('#txtPassword'.'your password', {delay: 100});
await page.waitFor(500);
await page.click('#btnSubmitData')
Copy the code

Climb douban movie search list, Document return is encrypted data window.DATA, routine is in the front JS decryption

const search_text = 'diffuse wei'; const size = 15; // Number of results per pageletstart = 0; // start pageconst browser = await puppeteer.launch({headless:false
});
const page = await browser.newPage();
const crawlMovies = async () => {
  await page.goto(`https://movie.douban.com/subject_search?search_text=${encodeURIComponent(search_text)}&start=${start * size}`, {waitUntil: 'domcontentloaded'})
  console.log(`crawling page ${start + 1}. `); // evaluate the currentStart parameter needs to be passed in, not externalletResult = await page.evaluate((currentStart) => {// get all movie titles on this pagelet list = Array.from(document.querySelectorAll('.detail')).map((item) => {
      return item.querySelector('.title a').innerHTML; }); // Determine if it is the last page, as a condition for recursive exitlet maxStart = Math.max.apply(null, Array.from(document.querySelectorAll('.paginator a')).map((item) => {
      let startNum = 0;
      try {
        startNum = item.getAttribute('href').match(/\d+$/)[0];
      } catch (e) {
      }
      return startNum;
    }))
    return {
      list: list,
      isEnd: currentStart > maxStart
    }
  }, start * size);
  if (result.isEnd) {
    return result.list;
  }
  start += 1;
  return result.list.concat((await crawlMovies()))
}
const movieList = await crawlMovies();
console.log(JSON.stringify(movieList, null, 2))
Copy the code

In a nutshell

Puppeteer is a headless browser that makes it easier to do things that other headless browsers can do. The above example gives a brief overview of its basic usage, and the detailed API is available in the official documentation

Refer to the link

The document address

Puppeteer is a practical tool to operate puppeteer online.