Puppeteer is a high-level abstraction of the headless Chrome browser with an extensive API. This makes it very easy to automatically interact with web pages.

This article introduces you to a use case where we search for a keyword on GitHub and get the title of the first result.

This is a basic example, purely for demonstration purposes, that can be done even without Puppeteer. Because keywords can appear in urls and page lists on GitHub pages, you can navigate directly to the results.

However, assuming that your interactions on a web page are not reflected in the URL of the page and there is no public API to retrieve the data, automation through Puppeteer comes in handy.

Set up Puppeteer and Node.js

Let’s initialize a Node.js project in a folder. From your system terminal, navigate to the project folder you want and run the following command.

npm init -y

Copy the code

This will generate a package.json file. Next, install the NPM package for Puppeteer.

npm install --save puppeteer

Copy the code

Now, create a file called service.mjs. This file format allows us to use the ES module and will be responsible for scraping pages using Puppeteer. Let’s do a quick test with Puppeteer to see if it works.

First, we launch a Chrome instance and display it with the headless: false argument, rather than running it headlessly without a GUI. Now create a page using the newPage method and use the goto method to navigate to the URL passed as a parameter.

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({
  headless: false
});

const page = await browser.newPage();
await page.goto('https://www.github.com');

Copy the code

When you run this code, a Chrome window should pop up and navigate to the URL in a new TAB.

Use automation with Puppeteer

In order for the Puppeteer to interact with the page, we need to manually examine the page and specify which DOM elements to target.

We need to identify the selectors, that is, the class name, ID, element type, or a combination of some of them. If we need a high degree of specificity, we can use these selectors in various Puppeteer methods.

Now, let’s use a browser to check www.github.com. We need to be able to focus on the search input field at the top of the page and type in the keywords we want to search for. And then we need the GitEnter button on the keyboard.

Open www.github.com– in your favorite browser. I use Chrome, but any browser will do — right click on the page and click Check. Then, under the element tag, you can see the DOM tree. Using the inspection tool in the upper left corner of the inspection pane, you can click on elements to highlight them in the DOM tree.

The input field elements we are interested in have several class names, but just for.header-search-input, that’s enough. To make sure we’re referring to the right element, we can do a quick test on the browser console. Click on the console TAB to use the querySelector method on the Document object.

document.querySelector('.header-search-input')

Copy the code

If this returns the right element, then we know it can work with Puppeteer.

Note that there may be several elements that match the same selector. In this case, querySelector returns the first matched element. To reference the correct element, you need to use querySelectorAll and then pick the correct index from the NodeList of the returned element.

There is one thing we should pay attention to here. If you resize the GitHub page, the input field becomes invisible and visible in the hamburger menu.

Because it’s not visible unless the hamburger menu is opened, we can’t focus on it. To ensure that the input box is visible, we can explicitly set the browser window size by passing the defaultViewport object to the Settings.

const browser = await puppeteer.launch({
  headless: false,
  defaultViewport: {
    width: 1920,
    height: 1080
  }
});

Copy the code

Now it’s time to use a query to lock the element. Before attempting to interact with the element, we must ensure that it is rendered and ready on the page. Puppeteer has a waitForSelector method for this reason.

It takes a selector string as the first argument and an option object as the second. Because we’re going to interact with that element, focus, and then type in the input box, we need to make it visible on the page, hence the visible: True option.

const inputField = '.header-search-input';
await page.waitForSelector(inputField, { visible: true });

Copy the code

As mentioned earlier, we need to focus on the input field element and then simulate typing. For these purposes, Puppeteer has the following methods.

const keyword = 'react';

await page.focus(inputField);
await page.keyboard.type(keyword);

Copy the code

So far, service.mjs looks like this.

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({
  headless: false,
  defaultViewport: {
    width: 1920,
    height: 1080
  }
});

const page = await browser.newPage();
await page.goto('https://www.github.com');

const inputField = '.header-search-input';
const keyword = 'react'

await page.waitForSelector(inputField);
await page.focus(inputField);
await page.keyboard.type(keyword);

Copy the code

When you run the code, you should see the search field focused and it has the react keyword typed in.

Now, simulate the Enter key on the keyboard.

await page.keyboard.press('Enter');

Copy the code

After we hit Enter, Chrome navigates to a new page. If we manually search for keywords and examine the page to which we are navigated, we will find that the selector for the element we are interested in is.repo-list.

At this point, we need to make sure that navigation to the new page is complete. To do this, there is a page.waitforNavigation method. After navigation, we again need to wait for the element by using the Page.waitForSelector method.

However, if we are only interested in scraping some data from that element, we don’t need to wait until it becomes visually visible. So this time, we can omit {visible: true}, which is set to false by default.

const repoList = '.repo-list';

await page.waitForNavigation();
await page.waitForSelector(repoList);

Copy the code

Once we know that the.repo-list selector is in the DOM tree, we can search for the title by using the Page. Evaluate method.

First, we select the.repo-list by passing the repoList variable to querySelector. We then cascade querySelectorAll, get all the Li elements, and select the first element from the NodeList element.

Finally, we add another querySelector that targets an.f4.text-normal query that has the title we access through innerText.

const title = await page.evaluate((repoList) => (
  document
    .querySelector(repoList)
    .querySelectorAll('li')[0]
    .querySelector('.f4.text-normal')
    .innerText
), repoList);

Copy the code

Now we can wrap everything in a function and export it to another file for use, where we will set up an endpoint for the Express server to serve the data.

The final version of service.mjs returns an asynchronous function that takes the keyword as input. Inside the function, we use a try… Catch block to catch and return any errors. Finally, we call browser.close to close the browser we started.

import puppeteer from 'puppeteer';

const service = async (keyword) => {
  const browser = await puppeteer.launch({
    headless: true,
    defaultViewport: {
      width: 1920,
      height: 1080
    }
  });

  const inputField = '.header-search-input';
  const repoList = '.repo-list';

  try {
    const page = await browser.newPage();
    await page.goto('https://www.github.com');

    await page.waitForSelector(inputField);
    await page.focus(inputField);
    await page.keyboard.type(keyword);

    await page.keyboard.press('Enter');

    await page.waitForNavigation();
    await page.waitForSelector(repoList);

    const title = await page.evaluate((repoList) => (
      document
        .querySelector(repoList)
        .querySelectorAll('li')[0]
        .querySelector('.f4.text-normal')
        .innerText
    ), repoList);

    await browser.close();
    return title;
  } catch (e) {
    throw e;
  }
}

export default service;

Copy the code

Create an Express server

We need a single endpoint to provide the data, where we capture the keyword to search for as a routing parameter. Because we defined the routing path as /:keyword, it is exposed in the req.params object with the keyword as the key. Next, we call the service function, passing the keyword as an input parameter to the Puppeteer to run.

The contents of server.mjs are as follows.

import express from 'express'; import service from './service.mjs'; const app = express(); app.listen(5000); app.get('/:keyword', async (req, res) => { const { keyword } = req.params; try { const response = await service(keyword); res.status(200).send(response); } catch (e) { res.status(500).send(e); }});Copy the code

From the terminal, run Node server.mjs to start the server. In the other terminal window, send a request to the endpoint using curl. This should return the string value in the entry title.

curl localhost:5000/react

Copy the code

Note that this server is basic. In production, you should secure your endpoints and set up CORS in case you need to send requests from the browser instead of the server.

Deploy to Google Cloud features

Now we will deploy this service to a serverless cloud function. The main difference between a cloud function and a server is that the cloud function is called quickly on request and remains for a period of time to respond to subsequent requests, while the server is always running.

Deploying to Google Cloud functionality is very straightforward. However, in order to run Puppeteer successfully, you should be aware of a few Settings.

First, allocate enough memory for your cloud functionality. Based on my tests 512MB is sufficient for Puppeteer, but if you are having problems with memory, please allocate more.

The contents of package.json should look like this.

{"name": "puppeteer-example", "version": "0.0.1", "type": "module", "dependencies": {"puppeteer": "^10.2.0", "express": "^4.17.1"}Copy the code

We have puppeteer and Express as dependencies and set “type” : “module” to use ES6 syntax.

Now create a file called service.js and fill it with what we used in service.mjs.

The contents of index.js are as follows.

import express from 'express'; import service from './service.js'; const app = express(); app.get('/:keyword', async (req, res) => { const { keyword } = req.params; try { const response = await service(keyword); res.status(200).send(response); } catch (e) { res.status(500).send(e); }}); export const run = app;Copy the code

Here, we import server.js from the Express package and our functions.

Unlike the server code we tested on localhost, we don’t need to listen on a port because this is handled automatically.

Also, unlike localHost, we need to export the app object in a cloud function. Set the entry point to run or whatever variable name you want to export, as shown below. By default, this is set to helloWorld.

To test our cloud function, let’s make it public. Select the Cloud feature * (note: there is a check box * next to it), and then click the Permissions button in the top bar menu. This will display a side panel where you can add the person in charge. Click Add Client and search for allUsers in the New Client field. Finally, select Cloud Function Invoker as the role.

Note that once this principal is added, anyone with a triggering link can invoke this function. This is great for testing, but be sure to implement authentication for your cloud functionality to avoid unwanted calls that will be reflected in the bill.

Now click on your function to see the details of the function. Navigate to the trigger TAB, where you will find the URL of the trigger. Click the link to invoke the function, which returns the title of the first repository in the list. Now you can use this link in your application to get the data.

conclusion

We have shown you how to use Puppeteer to automate basic interaction with web pages and to service them by scraping content from the Express framework on the Node server. We then deployed it on Google Cloud Functions to make it a microservice that could then be integrated and used in another application.

Creating Puppeteer microservices for deployment to Google Cloud Functions appears on the LogRocket blog.