Not only can Puppeteer be a crawler, this is what Puppeteer is for!

The Python Test automation tool, which you may know as Pyppeteer, is a similar implementation of the Python version of Puppeteer. However, Puppeteer and Pyppeteer can be used to do more than just make crawlers. Today, we will introduce a puppet operation using Puppeteer — automatic dispatch.

preface

Automated testing is a very important and convenient thing for software development, but automated testing tools can be used to simulate human operations in addition to testing, so some E2E automated testing tools (for example: Selenium, Puppeteer, and Appium are also frequently used by crawler engineers to capture data because of their powerful simulation capabilities.

There are many crawler crawling tutorials on the Internet, but they are only limited to how to get data, and we know that these browser-based solutions have a large performance overhead, and the efficiency is not high, and they are not the best choice for crawlers.

This article looks at another use of the automated test tool, which is to automate some human operations. The tool we used was Puppeteer, Google’s open source testing framework, which operates on Chromium (Google’s open source browser) to automate. Step by step, we will show you how to use Puppeteer to automatically publish articles on nuggets.

Principles of automated test tools

The idea behind automated testing tools is to control the web pages to crawl by programmatically manipulating the browser and engaging in simulated interactions such as clicking, typing, navigation, and so on. Automated testing tools can also retrieve the DOM or HTML of a web page, so they can easily retrieve web data as well.

In addition, for some dynamic sites, JS dynamic rendering data is usually not easy to obtain, and automated testing tools can easily do, because it is the HTML input browser run.

Puppeteer profile

Here is the definition from the Github home page for Puppeteer.

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

Puppeteer is a Node.js library that provides a high-level API for controlling Chrome or Chromium (via the developer tools protocol); Puppeteer operates in headless mode by default, but can be configured to be non-headless.

Loco Note: Headless refers to a GUI that does not display the browser, which is designed to improve performance because rendering images is a resource-intensive business.

Here’s what Puppeteer can do:

• Generate screenshots and PDF pages; • Grab a single page application to generate pre-rendered content (i.e. SSR, server-side rendering); • Automated form submission, UI testing, keyboard entry, and more; • Create an up-to-date, automated test environment; • Capture a timeline of the site to help diagnose performance issues; • Test Chrome plugins; •…

Puppeteer installation

Installing Puppeteer is not difficult, just make sure your environment has Node.js installed and NPM running.

Since the official installation tutorial doesn’t take Chromium into account, we use a third-party library called Puppeteer-chromium-resolver, which can customize puppeteer and manage Chromium downloads.

Run the following command to install Puppeteer:

npm install puppeteer-chromium-resolver --save

Puppeteer – chromium – detailed usage of resolver refer to website: https://www.npmjs.com/package/puppeteer-chromium-resolver.

Puppeteer common command

The official API documentation for Puppeteer is https://pptr.dev/, which provides detailed information about the open interfaces for Puppeteer. Here we only list some commonly used interface commands.

Build/close the browser

// puppeteer-chromium-resolver const PCR = require('puppeteer-chromium-resolver') revision: '', detectionPath: '', folderName: '.chromium-browser-snapshots', hosts: ['https://storage.googleapis.com', 'https://npm.taobao.org/mirrors'], retry: 3, silent: False}) // Generate browser const browser = await pcr.puppeteer.launch({... }) // close the browser await browser.close()Copy the code

To generate the page

const page = await browser.newPage()Copy the code

navigation

await page.goto('https://baidu.com')Copy the code

Waiting for the

await page.waitFor(3000)Copy the code

await page.goto('https://baidu.com')Copy the code

Getting page elements

const el = await page.$(selector)Copy the code

Click on the element

await el.click()Copy the code

The input

await el.type(text)Copy the code

Executing the Console code (important)

const res = await page.evaluate((arg1, arg2, arg3) => {
    // anything frontend
    return 'frontend awesome'
}, arg1, arg2, arg3)Copy the code

This should be the most powerful API in Puppeteer. Any developer familiar with front-end technology should be familiar with Console in Chrome Developer Tools, where any JS code can be run, including click events, get elements, add, delete, change elements, and so on. Our autodispatcher will make heavy use of this API.

You can see that the EVALUATE method can take some arguments and use them in the front end code as arguments in the callback function. This allows us to inject any data from the back end into the front-end DOM, such as the article title, article content, and so on.

In addition, the return value from the callback function can be the evaluate return value assigned to RES, which is often used for fetching data.

Note that all of the above code uses the await keyword. This is the new async/await syntax in ES7, which is the syntactic sugar of ES6 Promise, making asynchronous code easier to read and understand. If the async/await students don’t understand, you can refer to this article: https://juejin.cn/post/6844903487805849613.

Puppeteer: Automatically post articles on gold nuggets

Talk is cheap, show me the code.

Below, we will show Puppeteer’s capabilities with an example of automatic Posting. The platform used as an example in this article is Nuggets.

Why Denver? That’s because nuggets doesn’t require a verification code like some other sites, such as CSDN, which adds complexity. Instead, you just need an account name and password to log in.

For beginners to understand, we will start with the basic structure of the crawler. (For the sake of space, we will skip over browser and page initialization and cover only the highlights.)

infrastructure

To make the crawler less cluttered, we separate out the steps of publishing articles and form a base class (since we may have to dig more than one platform to crawl, other platforms can just inherit the base class if we write code with object-oriented thinking).

The general structure of the base class of this crawler is as follows:

We don’t need to understand all of them, just that the entry point to our startup is the run method.

Async is added to all methods to indicate that the method will return a Promise, and the keyword await must be added if it needs to be called synchronously.

The run method looks like this:

Async run() {// initialize await this.init() if (this.task.authtype === constants.authtype.login) {// LOGIN await this.login()} Else {// use Cookie await this.setcookies ()} // navigate to editor await this.gotoEditor () // type editor content await this.inputeditor () // Publish article await this.publish() // close browser await this.browser.close()}Copy the code

As you can see, the crawler is initialized first to do some basic configuration; Then according to the task authentication type (authType) to decide whether to use login or Cookie to pass the website authentication (this article only considers the login authentication); The next step is to navigate to the editor and enter the editor content. Next, publish articles; Finally, close the browser and publish the task.

The login

async login() { logger.info(`logging in... navigating to ${this.urls.login}`) await this.page.goto(this.urls.login) let errNum = 0 while (errNum < 10) { try { await this.page.waitFor(1000) const elUsername = await this.page.$(this.loginSel.username) const elPassword = await this.page.$(this.loginSel.password) const elSubmit = await this.page.$(this.loginSel.submit) await elUsername.type(this.platform.username) await elPassword.type(this.platform.password) await elSubmit.click() await This.page.waitfor (3000) break} catch (e) {errNum++}} this.status.loggedin = errNum! == 10 if (this.status.loggedIn) { logger.info('Logged in') } }Copy the code

The Nuggets’ login address is https://juejin.im/login, which we’ll navigate to first.

Here we loop 10 times, trying to enter the username and password, and if 10 attempts fail, set the login status to false; Otherwise, set it to true.

Next, we use the Page.$(selector) and el.type(text) apis to get elements and input, respectively. The final elsubmit.click () is the action to submit the form.

Edit the articles

Here we skip the step of jumping to the article editor, because it is easy, just call page.goto(URL), the source address will be posted for your reference.

The code for the input editor looks like this:

Async inputEditor() {logger.info(' input editor title and content ') // Enter title await this.page. Evaluate (this.inputTitle, this.article, this.editorSel, Task) await this.page.waitfor (3000) // evaluate(this.inputContent, this.article, WaitFor (3000) // Type the footnote await this.page. Evaluate (this.inputfooter, this.article, WaitFor (3000) await this.page.waitfor (10000) await this.afterinputeditor ()}Copy the code

First enter the title, call page. Evaluate, and pass this. InputTitle, the title callback, and other parameters. The input callback function is then called using the same principle; Then enter a footnote; Finally, the subsequent handler function is called.

Let’s look at this. InputTitle in more detail:

  async inputTitle(article, editorSel, task) {
    const el = document.querySelector(editorSel.title)
    el.focus()
    el.select()
    document.execCommand('delete', false)
    document.execCommand('insertText', false, task.title || article.title)
  }Copy the code

So we’re going to get the header elements first through the front end’s public interface document.querySelector, so in case the header has a placeholder, We use el.focus() (get focus), el.select() (all), document.execcommand (‘delete’, false) (delete) (placeholder) to delete the existing placeholder. We then enter the title content with document.execCommand(‘insertText’, false, text).

Next comes the input, which looks like this (it works in a similar way to the input title) :

  async inputContent(article, editorSel) {
    const el = document.querySelector(editorSel.content)
    el.focus()
    el.select()
    document.execCommand('delete', false)
    document.execCommand('insertText', false, article.content)
  }Copy the code

One might ask, why not use el.type(text) to enter content instead of using document. ExecCommand?

The reason we don’t use the former is because it is a complete simulation of human typing, which would break the existing content format. With the latter, the content can be entered at once.

We reserve a method in the base class BaseSpider to select categories, labels, and so on, and in the inherited class JuejinSpider it looks like this:

Async afterInputEditor() {const elPubBtn = await this.page.$('.publish-popup') await elpubbtn.click () await This. Page. WaitFor (5000) / / select category await this. Page. Evaluate ((task) = > {document. QuerySelectorAll ('. The category - a list > .item').forEach(el => { if (el.textContent === task.category) { el.click() } }) }, This.task) await this.page.waitfor (5000) // Select tag const elTagInput = await this.page.$('.tag-input > input') await elTagInput.type(this.task.tag) await this.page.waitFor(5000) await this.page.evaluate(() => { document.querySelector('.suggested-tag-list > .tag:nth-child(1)').click() }) await this.page.waitFor(5000) }Copy the code

release

Publish operation is relatively easy, just click the publish button. The code is as follows:

Async publish() {logger.info(' publishing article ') const elPub = await this.page.$(this.editorsel.publish) Await elpub.click () await this.page.waitfor (10000) await this.afterpublish ()}Copy the code

This.afterpublish is used to validate the status of a post and get the URL of a post, but there is no space here to go into details.

The source code

Of course, due to the length of this article, not all of the automatic release functions are introduced. If you want to know more, you can send a message [Nuggets automatic release] to the wechat public account [NightTeam] to obtain the source address, note is [NightTeam], not this number.

conclusion

This article shows how to use Puppeteer to operate the Chromium browser to publish articles on nuggets.

Puppeteer is used by many people to fetch data, but we think it is inefficient and expensive and not suitable for large-scale fetching.

Instead, Puppeteer is better suited for automated tasks such as manipulating the browser to publish articles, post, submit forms and so on.

Puppeteer is similar to RPA (Robotic Process Automation) in that it automates tedious, repetitive tasks, but the latter is not limited to the browser. Its Scope is based on the entire operating system, which is more powerful and more expensive.

As a relatively lightweight automation tool, Puppeteer is ideally suited for web automation. Puppeteer is part of ArtiPub, an open source multi-platform for Puppeteer.

Article author: “NightTeam” – Zhang Yeqing

Polish, proofread: “NightTeam” – Loco

Founded in 2019, the nightnight team includes Cui Qingcai, Zhou Ziqi, Chen Xiangan, Tang Yifei, Feng Wei, CAI Jin, Dai Huangjin, Zhang Yeqing and Wei Shidong.

My programming languages include but are not limited to Python, Rust, C++, Go, and the fields include crawler, deep learning, service development, object storage, etc. The team is neither good nor evil. We only do what we think is right. Please be careful.

Cui Qingcai

Jingmi is a blogger and author of Python3 web crawler development