The chattering bird channel for your friends

A,PuppeteerIntroduction and Installation

Puppeteer is a Node library that provides a high-level API for controlling Chromium through the DevTools protocol. After Google released the Headless browser, Selenium was abandoned by me because Puppeteer was too friendly for Nodejs developers to install with NPM I, There is no need to install other dependent libraries (originally too young O (╥﹏╥) O, actually not simple).

If the operating system is MacOS, Centos is deployed on the server. 7. It’s really simple on MacOS, just NPM I Puppeteer. Installation can not have the following solutions:

#1. Set environment variables to skip Chromium download (2018-09-03 invalid)
set PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1

#2. Only download module without build, but chromium needs to download by itself (valid on September 03, 2018)
npm i --save puppeteer --ignore-scripts

#3. Puppeteer provides an additional puppeteer-Core library starting from V1.7.0, which only contains the Puppeteer core library and does not download Chromium by default
npm i puppeteer-core

#If puppeteer cannot be installed, taobao Image is recommended
npm config set registry="https://registry.npm.taobao.org"
Copy the code

If Chromium was downloaded by itself, add the following configuration items when starting the Headless browser

this.browser = await puppeteer.launch({
  / / MacOS should be in the "XXX/Chromium. App/Contents/MacOS/Chromium", Linux should "/ usr/bin/Chromium - browser"
  executablePath: "Chromium install path"./ / to sandbox
  args: ['--no-sandbox'.'--disable-dev-shm-usage']});Copy the code

Click on Puppeteer use case to learn about Puppeteer

Second, the skills

Lazy loading screenshot

When taking screenshots or crawlers, we often encounter that some pages display data in a lazy loading way, and the first screen does not show all the information to us. For lazy loading, using the way of rolling to the end to crack. What? Lazy loading has no bottom, try to tune their interface directly, or there are other clever ways to welcome pointed out

page.evaluate(pageFunction, … Args): this function lets us use the built-in DOM selector

PageFunction = pageFunction; pageFunction = pageFunction

const result = await page.evaluate(param1, param2, param3 => {
  return Promise.resolve(8 + param1 + param2 + param3);
}, param1, param2, param3);

// You can also pass a string:
console.log(await page.evaluate('1 + 2')); / / output "3"
const x = 10;
console.log(await page.evaluate(1 + `${x}`)); / / output "11"
Copy the code

Code: take Jane book lazy loading as an example

/** * lazy page automatically scrolls */
const path = require('path');
const puppeteer = require('puppeteer-core');

const log = console.log;
(async () = > {
  const browser = await puppeteer.launch({
    // executablePath: path.join(__dirname, './chromium/Chromium.app/Contents/MacOS/Chromium'),
    // Turn off the headless mode to open the browser
    headless: false.args: ['--no-sandbox'.'--disable-dev-shm-usage']});const page = await browser.newPage();
  await page.goto('https://www.jianshu.com/u/40909ea33e50');
  await autoScroll(page);

  / / fullPage screenshots
  await page.screenshot({
    path: 'auto_scroll.png'.type: 'png'.fullPage: true});awaitbrowser.close(); }) ();async function autoScroll(page) {
  log('[AutoScroll begin]');
  await page.evaluate(async() = > {await new Promise((resolve, reject) = > {
      // The current height of the page
      let totalHeight = 0;
      // The distance to scroll down each time
      let distance = 100;
      // Run the setInterval loop
      let timer = setInterval((a)= > {
        let scrollHeight = document.body.scrollHeight;

        // Perform the scroll operation
        window.scrollBy(0, distance);

        // Stop execution if the scrolling distance is greater than the current element height
        totalHeight += distance;
        if(totalHeight >= scrollHeight) { clearInterval(timer); resolve(); }},100);
    });
  });

  log('[AutoScroll done]');
  // After lazy loading, complete screenshots can be taken or data can be crawled
  // do what you like ...
}
Copy the code

Element exact screenshot

Precise screenshots, as the name suggests, are taken of the area the element occupies on the page. Then change the way to Puppeteer processing, is to use the screenshot clip parameter, according to the element relative window coordinates (X, Y) and the element width and height (width, height) positioning screenshots. Of course, the element selector has to be accurate, otherwise no matter how accurate the screenshot is, right

  • Page. Screenshot parameters clip
  • element.getBoundingClientRect(): This method is used to get the relative positions of elements in the viewport (included in the return object)Left, top, width, height), relevant knowledge points can be understood by Google
  • $eval: This method is executed within the pagedocument.querySelector, and passes the matched element as the first argumentpageFunction
const path = require('path');
const puppeteer = require('puppeteer-core');

const log = console.log;
(async () = > {
  const browser = await puppeteer.launch({
    // executablePath: path.join(__dirname, './chromium/Chromium.app/Contents/MacOS/Chromium'),
    // Turn off the headless mode to open the browser
    headless: false.args: ['--no-sandbox'.'--disable-dev-shm-usage']});const page = await browser.newPage();
  await page.goto('https://www.jianshu.com/');
  const pos = await getElementBounding(page, '.board');

  / / clip screenshots
  await page.screenshot({
    path: 'element_bounding.png'.type: 'png'.clip: {
      x: pos.left,
      y: pos.top,
      width: pos.width,
      height: pos.height
    }
  });
  awaitbrowser.close(); }) ();async function getElementBounding(page, element) {
  log('[GetElementBounding]: ', element);

  const pos = await page.$eval(element, e => {
    // implement the evaluate function in pageFunction
    // document.querySelector(element).getBoundingClientRect()
    const {left, top, width, height} = e.getBoundingClientRect();
    return {left, top, width, height};
  });
  log('[Element position]: '.JSON.stringify(pos, undefined.2));
  return pos;
}
Copy the code

OK, so far we have been able to take screenshots of most of the elements, the rest of the elements that are inside the scroll

Screenshots of inner scroll elements

Inner scroll: As opposed to traditional Window form scrolling, the main scroll bar is inside the page (or an element), not on the browser form. The most common is in the background admin interface, the left bar and the right content area of the scroll bar are separate.

Imagine opening netease Cloud Music, there will be two inner scroll bars on the first screen. If we want to see more playlists, we need to slide the scroll bar down. The same goes for scrolling inside screenshots, which are combined with page scrolling to expose the target element to visual range, and window coordinates to achieve accurate screenshots.

Steps:

  1. Gets the coordinates of the target element and determines whether it is in the current viewable range. If it is in the window, no scrolling is required
  2. Because it is inside scrolling, the target element must have a layer of scrollbar parent element outside, by scrolling the parent element to indirectly show the target element. So this step needs to determine the parent element’s selector
  3. By simulating the page scrolling parent element (settingwindow.scrollByorscrollLeft scrollTop), so that the target object just appears intact in the window
  4. Because it is inner scrolling, we need to retrieve the coordinates of the target element (getBoundingClientRect)
  5. Take a screenshot with the new coordinates

Here’s a little detail about how to tell if an element has a scroll bar. If an element does not have an x-scroll bar, setting its scrollLeft has no effect; only global scrolling will work.

// If the scrollWidth value is greater than the clientWidth value, a horizontal scroll bar is present
element.scrollHeight > element.clientHeight

// If the scrollHeight is greater than the clientHeight value, the vertical scroll bar appears
element.scrollHeight > element.clientHeight
Copy the code

Example code: toNodejs official documentFor example, get a screenshot of the TTY in the left column

/** * Intercepts the li node */ where the TTY is located in the left column
const path = require('path');
const puppeteer = require('puppeteer-core');

const log = console.log;
(async () = > {
  const browser = await puppeteer.launch({
    executablePath: path.join(__dirname, './chromium/Chromium.app/Contents/MacOS/Chromium'),
    // Turn off the headless mode to open the browser
    headless: false.args: ['--no-sandbox'.'--disable-dev-shm-usage']});const page = await browser.newPage();
  await page.setViewport({width: 1920.height: 600});
  const viewport = page.viewport();

  // Nodejs official Api documentation site
  await page.goto('https://nodejs.org/dist/latest-v10.x/docs/api/');

  // await page.waitFor(1000);
  // It is strongly recommended to use waitForNavigation, 1000 is a devil of a number that makes the code insecure
  await page.waitForNavigation({
      // The 20-second timeout
      timeout: 20000.// Determine that the page is complete when there is no more network connection
      waitUntil: [
        'domcontentloaded'.'networkidle0',]});// step1: Determines the parent element selector for the inner scroll
  const containerEle = '#column2';
  // step1: determine the target element selector
  const targetEle = '#column2 ul:nth-of-type(2) li:nth-of-type(40)';

  // step1: Get the coordinates of the target element in the current window
  let pos = await getElementBounding(page, targetEle);

  // Use the built-in DOM selector
  const ret = await page.evaluate(async (viewport, pos, element) => {

    // step1: determine if the target element is currently visible
    const sumX = pos.width + pos.left;
    const sumY = pos.height + pos.top;

    // The distance that X and Y axes need to move
    const x = sumX <= viewport.width ? 0 : sumX - viewport.width;
    const y = sumY <= viewport.height ? 0 : sumY - viewport.height;

    const el = document.querySelector(element);

    // strp3: scroll the element into viewport
    // We need to check whether x and y of the target element can be scrolled. If the element cannot be scrolled, we need to scroll window
    // If the scrollWidth value is greater than the clientWidth value, a horizontal scroll bar is present
    if (el.scrollWidth > el.clientWidth) {
      el.scrollLeft += x;
    } else {
      window.scrollBy(x, 0);
    }
    // If the scrollHeight is greater than the clientHeight value, the vertical scroll bar appears
    if (el.scrollHeight > el.clientHeight) {
      el.scrollTop += y;
    } else {
      window.scrollBy(0, y);
    }

    return [el.scrollHeight, el.clientHeight];
  }, viewport, pos, containerEle);

  // step4: Since the target element is outside the window and inside the inner scroll parent element, we need to retrieve the coordinates again
  pos = await getElementBounding(page, targetEle);
  
  // await page.waitFor(1000);
  // It is strongly recommended to use waitForNavigation, 1000 is a devil of a number that makes the code insecure
  await page.waitForNavigation({
      // The 20-second timeout
      timeout: 20000.// Determine that the page is complete when there is no more network connection
      waitUntil: [
        'domcontentloaded'.'networkidle0',]});/ / 5. Screenshots
  await page.screenshot({
    path: 'scroll_and_bounding.png'.type: 'png'.clip: {
      x: pos.left,
      y: pos.top,
      width: pos.width,
      height: pos.height
    }
  });
  awaitbrowser.close(); }) ();Copy the code

Three, stepped on the pit: inLinuxInstalled on theChromium

It turns out that installing Chromium in a Linux environment will be an unforgettable experience. When puppeteer is installed, Chromium is automatically downloaded, and for well-known reasons, downloads often fail. Chromium can be downloaded successfully after changing the mirror source, but various errors are reported after startup, which is caused by the lack of partial dependence on Linux. After installing the required dependencies, the code runs smoothly. However, the screenshot shows that the Chinese font on the browser is full of boxes. OK, install font library, Chinese characters are displayed normally!

Best practices after potholes

  • usingChromiumandNPM packageSeparate the way to install onlypuppeteer-coreThrough theexecutablePathIntroduce self-downloadChromiumGreatly speed upnpm installThe speed.
  • Switch the Linux mirror source to Ali’s mirror source for quick downloadChromium
  • Change the project toDockerDeployment, to avoid the occurrence of local development normal, online but all kinds of problems
  • Avoid usingpage.waifFor(1000)1000 milliseconds is just a gross estimate of the time. It’s better to let the program decide for itself

Related solutions:

  • An official compilation of errors

  • Centos installs dependency libraries

yum install pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 ipa-gothic-fonts xorg-x11-fonts-100dpi xorg-x11-fonts-75dpi xorg-x11-utils xorg-x11-fonts-cyrillic xorg-x11-fonts-Type1 xorg-x11-fonts-misc -y
Copy the code
  • Alpine Installation Tips
#Set ali mirror source
echo "https://mirrors.aliyun.com/alpine/edge/main" > /etc/apk/repositories
echo "https://mirrors.aliyun.com/alpine/edge/community" >> /etc/apk/repositories
echo "https://mirrors.aliyun.com/alpine/edge/testing" >> /etc/apk/repositories

#Install Chromium and dependencies, including Chinese font support
apk -U --no-cache update
apk -U --no-cache --allow-untrusted add zlib-dev xorg-server dbus ttf-freefont chromium wqy-zenhei@edge -f
Copy the code

Once installed, you need to go to the sandbox to run, although it’s not officially recommended.

Linux Sandbox: In computer security, a Sandbox is a mechanism for isolating programs to limit the permissions of untrusted processes. Sandbox techniques are often used to execute untested or untrusted clients. To avoid untrusted programs that might disrupt the execution of other programs.

  • --no-sandbox: Go to sandbox run
  • --disable-dev-shm-usage: By default,DockerRun a/dev/shmContainer with 64MB shared memory space. This is usually too small for Chrome and will cause Chrome to crash when rendering large pages. To repair, you must run the containerdocker run --shm-size=1gbIn order to increase/dev/shmThe capacity. Starting with Chrome 65, use--disable-dev-shm-usageFlag to start the browser, which will be written to the shared memory file/tmpRather than/dev/shm.
const browser = await puppeteer.launch({
  args: ['--no-sandbox'.'--disable-dev-shm-usage']});Copy the code

Fourth, throughDocker containerDeployment project

At the end of the project, it was found that Chromium needed to be installed every time, and unexpected problems might occur every time. In order to save time and do more meaningful things, optimize the above deployment process through shell scripts and Docker containers.

Docker development process

  1. Determine the base image
  2. Written based on the underlying imageDockerfile
  3. According to theDockerfileBuilding project images
  4. Push the built image toThe Docker warehouseIf the private deployment directly export the image, then import it to the customer environment
  5. Pull the project image on the test/production machine to create and runDocker container
  6. Verify that the project is running properly

Here is to deploy a basedPuppeteerTake the service of

Determine the base image

Docker Search Node (Docker Search NodeCopy the code

Visit Docker Hub for a more detailed description and version

#Here select 'Node :10-alpine' as the base image
docker pull node:10-alpine
Copy the code

writeDockerfile(The walkthrough is not complete, please find more detailed information online)

FROM: Specifies the underlying image, which must be the first non-commented directive in the Dockerfile

FROM <image name>
FROM node:10-alpine
Copy the code

MAINTAINER: Sets the author of the image

MAINTAINER<author name> (Not recommended, recommendedLABELTo specify the mirror author.LABEL MAINTAINER="zhangqiling"(recommended)Copy the code

RUN: command executed under shell or exec environment. The RUN directive adds a new layer to the newly created image, and the resulting submission is used in the next directive in the Dockerfile

RUN <command>

# RUN can execute any command and then create and commit a new layer on top of the current image
RUN echo "https://mirrors.aliyun.com/alpine/edge/main" > /etc/apk/repositories

When executing multiple commands, use \ newline
RUNapk -U add \ zlib-dev \ xorg-serverCopy the code

The intermediate image created by the RUN directive is cached and used in the next build. If you don’t want to use these cache images, you can specify the –no-cache parameter at build time, such as docker build –no-cache.

CMD: provides the container’s default execution command. Dockerfile allows only one CMD directive to be used, and if there are more than one CMD, only the last one will take effect

# there are three forms
CMD ["executable"."param1"."param2"]
CMD ["param1"."param2"]
CMD command param1 param2
Copy the code

COPY: Copies files or directories from the build environment to an image

COPY <src>... <dest>
COPY ["<src>"."<dest>"]

Copy the project to my_app
COPY . /workspase/my_app
Copy the code

ADD: Also copies files or directories from the build environment to the image

ADD <src>... <dest>
ADD ["<src>"."<dest>"]
Copy the code

In contrast to COPY, ADD’s < SRC > can be a URL. Also, if the file is compressed, Docker will automatically decompress it.

WORKDIR: Specifies the working directory of the RUN, CMD, and ENTRYPOINT commands

WORKDIR /workspase/my_app
Copy the code

ENV: Sets environment variables

# Two ways
ENV <key> <value>
ENV <key>=<value>
Copy the code

VOLUME: authorized access to the directory from the container to the host

VOLUME ["/data"]
Copy the code

EXPOSE: Specifies the port on which the container listens at run time

EXPOSE <port>;
Copy the code

Attach the test passedDockerfileThe sample

A couple of points

  • Use domestic Ali cloud mirror station to speed up installation dependence
  • Default does not support Chinese display, must use wenquanyi free Chinese font, this library only inhttps://mirrors.aliyun.com/alpine/edge/testing/Can find
  • The default urban area in the container is not the east 8 area, which affects log printing. You need to reset the time zone
  • Docker container on Centos machinenpm installWill report an error, setnpm config set unsafe-perm trueAfter the smooth installation, what is the reason? (Docker on MacOS doesn’t have this problem)
# pull node image
FROM node:10-alpine

# set the mirror author
LABEL MAINTAINER="[email protected]"

# Set up domestic Ali cloud mirror station, install Chromium 68, Wenquanyi free Chinese font and other dependent libraries
RUN echo "https://mirrors.aliyun.com/alpine/v3.8/main/" > /etc/apk/repositories \
    && echo "https://mirrors.aliyun.com/alpine/v3.8/community/" >> /etc/apk/repositories \
    && echo "https://mirrors.aliyun.com/alpine/edge/testing/" >> /etc/apk/repositories \
    && apk -U --no-cache update && apk -U --no-cache --allow-untrusted add \
      zlib-dev \
      xorg-server \
      dbus \
      ttf-freefont \
      chromium \
      wqy-zenhei@edge \
      bash \
      bash-doc \
      bash-completion -f

# Set time zone
RUN rm -rf /etc/localtime && ln -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

Set environment variables
ENV NODE_ENV production

Create a directory for the project code
RUN mkdir -p /workspace

# specify working directories for RUN, CMD, and ENTRYPOINT commands
WORKDIR /workspace

Copy all files from the current path of the host to the working directory of the docker
COPY . /workspace

Clear the NPM cache file
RUN npm cache clean --force && npm cache verify
# If set to true, disallow UID/GID switching when running Package Scripts
# RUN npm config set unsafe-perm true

# installation pm2
RUN npm i pm2 -g

# install dependencies
RUN npm install

# Exposed port
EXPOSE 3000

# run command
ENTRYPOINT pm2-runtime start docker_pm2.json
Copy the code

Thanks for sharing

  • A tool that emulates browser behavior
  • Roll. Do you really get it
  • Puppeteer In Chinese
  • Linux sandbox technology introduction
  • How to write good Dockerfile, Dockerfile best practice