preface

Those who know me well may know that I do not write about hot topics. Why not? Is it because I don’t pay attention? Not really. There are some issues that I am very concerned about, and I do have a lot of ideas and opinions. But I have always followed one principle: make content that has life.

The content introduced in this paper comes from the crawler management platform that the author was responsible for researching and developing before. A relatively independent functional module is specially abstrused to explain how to use NodeJS to develop their own crawler platform. This article covers a lot of knowledge, including NodeJS, crawler framework, parent-child process and its communication, React and UMI, etc. The author will introduce to you in as simple a language as possible.

You will reap

  • Introduction and basic use of the Apify framework
  • How to create a parent-child process and communicate with it
  • Use javascript to manually control the maximum number of crawlers concurrent
  • Capture the whole page picture implementation scheme
  • Nodejs uses third-party libraries and modules
  • Umi3 + ANTD4.0 was used to build the front interface of crawler

Platform preview

The body of the

Before starting this article, it is necessary to understand some applications of crawlers. Commonly known crawlers are mostly used to crawl web page data, capture request information, web page screenshots, etc., as shown in the figure below:

Automated testing
Server-side rendering
Automated form submission
Test the Google extension
Performance diagnosis
nodejs
Apify

Introduction and basic use of the Apify framework

Apify is a scalable Web crawler library for JavaScript. Enable data extraction and development of ** Web** automation through Headless Chrome and Puppeteer. It provides tools to manage and automatically extend headless Chrome/Puppeteer instance pools, maintain request queues for target urls, and store the crawl results to local file systems or the cloud.

It is very simple for us to install and use it. There are also many examples on the official website for reference. The specific installation and use steps are as follows:

The installation

npm install apify --save
Copy the code

Start the first case with Apify

const Apify = require('apify');

Apify.main(async() = > {const requestQueue = await Apify.openRequestQueue();
    await requestQueue.addRequest({ url: 'https://www.iana.org/' });
    const pseudoUrls = [new Apify.PseudoUrl('https://www.iana.org/[.*]')];

    const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        handlePageFunction: async ({ request, page }) => {
            const title = await page.title();
            console.log(`Title of ${request.url}: ${title}`);
            await Apify.utils.enqueueLinks({
                page,
                selector: 'a',
                pseudoUrls,
                requestQueue,
            });
        },
        maxRequestsPerCrawl: 100.maxConcurrency: 10});await crawler.run();
});
Copy the code

The following interface may be displayed after the node is executed:

apify

Apify
Puppeteer
Puppeteer

How to create a parent-child process and communicate with it

If we want to implement a crawler platform, one of the key issues to consider is the execution time and the way of the crawler task. Since the page crawl and screenshots are not processed until the page is fully loaded to ensure data integrity, we can assume that it is a time-consuming task.

When we use NodeJS as the backend server, nodeJS itself is single-threaded, so when the crawl request is passed to NodeJS, nodeJS will have to wait for the “time consuming task” to complete before processing other requests, which will cause the rest of the page to wait for the task to finish before continuing. So for better user experience and smoother response, we don’t have to consider multi-processing. The good news is that NodeJS is designed to support child processes, so we can put time-consuming tasks such as crawlers in a child process and notify the main process when the child process is finished. The whole process is shown in the figure below:

Nodejs has three ways to create a child process. We use fork to do this. The implementation is as follows:

// child.js
function computedTotal(arr, cb) {
    // Time consuming computing task
}

// Communicate with the main process
// Listen for the main process signal
process.on('message', (msg) => {
  computedTotal(bigDataArr, (flag) => {
    // Send a completion signal to the main processprocess.send(flag); })});// main.js
const { fork } = require('child_process');

app.use(async (ctx, next) => {
  if(ctx.url === '/fetch') {
    const data = ctx.request.body;
    // Tell the child process to start executing the task and pass in data
    const res = await createPromisefork('./child.js', data)
  }
  
  // Create an asynchronous thread
  function createPromisefork(childUrl, data) {
    // Load the child process
    const res = fork(childUrl)
    // Notify the child process to start work
    data && res.send(data)
    return new Promise(reslove= > {
        res.on('message', f => {
            reslove(f)
        })
    })  
  }
  
  await next()
})
Copy the code

The above is a simple case of realizing the communication between father and son processes, and our crawler service will also adopt this pattern to realize.

usejavascriptManually control the maximum number of concurrent crawlers

The above are the technical problems that need to be considered to realize our crawler application. Next, we begin to formally implement the business function. Since the crawler task is carried out in the child process, we will implement our crawler function in the child process code. Let’s first sort out the specific business requirements, as shown below:

J ‘then I’ll be the crawler maximum concurrency control to solve the problem, the reason to solve this problem, in order to consider the crawler performance issues, we cannot one-time let spiders crawl web pages, so this will open a lot of parallel process to deal with, so we need to design a throttling device, to control the number of concurrent every time, After the completion of the current one, proceed to the next batch of page fetching processing. Concrete code implementation is as follows:

// Asynchronous queue
const queue = []
// Maximum number of concurrent requests
const max_parallel = 6
// Start pointer
let start = 0

for(let i = 0; i < urls.length; i++) {
  // Add an asynchronous queue
  queue.push(fetchPage(browser, i, urls[i]))
  if(i && 
      (i+1) % max_parallel === 0 
        || i === (urls.length - 1)) {
    // Execute every 6 pieces to realize asynchronous split execution and control the number of concurrency
    await Promise.all(queue.slice(start, i+1))
    start = i
  }
}
Copy the code

The above code can be realized at the same time to grab 6 pages, when the first task will be finished after the execution of the next batch of tasks. In the code, urls refer to the collection of urls entered by the user, and fetchPage is the crawler logic to fetch the page, which I encapsulated as a promise.

How to take a snapshot of an entire Web page

As we all know, puppeteer only intercepts the part of the page image that has been loaded, which is completely ok for the general static website. However, for the content type or e-commerce website with more page content, the mode of on-demand loading is basically adopted, so only a part of the page is intercepted by general means. Or a placeholder that has not yet been loaded, as shown in the picture below:

using
The API manually lets the browser scroll to the bottom, one screen at a time, until the scrolling height of the page remains the same

// Roll height
let scrollStep = 1080;
// Maximum scrolling height to prevent infinite page loading resulting in long time consuming tasks
let max_height = 30000;
let m = {prevScroll: - 1.curScroll: 0}

while(m.prevScroll ! == m.curScroll && m.curScroll < max_height) {// Stop intercepting if the last scroll is the same height as this scroll, or if the scroll height is greater than the maximum height set
    m = await page.evaluate((scrollStep) = > {
      if (document.scrollingElement) {
        let prevScroll = document.scrollingElement.scrollTop;
        document.scrollingElement.scrollTop = prevScroll + scrollStep; 
        let curScroll = document.scrollingElement.scrollTop
        return {prevScroll, curScroll}
      }
    }, scrollStep);
    
    // Wait 3 seconds to continue scrolling in order for the page to load fully
    await sleep(3000);
}
// Other business code...
// Take a snapshot of the web page and set the image quality and save path
const screenshot = await page.screenshot({path: `static/${uid}.jpg`.fullPage: true.quality: 70});
Copy the code

The rest of the crawler code is not the core focus, so I’m not going to give you an example here. I’ve put it up on Github for you to share your research.

There are also available apis for extracting text from web pages, so you can choose which API is suitable for your business. For example, let’s take Puppeteer’s page.$eval:

const txt = await page.$eval('body', el => {
    // El is a DOM node. It can extract and analyze the child nodes of the body
    return {...}
})
Copy the code

nodejsUse of third-party libraries and modules

In order to build a complete Node service platform, the author uses

  • Koa is a lightweight extensible Node framework
  • Glob traverses files using powerful regular matching patterns
  • Koa2-cors handles cross-domain access issues
  • Koa-static Creates a static service directory
  • How to use these modules to achieve a complete server application, the author in the code to do a detailed description, here is not a discussion. The specific code is as follows:
const Koa  = require('koa');
const { resolve } = require('path');
const staticServer = require('koa-static');
const koaBody = require('koa-body');
const cors = require('koa2-cors');
const logger = require('koa-logger');
const glob = require('glob');
const { fork } = require('child_process');

const app = new Koa();
// Create a static directory
app.use(staticServer(resolve(__dirname, './static')));
app.use(staticServer(resolve(__dirname, './db')));
app.use(koaBody());
app.use(logger());

const config = {
  imgPath: resolve('/'.'static'),
  txtPath: resolve('/'.'db')}// Set cross-domain
app.use(cors({
  origin: function (ctx) {
      if (ctx.url.indexOf('fetch') > - 1) {
        return The '*'; // Allow requests from all domains
      }
      return ' '; // This allows only the http://localhost domain to be requested
  },
  exposeHeaders: ['WWW-Authenticate'.'Server-Authorization'].maxAge: 5.// This field is optional and is used to specify the validity period of the precheck request, in seconds
  credentials: true.allowMethods: ['GET'.'POST'.'PUT'.'DELETE'].allowHeaders: ['Content-Type'.'Authorization'.'Accept'.'x-requested-with'],}))// Create an asynchronous thread
function createPromisefork(childUrl, data) {
  const res = fork(childUrl)
    data && res.send(data)
    return new Promise(reslove= > {
      res.on('message', f => {
        reslove(f)
      })
    })  
}

app.use(async (ctx, next) => {
  if(ctx.url === '/fetch') {
    const data = ctx.request.body;
    const res = await createPromisefork('./child.js', data)
    // Obtain the file path
    const txtUrls = [];
    let reg = /. *? (\d+)\.\w*$/;
    glob.sync(`${config.txtPath}/ *. * `).forEach(item= > {
      if(reg.test(item)) {
        txtUrls.push(item.replace(reg, '$1'))
      } 
    })

    ctx.body = {
      state: res,
      data: txtUrls,
      msg: res ? 'Grab done' : 'Fetching failed due to an invalid URL or request timeout or internal server error.'}}await next()
})

app.listen(80)
Copy the code

useUmi3 + antd4.0Build the front interface of crawler

The front-end interface of the crawler platform is developed by umi3+ ANTD4.0, because ANTD4.0 has improved a lot of volume and performance compared with the previous version, and has done more reasonable separation for components. Because the front-end page implementation is relatively simple, the entire front-end code uses hooks to write less than 200 lines, so I will not introduce them here. You can learn about it on my Github.

  • Github project address: based on Apify+node+ React build a bit of interesting crawler platform

The interface is as follows:

The crawler application

The address of the technical document used by the project

  • apifyA is used forJavaScriptIs scalablewebThe crawler library
  • Puppeteer
  • koa— The next generation Web development framework based on NodeJS platform

The last

If you want to learn more H5 games, Webpack, node, gulp, CSS3, javascript, nodeJS, Canvas data visualization and other front-end knowledge and practical, welcome to join our technical group in the public number “Interesting Talk front-end” to learn and discuss together, and explore the boundary of the front-end.

More recommended

  • Develop a friend circle app for programmers based on React/Vue
  • Build a common form management configuration platform based on React (vUE)
  • Programmers must have several common sorting algorithm and search algorithm summary
  • A few very interesting summaries of javascript knowledge points
  • Front-end advanced from zero to one to achieve unidirectional & bidirectional linked list
  • Preliminary study of micro front-end architecture and my front-end technology inventory
  • Develop your own graph bed application using nodeJs
  • Implementing a CMS full stack project from 0 to 1 based on nodeJS (Part 1)
  • Implement a CMS full stack project from 0 to 1 based on nodeJS (middle) (source included)
  • CMS full stack project Vue and React (part 2)
  • Write a mock data server using nodeJS in 5 minutes
  • Develop a component library based on VUE from zero to one
  • Build a front-end team component system from 0 to 1 (Advanced Advanced Prerequisite)
  • Hand-write 8 common custom hooks in 10 minutes
  • Javascript Design Patterns front-end Engineers Need to Know in 15 minutes (with detailed mind maps and source code)
  • “Front End Combat Summary” using postMessage to achieve pluggable cross-domain chatbot