preface

Objective of this paper: to crawl the keyword search results of Baidu search engine and deploy them into the function calculation of Ali Cloud.

Before you begin, please take a quick look at the following function calculations and Puppeteer concepts for the next step.

Function to calculate

What is the function evaluation

Function Compute: Function Compute is an event-driven, fully managed computing service that allows you to focus on writing code and not on server infrastructure. When an event is triggered, function computing runs tasks flexibly and reliably in the cloud service, and supports log queries, performance monitoring, and alarms.

Advantages of using functions for calculation

  • You do not need to purchase and manage servers and other infrastructure, and the operation and maintenance cost is low.
  • You only need to focus on the development of business logic, design, optimize, test, review, and upload your own application code using the development language supported by functional computation.
  • Trigger an application to respond to a user request in an event-driven manner. Seamless connection with Aliyun object storage OSS, API gateway, log service, table storage and other services, helping you quickly build applications.
  • Provides log query, performance monitoring, and alarm functions to quickly rectify faults.
  • Flexible expansion at the millisecond level enables rapid expansion of the bottom layer to cope with peak pressure.
  • Charge on demand, support 100 milliseconds level charge. You only need to pay for the computing resources actually used, which is suitable for user access scenarios with obvious peaks and troughs.

Puppeteer

What is the Puppeteer

Puppeteer is a Node library that provides a high-level API for controlling headless Chrome or Chromium via the DevTools protocol, and can also be configured to use full (non-headless) Chrome or Chromium.

You can directly control Chrome through the apis provided by Puppeteer to simulate most user actions to perform UI tests or to crawl pages to collect data.

What can Puppeteer do

  • Generate screenshots and PDF of the page

  • You can grab SPA or SSR websites

  • Automated tests that simulate form submission, keyboard input, mouse events, and more

  • Capture a timeline of your site to help diagnose performance problems

  • Create an up-to-date automated test environment. Run tests directly in the latest version of Chrome, using the latest JavaScript and browser features.

Puppeteer official help document

1. Prepare

Before starting, ensure that the following tools have been correctly installed, updated to the latest version, and configured correctly.

  • Funcraft
  • Docker

Funcraft

Funcraft is a command line tool provided by function computation. With this tool, you can easily manage resources such as function computation, API gateway, and logging service. Funcraft helps you develop, build, and deploy with a resource configuration file, template.yml. This article provides three ways to install Funcraft.

  1. The installation

    • Run the following command to install Funcraft

      npm install @alicloud/fun -g

    • After the installation is complete, run fun on the control terminal to check the version information.

      fun --version

  2. Configuration Funcraft

    • Execute the following command

      fun config

    • Configure Account ID, AccessKeyId, AccessKeySecret, and Default Region Name as prompted.

Docker

Funcraft relies on Docker to simulate the local environment for dependency compilation, installation, local run debugging, etc.

Windows

  1. The installation

    Download and install the Docker Desktop based on your system

  2. Configuring a Domestic Mirror

    {
      "registry-mirrors": [
        "https://docker.mirrors.ustc.edu.cn"."https://registry.docker-cn.com"."http://hub-mirror.c.163.com"]}Copy the code

    Right-click the Docker icon in the status bar at the bottom right corner of the desktop, modify the JSON in the Docker Daemon TAB, add the image address above to the array of “Registry -mirrors”, and save it.

    Tips: Recommended to use Aliyun Docker image.

Practice 2.

Initialize the project

  1. Run the following command and select the HTTP-trigger-nodejs10 template

    fun init -n xxx

    • -n, --nameOption is the name of the project to be generated for the folder. The default value isfun-app

  1. Create a folder to store baidu crawler functions

    cd fun-puppetter && mkdir baiduKeywordResult

  2. Generate package.json file

    npm init -y

  3. Generate the Funcraft configuration file

    fun config

    Configure Account ID, AccessKeyId, AccessKeySecret, and Default Region Name as prompted.

  4. Replace the contents of the template.yml file with:

    ROSTemplateFormatVersion: '2015-09-01' Transform: 'Aliyun::Serverless-2018-04-03' Resources: FunPuppetter: Type: 'Aliyun: : Serverless: : Service Properties: Description:' Puppetter Service, a Service can create multiple functions' baiduKeywordResult: Type: 'Aliyun::Serverless::Function' Properties: Handler: index.handler Runtime: nodejs10 CodeUri: './baiduKeywordResult' Timeout: 600 MemorySize: 1024 InstanceConcurrency: 3 Events: httpTrigger: Type: HTTP Properties: AuthType: ANONYMOUS Methods: ['POST', 'GET']Copy the code

    Yml declares a service named FunPuppetter. In this service, declare a function named baiduKeywordResult, configure the function to trigger httpTrigger, the entry to index. Handler, and the function runtime to nodejs10. Also, we specify that the Timeout handler can run for a maximum of 600 seconds and that the MemorySize function is allocated 1024 MB for execution. Specifies that InstanceConcurrency sets an InstanceConcurrency level for a function, indicating how many requests a single function instance can handle simultaneously. Specifies the CodeUri as the current directory. At deployment time, Fun packages and uploads the directory specified by CodeUri. For more configuration rules, see.

  5. Move the /index.js file to /baiduKeywordResult

    index.js

    var getRawBody = require('raw-body');
    var getFormBody = require('body/form');
    var body = require('body');
    
    module.exports.handler = function(req, resp, context) {
        console.log('hello world');
    
        var params = {
            path: req.path,
            queries: req.queries,
            headers: req.headers,
            method : req.method,
            requestURI : req.url,
            clientIP : req.clientIP,
        }
            
        getRawBody(req, function(err, body) {
            resp.setHeader('content-type'.'text/plain');
    
            for (var key in req.queries) {
              var value = req.queries[key];
              resp.setHeader(key, value);
            }
            params.body = body.toString();
            resp.send(JSON.stringify(params, null.' '));
        }); 
          
    }
    Copy the code

    The contents of the directory should look like this:

Run fun Local start to debug and Funcraft responds as follows:

Turn on the generate Url, and if the response looks like this, you can start Coding

Coding

Here are the interfaces we want to implement:

Baidu keyword search results

// request
{
    url: 'http://localhost:8000/2016-08-15/proxy/FunPuppetter/baiduKeywordResult'.params: {
        keyword, // The keyword to search for
        page, // How many pages to crawl}},// response
{
   msg: 'success'.code: 2000.data: [
       {
           title,
           abstract,
           redirectUrl,
           url,
           domain,
           keyword,
           pageNum,
       }
   ]
}
Copy the code

/baiduKeywordResult/index.js

/ baiduKeywordResult/index. Js file content is as follows:

Rely on the package

const puppeteer = require('puppeteer');
const _ = require('lodash');
const async = require('async');
const axios = require('axios');
const cheerio = require('cheerio');
const nodeUrl = require('url');
Copy the code
  • lodash

    High performance JavaScript utility library

  • async

    Async library is a very excellent asynchronous control library. Besides functions, it also provides a large number of other tool functions. When there is no async/await, async library plays an especially prominent role.

  • axios

    Axios is a Promise-based HTTP library that can be used in browsers and Node.js.

  • cheerio

    Cheerio is a quick, flexible and concise implementation of jquery’s core functionality, mainly for server-side DOM manipulation

  • url

    Used to process and parse urls

Handle function

module.exports.handler = async function(req, resp, context) {

    // Receive parameters
    let { keyword, page } = req.queries;

    if (_.isEmpty(keyword) || _.isEmpty(page)) {
        resp.send(JSON.stringify({
            msg: 'Incorrect parameters! '.code: 4005.data: null}}))try {
        // Baidu search results are 76 pages at most
        page = Math.min(page, 76);

        const task = new Task({ keyword, page })
        const result = await task.start();
        console.log('response result', result)
        resp.send(JSON.stringify(result))
    } catch(e) {
        console.log(e)
    }

}

Copy the code

Task Class

class Task {

    // Constructor, which will be called when creating the example
    constructor(task) {
        this._result = {
            msg: 'success'.data: {},
            code: 5000
        };
        this._browser = null;
        this._task = task;
    }

    async start() {
        try {
            await this.initialize();
            await this.execute();
            this._result.code = 2000;
        } catch(e) {
            console.log(e.stack);
            this._result.msg = e.stack;
            this._result.code = 5000;
        } finally {
            await this.destroy();
        }

        return this._result;
    }

    async initialize() {
        // Open a browser instance
        this._browser = await puppeteer.launch({
            headless: true.ignoreDefaultArgs: ['--disable-extensions'].args: [
                '--no-sandbox'.'--disable-setuid-sandbox']}); }async execute() {
        const { keyword, page } = this._task;
        const pageRange = _.range(0, page * 10.10);
        
        let results = [];
        
        // Retrieve search results for each page concurrently
        results = await async.mapLimit(
            pageRange,
            50.async (offset) => {
                
                // Failure retry mechanism
                let retry = 0;
                let success = false;
                
                do {
                    // Open a Tab page
                    let entryPage = await this._browser.newPage();
                    try {
                        const url = `https://baidu.com/s?wd=${keyword}&pn=${offset}`;
                        console.log('Crawler url:', url)
                        await entryPage.goto(url,{
                            waitUntil: 'load'.timeout: 1000 * 30});let pageData = [];
                        if(this.isLastPage(entryPage)) {
                            pageData = await this.structureData(offset, entryPage);
                        }
                        
                        success = true;
                        return pageData;
                    } catch(e) {
                        console.log('error', e);
                        retry++;
                        
                        // If this fails after 6 attempts, an exception is thrown, which is caught by a catch in the handler function
                        if (retry >= 6) {
                            throwe; }}finally {
                        await entryPage.close()
                    }
                } while(! success && retry <6)}); results = _.flatMapDepth(results).map((item, index) = >{
            item.rank = index + 1;
            return item;
        })
        console.log(results);
        this._result.data = results;

    }
    
    async structureData(offset = 0, entryPage) {

        const htmlContent = await entryPage.content();
        let htmlData = await this.htmlParse(htmlContent);
        
        // Iterate over parsed data, adding page and keyword fields
        htmlData = _.map(htmlData, (data) = > {
            data.keyword = this._task.keyword;
            data.pageNum = Math.max(1, offset / 10);
            return data;
        });
    
        return htmlData;
    }
    
    async htmlParse(html) {
        // Parse HTML to get data
        const $ = cheerio.load(html);
        let pageItems = [];
        $(".result.c-container").each(function (i, el) {
            const that = $(el);
            const item = {
                title: _.trim(that.find("h3 > a").text()),
                abstract: _.trim(that.find(".c-abstract").text()),
                redirectUrl: _.trim(that.find("h3 > a").attr("href")),
                url: ""}; pageItems.push(item); });// Request url concurrently to obtain the real URL after Baidu redirects
        pageItems = await new Promise((resolve, reject) = > {
            async.mapLimit(
            pageItems,
            50.async (item) => {
                const redirectResponse = await axios.head(item.redirectUrl, {
                timeout: 1000 * 10./ / 10 seconds
                maxRedirects: 0.validateStatus: function (status) {
                    return status >= 200 && status < 400; }}); item.url = redirectResponse.headers.location || item.redirectUrl; item.domain = nodeUrl.parse(item.url).host;return item;
            },
            (err, results) = > {
                if (err) {
                    reject(err);
                } else{ resolve(results); }}); });return pageItems;
    }

    async isLastPage(entryPage) {
        const htmlContent = await entryPage.content();
        // Parse HTML to get data
        const $ = cheerio.load(htmlContent);
        
        return $("#page").length && $("#page .n").length
    }

    async destroy() {
        await this._browser.close(); }}Copy the code

/baiduKeywordResult/package.json

"Dependencies" : {" async ":" ^ 3.2.0 ", "axios" : "^ 0.19.2", "cheerio" : "^ 1.0.0 - rc. 3", "lodash" : "^ 4.17.15", "puppeteer" : "^ 2.0.0", "url" : "^ 0.11.0"},Copy the code

Install dependencies

$ fun install -d
Copy the code

If you install the dependency directly using NPM install, puppeteer will run with an error. The problem here is that Puppeteer relies on Chromium, which in turn relies on some system library. So NPM install will also trigger the chromium download operation. Here users often encounter problems, mainly:

  1. Because of the large volume of Chromium, it often fails to download due to network problems.
  2. NPM only downloads Chromium, the system libraries that Chromium depends on are not installed automatically. Users also need to find missing dependencies to install.

Fortunately, the function calculation command line tool Funcraft has already integrated the Puppeteer solution. As long as the Puppeteer dependencies are included in package.json, use Fun Install -d to install all system dependencies in one click.

3. Local debug functions

To debug code locally, you can use the following command:

$ fun local start
using template: template.yml
HttpTrigger httpTrigger of FunPuppetter/baiduKeywordResult was registered
        url: http://localhost:8000/2016-08-15/proxy/FunPuppetter/baiduKeywordResult
        methods: [ 'POST', 'GET' ]
        authType: ANONYMOUS
Copy the code

The browser open http://localhost:8000/2016-08-15/proxy/FunPuppetter/baiduKeywordResult will automatically download the Response

Response:

{"msg":"Incorrect parameters!"."code":4005."data":null}
Copy the code

Response with keyword and page parameters:

http://localhost:8000/2016-08-15/proxy/FunPuppetter/baiduKeywordResult?keyword=vue&page=3

{
	"msg": "success"."data": [{"title": "Vue. Js's official website." "."abstract": "Vue.js - The Progressive JavaScript Framework... Subscribe to our weekly you can browse past issues and listen to podcasts at news.vuejs.org."."redirectUrl": "http://www.baidu.com/link?url=Men7IMCzaXf2qP148hYmJKK54l5fL03Wbya_S4L25_i"."url": "https://cn.vuejs.org/"."domain": "cn.vuejs.org"."keyword": "vue"."pageNum": 1."rank": 1
		}, 
        {
            "title": "Vue. | js tutorial novice tutorial"."abstract": Vue.js tutorial vue.js is a set of incremental frameworks for building user interfaces. Vue focuses only on the view layer and adopts a bottom-up incremental design. Vue aims to pass..."."redirectUrl": "http://www.baidu.com/link?url=WXIdaqC4EhUmm3Vdis5p0BCM3vUo139WwLQCB28LV8p5epqoiZMceQ1AWV_HpjKAb2jaqVpsXyWytUzPrnDqt_"."url": "https://www.runoob.com/vue2/vue-tutorial.html"."domain": "www.runoob.com"."keyword": "vue"."pageNum": 1."rank": 2
		},
        {
            "title": "Introduction - vue.js"."abstract": "Vue.js - The Progressive JavaScript Framework... Vue (pronounced vju/curliest, similar to View) is a set of progressive frameworks for building user interfaces. Unlike other large frameworks,Vue is set to..."."redirectUrl": "http://www.baidu.com/link?url=RjryFjnGxvreIzhFX1iicF8hHcRbNhkoTTTrFLjsLk4EmqM5ydhCbTR2vye8NBUv"."url": "https://cn.vuejs.org/v2/guide/"."domain": "cn.vuejs.org"."keyword": "vue"."pageNum": 1."rank": 3}... ] ."code": 2000
}
Copy the code

4. One-click service deployment

To debug code locally, you can use the following command:

  • Confirm the configuration in the YML file and select Yfun deploy -yYou can skip confirmation during deployment

  • Use NAS services to manage dependencies

FunPuppetter/baiduKeywordResult function more than 50 m size, you need to use the Nas service to manage the dependency.

  • ? Do you want to let fun to help you automate the configuration?

    Asked if Fun is used to automate the configuration of NAS management dependencies, select Yes

  • ? We recommend using the ‘NasConfig: Auto’ configuration to manage your function dependencies.

    Whether to use the NasConfig: Auto configuration to manage function dependencies, select Yes.

    Tips: Manual configuration is optional. Function to calculate mounting NAS access. If you have configured it manually, the user is prompted to select the NAS storage function dependency that has been configured

If you see this, the deployment is successful.

Why is Response forced to download

Because the server enforces the Content-Disposition: Attachment field in the Response header, this field causes the returned result to be opened as an attachment in the browser. This field cannot be overridden. Using custom domain names is not affected.

Configure the custom domain name

Next we configure a custom domain name for the function service so that Http trigger function responses are no longer forced to download.

  1. Log in ali Cloud function computing console

  2. Open the custom domain name to create a domain name

    Replace fun.root2.cn with your domain address

  3. Resolve the domain name to the Endpoint of the function calculation

    The Endpoint is obtained in the upper right corner of the function computing console/overview.

    Open the cloud Resolution DNS console, select the domain name, and add records

    The record type is CNAME and the record value is the Endpoint calculated by the function

  4. Tests whether the resolution takes effect

    The following drawing is successfully resolved

How do I update new dependencies?

If new dependencies are added, simply re-execute Fun NAS sync for synchronization.

If you change the code, simply re-execute Fun Deploy to redeploy.

Project code

Github.com/ITHcc/fun-p…