In fact, crawler is a technical activity which requires high comprehensive ability of computer.

First of all, it is necessary to have a basic understanding of network protocol, especially HTTP protocol, and be able to analyze the data request response of the website. Learn how to use some of the tools. Simply use the Network panel of Chrome DevTools. I usually cooperate with Postman or Charles to analyze, but for more complicated cases, I may need to use professional packet capture tools such as Wireshark. The deeper you know a site, the easier it is to come up with simple ways to crawl the information you want.

In addition to understanding some computer network knowledge, you also need to have certain string processing ability, to be specific, regular expression play, in fact, regular expression in the general use of the scene does not use a lot of advanced knowledge, more commonly used a little complex is grouping, non-greedy matching. As the saying goes, learn regular expressions, handling strings are not afraid of 🤣.

There is to master some anti-crawler skills, write crawler you may encounter all kinds of problems, but do not be afraid, no more complex 12306 are able to climb, and what is difficult to us. Common crawlers encounter problems such as the server checking cookies, checking host and referer headers, hidden fields in forms, captcha, access frequency limits, need for proxies, SPA sites, etc. In fact, most crawlers encounter problems that can eventually be solved by manipulating the browser.

This article is the second in a series of crawlers written using NodeJS. Use a crawler to grab popular items on Github. Want to achieve a goal:

  1. Learn to extract data from web source code, the most basic crawler
  2. Use JSON files to save captured data
  3. Familiarize yourself with some of the modules I introduced in my last article
  4. How does Node handle user input

Analyze requirements

Our requirement is to grab the popular project data from Github, i.e. the project with the highest star ranking. But github doesn’t seem to have a single page for the top projects. Often the search function provided by the site is the focus of our analysis as crawlers.

When I was flooding v2ex, a post discussing 996 taught me a way to check the top github stars repository. Github search will find all github repositories with stars >60000. Analyze the screenshot below and note the comments in the image:

After analysis, the following information can be obtained:

  1. The search results page returns the HTML document via a GET request because I network choseDocfilter
  2. There are three parameters in the URL: P (page) indicates the number of pages, Q (Query) indicates the search content, and type indicates the type of the search content

Then I wonder if Github checks cookies and other request headers such as referer, host, etc., and decides whether to return the page based on the presence of those headers.

Curl curl curl curl curl curl curl curl curl curl curl

curl "https://github.com/search?p=2&q=stars%3A%3E60000&type=Repositories"
Copy the code

Not surprisingly, the source code of the page is returned normally, so that our crawler script does not have to add headers and cookies.

With Chrome’s search function, we can see the project information we need in the source code of the web page

This is the end of the analysis, this is actually a very simple crawler, we just need to configure query parameters, through HTTP request to obtain the source code of the web page, and then use the parsing library to parse, obtain the information we need in the source code and the project, and then process the data into an array. Finally, serialize it into a JSON string and store it in a JSON file.

Start implementing this crawler

Get the source code

To get the source code from Node, you need to configure the URL parameters and access the configured URL through the SuperAgent module that sends HTTP requests.

'use strict';
const requests = require('superagent');
const cheerio = require('cheerio');
const constants = require('.. /config/constants');
const logger = require('.. /config/log4jsConfig').log4js.getLogger('githubHotProjects');
const requestUtil = require('./utils/request');
const models = require('./models');

@param {number} starCount * @param {number} page number */
const crawlSourceCode = async (starCount, page = 1) = > {// The lower limit is starCount k star number
    starCount = starCount * 1024;
    // Replace the parameters in the URL
    const url = constants.searchUrl.replace('${starCount}', starCount).replace('${page}', page);
    // Response.text is the source code returned
    const { text: sourceCode } = await requestUtil.logRequest(requests.get(encodeURI(url)));
    return sourceCode;
}
Copy the code

The constants module in the code above is used to store some of the constant configurations in the project, and when you need to change the constants, you can change the configuration file directly, and the configuration information is more centralized and easy to view.

module.exports = {
    searchUrl: 'https://github.com/search?q=stars:>${starCount}&p=${page}&type=Repositories'};Copy the code

Parse the source code for project information

Here I have abstracted the project information into a Repository class. In the project’s models directory in repository.js.

const fs = require('fs-extra');
const path = require('path');


module.exports = class Repository {
    static async saveToLocal(repositories, indent = 2) {
        await fs.writeJSON(path.resolve(__dirname, '.. /.. /out/repositories.json'), repositories, { spaces: indent})
    }

    constructor({
        name,
        author,
        language,
        digest,
        starCount,
        lastUpdate,
    } = {}) {
        this.name = name;
        this.author = author;
        this.language = language;
        this.digest = digest;
        this.starCount = starCount;
        this.lastUpdate = lastUpdate;
    }

    display() {
        console.log(` projects:The ${this.name}The author:The ${this.author}Language:The ${this.language} star: The ${this.starCount}Abstract:The ${this.digest}Last Updated:The ${this.lastUpdate}
`); }}Copy the code

To parse the obtained source code, we need to use Cheerio parsing library, which is very similar to jquery.

@param {number} starCount * @param {number} page number */
const crawlProjectsByPage = async (starCount, page = 1) = > {const sourceCode = await crawlSourceCode(starCount, page);
    const $ = cheerio.load(sourceCode);

    // If cheerio is familiar with jquery, there should be no obstacles. If not, you can check the API in github official repository. There are not many apis
    // Check the Elements panel to find the information for each warehouse in a li tag
    const repositoryLiSelector = '.repo-list-item';
    const repositoryLis = $(repositoryLiSelector);
    const repositories = [];
    repositoryLis.each((index, li) = > {
        const $li = $(li);

        // Get the a link with the warehouse author and the warehouse name
        const nameLink = $li.find('h3 a');

        // Extract the repository name and author name
        const [author, name] = nameLink.text().split('/');

        // Get the project summary
        const digestP = $($li.find('p') [0]);
        const digest = digestP.text().trim();

        // Get the language
        // First get the span with the class name.repo-language-color, then get the parent div containing the language text
        // It is important to note that some repositories do not have a language and cannot get the span. Language is an empty string
        const languageDiv = $li.find('.repo-language-color').parent();
        // Note that string.trim () is used to trim the whitespace
        const language = languageDiv.text().trim();

        // Get the number of stars
        const starCountLinkSelector = '.muted-link';
        const links = $li.find(starCountLinkSelector);
        // a.energy-link selector may be the issues link
        const starCountLink = $(links.length === 2 ? links[1] : links[0]);
        const starCount = starCountLink.text().trim();

        // Get the last update time
        const lastUpdateElementSelector = 'relative-time';
        const lastUpdate = $li.find(lastUpdateElementSelector).text().trim();
        const repository = new models.Repository({
            name,
            author,
            language,
            digest,
            starCount,
            lastUpdate,
        });
        repositories.push(repository);
    });
    return repositories;
}
Copy the code

Sometimes the search results are multiple pages, so I’ve written a new function here to get the repository for the specified number of pages.

const crawlProjectsByPagesCount = async (starCount, pagesCount) => {
    if (pagesCount === undefined) {
        pagesCount = await getPagesCount(starCount);
        logger.warn('does not specify the number of pages to crawl, will crawl all the warehouse, total${pagesCount}Page `);
    }

    const allRepositories = [];

    const tasks = Array.from({ length: pagesCount }, (ele, index) => {
        // Since the page number starts at 1, I + 1 is used here
        return crawlProjectsByPage(starCount, index + 1);
    });

    // Use promise.all to operate concurrently
    const resultRepositoriesArray = await Promise.all(tasks);
    resultRepositoriesArray.forEach(repositories= >allRepositories.push(... repositories));return allRepositories;
}
Copy the code

Make crawler more human

Just writing a script, configuring parameters in the code and then climbing is a bit crude. Here I’ve used readline-sync, a library that synchronously gets user input, with a bit of user interaction. I might consider using Electron for a simple interface for a later crawler tutorial. Here’s the startup code for the application.

const readlineSync = require('readline-sync');
const { crawlProjectsByPage, crawlProjectsByPagesCount } = require('./crawlHotProjects');
const models = require('./models');
const logger = require('.. /config/log4jsConfig').log4js.getLogger('githubHotProjects');

const main = async() = > {let isContinue = true;
    do {
        const starCount = readlineSync.questionInt(Enter the lower limit of the number of stars you want to crawl on Github, in k: ', { encoding: 'utf-8'});
        const crawlModes = [
            'Grab a page'.'Grab a certain number of pages'.'Fetch all pages'
        ];
        const index = readlineSync.keyInSelect(crawlModes, 'Please select a grasping mode');

        let repositories = [];
        switch (index) {
            case 0: {
                const page = readlineSync.questionInt('Please enter the specific number of pages you want to crawl:');
                repositories = await crawlProjectsByPage(starCount, page);
                break;
            }
            case 1: {
                const pagesCount = readlineSync.questionInt('Please enter the number of pages you want to crawl:');
                repositories = await crawlProjectsByPagesCount(starCount, pagesCount);
                break;
            }
            case 3: {
                repositories = await crawlProjectsByPagesCount(starCount);
                break;
            }
        }
        
        repositories.forEach(repository= > repository.display());
        
        const isSave = readlineSync.keyInYN('Do you want to save it locally (JSON format)? ');
        isSave && models.Repository.saveToLocal(repositories);
        isContinue = readlineSync.keyInYN('In or out? ');
    } while (isContinue);
    logger.info('Program exits normally... ')
}

main();
Copy the code

So let’s see what happens

On Windows, git bash will be garbled in vscode, regardless of whether your file format is utf-8 or not. After searching some issues, switch the code to UTF-8 in Powershell to display normally, that is, cut the page number to 65001.

The full source code for the project, as well as the subsequent tutorial source code, is stored in my Github repository: Spiders. If my tutorial is helpful to you, I hope you don’t skimp on your star 😊. Subsequent tutorials will probably be a more complex case of analyzing Ajax requests to access the interface directly.