Write a CLI tool to crawl strange dance weekly article links

The introduction

Front-end ER should all know qiwu Weekly, which is a technical blog, which gathers a large number of technical articles contributed by excellent bloggers. I personally check the above article every few days, but its official website often does not open, and every time I want to read an article, I have to turn the page to find the article I want to read. Or, sometimes I just want to read a random article to expand my knowledge or to refresh myself.

Based on the convenience of reading articles, I started the exploration of CLI tool. The core function of this tool is to quickly find the links of articles published on The weekly Strange Dance from the perspective of developers.

The main function

Grab all article links
Random N article links
Automatic capture of scheduled tasks

Grab all article links

The purpose of this function is to capture the link data of articles, provide data support for CLI tool development, and pave the way for the subsequent development of key word retrieval, article content crawl, article recommendation and other functions.

npx 75_action fetch
Copy the code

Local cache of article data

When used as a command line tool, it takes 20 + seconds to capture data from the official website. Therefore, local files are used to cache captured article data. NPX 75_action random

After the command is executed, the system automatically captures data from the cache and the cache validity period is 24 hours.

Random N article links

One of the main functions of the CLI tool, you can run a command to randomly return data of N articles.

npx 75_action random <N>
Copy the code

Automatic capture of scheduled tasks

Configure scheduled tasks with Github Actions, execute [[# grab all article links]] tasks at 0 8 16 o ‘clock every day, and upload captured article data to Github for download.

The project design

Get article data

CLI tool

Caching strategies

Function implementation

Article data capture

The corresponding source code can be viewed here: github.com/JohnieXu/75…

Grab the HTML of the official website of Qiwu Weekly and parse out the article collection data

function getCollections() {
  return fetch(homeUrl)
  .then(res= > res.text())
  .then(res= > {
    if(! res) {return Promise.reject(new Error('Failed to get web content'))}return cheerio.load(res)
  })
  .then($ => {
    const list = $('ol.issue-list > li')
    const collections = list.map((i, l) = > {
      const title = $(l).find('a').attr('title')
      const url = $(l).find('a').attr('href')
      const date = $(l).find('.date').attr('datetime')
      return { title, url, date }
    })
    return collections
  })
}
Copy the code

Grab the HTML of the collection URL page and parse out the collection of article data

function getArticleDoc(url) {
  return fetch(homeUrl + url)
  .then(res= > res.text())
  .then(res= > {
    if(! res) {return Promise.reject(new Error("Failed to get web content"))}return cheerio.load(res)
  })
}

function getArticles(doc) {
  const $ = doc
  const el = $('ul > li.article')
  const list = el.map((i, l) = > {
    return {
      title: $(l).find('h3.title > a').text(),
      url: $(l).find('h3.title > a').attr('href'),
      desc: $(l).find('.desc').text()
    }
  })
  return list
}
Copy the code

Consolidate article data and sort output

getArticleDoc(url).then(getArticles).then(list= > list.map((_, item) = > ({ ...item, issue: title, date }))).then(list= > {
  all = [...all, ...list]
}) // Consolidate article data

all = all.sort((a, b) = > b.date.localeCompare(a.date)) // date in reverse order
Copy the code

Article date field is the corresponding set release date (date of journal), for example: 2021-12-17, required date arranged in reverse chronological order, with String. The prototype. LocaleCompare String sort ().

Article data cache

The corresponding source code can be viewed here: github.com/JohnieXu/75…

Cache file and validity period

const CACHE_FILE = './.75_action/.data.json'
const CACHE_TIME = 1000 * 60 * 60 * 24; / / the cache for 24 h
const homeDir = require('os').homedir()
const p = path.resolve(homeDir, CACHE_FILE) The cache file path is in the user's home directory
Copy the code

Read the modification time of the cache file to determine whether it is expired (the cache is expired if the cache file does not exist)

function isCacheOutDate() {
  const p = path.resolve(require('os').homedir(), CACHE_FILE)
  if(! fs.existsSync(p)) {return true
  }
  const stat = fs.statSync(p)
  const lastModified = stat.mtime
  const now = new Date(a)return now - lastModified >= CACHE_TIME
}
Copy the code

Unexpired, the cache file is read as captured article data

function getHomeFileJson() {
  const homeDir = require('os').homedir()
  const p = path.resolve(homeDir, CACHE_FILE)
  const jsonStr = fs.readFileSync(p)
  let json
  try {
    json = JSON.parse(jsonStr)
  } catch(e) {
    console.error(e)
    json = []
  }
  return json
}
Copy the code

Article data is captured and written to the local cache

function writeFileToHome(json) {
  const homeDir = require('os').homedir()
  const p = path.resolve(homeDir, CACHE_FILE) // Write to the user's home directory
  return mkdirp(path.dirname(p)).then(() = > {
    fs.writeFileSync(p, JSON.stringify(json, null.2)) // Serialize using JSON format})}Copy the code

CLI Tool Development

Bin Configuration

To run the NPX 75_action command, use Node.js to execute the 75_action.js script pointed here

{
  "bin": {
    "75_action": "bin/75_action.js"}}Copy the code

Point to the script file source view here: github.com/JohnieXu/75…

The line parameters

Use the COMMANDER library to register CLI commands and parse command parameters

const program = require('commander')

// Register the command
program.command('random [number]')
       .description('Randomly get n links to articles')
       .option('-d, --debug'.'Enable Debug mode')
       .action((number, options) = > {
         number = number || 1
         if (options.debug) {
           console.log(number, options)
         }
         fetch({ save: 'home'.progress: true }).then(({ collections, articles }) = > {
           const selected = random(number, articles)
           console.log(JSON.stringify(selected, null.2))
           process.exit()
         }).catch((e) = > {
           console.log(e)
           process.exit(1)
         })
       })

program.command('fetch')
       .description('Re-crawl article links')
       .option('-d, --debug'.'Enable Debug mode')
       .action((options) = > {
          if (options.debug) {
            console.log(options)
          }
          fetch({ save: 'home'.progress: true.reload: true }).then(({ collections, articles }) = > {
            console.log('Crawl complete, total${collections.length}A collection of${articles.length}Article `)
            process.exit()
          })
       })

program.parse(process.argv)
Copy the code

Command line progress bar

Use the CLI-progress library to achieve the progress bar effect on the command line

const cliProgress = require('cli-progress')
const bar1 = new cliProgress.SingleBar({}, cliProgress.Presets.shades_classic)
bar1.start(collections.length, 0) // Set the number of articles to total progress
bar1.update(doneLen)              // Update the progress bar after you complete any collection of articles
Copy the code

Periodic capture of data

This function uses GitHub Actions to automatically execute scheduled tasks. Add the corresponding YML configuration file to the project. The corresponding source file can be viewed here: github.com/JohnieXu/75…

name: FETCH
on:
  push:
    branches:
      - master
  schedule:
    - cron: "0 0,8,16 * * *" # daily execution at 0 8 16 (8 hour time difference)

jobs:
  build:

    runs-on: ubuntu-latest

    strategy:
      matrix:
        node-version: [16.x]
        # See supported Node.js release schedule at https://nodejs.org/en/about/releases/

    steps:
    - uses: actions/checkout@v2
    - name: Use Node.js The ${{ matrix.node-version }}
      uses: actions/setup-node@v2
      with:
        node-version: The ${{ matrix.node-version }}
        cache: 'npm'
    - run: npm i -g yarn
    - run: yarn
    - run: node index.js
    - name: Save
      uses: actions/upload-artifact@v2
      with:
        path: data.json
Copy the code

Use actions/checkout to clone the repository source, use actions/setup-node to switch node.js to 16.X, Finally, use actions/upload-artifact to package the data.json file generated by executing the Node index.js command and upload the output to GitHub.

perform

Npm package release

To ensure that the project supports the NPX 75_action command, you need to publish the project to the official NPM repository with the project name 75_action.

Release process the following order according to the actual situation to choose (part), the NRM usage can view here: www.npmjs.com/package/nrm.

nrm use npm # Change the NPM source to official
npm login	   Log in to the NPM account
npm run publish# release
Copy the code

The finished product to show

The following commands are executed on the terminal, and the Node.js version must be at least 10.x or later. The Node and NPX commands can be used normally on the terminal

A random article

npx 75_action random
Copy the code

Random 5 articles

npx 75_action random 5
Copy the code

Random N articles (N is a positive integer)

npx 75_action random N
Copy the code

Fetching and updating local article data

npx 75_action fetch
Copy the code

conclusion

This paper implements a CLI tool for capturing the title, description and original link of qiwu weekly articles, which is implemented based on Node.js. It basically meets the requirements of quick access to qibu weekly article links, and article data can also be cached locally, effectively enhancing the use experience. In addition, there are some advanced functions not developed, such as: keyword search by article title, return to the latest collection of articles, classification by article title, article link validity detection and so on.

These undeveloped functions mentioned above will be developed one after another depending on the situation. Welcome to pay attention to the follow-up progress of this project. The project address is here: github.com/JohnieXu/75… .

The resources

[1] String. Prototype. LocaleCompare () : developer.mozilla.org/zh-CN/docs/…

[2] Cheerio 中文版 : github.com/cheeriojs/c…

[3] Commander use documentation: github.com/tj/commande…

[4] Cli-Progress using documentation: github.com/npkgz/cli-p…

[5] making Actions using tutorial: docs.github.com/cn/actions/…