Because recently when reading local news, always see this news network to see that news network, the feeling is really troublesome, rise want to write a small crawler to climb some news groups synthesize their own news list. Select Node to implement. News-crawler effect: news.imondo.cn

Train of thought

Crawling web pages involves using several plug-in libraries:

  • Request A Node request library
  • Cheerio is a jQuery DOM parsing library
  • Iconv-lite because Node.js only utF8 encoding rules, so if processing GBK encoding rules of the file will appear garbled, so use it transcoding
  • Node-schedule Specifies the scheduled task processing of a node
  • Ejs is a template engine

Procedures to achieve a simple process

  • Obtain web page information through Request
  • Cheerio analyzes the structure of web pages to obtain key news information
  • Write JSON file access
  • Display the news list through ejS templates

To get the data

  • Request to get the news page to crawl
  getList(url) {
    return new Promise(resolve= > {
      request.get({url, encoding : null}, function(err, res, body) {
        if (err) {
          console.log(err);
        }
        let html =  iconv.decode(body, 'gbk');
        let $ = cheerio.load(html, {decodeEntities: false}); resolve($); }); })}Copy the code
  • Analyze page organization for key information
  async get163List(ctx) {
    const newsList = [];
    const url = 'https://hunan.news.163.com/';
    const$=await this.getList(url);
    // Hot news
    $('.news-feature').each(function(index, elem) {
      let $elem = $(elem);
      $elem.find('a').each(function(index, e) {
        const $e = $(e);
        const title = $e.text();
        const href = $e.attr('href');
        const hot = $e.parents('h5').length > 0;
        newsList.push({
          title,
          href,
          hot, 
          tag: Netease News}); })});return newsList;
  }
Copy the code

Write data file

Using the FS module of Node to write the key information of the news we get, I select the date of each day to name the file

fs.writeFile( path.resolve(__dirname, `.. /database/${dir}.json`), JSON.stringify({
  data
}), function (err) {
  console.log(err);
  if (err) throw err;
  console.log('Write done');
});
Copy the code

Template rendering

The program uses EJS to render, because the program is mainly using KOA to build, so need to use KOA-static, KOA-Views middleware. Read the JSON data file first

fs.readFile(path.join( __dirname,  `.. /database/${date}.json`), (err, data) => {
  if (err) {
    reject(null);
  } else {
    resolve(JSON.parse(data.toString()).data); }})Copy the code

Render by reading data

const static = require('koa-static');
const views = require('koa-views');

// Static file
app.use(static(
  path.join( __dirname,  './static')));// Load the template engine
app.use(views(
  path.join(__dirname, './views'), {
    extension: 'ejs'})); app.use(async ctx => {
  let list = await crawler.getNews();
  await ctx.render('index', {
    list,
    time: utils.getNowDate()
  })
});
Copy the code

At this time, the crawler has actually finished writing, but because news is time-sensitive, we need to do a regular task, to always crawl the news

Timing task

The basic usage of node-schedule can be found in the documentation. The program uses a crawl every 4 hours

const schedule = require('node-schedule');
const config = require('. /.. /config');

const rule = new schedule.RecurrenceRule();

rule.hour = config.timeJob;
rule.minute = 0;

/* Scheduled task */
function timeJob (cb) {
  schedule.scheduleJob(rule, function () {
      console.log('Perform scheduled tasks once');
      cb && cb();
  });
}
Copy the code

Making actions deployment

Github Action is used to automatically deploy to its own server. Deployment of private server, the first need to solve the login server verification problem, either input password, or use SSH key login, the program used is the second

SSH private key connection

  • Log in to your server and generate a private key
$ mkdir -p ~/.ssh && cd ~/.ssh
$ ssh-keygen -t rsa -f mysite
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Copy the code

SSH generates two files: mysite (private key) and mysite.pub (public key). The private key is your personal login credential. Do not share it with others. The public key must be placed on the target server.

  • The public keymysite.pubIs posted to the target server~/.ssh/authorized_keys
  • Ensure the server~/.sshThe folder permission is lower than 711, I directly use 600 here (only this user can read and write)
chmod 600 -R ~/.ssh
Copy the code

Automatic configuration

Find GitHub repository Settings, add Secrets Settings, add SSH_PRIVATE_KEY configuration, this is to copy the mysite private key content in the previous step, you can refer to the picture to add the required private information

The configuration file

GitHub Actions automatically reads the YML configuration in the. GitHub/Workflows folder in the repository. Specific configuration meanings have been written in the configuration file

name: Mondo News action # the name
on:
  push:
    branches:
      - master # Push only triggers deployment on master
    paths-ignore:  Changes to the following files do not trigger deployment and can be added on their own
      - README.md      
jobs:
  build-and-deploy:
    runs-on: ubuntu-latest Run automated scripts using ubuntu system images
    steps: # Automated steps
    - name: Checkout  # step name
      uses: actions/checkout@master # Use someone else's packaged step image
    - name: Deploy file to Server
      uses: Wlixcc/[email protected]
      with:  
        username: The ${{ secrets.USERNAME }} # reference configuration, user name
        server: The ${{ secrets.SERVER_IP }} # reference configuration, server IP
        ssh_private_key: The ${{ secrets.SSH_PRIVATE_KEY }} SSH private key
        local_path: './*' 
        remote_path: '/front/news'
    - name: Server Start    
      uses: appleboy/ssh-action@master
      with:
        host: The ${{ secrets.SERVER_IP }}
        username: The ${{ secrets.USERNAME }}
        key: The ${{ secrets.SSH_PRIVATE_KEY }}
        port: The ${{ secrets.PORT }}
        script: sh /front/news/deploy.sh Run the script command
Copy the code

The biggest benefit of GitHub Actions is to use images already written by third parties to do deployment. We only need to write down the configuration to run, which has great scalability. Commit the configuration file to the remote repository and you can see the startup status in the Actions TAB of the repository

conclusion

This time I used code to solve some problems in my own life, and I also tried GitHub Actions to simplify some repetitive deployment operations.


Welcome to pay attention to the public number, we communicate and progress together.