Initialize & install dependencies

npm init --yes
npm i express superagent cheerio -s
Copy the code

SuperAgent is a lightweight Ajax API that can be used by both server-side (Node.js) client (browser)

Cheerio is a fast, flexible, and implementable jQuery core implementation of Cheerio Chinese documentation

Build a simple server

// Create a server instance
const express = require('express')
const app = express()

app.get('/'.(req,res) = > {
  res.send('Reptile In action')})// Get the server information and print it
let server = app.listen(3000.() = > {
  let host = server.address().address;
  let port = server.address().port;
  // %s Another way to concatenate strings
  console.log('Program is running http://%s:%s', host, port);
})
Copy the code

Running server

nodemon index.js
Copy the code

Enter localhost:3000 in the browser

Analyze page content

Baidu News — Baidu.com

Example, open baidu news home page, console, review elements

Cheerio to get id > UL > Li > A to get the text in a label

Access to the page

Introduce the superAgent module and call the get method, passing in the page address

const superagent = require('superagent')

superagent.get('http://news.baidu.com/').end((err,res) = > {
  if(err) {
    console.log('Hot news fetching failed' + err);
  }
  console.log(res);
})
Copy the code

After saving, the server will update, and the terminal will print out the result. Due to too much content, the terminal can not accommodate, and the upper part has been covered

All data returned by the page address is contained in the RES

Process the data

Now let’s start processing the data

  1. Introducing the cheerio library
  2. Below we declare the handlers for res
  3. The top declares the results to be returned
  4. Superagent.get () handles calls to methods within functions and returns the results to pre-declared variables

Note here that the request method app.get(‘/’, (req,res)=> {}) is placed under the handler

const express = require('express')
const superagent = require('superagent')
const cheerio = require('cheerio')
const app = express()

let hotNews = []

superagent.get('http://news.baidu.com/').end((err,res) = > {
  if(err) {
    console.log('Hot news fetching failed' + err);
  }
  // Call the function and the result is directly assigned to the external variable
  hotNews = getHotNews(res)
})

let getHotNews = res= >  {
  // Get $by passing res.text (get the full string of res) to cheerio library load
  let $ = cheerio.load(res.text)
  // Pass the selector selector element to the $method, and you get one
  // $('#pane-news ul li a')
  console.log($('#pane-news ul li a'));

}

app.get('/'.(req,res) = > {
  res.send(hotNews)
})

// Get the server information and print it
let server = app.listen(3000.() = > {
  let host = server.address().address;
  let port = server.address().port;
  // %s Another way to concatenate strings
  console.log('Program is running http://%s:%s', host, port);
})
Copy the code

$(‘#pane-news ul Li a’) returns an array of all corresponding node objects

let getHotNews = res= > {
  // Declare an empty array
  let hotNews = []
  // Get $by passing res.text (get the full string of res) to cheerio library load
  let $ = cheerio.load(res.text)
  // The $method is passed a selector to select elements, resulting in an array containing all the corresponding elements
  // Iterate through the array to get each element's text and href into the news object
  $('#pane-news ul li a').each((index, ele) = > {
    let news = {
      title: $(ele).text(), // Get the headlines
      href: $(ele).attr('href') // Get the news page link
    }
    hotNews.push(news) // The result of each iteration pushes news into the declared array
  })
  // The result is returned at the end of the loop and assigned to the uppermost empty object with a call
  return hotNews
}
Copy the code

Return the data

Print the result after the function call

The passed value is changed to the returned value

app.get('/'.(req,res) = > {
  res.send(hotNews)
})
Copy the code

Of course, once the data is retrieved, it may not be the client to display it directly

This part can then be handled in the Superagent

superagent.get('http://news.baidu.com/').end((err,res) = > {
  if(err) {
    console.log('Hot news fetching failed' + err);
  }
  // Call the function and the result is directly assigned to the external variable
  hotNews = getHotNews(res)
  /* 1. Save the route to the database. 2. The routing page requests data from the database to be displayed in the Echarts chart */
})
Copy the code

Failed to capture local news

I’m not going to write it here because I’ve written it over and over again, and the reason I can’t get this part of the data is because this part of the data is going to be dynamically retrieved from the current page of the browser

Access to news.baidu.com through superagent is to obtain all the static content under this domain name, and cannot trigger the function request to complete the loading of dynamic content

The solution is to use a third-party plug-in to simulate the browser to visit the front page of Baidu News. In this simulated browser, when the dynamic content is loaded successfully, the data is captured and returned to the front-end browser

Nightmare implements dynamic data fetching

segmentio/nightmare: A high-level browser automation library. (github.com)

Use the NIGHTMARE automated testing tool

Electron can create desktop applications using pure javascript to call Chrome’s rich native interface. Think of it as a desktop-focused node.js variant, rather than a Web server, whose browser-based application makes it extremely easy to interact with all manner of responses

Nightmare is a spider-based framework for automated web testing and crawlers, because it has the same automated testing capabilities as plantomJS that can simulate user behavior on a page and trigger some asynchronous data loading. You can also access urls directly to fetch data, just like the Request library, and you can set the latency of the page, so it’s a breeze to trigger either manually or behavior-triggered scripts

Install dependencies

npm i nightmare -s
Copy the code

use

Import modules, get instances, and call methods to get data dynamically

const express = require('express')
const app = express()
const Nightmare = require('nightmare')
// Setting show: true displays an automated built-in browser
const nightmare = Nightmare({ show: true})
const cheerio = require('cheerio')

let localNews = []
//---------------------------------------------------------------------------------
nightmare
  .goto('http://news.baidu.com')// The link to access
  .wait('div#local_news') // Nodes waiting to be loaded
  .evaluate(() = > document.querySelector('div#local_news').innerHTML)// Evaluate node content
  .then(htmlStr= > { // Get the HTML string
    localNews = getLocalNews(htmlStr) // Call the method
  })
  .catch(err= > {
    console.error(err)
  })
//----------------------------------------------------------------------------------
let getLocalNews = htmlStr= > {
  let localNews = []
  let $ = cheerio.load(htmlStr) // We don't need.text because we already got a string
  $('ul#localnews-focus li a').each((index, ele) = > {
    let news = {
      title: $(ele).text(),
      href: $(ele).attr('href')
    }
    localNews.push(news)
  })
  return localNews
}

app.get('/'.(req,res) = > {
  res.send(localNews)
})

// Get the server information and print it
let server = app.listen(3000.() = > {
  let host = server.address().address;
  let port = server.address().port;
  // %s Another way to concatenate strings
  console.log('Program is running http://%s:%s', host, port);
})
Copy the code

Now open the link and you can see the dynamically loaded content