Writing in the front

Learning how to crawl data from the web is a very important skill, according to statistics, 30 to 40 percent of the web crawlers get data. So what is a reptile? Simply put, it is a link on the web to get the content we need from that link (HTML). Node requests a URL -> get HTML (information) -> parse HTML

The preparatory work

To climb douban top250 film ranking information for a small demo

  1. Using vscode as the IDE, create a folder and create an index.js file
  2. Right-click on the folder and choose Open on Terminal

npm init -y
package.json

  1. Continue typing commands on the terminalnpm install cheerioCheerio is a fast, flexible and concise implementation of the core functions of jquery. It is mainly used for DOM manipulation on the server side. Please check for detailsCheerio official document
  2. Find the data you need to crawl on douban Top250. For simplicity, let’s take the example of picking movie titles, ratings, and covers

Detailed code

The code is minimal, but understand how it works

  1. Load module
// Load the HTTPS module, as long as it is required to get the website link
const https = require('https')
// Load the cheerio you downloaded earlier, it will work later
const cheerio = require('cheerio')
FileSystem, the module that operates on files
const fs = require('fs') 
Copy the code
  1. Find the location of the name, score and cover on douban Top250. Step 1: Right click to check. See the figure for other steps


So far we have all the data! These data can be stored in the database, become your own! What if we were worried that one day Douban Top250 would suddenly disagree with the way we use links to get pictures? We can download down, as for how to download down, the steps above are actually similar, will not repeat, of course, if there is a need to find me oh.

The complete code

// Load the HTTPS module, as long as it is required to get the website link
const https = require('https')
// Load the cheerio you downloaded earlier, it will work later
const cheerio = require('cheerio')
FileSystem, the module that operates on files
const fs = require('fs')
// Use the get method of the HTTPS module to request the following website link. In the callback function, res is the requested resource
https.get('https://movie.douban.com/top250'.function(res){
    // Create an empty string for concatenation since we need to concatenate the obtained resource piecewise
    let html = ' '
    // res.on is similar to addEventListener, except that this listener is listening for data,
    // Execute this function whenever data is generated. Chunk is the retrieved data, and HTML is used to concatenate it
    res.on('data'.function(chunk){
        html += chunk
    })
    As soon as the RES data is loaded, we execute the following callback function
    res.on('end'.function(){
        // We can use dom manipulation when we use Cheerio
        const $ = cheerio.load(html)  //$is specified by Cherrio
        // Use this array to store the data we crawled
        let allFilms = [] 
        
        $('li .item').each(function(){
            // This loop is for the current movie
            // The title under the current movie
            const title = $('.title'.this).text()
            const star = $('.rating_num'.this).text()
            const pic = $('.pic img'.this).attr('src')
            // Save to json file fs
            allFilms.push({title,star,pic})
        })
        fs.writeFile('./files.json'.JSON.stringify(allFilms),function(err){
            if(! err){console.log('File written down')}})})Copy the code

The difficulty is to understand the whole process, the process understood, look at the code will be a lot simpler oh ~ all see this, do not point a praise 🙃