Node crawler tutorial, static and dynamic crawling integration, easy to understand

preface

Nodejs crawler project this article introduces a NodeJS crawler project, the audience for the new crawler partners, through this project can have a simple understanding of node crawler, but also can start to write some simple crawlers.

Project Address:

github

Start the KOA service

The resulting data from 🐯 is intended for Use in Web development, so I’ve started a Web service here, also based on KOA. Koa is a new generation of Web development framework based on nodeJS platform. It is also very easy to start node service with KOA. Three lines of code can start an HTTP service

const Koa = require('koa')
const app = new Koa()

app.listen(8080)
Copy the code

You can learn more about KOA from the official documentation. As long as you can use NodeJS flexibly, KoA can get started in no time.

The crawler analysis

What is the purpose of 🕷️ crawlers? In fact, the purpose of crawler is very simple, is the need to capture the data we want in a site. It doesn’t matter how we do it, what language we use, if we get the data back, that’s what we want. But through the analysis of the site we found that some sites are static, the front end can not view the API request in the site, so only through the analysis of the page to extract data, this is called static capture. Some pages are front-end request interface rendering data, we can directly get the API address, and in the crawler to simulate the request, this is called dynamic grasping, based on this, I simply designed a general crawler.

Global configuration

For convenience, I’ve configured some parameter methods globally

const path = require('path')
const base = require('app-root-dir')

// require globally
global.r = (p = base.get(), m= ' ') = > require(path.join(p, m))

// Global path configuration
global.APP = {
  R: base.get(),
  C: path.resolve(base.get(), 'config.js'),
  P: path.resolve(base.get(), 'package.json'),
  A: path.resolve(base.get(), 'apis'),
  L: path.resolve(base.get(), 'lib'),
  S: path.resolve(base.get(), 'src'),
  D: path.resolve(base.get(), 'data'),
  M: path.resolve(base.get(), 'model')}Copy the code

For unified management, I wrote all the page addresses to crawl into a configuration file:

// All capture targets
const targets = {
  // Nuggets front end related articles
  juejinFront: {
    url: 'https://web-api.juejin.im/query'.method: 'POST'.options: {
      headers: {
        'X-Agent': 'Juejin/Web'.'X-Legacy-Device-Id': '1559199715822'.'X-Legacy-Token': 'eyJhY2Nlc3NfdG9rZW4iOiJoZ01va0dVNnhLV1U0VGtqIiwicmVmcmVzaF90b2tlbiI6IkczSk81TU9QRjd3WFozY2IiLCJ0b2tlbl90eXBlIjoibWFjIiw iZXhwaXJlX2luIjoyNTkyMDAwfQ=='.'X-Legacy-Uid': '5c9449c15188252d9179ce68'}}},// All kinds of movies in movie Heaven
  movie: {
    url: 'https://www.dy2018.com'
  },
  // Pixabay
  pixabay:  {
    url: 'https://pixabay.com'
  },
  // Douban high score movie
  douban: {
    url: 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limi t=20&page_start=0'}}Copy the code

As shown above, some fetching static pages, some fetching dynamic apis, and simulating the latter requires additional request headers, as well as post requests passing JSON, all configured uniformly here.

General class library

For static pages I used the Cheerio library

Cheerio is similar to jquery in Node environment, which can parse pages and extract relevant information in pages. The API exposed by Cheerio is almost the same as jquery, which can be understood as JQ on the server, which is simply encapsulated as follows

const cheerio = require('cheerio')

const$=html= > cheerio.load(html, {
  ignoreWhitespace: true.xmlMode: true
})

const $select = (html, selector) = > $(html)(selector)

// Node properties
const $attr = (html, attr) = > $(html).attr(attr)


module.exports = {
  $,
  $select,
  $attr
}
Copy the code

Superagent is a fully functional server-side HTTP library, which can capture static pages and provide them to Cheerio for analysis, as well as the data returned by dynamic API. Based on this, I made a simple encapsulation

// Encapsulate the SuperAgent library
const superagent = require('superagent')
const { isEmpty } = require('lodash')

// Pages need transcoding such as UTF-8
const charset = require('superagent-charset')
const debug = require('debug') ('superAgent')

charset(superagent)

const allowMethods = ['GET'.'POST']

const errPromise = new Promise((resolve, reject) = > {
  return reject('no url or method is not supported')
}).catch(err= > err)


 /* * options contains post data and headers, such as * {* json: {a: 1}, * headers: {ACCEPT: 'json'} *} */

// mode distinguishes between dynamic and static fetching, and Unicode is the page encoding mode used in static pages
const superAgent = (url, {method = 'GET', options = {}} = {}, mode = 'dynamic', unicode = 'gbk') = > {
  if(! url || ! allowMethods.includes(method))return errPromise
  const {headers} = options

  let postPromise 

  if(method === 'GET') {
    postPromise = superagent.get(url)
    if(mode === 'static') {
      // Static pages that are captured need to be decoded according to encoding mode
      postPromise = postPromise.charset(unicode)
    }
  }

  if(method === 'POST') {
    const {json} = options
// The POST request asks to send a JSON
    postPromise = superagent.post(url).send(json)
  }

// Set the request header if it is needed
  if(headers && ! isEmpty(headers)) { postPromise = postPromise.set(headers) }return new Promise(resolve= > {
    return postPromise
      .end((err, res) = > {
        if(err) {
          console.log('err', err)
          / / not wrong
          return resolve(`There is a ${err.status} error has not been resolved`)}// Static page, return text page content
        if(mode === 'static') {
          debug('output html in static mode')
          return resolve(res.text)
        }
        // The API returns the contents of the body
        return resolve(res.body)
      })
  })
}

module.exports = superAgent

Copy the code

In addition, we need to read and write the captured data:

const fs = require('fs')
const path = require('path')
const debug = require('debug') ('readFile')

// Read files in the data folder by default
module.exports = (filename, filepath = APP.D) = > {
  const file = path.join(filepath, filename)
  if(fs.existsSync(file)) {
    return fs.readFileSync(file, 'utf8')}else {
    debug(`Error: the file is not exist`)}}Copy the code

const fs = require('fs')
const path = require('path')
const debug = require('debug') ('writeFile')

// By default, all files are written to the corresponding files in the data folder
module.exports = (filename, data, filepath) = > {
  const writeData = JSON.stringify(data, ' '.'\t')
  const lastPath = path.join(filepath || APP.D, filename)
  if(! fs.existsSync(path.join(filepath || APP.D))) { fs.mkdirSync(path.join(filepath || APP.D)) } fs.writeFileSync(lastPath, writeData,function(err) {
    if(err) {
      debug(`Error: some error occured, the status is ${err.status}`)}}}Copy the code

Once everything is ready, grab the page

Fetching dynamic API

In the case of digging gold, requests need to be analyzed and simulated

The last page is returned with a “after” tag, and the next page is requested with the “After” value in the POST JSON. Other parameters are static and can be written to death when fetching

const { get } = require('lodash')
const superAgent = r(APP.L, 'superagent')
const { targets } = r(APP.C)
const writeFile = r(APP.L, 'writeFile')
const { juejinFront } = targets

let totalPage = 10 // Just grab ten pages

const getPostJson = ({after = ' '}) = > {
  return {
    extensions: {query: {id: '653b587c5c7c8a00ddf67fc66f989d42'}},
    operationName: ' '.query: ' '.variables: {limit: 10.category: '5562b415e4b00c57d9b94ac8', after, order: 'POPULAR'.first: 20}}}// Save all article data
let data = []
let paging = {}

const fetchData = async (params = {}) => {
  const {method, options: {headers}} = juejinFront
  const options = {method, options: {headers, json: getPostJson(params)}}
  // Initiate a request
  const res = await superAgent(juejinFront.url, options)
  const resItems = get(res, 'data.articleFeed.items', {})
  data = data.concat(resItems.edges)
  paging = {
    total: data.length, ... resItems.pageInfo } pageInfo = resItems.pageInfoif(resItems.pageInfo.hasNextPage && totalPage > 1) {
    fetchData({after: resItems.pageInfo.endCursor})
    totalPage--
  } else {
  // Write to the data folder after requesting play
    writeFile('juejinFront.json', {paging, data})
  }
}

module.exports = fetchData

Copy the code

Fetching static HTML

Take Movie Paradise, for example

The page for analyzing movie Heaven has list page and detail page. To get the magnetic link, you need to enter the detail page, and the link of the detail page needs to be entered from the list page. Therefore, we first request the list page, get the URL of the detail page, and then enter the analysis page of the detail page to get the magnetic link.

Co_content8 ul table. The DOM node retrieved from Cheerio is an array of classes whose each() API is equivalent to the array’s forEach method. This is how we grab links. After entering the details page, grab magnetic links similar to this one. This involves es7’s async await syntax, which is an efficient way to get data asynchronously.

const path = require('path')
const debug = require('debug') ('fetchMovie')
const superAgent = r(APP.L, 'superagent')
const { targets } = r(APP.C)
const writeFile = r(APP.L, 'writeFile')
const {$, $select} = r(APP.L, 'cheerio')

const { movie } = targets

// Various movie types, analysis sites get
const movieTypes = {
  0: 'drama'.1: 'comedy'.2: 'action'.3: 'love'.4: 'sciFi'.5: 'cartoon'.7: 'thriller'.8: 'horror'.14: 'war'.15: 'crime',}const typeIndex = Object.keys(movieTypes)

// Parse the page and get the page node selector, '.co_content8ul table'
const fetchMovieList = async (type = 0) => {
  debug(`fetch ${movieTypes[type]} movie`)
  // Save movie data, title, magnetic link
  let data = []
  let paging = {}
  let currentPage = 1
  const totalPage = 30 / / grab page
  while(currentPage <= totalPage) {
    const url = movie.url + ` /${type}/index${currentPage > 1 ? '_' + currentPage : ' '}.html`
    const res = await superAgent(url, {}, 'static')
    // Get an array of nodes
    const $ele = $select(res, '.co_content8 ul table')
    / / traverse
    $ele.each((index, ele) = > {
      const li = $(ele).html()
      $select(li, 'td b .ulink').last().each(async (idx, e) => {
        const link = movie.url + e.attribs.href
        // Go here to request the details page
        const { magneto, score } = await fetchMoreInfo(link)
        const info = {title: $(e).text(), link, magneto, score}
        data.push(info)
        // In reverse order
        data.sort((a, b) = > b.score - a.score)
        paging = { total: data.length }
      })
    })
    writeFile(`${movieTypes[type]}Movie.json`, { paging, data }, path.join(APP.D, `movie`))
    currentPage++
  }
}

'. Bd2 #Zoom table a'
const fetchMoreInfo = async link => {
  if(! link)return null
  let magneto = []
  let score = 0
  const res = await superAgent(link, {}, 'static')
  $select(res, '.bd2 #Zoom table a').each((index, ele) = > {
    // Do not make this restriction, some movies do not have magnet links
    // if(/^magnet/.test(ele.attribs.href)) {}
    magneto.push(ele.attribs.href)
  })
  $select(res, '.position .rank').each((index, ele) = > {
    score = Math.min(Number($(ele).text()), 10).toFixed(1)})return { magneto, score }
}

// Get all types of movies concurrently
const fetchAllMovies = (a)= > {
  typeIndex.map(index= > {
    fetchMovieList(index)
  })
}

module.exports = fetchAllMovies
Copy the code

The data processing

The captured data can be stored in the database, which I currently write locally. The local data can also be used as API data source, such as the data of Movie Heaven. I can write a local API to be used as the server developed locally

const path = require('path')
const router = require('koa-router') ()const readFile = r(APP.L, 'readFile')
const formatPaging = r(APP.M, 'formatPaging')

// router.prefix('/api');
router.get('/movie/:type'.async ctx => {
  const {type} = ctx.params
  const totalData = readFile(`${type}Movie.json`, path.join(APP.D, 'movie'))
  const formatData = await formatPaging(ctx, totalData)
  ctx.body = formatData
})

module.exports = router.routes()
Copy the code

I manually maintain a paginated list to facilitate feed flow when data is fed to the front end:

// Manually generate paging data
const {getQuery, addQuery} = r(APP.L, 'url')
const {isEmpty} = require('lodash')

module.exports = (ctx, originData) = > {
  return new Promise((resolve) = > {
    const {url, header: {host}} = ctx
    if(! url || isEmpty(originData)) {return resolve({
        data: [].paging: {}})}const {data, paging} = JSON.parse(originData)
    const query = getQuery(url)
    const limit = parseInt(query.limit) || 10
    const offset = parseInt(query.offset) || 0
    const isEnd = offset + limit >= data.length
    const prev = addQuery(`http://${host}${url}`, {limit, offset: Math.max(offset - limit, 0)})
    const next = addQuery(`http://${host}${url}`, {limit, offset: Math.max(offset + limit, 0)})
    const formatData = {
      data: data.slice(offset, offset + limit),
      paging: Object.assign({}, paging, {prev, next, isEnd})
    }
    return resolve(formatData)
  })
}

Copy the code

If it is convenient, we can write the data into the database, so that we can achieve a crawler – back-end – front-end dragon ha ha

The last API, paging controlled by limit and offset parameters, can be customized, request next can request the next page to implement the feed stream

✨ ✨ ✨

Of course, there are too many things about crawler expansion. Some sites have crawler restrictions, which require IP pool to be changed from time to time, and some need simulated login. There are still many things to learn