The crawler uses several modules, cheerio, Iconv, Async and Request

Cheerio is a document parsing module with jquery-like syntax, which you can easily interpret as jQuery in NodeJS.

Iconv is a Chinese website used to parse GBK code

Request module is an HTTP client request module. It is very simple to use. In addition to general HTTP requests, it can also support more complex HTTP requests, such as HTTPS requests, request redirection, data flow, form submission, HTTP authentication, OAuth login, custom HTTP headers, etc.

Async is an asynchronous process control module, here we mainly use async mapLimit(coll, limit, iteratee, callback)

Javascript code

JavaScript

async.mapLimit(urls, 10, function (url, callback) {Copy the code
        fetchUrl(url, callback, id)Copy the code
      }, function (err, results) {Copy the code
        //TODOCopy the code
      })Copy the code

The first argument, coll, is an array that holds the url of the request, the second argument, limit, controls the number of concurrent requests, and the third argument, iteratee, takes a callback function. The first argument is a single URL. The second argument is also a callback function. This callback saves the result (in this case, the image address for each address) to the fourth parameter callback, Results, which is an array that holds everything.

Code: javascript code

JavaScript

// Import modulesCopy the code
const http = require('http')Copy the code
const fs = require('fs')Copy the code
const cheerio = require('cheerio')Copy the code
const iconv = require('iconv-lite')Copy the code
const request = require('request')Copy the code
const async = require('async');Copy the code
Copy the code
Const urlList = [] // Address listCopy the code
Var id = 0 // counterCopy the code
Copy the code
// The format of the address is the same, so just concatenate the addressCopy the code
for (var i = 193; i > 190; i--) {Copy the code
  urlList.push('http://jandan.net/ooxx/page-' + i)Copy the code
}Copy the code
Copy the code
function getPages(url, callback) {Copy the code
  http.get(url, res => {Copy the code
    const html = []Copy the code
    res.on('data', (chunk) => {Copy the code
      html.push(chunk)Copy the code
    })Copy the code
    res.on('end', () => {Copy the code
// If the site is GBK, you can convert it to UTF8, otherwise it may be garbledCopy the code
      const html1 = iconv.decode(Buffer.concat(html), 'utf8')Copy the code
// The Cheerio module is used to parse the climbed page with jQuery syntaxCopy the code
      const $ = cheerio.load(html1, {Copy the code
        decodeEntities: falseCopy the code
      })Copy the code
      const link = []Copy the code
      $('.view_img_link').each((i, v) => {Copy the code

No picture words hanging, on a screenshot

Ps: I actually wrote a crawler for a porn website, but I was too shy to post it