Write an omelette crawler using NodeJS

The crawler uses several modules, cheerio, Iconv, Async and Request

Cheerio is a document parsing module with jquery-like syntax, which you can easily interpret as jQuery in NodeJS.

Iconv is a Chinese website used to parse GBK code

Request module is an HTTP client request module. It is very simple to use. In addition to general HTTP requests, it can also support more complex HTTP requests, such as HTTPS requests, request redirection, data flow, form submission, HTTP authentication, OAuth login, custom HTTP headers, etc.

Async is an asynchronous process control module, here we mainly use async mapLimit(coll, limit, iteratee, callback)

Javascript code

JavaScript

async.mapLimit(urls, 10, function (url, callback) {Copy the code

        fetchUrl(url, callback, id)Copy the code

      }, function (err, results) {Copy the code

        //TODOCopy the code

      })Copy the code

The first argument, coll, is an array that holds the url of the request, the second argument, limit, controls the number of concurrent requests, and the third argument, iteratee, takes a callback function. The first argument is a single URL. The second argument is also a callback function. This callback saves the result (in this case, the image address for each address) to the fourth parameter callback, Results, which is an array that holds everything.

Code: javascript code

JavaScript

// Import modulesCopy the code

const http = require('http')Copy the code

const fs = require('fs')Copy the code

const cheerio = require('cheerio')Copy the code

const iconv = require('iconv-lite')Copy the code

const request = require('request')Copy the code

const async = require('async');Copy the code

Copy the code

Const urlList = [] // Address listCopy the code

Var id = 0 // counterCopy the code

Copy the code

// The format of the address is the same, so just concatenate the addressCopy the code

for (var i = 193; i > 190; i--) {Copy the code

  urlList.push('http://jandan.net/ooxx/page-' + i)Copy the code

}Copy the code

Copy the code

function getPages(url, callback) {Copy the code

  http.get(url, res => {Copy the code

    const html = []Copy the code

    res.on('data', (chunk) => {Copy the code

      html.push(chunk)Copy the code

    })Copy the code

    res.on('end', () => {Copy the code

// If the site is GBK, you can convert it to UTF8, otherwise it may be garbledCopy the code

      const html1 = iconv.decode(Buffer.concat(html), 'utf8')Copy the code

// The Cheerio module is used to parse the climbed page with jQuery syntaxCopy the code

      const $ = cheerio.load(html1, {Copy the code

        decodeEntities: falseCopy the code

      })Copy the code

      const link = []Copy the code

      $('.view_img_link').each((i, v) => {Copy the code

No picture words hanging, on a screenshot

Ps: I actually wrote a crawler for a porn website, but I was too shy to post it

Write an omelette crawler using NodeJS

Related Posts

Rendering React components in Vue (Cross-stack Rendering studies)

Webpack package optimization —- 2

Task scheduling mechanism in Monorepo