Introduce a,

NodeJS’s single-threaded, event-driven nature can achieve great throughput on a single machine, making it ideal for writing resource-intensive programs such as web crawlers.

During this period of time, I wrote a crawler that can climb zhihu relationship chain. By inputting the URL of a user’s homepage, I can climb his relationship chain: github.com/starkwang/Z…

Second, the implementation of crawler

Request module is used for data request, express is used to respond to requests, Echarts is used for front-end composition, and Websocket is used for intermediate data interaction.

Two apis of Zhihu are mainly used:

/ / get the target user's followers POST https://www.zhihu.com/node/ProfileFollowersListV2 parameters: Params :{offset:40, // a multiple of 20, Starting from 0 each pull 20 followers order_by: "created", / / fill in the "created" can hash_id: "d965f32a168564f9e58ad3a48a1585a4" / / target users in zhihu only hash_id}, C6ef5534d3dbb6a54057826864799 _xsrf: "289" / / XSRF parameters given in the cookiesCopy the code
/ / to get goals users POST https://www.zhihu.com/node/ProfileFolloweesListV2 parameters: Params :{offset:40, // a multiple of 20, Starting from 0 20 order_by attention: every time I pull "created", / / fill in the "created" can hash_id: "d965f32a168564f9e58ad3a48a1585a4" / / target users in zhihu only hash_id}, C6ef5534d3dbb6a54057826864799 _xsrf: "289" / / XSRF parameters given in the cookiesCopy the code

The working process of crawler is as follows:

  1. Get a list of the target user’s followers and followers, and find out the people who follow each other (friends).

  2. Repeat step 1 for the friends in your friend list to find the friends in your friend list

  3. Iterate over the results in 2 to find mutual concern between friends

Three, part of the code

First, we write a getUser method that asks for a user’s home page URL, retrieves the result, and resolves the user’s nickname, hash_id, number of followers, and number of followers.

var request = require('request'); var Promise = require('bluebird'); var config = require('.. /config'); function getUser(userPageUrl) { return new Promise(function(resolve, reject) { request({ method: 'GET', url: userPageUrl, headers: { 'cookie': config.cookie } }, function(err, res, body) { if (err) { reject(err); } else { resolve(parse(body)); }})}); } function parse(html) { var user = {}; var reg1 = /data-name=\"current_people\">\[.*\"(\S*)\"\]<\ script="">/g; reg1.exec(html); user.hash_id = RegExp.$1; Var reg2 = / < span="">

\n(\d*)/g; reg2.exec(html); user.followeeAmount = parseInt(RegExp.$1); Var reg3 = / < span="">

\n(\d*)/g; reg3.exec(html); user.followerAmount = parseInt(RegExp.$1); Var reg4 = / (. *) - zhihu < \ title = "" > / g reg4 exec (HTML); user.name = RegExp.$1; return user; } module.exports = getUser; < / \ >
Copy the code

Next we need a fetchFollwerOrFollwee method, which inputs the above user object and fetches all the users or followers of the user according to the API described in Part 2, using a similar method (using ES6) :

getUser('someURL')
    .then(user => fetchFollwerOrFollwee({user: user, isFollowees: false})
    .then(list => console.log(list))Copy the code

Specific code refer to here, will not stick up

The next thing to do is to combine getUser and fetchFollwerOrFollwee into a getFriends method, where the input is the user’s page URL and the output is the user’s friends list, something like this:

function getFriends(someURL){ getUser(someURL) .then(user => fetchFollwerOrFollwee(...) ) .then((followersList, follweesList) => findFriends(followersList, follweesList)) }Copy the code

We can then encapsulate the final searchSameFriend method, with the input being a user and a friend list, myFriends, and the output being all the friends of that user, who are also in the list, myFriends

function searchSameFriend(user, myFriends){
    return getFriends(user.url)
        .then(user => findSameFriends(userFriends, myFriends))
        .then(sameFriends => console.log(sameFriends))
}Copy the code

The final promise flow for the entire crawler looks something like this:

function Spider(){
    return getUser(URL)
        .then(user => getFriends(user))
        .then(userFriends => 
            Promise.map(userFriends, friend => searchSameFriend(friend,userFriends))
        )
}Copy the code

Of course, part of the missing websocket and front-end data interaction code