Js crawler

Recently, I wanted to write a crawler script. I checked the script tutorial on the Internet, which is based on Pyhton, but js is very rare. What tools does js crawler need? The tools you probably need are Axios, Cheerio, and of course Node.js installed.

After installing the basic tools, we started happy coding. Axios is used for front and back end interaction, and HERE I used to get HTML. When I was climbing Douban, I encountered Ajax.

I saw the WEBSITE of the D edition books on the Internet, and I couldn’t resist the crawler.

The first step, we want to get the HTML, open web analytics page elements, here I access the link is http://www.shuquge.com/txt/108449, ears big book, and find the url is in the novel chapter. Listmain dl dd under a

Here is my code

let instance = axios.create({
    baseURL: 'http://www.shuquge.com/txt/108449', timeout: 2500, }); // Get the HTML template asyncfunction getHtml() {
    let html = await instance.get('index.html').then(res => res.data);
    returngetUrl(html); } // Get the url request for the book sectionfunction getUrl(html) {
    let $ = cheerio.load(html);
    let list = Array.from($('.listmain dl dd a'));
    let urlArr = [];
    // ele.attribs.href
    list.forEach(ele => urlArr.push(ele.attribs.href));
    return urlArr;
}
Copy the code

These two functions return the urls of all the chapters of the novel, which gives me the foundation to finally climb the entire book. All I need to do is access these urls, get the specific novel information, and save it.

function sendGetRequest(url) {
    return instance.get(url);
}

async function getNovel() {
    let urlArr = await getHtml();
    fetchUrl(10, urlArr).then(res => {
        console.log('Result is' + res.length);
        for (let i = 0; i < res.length; i++) {
            console.log(res[i]);
            fs.appendFile('Three inch world. TXT', res[i], function () {
                (console.log('Write first' + i + 'chapter' + 'success'))})}})}Copy the code

I have modified the code here. The original idea is that the novel is fetched in order of chapter, so the result of writing the file is also fetched in order of chapter. Write a for loop that is easy to grab a novel for.

But how could it be so simple? My friend told me that it would be much faster to make a concurrency and make n requests at a time. I looked it up on the Internet, and it was very little. I set a Max value, and then I send the request Max ++, and if it reaches n it doesn’t send a new one.

There are several problems with this thinking: (1) my request can complete less than the fetch, (2) it can only be sent once, and it will not continue to execute after reaching the upper limit.

I began to think very simply, since I would write async outside and await the end of the concurrent request inside, that is no different from synchronous execution, just wait for a batch, a batch has n requests, wait for this batch of requests returned, will execute the next batch of requests. I’m not happy with the result.

Here is my optimized code.

//howmany is howmany to fetch at a time,urlArr is url array asyncfunction fetchUrl(howMany, urlArr) {
    return new Promise((res, rej) => {
        letresult = []; // Save the crawl recordletcount = 0; // Record the coordinates of the URL in the arraylettask = []; // Task queueletremain = 0; // Record the remaining unfetched urlslet doTask = (indexOfTasks, indexOfUrls = count) => {// Make a promise and execute itlet req = sendGetRequest(urlArr[indexOfUrls]);
            req.then(data => {
                remain--;
                if (count >= urlArr.length) {
                    if (remain === 0) return res(result);
                    return;
                }
                doTask(indexOfTasks);
                result[indexOfUrls] = getContent(data.data);
                count++;
                remain++;
            }, rej => {
                console.log("Request failed" + rej);
                doTask(indexOfTasks, indexOfUrls);
            });
            console.log('sending the first' + count + 'Secondary request' + 'Concurrency is' + task.length);
            task[indexOfTasks] = req;
        }

        for (let i = 0; i < howMany; i++) {
            doTask(i); count++; remain++; }})}Copy the code

There’s a little bit more here, so let’s explain this block of code.

for (let i = 0; i < howMany; i++) {
    doTask(i);
    count++;
    remain++;
}
Copy the code

Send howmany requests while executing and keep track of the number of requests sent, remain to keep track of howmany outstanding requests remain in the queue.

For example, if I send out 10 requests first, it’s like asking 10 workers to do a job. When the workers finish their work, they let other workers take their place, and there are only 10 positions.

When the worker has finished his task, he tells his colleague what to do next and asks him to sit at his station.

When the factory’s remaining tasks are zero (remain equals zero), workers can leave work.

But because it’s concurrent, you don’t know which worker completes the task first, i.e. the chapters of the novel are out of order. For this reason, I use closures to record the location of the current URL in the array, which is concurrent, but the result is added to the array by location.

So once I solved that problem, I was thinking about what to do in case something goes wrong: use this coordinate value, get the url that went wrong, and retrieve the content of the novel.

Crawl multiple novels

Once I had done that, I thought that if I needed to know the URL every time to crawl the content of a novel, the script wasn’t smart enough. Therefore, I analyzed the HTML of the homepage novel and obtained the URL and name of the novel at the top of each section. Then, I could obtain the novel at the top of the section of the novel network through this URL, but I did not need to look at the HTML myself. Just do it. The final code looks like this.

const cheerio = require('cheerio');
const axios = require('axios');
const fs = require('fs'); // Get the popular novel URL on the homepage and create Axios Asyncfunction getHotNovelUrl() {
    let novel = {
        Names: null,
        Urls: null
    };
    let instance = axios.create({
        baseURL: 'http://www.shuquge.com/',
        timeout: 2500,
    });
    let result = await instance.get().then(res => res.data);
    let $ = cheerio.load(result);
    novel.Names = Array.from($('.block .image a img')).map(ele => ele.attribs.alt);
    novel.Urls = Array.from($('.top dl dt a')).map(ele => 'txt/' + ele.attribs.href.match(/[0-9]/g).join(' ') + '/');
    console.log(novel);
    returnnovel; } // Get the HTML template,index represents the position async in axiosArrfunction getHtml(url) {
    let instance = axios.create({
        baseURL: 'http://www.shuquge.com/' + url,
        timeout: 2500
    })
    let html = await instance.get('index.html').then(res => res.data);
    returngetUrl(html); } // Get the url request for the book sectionfunction getUrl(html) {
    let $ = cheerio.load(html);
    // let document.querySelectorAll(".listmain dl dd a")[0].getAttribute("href");
    let list = Array.from($('.listmain dl dd a'));
    let urlArr = [];
    list.forEach(ele => urlArr.push(ele.attribs.href));
    return urlArr;
}

function getContent(html) {
    let $ = cheerio.load(html);
    let content = $('.content h1').text() + $('#content').text();
    return content;
}

function sendGetRequest(baseURL, url) {
    console.log(baseURL, url);
    let instance = axios.create({
        baseURL: 'http://www.shuquge.com/'+ baseURL,
        timeout: 2500
    })
    return instance.get(url);
}

async function getNovel() {
    let novels = await getHotNovelUrl();
    let urlArr = [];
    for (let i = 0; i < novels.Urls.length; i++) {
        await getHtml(novels.Urls[i]).then(result => urlArr.push(result));
    }
    console.log(urlArr);
    for (let i = 0; i < urlArr.length; i++) {
        await fetchUrl(50, urlArr[i], novels.Urls[i]).then(res => {
            console.log('Result is' + res.length);
            for (let j = 0; j < res.length; j++) {
                console.log(res[j]);
                fs.appendFile(novels.Names[i]+'.txt', res[j], function () {
                    (console.log('Write first' + j + 'chapter' + 'success')})})}} //howmany is howmany at a time,urlArr is url array asyncfunction fetchUrl(howMany, urlArr, baseURL) {
    return new Promise((res, rej) => {
        letresult = []; // Save the crawl recordletcount = 0; // Record the coordinates of the URL in the arraylettask = []; // Task queueletremain = 0; // Record the remaining unfetched urlslet doTask = (indexOfTasks, indexOfUrls = count) => {// Make a promise and execute itlet req = sendGetRequest(baseURL, urlArr[indexOfUrls]);
            req.then(data => {
                remain--;
                if (count >= urlArr.length) {
                    if (remain === 0) return res(result);
                    return;
                }
                doTask(indexOfTasks);
                result[indexOfUrls] = getContent(data.data);
                count++;
                remain++;
            }, rej => {
                console.log("Request failed" + rej);
                doTask(indexOfTasks, indexOfUrls);
            });
            console.log('sending the first' + count + 'Secondary request' + 'Concurrency is' + task.length);
            task[indexOfTasks] = req;
        }

        for (let i = 0; i < howMany; i++) {
            doTask(i);
            count++;
            remain++;
        }
    })
}


getNovel();
Copy the code

Hope to give you help, in addition to this code only for learning, do not bring unnecessary server burden to others.