@ (spider) [puppeteer |]

Crawler is also called network robot. You probably use search engines every day, and crawlers are an important part of search engines, crawling content for indexing. Now big data, data analysis is very hot, then where does the data come from, you can crawl through the web ah. Then I meng to discuss the web crawler.


[TOC]

How a reptile works

As shown in the figure, this is the flow chart of crawler. It can be seen that the crawler’s crawling journey is started through a seed URL, the content storage in the web page is resolved through downloading the web page, and the URL in the resolved web page is removed and added to the queue waiting for crawling. Isn’t it easy to repeat the above steps from the queue to the next URL to crawl?

Breadth (BFS) or depth (DFS) is preferred

It is also mentioned above that after climbing a web page, select a URL from the queue waiting to climb to climb, and how to choose? Do you want to select the URL in the current page or continue to select the same URL in the current URL? Peer urls here refer to urls from the same web page, which is the division of the crawl strategy.

Breadth First Strategy (BFS)

The breadth-first strategy is to crawl the URL from the current page completely and then crawl the URL from the current page, which is BFS. If the diagram above represents the relationship between the pages, the BFS crawl strategy will be :(A->(B,D,F,G)->(C,E));

Depth-first Strategy (DFS)

Depth-first climbs a web page and then continues to climb the URL resolved from the web page until the climb is complete. (A – > B > C > D – > E – > F – > G)

Download the web page

Downloading a web page looks as simple as typing a link into a browser, and then the browser displays it. Of course, it turns out it’s not that simple. #### simulated login For some web pages need to log in to see the content of the web page, that crawler how to log in? In fact, the login process is to obtain access credentials (cookie,token…).

let cookie = ' ';
let j = request.jar()
async function login() {
    if (cookie) {
        return await Promise.resolve(cookie);
    }
    return await new Promise((resolve, reject) = > {
        request.post({
            url: 'url'.form: {
                m: 'username'.p: 'password',},jar: j
        }, function(err, res, body) {
            if (err) {
                reject(err);
                return;
            }
            cookie = j.getCookieString('url'); resolve(cookie); })})}Copy the code

Here is a simple chestnut, log in to get a cookie, and then bring the cookie with each request.

Get web content

Some web content is a rendering of a service, no CGI access to data, only from the HTML parsing content, but some content and is not a simple can get content, sites like linkedin, and is not a simple access to web content, web page after be executed by the browser to get the final HTML structure, So what’s the solution? I mentioned browser implementation earlier, but do I have a programmable browser? Puppeteer is Google chrome’s open source, headless browser project that simulates user access to the heaviest web pages. Mock login using puppeteer

async function login(username, password) {
    const browser = await puppeteer.launch();
    page = await browser.newPage();
    await page.setViewport({
        width: 1400.height: 1000
    })
    await page.goto('https://maimai.cn/login');
    console.log(page.url())
    await page.focus('input[type=text]');
    await page.type(username, { delay: 100 });
    await page.focus('input[type=password]');
    await page.type(password, { delay: 100 });
    await page.$eval("input[type=submit]", el => el.click());
    await page.waitForNavigation();
    return page;
}
Copy the code

By executing login(), you can get the content in the HTML as you login to the browser, or you can request the CGI directly

async function crawlData(index, data) {
                    let dataUrl = `https://maimai.cn/company/contacts?count=20&page=${index}&query=&dist=0&cid=${cinfo.cid}&company=${cinfo.encodename}&forcomp=1&searchTokens=&highlight=false&school=&me=&webcname=&webcid=&jsononly=1`;
                    await page.goto(dataUrl);
                    let res = await page.evaluate((e) = > {
                        return document.body.querySelector('pre').innerHTML;
                    });
                    console.log(res)
                    res = JSON.parse(res);
                    if (res && res.result == 'ok' && res.data.contacts && res.data.contacts.length) {
                        data = data.concat(res.data.contacts.map((item) = > {
                            let contact = item.contact;
                            console.log(contact.name)
                            return {
                                name: contact.name,
                                occupation: contact.line4.split(', ') [0].company: contact.company,
                                title: contact.position
                            }
                        }));
                        return await crawlData(++index, data);
                    }
                    return data;
                }
Copy the code

Like some sites, the hook, every time to crawl the same cookie, can also use a headless browser to crawl, so every time you don’t have to worry about cookies every time you crawl.

Write in the last

Of course, crawler is not only these, more is to analyze the website, find the right crawler strategy. After all, puppeteer can be used not only for crawlers, but also for programming, headless browsers, automated testing and so on.