How to write a crawler script in Puppeteer from scratch

Recently, I saw an article about reptiles, and I happened to be in reptiles, so I wanted to write an article to share. Let’s go step by step.

Step 1: Install core crawler dependency puppeteer, if you open googole.com is 404, run NPM I puppeteer before running set PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1; Ok, if there is no problem, we can create index.js in the project root directory.

//index.js
const puppeteer=require('puppeteer');
Copy the code

Step 2: choose a site you need to crawl resources, as a user b stand, take b station as an example, I often watch list, then today we climb a list, address (www.bilibili.com/ranking/all…

Step 3: To analyze how to crawl, open Chrome, press F12, CTRL + Shift + C, and you will first see some information for each entry in the leaderboard. If you have any experience with simple crawlers, you will probably grab the page content and then extract it. A more elegant way is to climb the API. To climb the API, we need to switch the debugging tool we just opened to the Network interface. Click on the link to jump to the page, and you will find some request records. The XHR TAB will correspond to the data content, but the magic, after debugging, the data is found in the JS TAB.

Step 4: Write the crawl code to go back to our index.js;

//index.js
const puppeteer=require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPad = devices['iPad landscape']; //https://github.com/GoogleChrome/puppeteer/blob/master/DeviceDescriptors.js const program = require('commander'); // Define some commands program.version ('0.0.1')
    .option('-t, --top_10'.'show top 10') .parse(process.argv); // Record the result, if you want to write to the database, it can be connected to constlog4js = require('log4js');
log4js.configure({
    appenders: { log: { type: 'file', filename: './log/log.log' } },
    categories: { default: { appenders: ['log'], level: 'info'}}}); const logger =log4js.getLogger('log');

const ifOpenBrowser=false;
const lanchConf={
    headless:!ifOpenBrowser,
    // executablePath: 'C:/Users/xxx/Downloads/chromium/chrome-win32/chrome.exe',// MAC users view document changes by themselves}; const sleep=(time)=>{return new Promise(resolve=>setTimeout(resolve,time))
};
async function repeat(time,fn,gapTime=500){
    if(time>0){
        // console.log('do fn',time);
        await fn();
        await sleep(gapTime);
        return repeat(--time,fn,gapTime)
    }
    // console.log('final');
    return {msg:'done'}
}
const banList=['.png'.'.jpg']; Puppeteer.launch (lanchConf).then(async browser => {// open a newPage const page = await browser.newpage (); // Change the look of the browser to wider await page. Emulate (iPad); / / enable request intercept await page. SetRequestInterception (true); // Block unwanted requests page.on('request', interceptedRequest => {// intercepts requests with a.png or.jpg suffix; Reduce resource consumptionif(banList.some(val=>interceptedRequest.url().match(val))){ interceptedRequest.continue(); // The address of the image can not be obtained under the condition of masking, so it is enabled}else{ interceptedRequest.continue(); }}); // Jump to our target page await page.goto('https://www.bilibili.com/ranking/all/0/0/3', {waitUntil:'networkidle0'// page fully loaded}); // The page should scroll to the bottom and click the page turn button a certain number of times, otherwise the image may not get await repeat(20,async ()=>{await page.keyboard. Press () {await page.keyboard.'PageDown', 200); }, 200); Cheerio const listHandle = await page.$(cheerio const listHandle = await page.'.rank-list'); Const titles=await listHandle.$const titles=await listHandle.$$eval('.info .title', nodes => nodes.map(n => n.innerText));
    const authors=await listHandle.$$eval('.detail>a>.data-box', nodes => nodes.map(n => n.innerText));
    const pts=await listHandle.$$eval('.pts div', nodes => nodes.map(n => n.innerText));
    const links=await listHandle.$$eval('.title', nodes => nodes.map(n => n.getAttribute('href')));
    const views=await listHandle.$$eval('.detail>.data-box', nodes => nodes.map(n => n.innerText));
    const images=await listHandle.$$eval('img', nodes => nodes.map(n => n.getAttribute('src'))); // serialize result const res=[];for(leti=0; i<100; i++){ res[i]={ rank:i+1, title:titles[i], author:authors[i], pts:pts[i], link:links[i], view:views[i], Image :images[I]}} // Output data from the command lineif(program. Top_10) console. The log (res) slice (0, 10)); // Write data logger.info(res); // Close the browser await browser.close(); });Copy the code

Write the above content, read the configuration options carefully, and then complete the dependency NPM I Commander log4js; Open the command line interface where the current project is located, and input the result of running node. program will be output to the log directory in the root directory. If you want to view the first 10 data from the command line, you can run node. -t or node. -top_10.

Step 5: Upload the code to Github

Ps: If set PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1 is run, set the local Chromium path. Also, chromium seems to download normally if you use CNPM to install dependencies, so try it

How to write a crawler script in Puppeteer from scratch

Related Posts

Cheerio, the Node feature series, does crawlers

Why does React bind events

Vue2 source, lifecycle -initLifecycle + initEvents