Data fetching using Puppeteer (part 1)

Preface: Recently, I wanted to use Python to crawl some data, and suddenly remembered why not use Node.js to crawl data. After all, Python has not been used for several years, so I came up with this article (purely for recording my development experience). However, there is not much development using Node in the market. So this article may be written may be a little sloppy, hope to read a small partner many forgive, can also point out deficiencies or mistakes.

The current project adopts _VSCode_ development.

1. Create a project and run NPM init to initialize the project.

2. Puppeteer NPM I Puppeteer

3. Create a new project directory SRC /index.js.

4. Introduce puppetter in index.js first

`const puppeteer = require(‘puppeteer’)`

5. Because I am here to practice the first business website, so temporarily used to climb. Website addresses can be climbed according to their own needs.

To enter the body

What is written here is a self-executing function to run the project

; (async ()=>{})();Copy the code

There are many basic ways to write puppetter on the web, but I won’t go into details here. I will post the code directly here.

const puppeteer = require('puppeteer') const baseUrl = 'https://www.yicai.com'; ; (async () => {const browser = await puppeteer.launch({headless: false, // browser interface startup slowMo: Args: ['--no-sandbox'], dumpio: false, devTools: true, // Dev mode}); const page = await browser.newPage(); await page.goto(baseUrl, { waitUntil: 'networkidle2' }); await page.waitFor(2000); // Wait to load more button node load await page.waitForSelector('.u-btn')})();Copy the code

Because they don’t have pagination like other pagers, the temporary strategy is to load more data in one click

So implement a loop of clicking the page button while I wait for more buttons to load

for (let index = 0; index < 1; Index ++) {// my index<1 is currently crawling only one page, so I am awaiting page.waitfor (2000); Click ('.u-btn')}Copy the code

Here is where the button is clicked in the code above, and we can get the class

Next, we should start crawling our data.

Await Page.evaluate (options) We are executing the script with this, which is similar to the console executing commands

const result = await page.evaluate(() => { let $ = window.$; // var items = $('.m-con a') Var items = $('# headList ').children('a') var links = [] if (items.length >= 1) {items.each((index, item) => { let it = $(item) let articleTitle = it.find('h2').text() let articleIntroduction = it.find('p').text() let imageAddress = it.find('img').attr('src') let createdTime = it.find('span').text() let detailPage = it.attr('href') links.push({ articleTitle, articleIntroduction, imageAddress, createdTime, detailPage: detailPage }) }) } return linksCopy the code

_var items = $(‘.m-con a’) _ This requires us to analyze the element structure of the page we climb. See the analysis below for details

My own requirement is to shoot the list data under the headline Tab. But when I typed in the log, there were 75 entries, and I checked the page repeatedly and found only 25 entries, so I started F12 again. Finally, I found that they controlled the tags by dispaly: None and display:block. The first time they enter the page, they load three tabs at a time, and they each have an ID.

So instead of (‘.m −cona ‘), we get the class from (‘.m −cona ‘), we get the class from (‘.m −cona ‘), we get the class from (‘# headList ‘).children(‘a’) A tag under headList to process the data.

Of course, this is determined on a case-by-case basis, depending on the elements of each page

Ok, now that we’ve written the script method, we can execute it…

Node SRC /index.js Run the command

Okay, so that’s what we printed out.

Finally, I’ll post the final code for you

const puppeteer = require('puppeteer') const baseUrl = 'https://www.yicai.com'; ; (async () => {const browser = await puppeteer.launch({headless: false, // browser interface startup slowMo: Args: ['--no-sandbox'], dumpio: false, devTools: true, // Dev mode}); const page = await browser.newPage(); await page.goto(baseUrl, { waitUntil: 'networkidle2' }); await page.waitFor(2000); // Wait for a node to load await page.waitForSelector('.u-btn') for (let index = 0; index < 1; index++) { await page.waitFor(2000); await page.click('.u-btn') } const result = await page.evaluate(() => { let $ = window.$; // var items = $('.m-con a') Var items = $('# headList ').children('a') var links = [] if (items.length >= 1) {items.each((index, item) => { let it = $(item) let articleTitle = it.find('h2').text() let articleIntroduction = it.find('p').text() let imageAddress = it.find('img').attr('src') let createdTime = it.find('span').text() let detailPage = it.attr('href') links.push({ articleTitle, articleIntroduction, imageAddress, createdTime, detailPage: detailPage }) }) } return links }) console.log(result); await page.close(); }) ();Copy the code

In the next chapter, we will talk about how to enter the details page to climb the details of the data, thank you for your friends to browse, what is not enough can be returned to chat.

Data fetching using Puppeteer (part 1)

To enter the body

Finally, I’ll post the final code for you

In the next chapter, we will talk about how to enter the details page to climb the details of the data, thank you for your friends to browse, what is not enough can be returned to chat.

Related Posts

What exactly did the F5 refresh, do you know

Behind the scenes of ‘Three Things’ in browsers

One line of CSS code fixes the responsive layout