digression

Lao Wang, recently adopted a pen name. Not only pen names, but also characters and numbers.

Shangguan chase wind, word chase wind, number chase wind lay.

If I had to give it an explanation, it would be “Chasing the wind at home.”

Lao Wang’s way of writing is actually his way of thinking.

Puppeteer

The best teacher of the unknown is obviously the search engine, and arguably the best of the search engines is Google search.

Google search Puppeteer

Puppeteer document

Github: https://github.com/puppeteer/puppeteer

English document: https://pptr.dev

Chinese document: https://zhaoqize.github.io/puppeteer-api-zh_CN

Puppeteer profile

The following introduction is excerpted from the Chinese document.

Puppeteer, pronounced /puh·puh·teer/, is a Node library that provides a high-level API for controlling Chromium or Chrome via the DevTools protocol. Puppeteer runs in headless mode by default, but can be run in headless mode by modifying the configuration file.

  • Generate page PDF.
  • Grab SPA “single page application” and generate pre-rendered content (SSR “server-side render”).
  • Automatic form submission, UI testing, keyboard input, etc.
  • Create an automated test environment that is constantly updated. Perform tests directly in the latest version of Chrome using the latest JavaScript and browser features.
  • Capture the Timeline trace for the site to help analyze performance issues.
  • Test the browser extension.

Project background

Lao Wang started the affairs of the “solicitation” plate of the electric duck community, and needed to contact a lot of people to call the duck community to send solicitation paste.

Sign in to post

I happened to find a douban group has a sign-in post, how to do that? Manual copy and paste is not possible, start writing crawlers now.

The code field

Step 1: Create the project

  • Create a directorydouban
Create a project
  • createdouban.jsfile
  • Paste the sample code from the official website
const puppeteer = require('puppeteer');

(async () = > {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
 await page.goto('https://douban.com');  await page.screenshot({path: 'example.png'});  await browser.close(); }) ();Copy the code
  • npmThe installationPuppeteer

Don’t worry, you can’t run the code yet. Run the terminal to the project root directory NPM to install Puppeteer

npm i puppeteer

Need to wait for Chromium to install, the network is not good friends, do it yourself.

Install the Puppeteer
  • Modify thepackage.jsonfile
{
  "name": "douban".  "version": "1.0.0".  "scripts": {
    "start": "node ./douban.js"
 },  "dependencies": {  "puppeteer": "^ 3.1.0"  } } Copy the code

Step 2: Simulate landing

Login is required when accessing the target page.

Need to log in
  • Analyze the structure of the login page

I chose password login to reduce complexity.

Landing page

What do we need to do?

  • Open the page
  • Click password to log in
  • Enter account
  • Enter the password
  • Click on the landing

  • Code sample
const puppeteer = require('puppeteer');

(async () = > {
    const browser = await puppeteer.launch({
        headless: true. timeout: 50000  })   const page = await browser.newPage()   // Go to the douban page  await page.goto('https://accounts.douban.com/passport/login', {  waitUntil: 'networkidle2' // Network idle indicates that the load is complete  });   // Click on the search box and enter  const clickPhoneLogin = await page.$('.account-tab-account')   await clickPhoneLogin.click()   const name = 'xxxxxxxxxx'  await page.type('input[id="username"]', name, {delay: 0})   const pwd = 'xxxxxxxxxx'  await page.type('input[id="password"]', pwd, {delay: 1})   // Get the login button element  const loginElement = await page.$('div.account-form-field-submit > a')   await loginElement.click()   await page.waitForNavigation()   await browser.close() }) ();Copy the code

The final result

Simulation on

Step 3: Crawl data

I won’t go into details after I have the foundation in front.

const puppeteer = require('puppeteer');

(async () = > {
    const browser = await puppeteer.launch({
        headless: false. timeout: 50000  })   const page = await browser.newPage()   // Go to the douban login page  await page.goto('https://accounts.douban.com/passport/login', {  waitUntil: 'networkidle2' // Network idle indicates that the load is complete  });   // Click on the search box and enter  const clickPhoneLogin = await page.$('.account-tab-account')   await clickPhoneLogin.click()   const name = 'xxxxxxxx'  await page.type('input[id="username"]', name, {delay: 0})   const pwd = 'xxxxxxxx'  await page.type('input[id="password"]', pwd, {delay: 1})   // Get the login button element  const loginElement = await page.$('div.account-form-field-submit > a')   // Click the button to log in  await loginElement.click()   await page.waitForNavigation()    // Target page URL  let url = 'https://www.douban.com/group/topic/112565224/?start='  // Page turn parameters  let pages = [0.100.200.300.400.500]   // Define the crawl function  async function next(url) {  await page.goto(url, {  waitUntil: 'networkidle2' // Network idle indicates that the load is complete  })   return await page.$eval("div.reply-doc.content > p", e => {   let a = []   e.forEach(element= > {  a.push(element.innerText)  })   return a  })  }   // Concatenates text strings  let data = ' '   for (const index of pages) {  let res = await next(url + index)  data = res.join('\n\n\n-----------------------------------------------------------\n\n') + data  }   // Check the data  console.log(data)   await browser.close() }) ();Copy the code

The final result

Crawl data

Step 4: Write data

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () = > {
    const browser = await puppeteer.launch({
 headless: false. timeout: 50000  })   const page = await browser.newPage()   page.setViewport({  width: 1920. height: 1080  })   // Go to the douban login page  await page.goto('https://accounts.douban.com/passport/login', {  waitUntil: 'networkidle2' // Network idle indicates that the load is complete  });   // Click on the search box and enter  const clickPhoneLogin = await page.$('.account-tab-account')   await clickPhoneLogin.click()   const name = 'xxxxxxxx'  await page.type('input[id="username"]', name, {delay: 0})   const pwd = 'xxxxxxxx'  await page.type('input[id="password"]', pwd, {delay: 1})   // Get the login button element  const loginElement = await page.$('div.account-form-field-submit > a')   // Click the button to log in  await loginElement.click()   await page.waitForNavigation()    // Target page URL  let url = 'https://www.douban.com/group/topic/112565224/?start='  // Page turn parameters  let pages = [0.100.200.300.400.500]   // Define the crawl function  async function next(url) {  await page.goto(url, {  waitUntil: 'networkidle2' // Network idle indicates that the load is complete  })   return await page.$eval("div.reply-doc.content > p", e => {   let a = []   e.forEach(element= > {  a.push(element.innerText)  })   return a  })  }    // Concatenates text strings  let data = ' '   for (const index of pages) {  let res = await next(url + index)  data = res.join('\n\n\n-----------------------------------------------------------\n\n') + data  }   // Write to the file  fs.writeFile('douban.txt',data,'utf8'.function(error){  if(error){  console.log(error);  return false;  }  console.log('Write succeeded');  })   await browser.close() }) ();Copy the code

Practical reflection

  • The code still needs to be optimized, especially when page-turning is poorly written.
  • Can be divided into modules to achieve. In this code, simulated login, target crawling, and file writing are all rolled together.
  • That’s all for now.

The complete code

https://gist.github.com/w3cfed/75217423f86cc9106976d5beffca745b

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () = > {
    const browser = await puppeteer.launch({
 headless: false. timeout: 50000  })   const page = await browser.newPage()   // Go to the douban login page  await page.goto('https://accounts.douban.com/passport/login', {  waitUntil: 'networkidle2' // Network idle indicates that the load is complete  });   // Click on the search box and enter  const clickPhoneLogin = await page.$('.account-tab-account')   await clickPhoneLogin.click()   const name = 'xxxxxxx'  await page.type('input[id="username"]', name, {delay: 0})   const pwd = 'xxxxxxxx'  await page.type('input[id="password"]', pwd, {delay: 1})   // Get the login button element  const loginElement = await page.$('div.account-form-field-submit > a')   // Click the button to log in  await loginElement.click()   await page.waitForNavigation()    // Target page URL  let url = 'https://www.douban.com/group/topic/112565224/?start='  // Page turn parameters  let pages = [0.100.200.300.400.500]   // Define the crawl function  async function next(url) {  await page.goto(url, {  waitUntil: 'networkidle2' // Network idle indicates that the load is complete  })   return await page.$eval("div.reply-doc.content > p", e => {   let a = []   e.forEach(element= > {  a.push(element.innerText)  })   return a  })  }    // Concatenates text strings  let data = ' '   for (const index of pages) {  let res = await next(url + index)  data = res.join('\n\n\n-----------------------------------------------------------\n\n') + data  }   // Write to the file  fs.writeFile('douban.txt',data,'utf8'.function(error){  if(error){  console.log(error);  return false;  }  console.log('Write succeeded');  })   await browser.close() }) ();Copy the code

Pay attention to our

Pay attention to our

This article is formatted using MDNICE