Node+puppteerTo achieve the network picture crawl

I. Project construction

1. Initialize the project
npm init -y
Copy the code
2. Install the dependency
npm install -s puppeteer
Copy the code
3. Directory structure
├─node_modules ├─ package.txt ├─ package.txt ├─README. Md ├─ SRC ├─index.js ├─imgs ├─ utils ├─ srctoim.jsCopy the code

Ii. Functional realization

1. Image DOM element acquisition
// src/index.js const puppeteer = require('puppeteer') const path = require('path') const srcToImg = require('./utils/srcToImg') ; (async function () {// create a browser object // const browser = await puppeteer.launch({// slowMo: // Devtools: // open the console //}) const browser = await puppeteer.launch() // create page const page = await browser.newPage() // jump to the corresponding url Await page.goto('https://image.baidu.com') // Select the element with the ID of the input tag in the current page, Then focus await page. Focus (' # kw) / / analog keyboard input keywords await page. The rid_device_info_keyboard. SendCharacter (' rem 3840 * 1080) / / by the name of the class get click search button and triggers the click event Await page. Click ('.s_newbtn ') // Triggers page. On ('load', Async function() {const sources = await page.evaluate(async () => {// Get all image tags const by class name Images = document. GetElementsByClassName (' main_img ') / / return to return all of the image tag SRC attribute array images [...] map (img = > img. SRC)}) For (let SRC of sources) {await srcToImg(SRC, path.resolve(__dirname, 'imgs')) // close} await browser.close()})()Copy the code
  • First introduce the required modules
  • usepuppteerTo create abrowserobject
  • usingbrowserObject to create apage
  • Jump to the corresponding url
  • Query toinputThe boxid, and through theidGet to theDOMElement and focus
  • Simulate keyboard input to query keywords
  • The class name of the search button is queried, and the class name is obtainedDOMElement and triggers the click search event
  • Listen for the page loading completion event, and after the page loading is complete, passpage.evaluateMethod is passed in a callback that can be executed in the context of a page instance
  • Get all by class nameimageTag and return containing allimageOf the labelsrcAn array of property values
  • Traversing the array, callingsrcToImgMethod to download the image toimgsfolder
  • After downloading, close the browser page
2. Download the image to the local PC
// src/utils/srcToImg const http = require('http') const https = require('https') const path = require('path') const { promisify } = require('util') const { createWriteStream, writeFile } = require('fs') const myWriteFile = promisify(writeFile) module.exports = async function(src, Dir) {/ / match JPG | jpeg | PNG | GIF at the end of the string const reg = /. (JPG | jpeg | PNG | GIF) $/ / / that is picture address the if (reg. Test (SRC)) {await UrlToImg (SRC, dir)} else {// Const urlToImg = async (url, Const reg = /^ HTTPS :/ // get url const ext= path.extname(URL) const file = path.join(dir, `${ Date.now() }${ ext }`) const module= reg.test(url) ? https : http module.get(url, res => { res.pipe(createWriteStream(file)) .on('finish', () => {console.log(' write complete ')})})} // Store base64 images in the images folder const base64ToImg = async (STR, dir) => { // data:image/jpeg; base64,...................................... const reg = /^data:(.+); base64,(.+)$/ const matches = str.match(reg) try { // image/jpeg => jpeg => jpg const ext = matches[1].split('/')[1].replace('jpeg', 'jpg') const file = path.join(dir, '${date.now ()}.${ext} ') await myWriteFile(file, matches[2], 'base64')} catch (err) {console.log(' invalid base64')}}Copy the code
  • There are two types of images, one is image link, the other isbase64Code image, so need to be judged by the regular and separate processing
  • If it is an image link, callurlToImgMethod, which is based onhttpandhttps, which calls the module respectivelygetMethod to get the image resource and write toimgsfolder
  • If it isbase64Picture, is calledbase64ToImgMethod, which uses the re to matchbase64Image prefixes and image data, and writes the image data toimgsfolder