All things can climb – Puppeteer

Prior to the start

The last article explained how to quickly use puppeteer to crawl a simple video website. However, the logic is relatively simple, and the trick is to directly use the details page logic to crawl the TV show, but this can only crawl a TV show, movie. It’s annoying. Trying to do your own video site is still a bit of a chicken. There is, of course, the tricky question of what to do if the details page has token authentication.

This article is related only for learning exchange, invasion joint reform

Content forecast

This paper will answer these questions:

  1. One click to climb a video website all movies/TV series
  2. Automatically populate the base form
  3. page.click()
  4. Slider verification code cracking

Video station

Here’s what you need to do to climb a video site:

  1. Find a website you like
  2. Crawl list page – Get details page address for each video (note: regular list pages are paginated)
  3. Go to the details page
  4. Get the video playing address – jump again if you also need to select the resource
  5. Closing Details page

Repeat 3 to 5. After all the contents of the current list page are climbed, jump to the second page of the list, and so on. After all the list pages are climbed, close the browser.

Once you know the steps, let’s go

Find the website and analyze it

In order to avoid the suspicion of advertising, the video url in the demo uses a fake address, if you want the address, please comment

This is its list page, and as you can see, it’s paginated. There are two ways to do this

  1. Analyze its paging rules
  2. Use page. Click ()

Start with a scheme that analyzes paging rules

http://www.xxxxxxx.xx
http://www.xxxxxxx.xx/page/2
http://www.xxxxxxx.xx/page/3
...
Copy the code

I don’t need to tell you what the rule is

Began to climb

List of pp.

If you are not familiar with the Puppeteer API, move around. Everything can be climbed

const findAllMovie = async () => {
  console.log('Start visiting this site')
  const browser = await (puppeteer.launch({
    executablePath: puppeteer.executablePath(),
    headless: false})); /* @params * pageSize: How many pages do you want to crawlfor (leti = 1; i <= pageSize; I++) {/ / used to save up the details of the page address var arr = [] const targetUrl = ` https://www.xxxxxxx.xx/page/${i}` const page = await browser.newPage(); // enter page await page.goto(targetUrl, {timeout: 0,waitUntil: 'domcontentloaded'}); // Get the root const baseNode ='ul#post_container'
    const movieList = await page.evaluate(sel => {
      const  movieBox = Array.from($(sel).find('li'))
      var ctn = movieBox.map(v => {
        const url = $(v).find('.article h2 a').attr('href');
        return {url: url}
      })
      returnctn }, baseNode) arr.push(... MovieList) // Ready to climb the detail page await detailMovie(arr, page)} browser.close(); console.log('Visit Over')
  return {msg: 'Sync done'}}Copy the code

Details page

const detailMovie = async (arr, page) => {
  var detailArr = []
  console.log('Number of movies per page:' + arr.length)
  for (let i = 0; i < arr.length; i++) {
    await page.goto(arr[i].url, {
      timeout: 0,
      waitUntil: 'domcontentloaded'
    })
    const baseNode = '.article_container.row.box'Const movieList = await page.evaluate(sel => {const movieBox = array. from($(sel).find())'#post_content').find('p'))
      const urlBox = $(sel).find(# Blu-ray HD TD A).attr('href')
      var tmp = [{}]
      var ctn = tmp.map((v,i) => {
        const imgUrl = $(movieBox[0]).find('a').attr('href');
        var info = $(movieBox[1]).text()
        return {
          imgUrl: imgUrl,
          name: info,
          urlBox: urlBox
        }
      })
      returnctn }, baseNode) console.log(movieList) detailArr.push(... movieList) console.log('抓取第 ' +  detailArr.length + 'Page complete')
  }
  console.log('Start adding data to database')
  await addMovie(detailArr)
  page.close()
  return detailArr
}

Copy the code

process

Feels good, doesn’t it?!

The results of

Their own video website has been built.

Form

What makes Puppeteer so powerful is that it can simulate operations, such as baidu itself

implementation

It’s simple. One way to do it

page.type(selector, text[, options])

  • Selector, the element selector to input. If there are multiple matched elements, enter the first matched element
  • Text, the content to enter
  • options
    • Delay Indicates the delay in entering each character, in milliseconds. The default is 0

Note: KeyDown, KeyPress/Input, and KeyUP events are emitted after each character is entered.


page.type('#mytextarea'.'Hello'); // Enter immediately
page.type('#mytextarea'.'World', {delay: 100}); // Input becomes slow, like a user

Copy the code

Next open Baidu to try


const getForm = async() = > {/ / puppteer validation
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('https://baidu.com');
    await page.type('#kw'.'puppeteer', {delay: 100}); // After opening Baidu, enter puppeteer slowly in the search box.
  page.click('#su') // Then click Search
  await page.waitFor(1000);
  const targetLink = await page.evaluate((a)= > {

    let url =  document.querySelector('.result a').href

    return url
  });
  console.log(targetLink);
  await page.goto(targetLink);
// await page.waitFor(1000);
  browser.close();
}

Copy the code

In the example above, we also use the page.click method.

page.click(selector, [options])

  • Selector: The selector of the element to be clicked. If there are multiple matching elements, click the first one.
  • options
    • Button: Left,right or middle
    • ClickCount: Default is 1
    • Delay: duration between mouseDown and mouseUp, in milliseconds. The default is 0

This method finds an element that matches the Selector, scrolls it visually if needed, and then clicks on it via page.mouse. This method will report an error if the selector does not match any elements.

Note that if click() triggers a jump, there is a separate Page.waitforNavigation () Promise object to wait on. The correct waiting jump looks like this:

const [response] = await Promise.all([
  page.waitForNavigation(waitOptions),
  page.click(selector, clickOptions),
]);
Copy the code

Simple slider verification code cracking + analog mobile phone

steps

  1. Find the slider
  2. Calculates the position of the slider
  3. Distribute events
  4. Drag the
  5. Let it go

implementation

The difficulties in analyzing

Build an emulator


 const devices = require('puppeteer/DeviceDescriptors');
 const iPhone6 = devices['iPhone 6'];
 await page.emulate(iPhone6)

Copy the code

The full version

const getYzm = async() = > {const devices = require('puppeteer/DeviceDescriptors');
  const iPhone6 = devices['iPhone 6'];
  const conf = {
    headless: false.defaultViewport: {
      width: 1300.height: 900
    },
    slowMo: 30
  }
  puppeteer.launch(conf).then(async browser => {
    var page = await browser.newPage()
    await page.emulate(iPhone6)
    await page.goto('https://www.dingtalk.com/oasite/register_h5_new.htm')
// The sliding captcha checks nabigator.webdriver. So we need to set this property to false before sliding
// The WebDriver read-only property navigator of the interface indicates whether the user agent is controlled by automation.
    await page.evaluate(async() = > {Object.defineProperty(navigator, 'webdriver', {get: (a)= > false})})// Incorrect input triggers the verification code
    await page.type('#mobileReal'.'15724564118')
    await page.click('.am-button')
    await page.type('#mobileReal'.' ')
    await page.keyboard.press('Backspace')
    await page.click('._2q5FIy80')
    // Wait for the slider to appear
    var slide_btn = await page.waitForSelector('#nc_1_n1t', {timeout: 30000})
    // Calculate the slider distance
    const rect = await page.evaluate((slide_btn) = > {
// Returns the size of the element and its position relative to the viewport
      const {top, left, bottom, right} = slide_btn.getBoundingClientRect();
      return {top, left, bottom, right}
    }, slide_btn)
    console.log(rect)
    rect.left = rect.left + 10
    rect.top = rect.top + 10
    const mouse = page.mouse
    await mouse.move(rect.left, rect.top)
    // TouchEvent, puppeteer only has mouseEvent. So there needs to be some way to pass the event before sliding.
    await page.touchscreen.tap(rect.left, rect.top) // H5 needs to manually distribute events to simulate the event distribution mechanism of app.
    await mouse.down()
    var start_time = new Date().getTime()
    await mouse.move(rect.left + 800, rect.top, {steps: 25})
    await page.touchscreen.tap(rect.left + 800, rect.top,)
    console.log(new Date().getTime() - start_time)
    await mouse.up()
    console.log(await page.evaluate('navigator.webdriver'))
    console.log('end')
    // await page.close()})}Copy the code