The puppeteer documentation is extensive, and the puppeteer startup parameters are optimized for common application scenarios

Optimization of startup parameters

puppeteer.launch({
  args: [
    '--no-sandbox'.// Sandbox mode
    '--disable-setuid-sandbox'./ / the uid sandbox
    '--disable-dev-shm-usage'.// Create temporary file shared memory
    '--disable-accelerated-2d-canvas'./ / canvas rendering
    '--disable-gpu' // GPU hardware acceleration]})Copy the code

Anti-crawl Strategy (attack and Defense)

  • Attack and prevent
// Set the webDriver property in navigator to false to monitor through the page
await page.evaluateOnNewDocument((a)= > {
  Object.defineProperty(navigator, 'webdriver', {
    get: (a)= > false})})Copy the code

Request to intercept

Filter requests for irrelevant services

  1. Parsing DOM content without images, audio and other resources
const blockTypes = new Set(['image'.'media'.'font'])
// Enable request blocking
page.setRequestInterception(true)

page.on('request', request => {
  // Filter by request type
  const resourceType = request.resourceType()
  const shouldBlock = blockTypes.has(resourceType)
  if (shouldBlock) {
    request.abort()
  } else {
    request.continue()
  }
})
Copy the code
  1. There is no need to wait for third-party requests to be reported
await page.setRequestInterception(true)

page.on('request', request => {
  const url = request.url()
  // console.log(url)
  if (url.includes('google.com') || url.includes('baidu.com')) {
    request.abort()
  } else {
    request.continue()
  }
})
Copy the code
  1. Specifically intercept a request for business processing (implementation scheme is the same as above)
await page.setRequestInterception(true) // Enable request blocking
page.on('request', request => {
  const url = request.url()
  if (url === 'https://abc.com/api/getData') {
    // Override the request
    return request.continue({
      // Override url, method, postData, headers
      headers: Object.assign({}, request.headers(), {
        'Auth-abc': 'abcde'})})}})Copy the code

The resources

  • Proxy for page
  • Talk about Puppeteer in conjunction with the project