Apify is a Nodejs-based crawler framework, which integrates the NodeJS library commonly used in Puppeteer, Cheerio and other crawler services, aiming to fill the functional gaps in web page crawler services in complex scenarios. Such as general crawler task entry, crawler task error capture and retry, crawler task queue/list, crawler internal state monitoring and proxy pool, etc.

Provides a formal crawler API.

Developers can focus on writing goal crawl logic, written by setting handleFailedRequestFunction crawler tasks at the same time the processing logic of failure.

const Apify = require('apify'); Apify. Main (async () => {// Initialize crawler list const requestList = new Apify.RequestList({sources: [{url:'http://www.google.com/' },
            { url: 'http://www.example.com/' },
            { url: 'http://www.bing.com/' },
            { url: 'http://www.wikipedia.com/'},]}); await requestList.initialize(); BasicCrawler = new apify. BasicCrawler({// minConcurrency: 10, maxConcurrency: MaxRequestRetries: 1, // Increase the timeoutforHandlePageTimeoutSecs: 60, // requestList, // page handler function: Async ({the request, the response}) = > crawl logic} {/ / target page handleFailedRequestFunction: Async ({request}) => {// Crawl error handling logic after n retries},}); // Start crawler await crawler.run(); });Copy the code

Based on thepuppeteerandcheerioExtended crawler

CheerioCrawler and PuppeteerCrawler are integrated into BasicCrawler, providing two libraries with page HTML processing functions respectively.

Const crawler = new Apify.CheerioCrawler({// crawler list requestList, // - HTML: page HTML string // - $: Cheerio object containing target HTML handlePageFunction: Async ({request, HTML, $}) => {console.log(' Processing ')${request.url}. `); const title = $('title').text();
      const h1texts = [];
      $('h1').each((index, el) => { h1texts.push({ text: $(el).text(), }); }); }});Copy the code

In the Puppeteer crawler, Apify encapsulates the puppeteer process of creating and destroying browsers and pages. Developers only need to code the target page crawl logic.

Const crawler = new Apify.PuppeteerCrawler({// Crawler task column requestQueue, // Puppeteer startup parameters: { }, handlePageFunction: Async ({request, page}) => {// page is created from the Puppeteer context and can use all puppeteer-related apis const pageFunction =$posts => {
          const data = []
          $posts.forEach($post => {
              data.push({
                  title: $post.querySelector('.title a').innerText,
                  rank: $post.querySelector('.rank').innerText,
                  href: $post.querySelector('.title a').href,
              });
          });
          return data;
      };
      const data = await page.$$eval('.athing', pageFunction); }});Copy the code

Crawler internal state monitoring

Apify internally implements the SystemStatus class, which periodically queries the internal status information of crawler service, such as memory, CPU, Eventloop, etc., in the way of periodic polling. When the crawler is set to minConcurrency and maxConcurrency, Apify dynamically scales the number of crawl concurrency in proportion to the current SystemStatus. When the current system state is not sufficient for the current crawler task, Apify will temporarily schedule crawler tasks from the requestList or requestQueue until SystemStatus returns to normal. The life cycle of crawler tasks can be dynamically managed by setting autoscaledPoolOptions.

{
  runTaskFunction: async () => {
  },
  isTaskReadyFunction: async () => {
  },
  isFinishedFunction: async () => {
  },
}
Copy the code

Dynamically control the execution of the task queue/list through the AutoscaledPool instance.

handlePageFunction: Async ({autoscaledPool}) => {autoscaledPool. Run () long term Promise [timeoutSecs Promise // suspend.resume() // restart}Copy the code

Built-in proxy pool

In crawler service, in order to evade anti-crawler mechanism of target website and add proxy to hide browser source, it is often necessary to set different proxy IP for request by writing complex logic. Apify has proxy pooling built in, passing in an array of IP addresses via the proxyUrls argument. Apify will automatically maintain a pool of agents internally, randomly assigning agents to each crawler task.

Local data store

Apify implements two data structures, DataSet and KeyValueStore, for local storage of data based on FS module.

DataSet
// Write a row of data to the default local store await Apify. PushData ({col1:123, col2:'val2'}); // Open a DateSet const dataset = await apify. openDataset('some-name'); // Write an await dataset. PushData ([{foo:'bar2', col2: 'val2' }, { col3: 123 }]);
Copy the code
KeyValueStore
// Write a key-value pair to the default keyValuestore.await Apify. SetValue ('OUTPUT', { myResult: 123 }); // get const store = await Apify. OpenKeyValueStore ('some-name');

// Write a record. JavaScript object is automatically converted to JSON,
// strings and binary buffers are stored as they are
await store.setValue('some-key', { foo: 'bar' });

// Read a record. Note that JSON is automatically parsed to a JavaScript object,
// text data returned as a string and other data is returned as binary buffer
const value = await store.getValue('some-key');

// Drop (delete) the store
await store.drop();
Copy the code

Rich utility functions

Apify.utils.log (Log classification)
Log.setlevel (log.levels.error); log.debug('Debug message'); 
log.info('Info message'); 
log.error('Error message'); 
Copy the code
Apify.utils.puppeteer

Puppeteer utility functions:

Apify. Utils. Puppeteer. AddInterceptRequestHandler ⇒ Promise / / page request intercept processing. GotoExtended ⇒ Promise < Response > / / page jump pre-processing InjectFile (page, filePath, [options]] [color = blue] [color = blue] [color = blue BlockRequests (Page, [options]). CacheResponses (Page, cache) "// the name of the page response will be longfunction// Execute additional scripts in page contextCopy the code

Apify solves many of the pain points in NodeJS crawler service. Using Apify to write crawlers can greatly reduce the amount of code, and developers can focus on writing the crawling logic of the target page. In addition to the advantages described above, apify still has some disadvantages, such as not supporting dynamic requestQueue and manually switching proxies. However, the current support of the function, has been able to meet most of the business common.

The resources

  • Apify – SDK official documentation