background

WebMagic is a simple and flexible Java crawler framework. Based on WebMagic, efficient and easy to maintain crawler applications can be developed quickly. The crawler manager uses the UUID attribute of Task to associate site and URL, so as to eliminate the repetition of URL with collection and avoid the problem of repeated collection.

This paper analyzes the effects of Spider class setting crawler task number and sequence of setScheduler on crawler task and matters for attention in use.

RedisScheduler profile

Scheduler is a component of WebMagic for URL management. Generally speaking, Scheduler has two functions:

  1. Manage URL queues that are captured
  2. Deduplication of captured urls

WebMagic has several commonly used schedulers built-in. In project development, we all choose RedisScheduler that supports distributed deployment. It uses Redis to save captured URL information and uses the UUID of Task as the main key to carry out simultaneous cooperative fetching of multiple machines.

RedisScheduler stores three types of information to Redis with the UUID suffix for tasks in order to manage urls and de-weight:

  1. URL queue to climb: queue_taskUuid
  2. Set of historical urls: set_taskUuid, used for weight determination
  3. Request Additional information: item_taskUuid, the set that stores additional Request information

Spider creation process

The general code for using Spider + RedisScheduler is:

   Spider spider = Spider.create(pageProcessor);
   spider.addUrl(url).setUUID(taskId)
         .addPipeline(pipeline)
         .setScheduler(new RedisScheduler(SpringUtil.getBean(JedisPool.class)))
         .setExecutorService(threadPool);
   return spider;
Copy the code

Because RedisScheduler stores intermediate data of crawler by considering uUID as key, according to the source code, after the scheduler queue is set, the scheduler queue data is bound to the uUID at this time:

/** * set scheduler for Spider ** @param Scheduler * @return this * @see scheduler * @since 0.2.1 */ public Spider setScheduler(Scheduler scheduler) { checkIfRunning(); Scheduler oldScheduler = this.scheduler; this.scheduler = scheduler; if (oldScheduler ! = null) { Request request; while ((request = oldScheduler.poll(this)) ! = null) { this.scheduler.push(request, this); } } return this; }Copy the code

When the scheduling queue is set up, it adds the request data from the old scheduler queue to the new scheduling queue.

Matters needing attention

When the author tested the distributed crawler, after calling the setScheduler of Spider, hoping to fragment crawler, the author set the UUID of Spider to a new value again, and the crawler would end in one second and the number of downloaded pages was 0. Why is that?

publicSpider setUUID(String uuid) {
        this.uuid = uuid;
        return this;
    }
Copy the code

SetUUID simply resets the ID and does not update the Scheduler’s association with the task.

Therefore, it should be noted that the process of crawler creation should follow a certain order: uUID should be set first, then Scheduler should be set. Once fixed, task number should not be changed.

If you change the task number, you should also reset the scheduler queue to update the correspondence.