Cendertron, The road to distributed and Stability optimization for security crawlers

Cendertron is a Web 2.0 dynamic crawler and sensitive information leak detection tool based on Puppeteer, which provides URL targets for subsequent chao-Scanner basic and POC scans. We have introduced the basic use of Cendertron before, and here we briefly describe the crawler parameter design and cluster architecture in the actual scanning scenario. I have to say that elegant design also needs to go through a lot of data practice and experience precipitation, compared with the previous version of Cendertron, more from the details of the adaptation.

Flexible cluster deployment based on Docker Swarm

In Docker actual combat series, we introduced the concept and configuration of Docker and Docker Swarm in detail. Here, we also used the Route Mesh mechanism provided by Docker to expose multiple nodes with the same port. This requires us to store part of the states of each crawler node in a centralized manner. Here, Redis is used as the centralized storage.

In fact, both POC and crawler nodes in Chaos Scanner follow this schedule, but POC scan nodes rely on RabbitMQ for task distribution:

The logical flow of the whole crawler in scan scheduling is as follows:

Here we can edit the Compose file based on the base image, docker-comemage.yml:

version: '3'
services:
  crawlers:
    image: cendertron
    ports:
      - '${CENDERTRON_PORT}:3000'
    deploy:
      replicas: 2
    volumes:
      - wsat_etc:/etc/wsat

volumes:
  wsat_etc:
    driver: local
    driver_opts:
      o: bind
      type: none
      device: /etc/wsat/
Copy the code

Here we mount the Redis configuration into the container as a volume. In Chaos Scanner, a unified registry for different devices is simplified to this unified configuration file:

{
  "db": {
    "redis": {
      "host": "x.x.x.x"."port": 6379."password": "xx-xx-xx-xx"}}}Copy the code

After Redis is configured, we can create the service by using the following command:

Create a service
> docker stack deploy wsat --compose-file docker-compose.yml --resolve-image=changed

# specify instance
> docker service scale wsat_crawlers=5
Copy the code

Here we scan multiple targets at the same time provides the way of creating, | as a separator between different URL:

POST /scrape

{
"urls":"http://baidu.com|http://google.com"
}
Copy the code

After the cluster runs, we can see the state of the container started on a single machine by using the ctop command:

Using the htop command, you can see that the entire system is very full of CPU calls:

Failure oriented design and monitoring is preferred

In our series on Testing and High Availability Assurance, we specifically discussed the principles of fail-oriented design in high availability architecture design:

One of the most important principles is the principle of monitoring coverage. In the design stage, we assume that there will be problems in the online system, so as to add corresponding measures to the control system to prevent timely remedy once something happens to the system. In the case of diversified business scenarios like crawler, we need to be able to timely review the status quo of the system, so as to know the inappropriate place of the current strategy and parameters at any time.

In the cluster context, the state information of crawlers is stored in Redis, and each crawler will report it regularly. The reported crawler information will automatically Expire. If the state information of a node does not exist when viewing the current state of the system, it means that the crawler has faked death in this event window:

We still use the GET /_ah/health port to check the status of the entire system, as shown below:

{
  "success": true."mode": "cluster"."schedulers": [{"id": "a8621dc0-afb3-11e9-94e5-710fb88b1291"."browserStatus": [{"targetsCnt": 4."useCount": 153."urls": [{"url": ""
            },
            {
              "url": "about:blank"
            },
            {
              "url": ""
            },
            {
              "url": "http://180.100.134.161:8091/xygjitv-web/#/enter_index_db/film"}}]]."runingCrawlers": [{"id": "dabd6260-b216-11e9-94e5-710fb88b1291"."entryPage": "http://180.100.134.161:8091/xygjitv-web/"."progress": "0.44"."startedAt": 1564414684039."option": {
            "depth": 4."maxPageCount": 500."timeout": 1200000."navigationTimeout": 30000."pageTimeout": 60000."isSameOrigin": true."isIgnoreAssets": true."isMobile": false."ignoredRegex": ".*logout.*"."useCache": true."useWeakfile": false."useClickMonkey": false."cookies": [{"name": "PHPSESSID"."value": "fbk4vjki3qldv1os2v9m8d2nc4"."domain": "180.100.134.161:8091"
              },
              {
                "name": "security"."value": "low"."domain": "180.100.134.161:8091"}},"spiders": [{"url": "http://180.100.134.161:8091/xygjitv-web/"."type": "page"."option": {
                "allowRedirect": false."depth": 1
              },
              "isClosed": true."currentStep": "Finished"}}]]."localRunningCrawlerCount": 1."localFinishedCrawlerCount": 96."reportTime": "The 2019-7-29 23:38:34"}]."cache": ["Crawler#http://baidu.com"]."pageQueueLen": 31
}
Copy the code

Parameter tuning

Due to network oscillation and other reasons, Cendertron cannot guarantee absolute stability and consistency, and it is more a trade-off between efficiency and performance. SRC /config.ts contains all of the built-in Cendertron parameters:

export interface ScheduleOption {
  // Number of concurrent crawlers
  maxConcurrentCrawler: number;
}

export const defaultScheduleOption: ScheduleOption = {
  maxConcurrentCrawler: 1
};

export const defaultCrawlerOption: CrawlerOption = {
  // Crawl depth
  depth: 4.// Maximum number of pages a crawler can crawl
  maxPageCount: 500.// The default timeout is 20 minutes
  timeout: 20 * 60 * 1000.// The jump timeout is 30s
  navigationTimeout: 30 * 1000.// Single page timeout is 60 seconds
  pageTimeout: 60 * 1000,

  isSameOrigin: true,
  isIgnoreAssets: true,
  isMobile: false,
  ignoredRegex: '.*logout.*'.// Whether to use cache
  useCache: true.// Whether to scan sensitive files
  useWeakfile: false.// Whether to use emulation
  useClickMonkey: false
};

export const defaultPuppeteerPoolConfig = {
  max: 1.// default
  min: 1.// default
  // how long a resource can stay idle in pool before being removed
  idleTimeoutMillis: Number.MAX_VALUE, // default.
  // maximum number of times an individual resource can be reused before being destroyed; set to 0 to disable
  acquireTimeoutMillis: defaultCrawlerOption.pageTimeout * 2,
  maxUses: 0.// default
  // function to validate an instance prior to use; see https://github.com/coopernurse/node-pool#createpool
  validator: (a)= > Promise.resolve(true), // defaults to always resolving true
  // validate resource before borrowing; required for `maxUses and `validator`
  testOnBorrow: true // default
  // For all opts, see opts at https://github.com/coopernurse/node-pool#createpool
};
Copy the code

read

You can read the author’s series of articles in any of the following ways, covering a variety of fields, such as technical summary, programming language and theory, Web and big Front end, server-side development and infrastructure, cloud computing and big data, data science and artificial intelligence, product design and so on:

  • Browse online in Gitbook, and each series corresponds to its own Gitbook repository.
Awesome Lists Awesome CheatSheets Awesome Interviews Awesome RoadMaps Awesome-CS-Books-Warehouse
Programming language theory Java of actual combat JavaScript of actual combat Go of actual combat Python of actual combat Rust of actual combat
Software Engineering, data structures and algorithms, design patterns, software architecture Modern Web development fundamentals and engineering practices Large front-end hybrid development with data visualization Server-side development practices and engineering architecture Distributed Infrastructure Data science, artificial intelligence and deep learning Product design and user experience