Golang-based distributed crawler management platform, support Python, NodeJS, Java, Go, PHP and other programming languages and a variety of crawler framework.

Since its launch in March this year, the project has been well received by crawler enthusiasts and developers, with many users saying they will use Crawlab to build their own crawler platform. After iterations in recent months, we have successively launched scheduled tasks, data analysis, website information, configurable crawler, automatic extraction of fields, download results, upload crawler and other functions, making Crawlab more practical and comprehensive, which can really help users solve the problem of difficult crawler management.

Crawlab mainly addresses the problem of managing a large number of crawlers, such as scrapy and Selenium projects that need to monitor hundreds of websites, which are not easy to manage at the same time, and command line management is very expensive and error prone. Crawlab supports any language and any framework. With task scheduling and task monitoring, it is easy to effectively monitor and manage large-scale crawler projects.

  • Viewing Demo
  • Making: github.com/tikazyq/cra…

Update the content

In v0.3.0, a significant update has been made to replace the original Celery based Python version with Golang. The update is as follows:

  • Golang back end: Refactoring the original Python code from Golang to improve stability and performance
  • Node topology: Visualizes the node topology
  • Node system information: You can view the node system information, such as operating system, number of cpus, execution file, and so on
  • Node monitoring upgrade: Nodes register for monitoring through Redis
  • File management: Ability to modify crawler files and provide code highlighting
  • Login/Registration/User management: Requires user login to use Crawlab, allows user registration and user management, adds role-based rights management
  • Automatic crawler deployment: The crawler is automatically synchronized/deployed to all online nodes
  • Smaller Docker image: Through multi-stage construction, the original Docker image size is reduced from 1.3g to 700M

Why was Crawlab refactored

The original intent of refactoring with Golang was to solve some fundamental bugs, such as timing tasks that don’t trigger stably, nodes that don’t automatically appear offline, and so on. The refactored API is much more stable and high-performance, with task list responses taking milliseconds instead of hundreds of milliseconds. In addition, this reconstruction optimized the user usage process. For example, the crawler needed to be deployed manually before, and the user needed to click many times to run the crawler. Now, all crawlers are deployed automatically, with the cost that users have to wait less than a minute after uploading the crawler, until the crawler file is deployed to all nodes via GridFS before it can be run (although the primary node can run directly, of course). Additional features were added, such as user rights functionality (which provides basic rights management), node topology, file management, and so on. Overall, this update makes Crawlab more stable and practical.

Crawlab screen preview

The login

Home page

The node list

Node topology

The crawler list

An overview of the crawler

The crawler analysis

The crawler file

Task details – Grab results

Timing task

Why is there no configurable crawler

Unfortunately, due to time constraints, configurable crawlers have not been ported to the new Version of Crawlab. But we’ll add that functionality later.

What’s next

  • Log management, more centralized log management
  • Other SQL databases are supported, and the results are stored in mainstream databases such as MySQL and Postgres
  • Configurable crawler
  • Exception monitoring, log error exception, zero value exception, etc
  • Statistical visualization, more charting capabilities

However, if you have a better idea, please feel free to ask.

community

If you find Crawlab useful for your daily development or for your company, please add “Crawlab” to the author’s wechat tikazyQ1 and the author will pull you into the group. Feel free to star on Github, and if you have any questions, feel free to mention the issue on Github. In addition, you are welcome to make development contributions to Crawlab.