preface

Developing products is always a pain and a joy. On the one hand, it is painful because it does not understand the real needs of users, is afraid of making cars behind closed doors and is worried that the technology cannot be realized. On the other hand, they are happy with small achievements, gaining recognition from users, and constantly helping them solve problems, so they keep going. Crawlab is such an open source project that makes me both happy and painful. Since its first commit in March last year, it has recently amassed 5K Star on Github and has grown to become the most popular open source crawler management platform. Crawlab has been listed on Github Trending for many times, and has been known and understood by developers all over the world. At the same time, it has also been collected by code cloud Gitee and open Source China, making more domestic developers familiar with it. The community is also constantly improving, wechat group members have been close to 1.2K, every day there are people on the above various questions and exchange experience. At the same time, Github has a lot of enthusiastic users to raise various issues to help us optimize the product.

From Flask + tasks in the beginning to Golang self-developed scheduling engine, it has gone through many iterations. As the product continues to mature, it also continues to evolve. I believe there will be more useful features soon, including all kinds of good feedback from users.

The graph below shows Crawlab’s Github cumulative Star trend. It can be seen that Crawlab has experienced two large growth and continued small growth in the process of reaching 5K Star.

The purpose of this article is to document the milestones that you and your friends have achieved together. Back in August last year (eight months ago), I wrote an article, “How to Build a Thousand-Star Github Project,” about how to get Github’s attention and build a popular product.

Project introduction

Crawlab is a distributed crawler management platform based on Golang, which supports a variety of programming languages and crawler frameworks. Students who are not familiar with the crawler management platform can refer to this article, “How to Quickly Build a Practical Crawler Management Platform”, which has a detailed introduction to the crawler management platform.

Viewing Demo

Since its launch in April 2019, the project has been well received by crawler enthusiasts and developers, with more than half of users reporting that they are already using Crawlab as the company’s crawler management platform. After several months of iteration, we have successively launched scheduled tasks, data analysis, configurable crawler, SDK, message notification, Scrapy support, Git synchronization and other functions, making Crawlab more practical and more comprehensive, which can really help users solve the problem of difficult crawler management.

Crawlab mainly addresses the problem of managing a large number of crawlers, such as scrapy and Selenium projects that need to monitor hundreds of websites, which are not easy to manage at the same time, and command line management is very expensive and error prone. Crawlab supports any language and any framework. With task scheduling and task monitoring, it is easy to effectively monitor and manage large-scale crawler projects.

Crawlab makes it easy to integrate developers’ crawlers. With CLI tools, you can upload any crawler project to Crawlab and it will be synchronized to all nodes to form a distributed architecture. In addition, Crawlab’s SDK makes it easy for you to visualize the captured data into the Crawlab interface. You can view and download the obtained task results (as shown in the following figure).

Project development

Crawlab is now a year old and has more than half a million stars on Github, but that doesn’t tell you anything. The number of stars on Github doesn’t really mean anything. You can also buy Stars on Taobao… Another interesting fact is that many of the projects that have been Star counted by the thousands on Github are Markdown projects. What is a Markdown project? That is to say, there are not many executable code files in this project, most of them are Markdown files filled with technical knowledge, such as interview questions, knowledge sorting, tang and Song poems, etc. The prevalence of these Markdown projects reflects the intellectual anxiety of developers. In fact, sometimes, focus on the use of a project, more reading to understand the source code, and even diy code a few lines of code, but also can constantly improve their own. I am a wild programmer. I don’t like the underlying principles and theoretical derivation. I like to pick up the keyboard and immerse myself in producing (tie). Therefore, I like to find the pain points in the product and solve them in a technical way.

The following figure shows the development of Crawlab project.

Flask + tasks were used to complete the distributed scheduling logic at the beginning of the project. Because the language I was most familiar with at that time was Python, and I could not learn Java, Golang or C++, so I chose Python as my main programming language, which was the fastest I could learn at that time, and laid the groundwork for changing the framework later.

Once we had our first users, they gave us all sorts of feedback. Among them, Docker was recommended for deployment, which became the first choice for Crawlab deployment later. Someone else came up with the concept of a configurable crawler (not that name at the time, configurable crawler was my name), and I implemented it in Python.

But it’s annoying that in v0.2.x, scheduled tasks bugged all sorts of things: they were sometimes executed twice or more; Sometimes it’s not done on time; Sometimes it’s not implemented at all. More worryingly, as the number of crawlers increased, the pressure on the back end increased, taking a second or even a few seconds to return a result. Even I have a hard time using it. So I began to think fundamentally about whether the Python architecture simply wasn’t good enough for what we needed.

It happened that I bought a course on Golang learning in the Nugget booklet at the time, and it came to me that I should use Golang to refactor the Crawlab back-end application. So, with learning and practice, I refactored Crawlab from the Python version to the Golang version and released it to v0.3. Crawlab’s refactoring is like a couple of steps up, easily crushing the Python version in terms of both performance and stability. There are no more bugs, the response is no longer delayed, and the concurrency is high. Even better, Golang is a statically typed language, which is a convenient way to avoid some of the low-level errors caused by typing (at the cost of more code). I think refactoring Crawlab with Golang was the most successful decision of the project.

Compared to the popularity effects of the Golang refactoring, I think v0.4.x is relatively less straightforward. Many of v0.4.x’s iterative features are based on user feedback, including notification, rights management, UI install dependencies, and Scrapy support. These functions are developed for many users who need to apply crawler management platform in the enterprise. Now I don’t know how many enterprises are really using Crawlab, but I believe that with the continuous improvement of Crawlab, more small and medium-sized enterprises, even large enterprises can deploy Crawlab out of the box, and further promote it to other users with demand.

Experience in project

The lessons from Crawlab are numerous. A lot of people ask me, what made you last so long to build a free product? A lot of people ask me why I don’t have a commercial version. I think all these questions are natural and taken for granted. In my opinion, to do a good open source project, the idea alone is not enough. Of course, the idea of making money on this will lead the project astray. Here are some of the things THAT I think make a popular open source project.

Look for pain points and try to solve them

Many people have pain points in their work and life. If you can spot these pain points (” pain, “not” itch, “note), you’re more likely to find an opportunity to fix it. We can look around and try to find pain points. Crawlab, for example, was born when thinking about a work problem. My department has hundreds of crawlers, including Selenium crawlers and other types of crawlers. Our crawler management mode and implementation mode at that time had many limitations, which led to problems such as low scalability and difficult troubleshooting. We have a Web UI interface, but it’s just business, not focused on the crawler itself. The author wondered if this problem was unique to our company, or if it was a common problem for almost every company that needed a reptile.

Of course, it’s not enough just to find the pain point, but also to verify it. For example, in order to verify this hypothesis before, I spent half a month to make a minimum viable product (MVP), Crawlab V0.1 version, which only has the most basic function of executing crawler scripts. As a result, positive feedback and suggestions for improvement continued after the first release. The number of stars reached 30 on the first day and rose to 100 in the following two days. This verified my hypothesis that difficulty in crawler management was a common problem. Everyone thought Crawlab was a good idea and was willing to try it. This started to give the author more motivation to keep improving the product. So it’s a good place to start with the problems around you.

Improve the product through user research

A lot of people develop products behind closed doors, trying to make users fall in love with their products. This is a trap for technicians, and we need to be vigilant not to fall into the trap of complacency. How to understand the user needs? One effective way to do this is to do user research.

In How to Build a Github Project with thousands of Stars, I mentioned that there are two ways to do user research. One is to ask directly. I often ask users in wechat groups about the use of Crawlab, whether there is anything that can be improved, where it is difficult to use, what bugs there are, and so on. A lot of the time, I get feedback, sometimes important feedback. Another way is through questionnaires. This way is more objective, can quantitatively obtain the user’s usage, which is very helpful for us to understand the user’s usage. For example, I regularly use questionnaire star to design questionnaires and put them into wechat groups, usually receiving dozens or hundreds of answers. This sample is sufficient for the survey, and the questionnaire star can help analyze the data distribution of each question, so that the usage and demand can be seen at a glance.

Don’t underestimate the power of product promotion

The part is marketing and operations. When you launch your product, you should be the first to let your users know and try it out. You get instant user feedback and you have the opportunity to improve your product. There are various channels for promotion. First of all, I can write articles. Every time I release, I will write and release articles on the platforms such as Jinhua, SF, V2ex and Open Source China, introducing new functions, product planning and so on, so that more users can understand and try Crawlab. Second, SEO needs to be done. The document website of Crawlab is pushed to baidu index, so that Baidu can continuously include the pages of Crawlab and rank the brand words such as “Crawlab” and “Crawler management platform” in the top several places according to its internal algorithm. Third, build a Demo platform, which is the simplest way for users to try out the product. Users will first see your product and decide whether to install and use it further according to the appearance and functions. Practice has proved that this is a very effective method.

conclusion

Crawlab, a crawler management platform, is now in its second year. Crawlab is a rising star. Compared with its predecessors, Gerapy, SpiderKeeper and ScrapydWeb, Crawlab is younger, more flexible and practical. That’s why so many people are trying Crawlab. Building open source products is a long-term business, and not everyone can create an overnight “upstart.” Therefore, patience and craftsmanship are needed. The so-called craftsmanship spirit is not to make the products more perfect, but to make the products more grounded, more user-friendly, more satisfying to users, and more able to solve the problems of users. This is the craftsmanship spirit. Therefore, we can not be behind closed doors, blindly pursue technical perfection, while ignoring the real problems of users. Crawlab still has a long way to go in solving users’ problems. But we’re not worried because we now have a strong development team, a growing community, and users who are constantly giving feedback. I believe that in the second year, Crawlab will solve more users’ problems, make crawlers simple and usher in the second 5K Star.

I hope this article will be helpful to your work and study. If you have any questions, please add tikazyQ1 to the author’s wechat, or leave a message at the bottom to ask questions. The author will try to answer. Thank you very much!

reference

  • Making: github.com/crawlab-tea…
  • Demo: crawlab.cn/demo