Author: Yin Chengwen, Maintainer of Chaos Mesh

During this period of time, Beijing is really terribly cold, friends posted all kinds of photos of minus 20 degrees Celsius, in such a cold time, I always want to find something warm to do. Recently, it was the one-year anniversary of Chaos Mesh open source (December 31, 2020.31), so I will sort out and share my experience of growing together with Chaos Mesh. On the one hand to celebrate, on the other hand, I hope to bring some warmth to everyone in this cold winter.

Fell in love with PingCAP

Before we start the Chaos Mesh story, let me tell you a little bit about myself and PingCAP.

The first time I really came into contact with PingCAP was in 2016, when I participated in a technology sharing session of PingCAP CTO Huang Dongxu. At that time, I was participating in a Go language project. I paid more attention to the Go language ecology and admired TiDB, a star project in the Go circle. I thought this would be an in-depth share, covering databases, the CAP theorem, and more. Unexpectedly, Dongxu finally chatted with us for an hour about Unix philosophy, said good database? What about the CAP theorem? I believe at that time a lot of small partners and MY mood is the same – mengbi. But PingCAP is more interesting to me.

I came into contact with PingCAP again when I accompanied students to Beijing for an interview. I happened to see an intern recruitment information of PingCAP in the Go community, and was immediately attracted by it. Encouraged by my classmates, I tried to submit a resume for a try. At about 8 o ‘clock that evening, I got a call from Brother Qiu (PingCAP founder Cui Qiu), who said that they were in TB, in a barbecue restaurant masturbating to watch football, and asked me if I wanted to come and have a chat. I was shocked. You don’t ask someone to interview at a barbecue at night! When we arrived at the barbecue, they were really watching football. I remember it was still a Chinese team match, and the amazing interview was going on simultaneously with the match. Finally the game ended, the Chinese team lost, but I received an offer, thank the Chinese team! Gave me this chance!

Chaos Mesh past

Here’s how I got involved with PingCAP, and here’s how I got involved with Chaos Mesh.

Before I started my internship in PingCAP, I attended a Meetup organized by PingCAP one Saturday morning. The small conference room was packed with people, most of them standing. I remember one of the topics was “Deep Exploration of Distributed System Testing” brought by Liu Qi, another founder and CEO of PingCAP. I was deeply impressed by Qi’s sharing. For the first time, I learned that testing could be done this way, with all kinds of fault injection methods designed to abuse our systems. Now I think that what Ji Shu shared at the beginning was not just the idea of chaos engineering, but also I did not expect that this theme would become my continuous work in the following period of time.

After I started my internship, my first task was to perform performance pressure test on TiDB. This is a very simple task if you simply want to run out a set of numbers. However, if you need to find performance bottlenecks in the current cluster and find optimization solutions for the cluster topology, this task is not so easy. It is precisely because of this task that I began to learn the architecture design of TiDB and the legendary metaphysical tonal parameter. You may think it has nothing to do with chaos engineering, but it is not. In chaos engineering, state check and pressure simulation are two essential steps. Also from this task, a lot of my subsequent activities have involved testing or trick-or-treating.

CTO mischief

Most of the time, we want the system environment to be as stable as possible, but this is often not the case. In order to better verify the system stability and rapid recovery ability, our CTO Dongxu senior manager, in our IDC business test cluster made a surprise attack. At that time is a very important user online, on the eve of our internal test cluster that has a set of business, in order to test the reliability of the system and fault since the recovery ability, Orient bosses even late at night on TiDB server, delete violent physical file directly, all kinds of SAO restart machine operation, remember it’s a lot of r&d bosses surprised out in a cold sweat, also good, In the end, our system stood up to the CTO’s mischief.

Programmers are lazy, after this event we began plotting how to lazy, one is a manual test is very difficult to continue, the other is to test TiDB in a comprehensive way and is not hard to do a database, but how to prove the correctness and robustness of a distributed system is a very challenging thing, and at the same time to put this thing done efficiently, Automation is an even bigger challenge, ensuring that every release goes through a variety of versions. We started on the road to automated chaos engineering, and the Schrodinger project took off.

Start the Schrodinger tour

The name of the project is indicative of our design philosophy: think of the cluster to be tested as a cat in a box, then continually add exceptions to the box, and finally check the status of the cat.

In technical terms, Schrodinger’s core idea is simple. Using K8s as the base, different TiDB test clusters and test cases are run in a controlled container cluster (box), and then error injection is performed on the underlying container platform.

When we started the Schrodinger project in September 2017, our first challenge was how to manage multiple TiDB clusters. At that time, we had two solutions: one was to manage multiple clusters using the mature Tidb-Ansible base; The second option is to choose the tidb-operator, which we have just started. Schrodinger became tiDB-Operator’s first user after a brief conversation with the boss, primarily because of our unwavering cloud native orientation.

After a period of Schrodinger

After more than a year, the Schrodinger project stabilized. At the same time, with the continuous maturity of TiDB ecology, the emergence of various peripheral tools such as TiCDC, TiDB Data Migration, TiDB Lightning and so on, the testing demand is also increasing. Gradually, we found it more and more difficult to get these tools into Schrodinger. I think it was in early 19th, when I was having dinner with the head of the department, we were talking about this. At that time, he proposed an idea that we could abstract the fault injection capability separately, define the fault as CRD object, and monitor and manage the fault object in the way of controller. Hearing this thought, I felt like a new window had been opened.

Chaos Mesh this year

In September 2019, I submitted the first Commit of Chaos Mesh, which was six months late. As is probably the case with many projects, the Chaos Mesh first Commit is a single line that initializes the README file.

After a month of development, Chaos Mesh finally had the most basic functions. At this time, the second developer of Chaos Mesh, Keao, was still an intern at that time, but he was full of combat power. In the following month, Keo promoted the use of Kubebuilder to replace the original controller, further optimizing the logic of multiple controllers in Chaos Mesh and further embracing ecology. It was a bit frustrating at first to see my code being optimized.

The road of open source

After more than three months of intensive development, during which many Chaos tests were successfully migrated to Chaos Mesh, we decided to open source the tool by the end of the year. We hope that this tool can help those who need it, and we also hope to promote the Chaos Mesh development with the help of the community.

The days before open source were the busiest for all of us (me, Jango, and Keo) : testing, documentation, videos, and articles. Some friends may think that open source is not open source can be it? But from our past experience, a good open source tool, only open source is far from enough, want users to rest assured to use quickly, necessary tests, introductory tutorials, principle introduction of these are essential, perfect documentation is particularly important.

It is often easy to implement a feature, but it takes more effort to make it easy for users to use it quickly. Sometimes, we often see users who use some open source tools go through the pit to the pit, analysis of the reason, often due to the lack of documentation, if such a large number of users, the tool will eventually be abandoned by the community. In order to be able to share an open source ready to use tool, we have been working hard in the days leading up to open source to fix the gaps, testing, and improving the documentation. Hard work always pays off, and we opened Chaos Mesh on the last day of 2019.

Chaos Mesh hit the front page of Hacker News the day it was announced, and topped the Github Go Treding project for several days. The Chaos Mesh boom was beyond my expectation, but I was happy and stressed at the same time.

Join CNCF

Cloud native has been written in the DNA of the project since the inception of Chaos Mesh, and it has always been our unwavering goal to become the Chaos reality standard of cloud native. In order to better realize our goal and make more people, even people around the world, enjoy the dividends of Chaos Mesh, according to the experience of the rapid development of TiKV project after it was hosted by CNCF, after the Chaos Mesh was opened to the public, We started exploring hosting the Chaos Mesh on CNCF.

After investigation, CNCF ecology just lacks a project to promote Chaos engineering standard, which further strengthens our determination to host Chaos Mesh in CNCF. What I have to say here is that PingCAP’s idealism and firm global strategy allow us to do any project without reservation, considering not just a few people, but the whole ecology, the whole world. This further strengthens our determination to share Chaos Mesh with the world, and CNCF is the best choice and platform. After a short preparation, we began the long process of hosting applications.

During the application period, many stories also happened, breathtaking. During this period, CNCF modified the rules for selecting Sandbox projects, which directly led to the delay of our application and even the emergence of a company with the same name as Chaos Mesh. All these accidents made us nervous for a time. At the same time, in order to better adapt to the development of the community, we built a more perfect automatic testing process, established the Chaos Mesh official website, added developer guidance and so on, laying a solid foundation for the future development of the Chaos Mesh community. Finally, at the CNCF TOC meeting in July 2020, Chaos Mesh was approved as a Sandbox project.

Joining CNCF was an important milestone for Chaos Mesh in the past year, which also had a profound impact on me: the first time I shared in English, the first time I organized a community meeting with others, and I gained new insights and ideas about the goals of the Chaos Mesh project.

transition

As the Chaos Mesh project grew, the small team grew, bringing in new people: professional front-end engineers, more experienced partners, and the community began to thrive. Remember when we first started developing new features, we just thought it made sense, so we started working on it. Now I find that although the original model is efficient, it lacks thinking and often realizes some functions that have no practical use. The development of the community forced us to change our roles, because the Chaos Mesh project was no longer a project of a few people, but a community project, and we were just members of the community. At the same time, Chaos Mesh has gradually built up a user community, and each of our PR is responsible for the community and users. We created an RFCs repository to collect and discuss the Chaos Mesh requirements and design. In this way, our design can not only be discussed and recognized by the community, but also attract more partners to help the Chaos Mesh project grow to a certain extent.

We always believe that a good open source project is not enough for one person or a few people, and Chaos Mesh has done a lot of work to attract more people to join the project.

In addition to RFCs warehouse, we classified Issues more, and publicized problems encountered and release plans through Issues. For those who are new to Choas Mesh, Issues with the label “Good First Issue” is a good start. If there are features you’d like to add to our Release Issues, feel free to leave a comment. In addition, we provide a complete developer documentation to help developers quickly start the Chaos Mesh development journey. Of course, if you want to communicate with us further, you can join CNCF Slack and search # project-chaos-Mesh channel to participate in our discussion. In addition, we regularly hold community meetings on the last Thursday of every month to discuss the Chaos Mesh problems and follow-up plans, and regularly invite partners in the community to share their Chaos Mesh experiences. We look forward to more partners joining us to create a more open and friendly Chaos Mesh community!

I won’t share much about the growth of the Chaos Mesh project itself, but if you’re interested, you can look forward to our first anniversary summary of Chaos Mesh.

The last

From the end of 2016 to now, I joined PingCAP by accident and accidentally became attached to chaos engineering. In these four years, I witnessed the exploration and application of Chaos engineering in PingCAP and the Chaos Mesh from idea stage to implementation. In fact, the above experience is not vigorous and victorious, just in this anniversary of the special moment, their own little pretense. Time has moved on, and my Chaos Mesh story has just begun. In the following days, there will be more challenges and more wonderful things! I will work with the community to make Chaos Mesh a true Chaos engineering standard! Chaos Mesh, the future!

Finally, the Chaos Mesh community survey link is attached, fill in the surprise: bit.ly/2LzES5o