This article is reprinted from the wechat official account “Qiniuyun”.

In the sharing of ECUG conference on January 5th, Liu Qi, founder and CEO of our company, brought us a wonderful speech themed “Chaos Engineering at PingCAP”, and shared with us the relevant content and deep thinking about Chaos Engineering. The following is a transcript of his speech.

Let me start by mentioning TiDB. TiDB is a distributed database that supports the MySQL protocol and Spark API. TiDB is currently one of the largest, most used and most popular databases in NewSQL.

The typical scenario of TiDB is like this: for example, when people want to do sub-database sub-table, there is no need to bother with TiDB; For example, if we need complex workload, for example, OLTP + OLAP coexist in the system at the same time. Usually when you use a database, you will consider whether it is OLAP or OLTP, but for the user, you don’t want the database people to educate him what OLAP and OLTP is, SQL can run fast enough, as simple as possible. Other users, such as those in Japan and the US, migrate from Amazon Aurora to TiDB and struggle to run when it reaches tens of tons.

What is PingCAP? You probably know TiDB, not PingCAP. In fact, TiDB was developed by PingCAP.

So what is CNCF? CNCF stands for Cloud Native Computing Foundation. Maybe many people know K8s but not CNCF, just as TiDB knows PingCAP is the same, Kubernetes is the hottest project under CNCF. At present, PingCAP ranks sixth globally in all code contributions of CNCF, huawei is the seventh below us, and there are only two Chinese companies in the top eight.

Next, we will use TiDB as an example to demonstrate the use of Chaos Mesh. First, we need to focus on the structure of TiDB. Below is the structure of TiDB. Why is the structure of TiDB mentioned?

As shown, TiDB is roughly divided into two layers:

  • The computing layer is commonly known as the SQL layer.

  • The storage layer includes a TiKV that supports the row memory model and a TiFlash that supports the column memory model.

In addition, there is a scheduler PD, responsible for global control of the entire system.

I didn’t know it was so hard when I was in the pit, but I didn’t know why it was so hard until later. After I finished, you might think it was even harder. Over the years, we have encountered compiler bugs, operating system bugs, file system bugs, and also been poked into data loss. Fortunately, we have finally fixed it. Maybe these problems we usually encounter relatively less, but here we have encountered, we also encountered a batch of disk hardware drivers purchased by a cloud storage manufacturer problems, write in the data and read out of the data is not the same part. I don’t know if any of you have encountered these magic phenomena, but as a database vendor, we have.

This is a GitHub page. You’ve probably seen it before. It’s up a couple of times a year, if you’re lucky. What does this picture tell us? What it tells us is that you might as well refresh. When he says this, it means he is anticipating and dealing with the situation. Does our system have the same mechanism? For example, there is a problem, in the user side roughly how to operate, you can recover. But recently GitHub went down again, and this time it seemed to take hours to recover.

This black swan is what we’ve been experiencing for years. Almost ten years ago to write programs encountered problems, these years all encountered. Until you see a black swan, you think all swans are white, until compiler and operating system bugs get in your way.

This is an operating system bug we encountered, and we wrote an article later describing how to find it. There is a bug in the Linux kernel that causes the page cache to fail to flush to disk. I would like to say that operating system bugs are also quite close to us, no matter how good we write the above application, we may encounter problems in the operating system later. But usually when we write programs, we don’t think about that much.

Everyone thought file systems were stable, and they didn’t have file system bugs, and they didn’t know how vulnerable file systems were until there was a tool that was first used to find security vulnerabilities, and then used to find other system vulnerabilities. Let’s take this tool and test the existing file system. Here is the time when we encountered the first bug.

These systems seemed pretty good at the time. Ext4 was considered pretty good, and it did, because it took two hours to find the first bug. The first two filesystems we recommended were Xfs and Ext4, but once we got stuck with Xfs, we turned Xfs off, so now we just recommend Ext4. It’s an interesting tool to see at a glance what file system maturity is, who can withstand it the longest, and who is the most stable file system in reality. So sometimes we have to choose between stability and performance based on different business requirements.

Speaking of Chaos Engineering, I don’t know what will come to mind. Chaos is the Chinese translation of Chaos, maybe the concept of Chaos is a little vague, but today we will see what it can solve the problem.

Here are a few articles I wrote about distributed system testing when I was just starting my own business. I already knew that distributed system testing was incredibly difficult. I remember we had to go through hundreds of microservices and look for a dead microservice, and that was interesting, and that shaky microservice out of hundreds of microservices if it happened to be related to your storage, or related to your login, how would the whole system react? The user can not land, or land very card, or the operation will be very card.

One of the pioneers in the history of this was Netflix, which created something called Chaos Monkey, where you randomly kill an entire system, like hundreds or thousands of redundant services, and just kill one of them and see what happens.

I don’t know if any of you have actually done this online, but I don’t think so. We talked about black swans earlier, and of course there’s not just black swans but Murphy’s law, which is that everything that you think is possible is going to happen. If you don’t kill yourself, it will be killed by someone else, and there will be all kinds of accidents.

One of the most inspiring things is that Netflix created a new position in 2014 called Chaos Engineer, where an engineer does this job and randomly kills nodes online. You’re probably going to have a situation where, in the middle of the night, all of a sudden, development is pulled up, and everyone is looking at the problem at the same time, and they don’t know what the problem is. The first reaction is “It’s not my problem” and you have to be quick. Whose problem is it? In fact, the boss does not care who the problem is, but whether it can be found as soon as possible and when it can be restored. Then the system will help you a lot.

At the end of December 2019, we launched this project on GitHub called Chaos Mesh, which is one of the fastest growing Star projects.

And the idea is very simple, usually you have a normal state of the system, and everybody knows that in this normal state, you start making assumptions, like kill a node, what do I think should happen. And then you do the experiment, and in my experience, the assumption is always wrong, you assume that after the kill node, the system is still stable, and after a few seconds it either doesn’t have much effect on the system, or it wobbles a little bit. That’s probably not a good assumption, because it might start to affect other people, and they might go on to affect other services, and it’s like a traffic jam in the system, where you find a traffic jam, it stays there as long as it’s there.

And then you go to Verify, and you find out that the experiment you just did is not working, and you improve your system, and you go around. The idea is simple.

Above is the diagram we designed for the whole system. I don’t know how you feel when you see the monkey. My first reaction was that Chinese and foreign cultures were too similar, because they are called Chaos Monkey in foreign countries, what do they call it in China? There are four great masterpieces in China. What is the name of the monkey that is very powerful? It had never occurred to Heaven that one monkey could have such an impact, and eventually the whole system had to be changed so that the monkey could fit into the whole system.

It’s exactly the same idea as the Chaos Monkey: You integrate the Monkey into your system, and eventually, you incorporate it into your system. You have to create a job for it, a system for it, a process for it. So in this process, did the Chinese culture of thousands of years suddenly have something in common with the foreign culture, everyone is the same, they all need monkeys, one monkey is very good. Before the appearance of monkeys, Yang Jian is invincible, heaven’s flat peach has never been stolen to eat.

The picture above is our reaction after open source, which immediately made Trending # 1 in Go language and the front page of Hacker News, ranking 10th at that time. People all over the world think it’s great, it’s exciting, it’s inspiring, and so do we.

In fact, we all want to invent a monkey, the monkey may be a general monkey, may also be sun Wukong, what kind of monkey do we need? So, before we built this system, we did a survey, what were the functions of the Chaos Mesh that we built ourselves, what did we do?

A CPU burn, for example, is the equivalent of a system where you have a process that burns up all of the CPU, an infinite loop, or, worse, a multithreaded infinite loop that eats up almost all of the CPU. What does that look like? It’s a bit like CPU overheating. After CPU overheating, it starts to lower the frequency. That is, your system suddenly slows down, probably because of poor ventilation in the machine room. The most terrible is that the boss does not know, the boss does not know everyone is more hard, no matter who the problem, all call up. Mem burn is a system that has 20 GB of memory and suddenly a program takes up 19 GB, you should be able to run out of memory, this is easy to understand, let memory leak, other bad at, this is too easy. Chaos Mesh is not yet available, but it will be soon.

Talk about so many benefits, do you want to actual combat? We shamefully posted some of the bugs that the monkeys found.

One of our former colleagues posted a post about Chaos Mesh and said, “I used to think I was pretty good at writing code.” “I asked.” And then?” He said, “Forget it, I’ve been working on bugs for weeks.” Trust me, when your app comes up, it’s not going to be an accident, it’s all the same, because you’ve never tested this, and it’s weird that it’s going to behave normally in this case, because you haven’t defined what normal is in this case. Have you ever wondered, if you write on disk, read on disk, read different data, how does the program behave? It’s scary. Someone didn’t hang, because he didn’t check something he wrote, didn’t put a check character after it, and then he read it and used it, god knows what that would be. And then the point is that it’s not dead, it keeps running through the system, and then it affects other things, like a virus replicating in the system, and the results are unpredictable. How to do? This system will do it for you. Here are some bugs we found ourselves.

In the process of running, we kill one of the storage nodes in the previous picture. We expect that the QPS will definitely fall, and will recover later. We look like yes, the QPS will drop zero, and when it gets to zero, after a few seconds or so it seems to come back up again, it looks like everything is fine. Until we looked at the whole system, we found that it took quite a long time for QPS to recover, lasting about 10 minutes before returning to normal. The expectation was that it would return to normal very quickly, but it didn’t. No doubt we found a bug later.

K8s has the advantage of being easy to install, and the trilogy is over. After that, I set the behavior. For example, if I want to kill some random nodes in the whole system, what should I do? I choose which nodes I want to kill, and what is the kill method? Every 2 minutes, this configuration is easy to understand. We just need to deploy the YAML file, and when we don’t want to do the experiment, we just need to stop the whole experiment, which is usually the time when our system should be recovered, and this is the time when people are most nervous, because there is a high probability that your system is not recovered. This is our experience. We always find problems.

This is a graph where we kill a node every 5 minutes, and it looks like the QPS will always come back. Hopefully, your online system looks like this, and it still looks good.

K8s is an operating system. Node stands for a machine, operator stands for systemd, pod stands for process, sidecar stands for thread. That last one takes a little time to sink in.

In fact, All of the above can be summed up as All in K8s. The whole system is completely based on K8s. If your system does not use K8s, you cannot use this system to conduct experiments on it. One of the interesting things is that we were talking to users in the US, what deployment mode do you use, and I said we use Ansible. Do you have a K8s? If not, I won’t watch it. In the US, K8s is a politically correct choice, if your system is not running on K8s, you don’t need to talk, everyone will not match. It’s like a blind date. Do you have a car? Have a room? If not, goodbye. So if you haven’t learned it, learn it quickly, or you’ll have trouble communicating with Americans in the future.

This is the whole structure of CRD, focusing on several common errors that can be constructed in CRD, such as network partitioning, network packet loss, network retransmission, bandwidth limitation, and so on. And then the file system, we can do file manipulation, write data and read data is not the same. Either a write failure or a read failure.

Kubernetes has API Server, it’s easy to understand, you can think of N monkeys, each monkey only do one thing, for example, one monkey only do network related, one monkey only do I/O related, are themselves CRD. For the network, you use Iptables, and you know iptables can do all sorts of things on the network, like isolation, like letting data in and not out, and so on, and fuse for I/O.

This is a common error. Let’s take a look at the I/O delay here. We once encountered a problem that a write operation of a cloud disk took 5 seconds. If you flush a disk in a virtual machine for 5 seconds, there must be a bug in the disk. I remember someone wrote an article saying, what’s worse than a broken disk? If you don’t know why a disk reads and writes at 1/10 of its original speed, it suddenly slows down, which is worse than breaking down.

Here’s how it works.

As for future plans, our future plans include Verifier, on which more functions can be added, such as what operations I can do at a certain point in time, and even I can replace data in the middle and then observe whether the system status is wrong. Of course, it will also add a nice interface, now the entire command line to operate. It is also desirable to be able to support error injection in the cloud and not assume that services on the cloud are OK. The program is written by a human. Black swans will appear.

Finally, I would like to talk about the importance of observability of the whole system. I don’t know if you have done database operation and maintenance. Usually when the business side says I have a problem, my first reaction is what workload you are. Do you read or write more? How many times do you read or write? The other side said I am not too clear, how to do at this time? At this time, you need to have a system to know what the business side is doing without the business side explaining to you, because the business side may not be accurate, because he has his own understanding, everyone’s understanding is different. This is scary, because if you have a system that’s good at observing, you don’t have to tell me anything.

What does this system mean roughly? You can look at the picture above. What does the brightest yellow mean? Is the write hotspot of the entire system, what is the concept of write hotspot? It’s an oblique line that slopes like this. What does this line tell us? I don’t care what I’m going to do with the database, if I look at it, it means that there are six tables in the system that are appending data all the time. For example, we see a large purple on the distribution is very uniform, means it’s writing is relatively random, so this side of the database, you don’t have to chat with other people, you have a look, you will know what to do, a few tables in additional, a few tables in random writing, several tables at random, or a few hot, what are the benefits? When there is a problem quickly locate, there is such a system will be clear

This is pingcap/ Chaos-Mesh on GitHub and we welcome you to follow us on Twitter: chaos_mesh. That’s all I have to share with you. Thank you.

Read the original: mp.weixin.qq.com/s/dYv7neg6P…