In a fit of piquing, I wrote Vivian (see “Test-driven Development of Nginx Configuration”). I also wrote a flow testing tool while waiting for the customer approval process.

background

The client’s site is built on WordPress, which is hosted on an EC2 virtual machine. Interestingly, the MySQL database of this application is also on this virtual machine. A previous RDS migration failed, for unknown reasons. It seems that the application and database are like the chopstick brothers, stuck together and unable to scale horizontally with the AutoScaling Group. That is, everything is on one virtual machine.

What I’m trying to do is turn this architecture back into an architecture that is automatically scalable and highly available with high performance with caching and low consumption with monitoring and more security with version control and can be deployed semi-automatically through continuous delivery pipelines. You can reread the last sentence in bold. Yes, right now they don’t even have version control, all operations are done on the server via the SCP between Mv.

Unfortunately, this “Chopstick brother” app started last week, and it randomly went Down at night, showing that the database was deleted. However, logs show that the MySQL data engine fails to load due to insufficient memory resources.

Due to the “chopstick brothers” split operation, the purpose is to separate the database from the application, and some service restart and split is required. These operations result in downtime, and in order to be able to measure this downtime and make better decisions, the customer wants to be able to do this on the test environment by simulating the working state of the production environment. I designed the plan, including the following points:

  1. Know how long each operation that might cause an outage will cause an outage.

  2. Test how much performance improvement RDS can provide.

  3. Identify the root cause of the downtime for the entire architecture.

  4. The performance inflection point that occurs in the case of 500 concurrent user access.

  5. Ability to measure application resource consumption.

The client has already purchased NewRelic and Flood. IO (the entry I submitted in issue 17 of technical Radar, fork will back). However, the account allocation of Flood. IO requires an extra approval before it can be used, that is to say, I can’t use it until the next day.

I thought, maybe there’s a tool on Github that meets this simple need. I searched around, but there wasn’t one.

So, in a fit of pique, I wrote such a test tool in Python in about two hours.

Tool design

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton

Cache invalidation and naming things are two of the most difficult things in computer science. So, in honor of this, I initially named it after the customer who made the request (Dave), but it might not be easy to remember. Therefore, Wade (Web Application Downtime Estimation) was used as the name of this tool. It’s very simple and can be found at github.com/wizardbyron… To find it.

If I wanted to know the downtime, I would have to be able to keep making HTTP requests and logging returns with a status other than 200 OK. I don’t want the app to be a dead loop, so I need to be able to add time controls. I expect to use it in the following way:

wade -t 10 -u https://www.baidu.comCopy the code

Where -t represents the time, 10 represents the duration, and -u represents the URL to be tested. I expect this tool to continuously output the time and HTTP status words for each request.

Such as:

[2018-07-05 22:30:57]status:200
[2018-07-05 22:31:08]status:200
[2018-07-05 22:31:15]status:200Copy the code

We can use BDD to construct an end-2-end automated test for the command line tool. This requirement was actually very simple and I completed it in about half an hour.

In practice, you would need to run this program before performing any actions that might cause an outage. So I’m going to let it run for 30 minutes, because I don’t know how long these operations will take.

Good. In tests, these actions only cause about 5 seconds of downtime.

However, I seem to have forgotten one important thing…

That is, the application is single-process, which means that this is actually quite different from the real world. I needed to be able to have 500 concurrent HTTP requests, so I made it multi-process. I expect to use it in the following way:

wade -t 10 -n 5 -u https://www.baidu.comCopy the code

-n indicates the number of processes.

With multiple processes, I need to change the output of these applications. For multi-process applications, the output needs to know the execution of each process and be able to summarize it. Therefore, I expect the application to output something like this:

{'Thread': 0, '2XX': 2, '3XX': 0, '4XX': 0, '5XX': 0}
{'Thread': 3, '2XX': 2, '3XX': 0, '4XX': 0, '5XX': 0}
{'Thread': 1, '2XX': 4, '3XX': 0, '4XX': 0, '5XX': 0}
{'Thread': 4, '2XX': 4, '3XX': 0, '4XX': 0, '5XX': 0}
{'Thread': 2, '2XX': 4, '3XX': 0, '4XX': 0, '5XX': 0}Copy the code

Because, in fact, the return value of the 3XX class should be correct in some cases, and the return value of the 4XX and 5XX class should be counted separately. So, I improved the tool.

After I retested it, I found the answer to the question:

I successfully migrated the database to RDS and stopped the MySQL process on the test environment instance, resulting in a 40-fold performance improvement. It turns out that the application’s database needs at least 10 GB of memory to work properly.

I crashed the server when I continued requests with 500 processes. The input response is slow, and the -bash: fork: Cannot allocate Memory error is returned. By reducing the number of processes, I found that a single user request would take up about 110 MB of memory, and the host would need at least 64 GB of memory to accommodate 500 concurrent users.

As the number of concurrent users increases, the memory also increases. When the physical memory is used up, the virtual memory space of Swap is used. When the Swap space is used up, the application responds slowly because there is no memory to allocate. Even if I terminate the test request, there is still no relief. I guess the previous request is queued up on the HTTP side. After the request is not finished or the resource is released due to timeout, the subsequent request will continue to queue up.

That… It looks like I’ve launched a denial of service attack on this server.

After loading NewRelic, I found that the application had the lowest performance when loading the home page, and most of the resources were consumed by select queries. Therefore, I judge that there is a problem with the table or data and load a lot. Second, you can add a page cache to the home page or add a cache to the database library side to reduce resource usage. After all, the home page is the most frequently visited page.

Finally, we can treat the data measured by Wade as an acceptance test or smoke test in the evolution of the architecture, integrated into a continuous deployment pipeline and executed after infrastructure changes or applications are deployed. We need non-functional architecture-level automation testing to protect application architecture refactoring.

Reflect – Less is more

If you don’t have this tool, get the answers above. I need to switch back and forth between three services (AWS CloudWatch, NewRelic, Flood. IO) at the same time and collect the required data. With so much data, it is difficult to find a simple data that can directly reflect the problem. While I was waiting for my account to be approved, I wrote up this tool. This tool covers the basic scenarios and relationships between data that customers will care about. All three tools can’t be satisfied at the same time (NewRelic is just a little bit behind). While each tool is a powerful tool in its own domain and for the customers it works with, a scenario of real customer needs – finding the factors that affect downtime under normal pressure – can be difficult to satisfy.

So, for non-target users and usage scenarios, the product’s rich features and data can be noise that needs to be filtered. The more diverse the user scenarios a product faces, the more noise it will introduce. More value-added and high-value services are lost in the noise.

Alipay and bank’s mobile apps have such a problem. They can do everything well.

The last

As a tool that can be done in less than two hours, Wade lacks a variety of automated tests. However, we can see from The design process of Wade that although I did not write an automated test, the results of setting expectations and fulfilling them are consistent. In this sense, TDD is also an activity that records the design process of a program in the brain.

If you are interested in this tool, welcome to PR: github.com/wizardbyron…