Disconnection of system level integration testing

It is tasteless to eat, but it is a pity to discard

At a time when “quarterly or monthly releases” of enterprise applications are considered domain best practices, it is essential to maintain a complete environment for integration testing before deploying the application to production. However, the integration test environment and the integration test itself have the following problems:

The environment itself is fragile, and there is usually a manual configuration part, the cost of environmental maintenance is high;
Due to environmental factors, integration tests are unstable, unreliable, and slow in feedback. In addition, test failures are difficult to locate problems. In addition, repeated tests are performed on functions that have been tested by isolated components.

Integration testing becomes a bottleneck to continuous delivery, like a chicken rib. Therefore, the latest issue of ThoughtWorks Technology Radar (Issue 16, 2017) recommends that companies hold off on building enterprise-level integration test environments and instead release key components incrementally to production. Incremental release involves some important techniques including contract testing, decoupling release from deployment, focusing on average recovery time, and QA in production.

Technical feasibility of Danshari

Here’s a look at each of the four techniques recommended by Technical Radar, how to ensure the quality of your application without integration testing, and how to help your organization with standalone incremental releases.

Consumer-driven contract testing

Consumer-driven contract testing is an important part of microservice testing, mainly used to cover the contract relationship between two services. The following example illustrates what contract testing is and the difference between contract testing and API testing.

There is a socket at home, when buying electrical appliances, you need to consider the plug and socket can be paired, that is to say, there needs to be a contract between the socket and the plug.

The contract test here is the compatibility test of socket and plug, including

Three phase/three head or two phase/two ends;
Voltage is 220V or 110V;
Jack shape: English or Chinese?
, etc.

API test: for the socket itself function test, need to cover

The socket can be energized normally;
The voltage is expected to be 220v;
Grounded jacks can actually be grounded
, etc.

That is to say, API tests need to test all aspects of THE FUNCTIONS of the API itself, while contract tests focus on covering the format, parameter number and parameter type of API calls, and do not necessarily involve the functions and specific data of the API itself.

For more on consumer-driven contract testing, see the article consumer-Driven Contracts: A Service Evolution Patter.

Release and deployment are decoupled

Deployment refers to the deployment of components or infrastructure to the production environment, which is not visible to users and does not affect services and users. Publishing is to make deployed components visible to users, which has an impact on services. You can decouple deployment and release by using Feature toggle to achieve continuous deployment and controlled release and reduce the risk of component changes. In this way, the product manager can flexibly control the functional components released to end users based on business requirements, helping maximize business value for the enterprise.

Focus on average recovery time

Look at these two scenarios and think which is better:

System outages are rare, perhaps once or twice a year, but the recovery capability is very weak. Once the system goes down, it takes at least 24 hours to recover.
System outages occur frequently, but they can be quickly recovered without users even noticing them.

For the first, the average interval between failures is long, but the impact of failure on the user is self-evident. The second method is better, although it has frequent failures, but the average recovery time is very short and the user experience is not affected.

Traditional Ops teams focus on how often failures occur, and with the development of continuous delivery and monitoring technologies, “quick recovery” is possible. Do not worry about the occurrence of errors and failures, but use the monitoring and analysis of these errors and failures, so that the system can achieve rapid recovery, can save some complex integration tests, can also reduce the impact of ubiquitous security attacks.

QA in production environment

The production environment is the environment used by real users, and it is usually not the same as the test environment where the product function can be directly tested. It is not simple to postpone the QA technology used in the test environment to the production environment, and one of the technologies used in the production environment is monitoring. Monitoring technology is used to capture and analyze information about the production environment, and then optimize the development and testing process, as well as the enterprise business. For more on QA in production, see the article QA in Production.

Contract testing is helpful for continuous standalone deployment, and monitoring is one of the key techniques for reducing average recovery times and doing QA well in production. Next, I will share some of the project’s practices around contract testing and log monitoring, and see how the decoupling of system-level integration testing can be implemented.

The project practice of Danshari

The project is an old project that has been developed for seven or eight years. The team has made several adjustments to the integration test, but still feels it is a chicken rib after the “seven-year itch” :

Execution on Pipeline is very unstable, always “random hang”, cannot truly reflect the problem;
The system is complex, and the integration test is certainly not simple. Each successful execution takes more than half an hour. Moreover, there are also very unstable situations, which lead to QA not getting package deployment for more than a week in the most serious cases.
Application defects found through integration tests rarely appear in half a year;
As systems become more complex, the cost of integration test maintenance increases.

As you can see, integration testing has seriously hindered the continuous delivery of the project and has to be cut off.

(I) Test strategy adjustment

The first part of Danshari is to start with the integration testing itself and adjust the testing strategy. The steps are as follows:

No new automatic tests at the functional level will be added. The original integration test can be covered by UT and API tests at the bottom as far as possible.
The contract relationship between UT and services that cannot be covered by API tests is guaranteed by adding contract tests;
Remove the original integration test from CI, accelerate pipeline package, then add Smoke test to QA environment execution;
Whether defects are found in the test environment or production environment, add the corresponding tests.

The overall strategy adjustment idea is to adjust according to the structure of the test pyramid and move the tests to the lower level as far as possible. However, the integration tests are not completely eliminated, but are reduced to the very critical path coverage. The final test structure adjustment is shown in the figure below:

(2) Log monitoring, analysis and optimization

Without integration testing on the Pipeline and going directly to QA or PROD environments, it becomes especially important to strengthen log monitoring. Therefore, the second part of Danshari is the monitoring, analysis, and optimization of logs.

Log Data Collection

The project uses the log analysis tool Splunk to collect log data from the following aspects:

The Dashboard configured for log monitoring on Splunk is used to monitor API request failure logs, error logs, and API execution time. As shown below:
Set early warning email reminder: For some API requests with serious performance problems, send early warning and reminder by email, as shown in the following figure:
Proactively search error logs: Proactively search error logs to locate problems reported by users in the production environment.
The previous mechanisms can also be used in QA and Staging test environments to detect problems as early as possible through log analysis.

Log data utilization

Analyze and optimize the log data collected in the preceding ways from the following aspects:

Find and locate system function problems, analyze system users’ behavior and habits, optimize business;
Identify and locate non-functional problems such as security and performance, and repair and optimize them.
Discover and analyze the shortcomings of logging itself, standardize logging, and further enhance the benefits of logging for problem diagnosis, creating a virtuous cycle. Canonical logging involves a common log output path, a common log format, clearly defined log levels, and adding necessary logs as part of the story development process, with QA participating in log review. Here is the newly optimized logging format for the project as seen in Splunk:

The project has just started to break away from the integration test and is constantly groping for progress. At present, the most direct impact can be seen is that the package output of pipeline is obviously accelerated, which has changed from several days without a package to several packages a day. The most exciting thing is what happened just this week: a bug reported one day before leaving work was fixed and ready for testing the next morning.

Write in the last

System level integration test although there are all sorts of problems, not necessarily because the integration test hang up and found a lot of problems, the front end is discussed from the feasibility of break project are analyzed from the practice, but integration testing is not used to found the problem, but a for quality control of the barrier, the critical path of necessary testing is irreplaceable. Therefore, we advocate reducing the number of integration tests and rationalizing the ratio of tests to each layer.

The severing of system-level integration test requires the team to have continuous, progressive and stable delivery ability, and to ensure that users will not be affected by this, and the normal operation of enterprise business. The separation process of system-level integration test is not accomplished overnight, and the painstaking efforts of integration test are not easy to give up. It requires a lot of balance and trade-offs, and the quality and risk of the system should be constantly paid attention to in the whole process, and corresponding adjustments should be made in time.

For more insights, please follow our wechat official account: Sitvolk

WeChat
Sina Weibo
Evernote
Pocket
Instapaper
Email
Baidu
LinkedIn
Pinterest