Introduction: Development pain: Stable test environment, why so difficult. For the production environment, accuracy and stability are the most important, and we recommend application-centered practices based on OAM and IaC. For test environments, isolation, low cost, and stable dependencies are of Paramount importance, and we recommend the practice of isolating test environments based on stable environments, reusing stable environments, and generating test environments through traffic isolation and data isolation. Through environmental construction, we resolve resource conflicts in the r&d process.

Column planning | Yachun

Volunteer editors | Jimmy, Lu Ruixing

“For production environments, accuracy and stability are Paramount, and we recommend application-centric practices based on OAM and IaC.

For test environments where isolation, low cost, and stable dependencies are most important, we recommend the practice of isolating test environments based on stable environments, reusing stable environments, and generating test environments through traffic isolation and data isolation. “

Here are the details.

The concept of an environment is familiar to most developers. A stable, predictable and low-cost environment is also a common aspiration.

As shown in the figure below, we divide the environment into production environment, test environment and development environment. Most of the time we separate the production environment, test environment, and development environment, just like the firewall in the picture, into offline environment and online environment.

In practice, however, the use and division of the environment can vary depending on the size of the company and the cost of development.

For example, based on cost considerations, the first thing to ensure is the production environment, all to provide services as the core priority; The second is the test environment. Before migrating to the online environment, we need to carry out corresponding verification in the test environment similar to the production environment. Only if the verification is correct in the test environment can we migrate to the production environment, so as to ensure the stable transition of the system.

The production environment

For the production environment, accurate and stable operation is very important, but also produced a large number of operations and peacekeeping governance demands.

If the test environment is configured with one node, the production environment needs to consider backup, active/standby, diversion, and disaster recovery to ensure the stable operation of the environment.

Accuracy and stability are the biggest differences between the production environment and other environments. This feature brings about a large number of configuration requirements for operation and maintenance and service governance. How to effectively maintain these configurations is also our original intention to manage configurations based on THE OAM model and IaC mode, which was shared in the previous article.

AppStack is a cloud native application delivery platform based on OAM. Enterprises can use declarative definitions such as application choreography, placeholders and variables to achieve a set of differentiated deployment of orchestration in multiple environments. At the same time, one-click pull up and one-click rollback of the environment are realized based on version and baseline. Interested students can click on the bottom of the article to read the original text for free).

The production environment contains many configurations, such as application configuration, application mirroring, application O&M, and infrastructure O&M. These different configurations and mirrored contents are monitored and managed by different students.

Develop and modify the code, the code release changes the image and configuration; The application O&M system modifies the application O&M configuration. Infrastructure O&M modifies infrastructure configurations. All configuration changes have an impact on the production environment, resulting in changes in the production environment, which may lead to risks.

Therefore, the operation and management of the production environment should obviously be jointly responsible for development and operation.

The test environment

The test environment is another important type of environment. The test environment includes two types: one is integrated environment, and the other is pre-release environment. A pre-release environment is a production-like environment. Integration environments are mainly used for integration testing, or functional verification; The pre-release environment is mainly used during the acceptance process.

The goal of the test environment is to isolate, reuse, and simulate independent tests with as few resources as possible.

For example, if an application interacts with an external service, you can simulate one in a test environment if the external service has problems.

Take a big data product as an example, big data products we may feel that the environment is too high, there is no way to do the test environment, a lot of technical services such as Hive, Kafka, MySQL, machine requirements will be very high: Hive, Kafka needs a lot of machines. In addition, Redis is needed for caching and Zookeeper for service discovery. Starting with a single test environment, this is obviously inefficient. If you have 50 developers sharing a test environment, there is little way to do testing in the case of frequent conflicts.

To solve this problem, services and applications can be layered, in this case into three layers. The first is common infrastructure services, such as Hive and Kafka. Then there are the small independent services like Redis and Zookeeper. In the test environment, Redis and Zookeeper all use a single point is no problem, can run on a virtual machine; The top layer is the application, deploying only what is necessary to do the required testing.

Therefore, the test environment will be managed in this way: first, all common services are shared base services, all test environments depend on these base services, and the data of each environment is isolated through logical mechanisms such as namespaces. A set of independent services Redis and Zookeeper will be deployed in each test environment.

The application layer only deploys the required applications, so it is possible to deploy a test environment with minimal resource consumption. Many tests have low resource utilization, and if you build a complete set of environments, you will find that 99.99% of the time resource utilization is low.

It is also important that all test environments be temporary. If the test environment is used as a long-term environment, users will become accustomed to the environment as their own, such as naming the environment that others cannot use, which can be a huge waste of time, after all, the time is limited. We want the resources of our test environment to be a pool that can be reused and destroyed upon completion. This also requires improving testing efficiency and doing more tests in the shortest possible time.

The development environment

Development environment is the most involved environment in addition to the production environment and test environment mentioned above, such as some tool chains to be used in development and construction, all belong to the category of development environment. In a development environment, our focus is on running the service locally.

Ideally, the development environment can communicate with other services in both directions, so there are three problems to solve: first, how does the development environment access services in the underlying environment, such as another Service? The second is how to get other services to access the services we are developing. The third is how to isolate requests and data from other development environments. This is a similar problem we encountered in the previous test environment, so we need a similar approach across development environments, and the cloud Effects team’s open source KT-Connect is a tool designed to solve this problem.

There are also tools available in the development environment, as shown in the figure above. You can also take a look at some of the ones you use.

Test environment pain

Many companies, many people will say that the test environment is not enough, the test environment is not stable. What challenges do we face in a test environment? Especially distributed applications. After microservitization, the challenges of distribution are becoming more and more obvious, many of which are related to the environment.

For example, an application change is not well validated and accidentally enters the integration environment. In this way, its quality is not guaranteed when it enters the integrated environment. In the integration test phase, the relationship between applications is very complex, one service is unstable, other links are likely to be unstable.

As a result, we often don’t do our daily integration testing very well. Because there is no way to guarantee the previous process, the changing application will occupy the pre-delivery environment at this time, and the pre-delivery environment is a relatively high cost environment, which cannot be often occupied by a certain person. Therefore, in order to allow everyone to use pre-delivery, the use of pre-delivery will become a lot of people to batch, so that pre-delivery becomes a long-term environment, the consequence is that the time of pre-delivery will increase, the whole development cycle and delivery cycle will increase. In the process of continuous delivery, we face many challenges in the test environment: instability issues, resource issues, integration issues, etc.

At the moment, most of the problems you encounter with test environments come from services that are not effectively governed. There are many service methods and high coupling, and once one service fails, all the others are affected. When the services of an environment are in flux, the entire environment will be unstable as unstable services are deployed at any time.

The consequence of an unstable integrated environment is that a large number of tests migrate to pre-launch, which becomes a bottleneck and then migrate online. Any application will ultimately use the on-line environment as a backstop.

In summary, the test environment mainly faces the following two challenges:

The first is how to resolve dependencies between services. For example, the strong dependence of A on C, the success of A’s function depends on C, and corresponding verification should be done on A after THE change of C to ensure that the change of C is correct.

The other is the environment itself, there are two main points, one is the stability of the machine, the other is the stability of the service itself.

The stability of the machine is mainly as follows: Effectively cope with hard disk faults and network faults, and do a good job in system backup and disaster recovery.

The stability of the service itself is mainly: to effectively ensure the availability of each service itself, because if the availability of one application is 90%, then 10 applications are 90% to the power of 10, resulting in the whole system will be low.

How to ensure the stability of the test environment

We mentioned above that there are two challenges in the testing environment. Any test environment needs to be stable and reduce the risk of using an online environment. So how to ensure the stability of the test environment?

Common practices in test environments include two-node deployment, N+1 deployment, and isolation.

For example, an application should deploy at least two PODS to ensure that at least one of them is providing services and the two cannot be restarted at the same time. It does happen that in a test environment, if there is only one copy of a service and the service is deployed and restarts, the entire test becomes unavailable. In this case, two-node deployment is a good quick solution, but it also consumes a lot of resources.

To solve the problem of high resource usage in two-node deployment, the N+1 deployment mode emerges. Replace service applications one by one in a rolling manner. So only one of your machines is changing, and all the others are working. This is also the default K8S approach, which typically generates new instances and then drops the old ones.

In order to ensure the stability of the test system, we need to do isolation, try to make other applications are stable except the ones we modify.

In Ali, the team introduced the project pre-integration environment, which is called project environment inside Ali. This is an isolated environment, which is pulled out separately for a feature in the development stage.

To sum up, the pre-integration environment is isolated and has no relation to anyone. Other services that depend on it are derived from a stable environment to ensure that the dependent services are stable for independent development and testing.

In the early days of the project, the environment that the project relies on in the pre-integration environment is still the daily integration environment, which is definitely better than doing nothing and putting it directly into the daily integration environment. This time we found that the daily integration environment is a problem, because at the beginning of the program does not guarantee that all submitted projects for the integrated environment to do validation, thus leads to the inside of the daily integration environment may also have a lot of problems, it essentially goes back to the us to governance the daily integration environment, how to maintain a relatively stable.

To solve these problems, we introduce the concept of stable environment. Since we have isolated the environment, but the isolation depends on the basic environment is unstable, at this time if we have a stable environment can solve the problem?

What is a stable environment? The environment that you can publish to the online version, the online environment is definitely a stable environment, so our stable environment is actually made up of application services that are consistent with the online version, consistent with the online service. Stable on line, this environment is stable, so we can create isolation environment in this stable environment, so as to ensure the overall stability.

Once you have a stable base environment, deploy the application to the base environment after it is deployed to production, providing a base environment on which the test environment can depend. With such a base environment dependency, when we develop our applications, the environment is completely isolated, containing only the applications in the few changes that are closely related to me, and all the other dependent services are from the base environment.

The concept of a base environment is mentioned here, but what is a base environment? Base environment is a stable environment, when there is a stable integration environment can be isolated environment, feature testing will be based on the isolated environment, dependent traffic can also be found in the isolated environment. But the base environment has some maintenance costs, and while the deployment costs are relatively low and the machine resources it takes up are less of a problem for the average large company, they can be a problem for small companies. But the main cost is the maintenance of the basic environment, monitoring the basic environment and repairing the problems, which requires a certain amount of investment in manpower.

The maintainer of the base environment is generally not the user of the environment, so it is necessary to have a mature mechanism to ensure the long-term stable operation of the base environment. Let’s imagine, if there is no new base environment, which environment is the most stable? We put in front with a firewall online, why is separated as you know, we are afraid of security risks, afraid of pollution data, but if we isolated the ability to do good enough, service routing do enough, monitoring do enough, safety protection do good enough, we can use in a production environment for the basic environment.

Production environment as a basic environment, to solve two important problems, the first is traffic isolation, traffic isolation is relatively not a big problem, from the former resource-oriented to now flow-oriented isolation there are many ready-made means to do. The second is data isolation. This is a big challenge, there are many forms of data, such as message queue and ordinary database is not the same, data warehouse is not the same, many troublesome problems here, but specific to a point there are ways to solve.

summary

To sum up, accuracy and stability are the most important for the production environment, and we recommend application-centered practices based on OAM and IaC. For test environments, isolation, low cost, and stable dependencies are of Paramount importance, and we recommend the practice of isolating test environments based on stable environments, reusing stable environments, and generating test environments through traffic isolation and data isolation. Through environmental construction, we resolve resource conflicts in r&d, and the next chapter will focus on collaboration in R&D.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.