Driven by the open source community of container technology, sustainable delivery and choreography systems, as well as the development concept of microservices, cloud applications have become an irreversible trend with the maturity of virtualization technology and the popularity of distributed frameworks.

Cloud native brings standardization, loose coupling, easy to observe, and easy to scale, opening new opportunities for delivering infrastructure and business decoupling, more flexible environmental management, and lossless distribution. At the same time, under the microservice architecture, the number of services explosively increases, the corresponding delivery infrastructure workload explosively increases, and the topology among services is complex, which leads to the difficulty in evaluating the impact of the upgrade, the difficulty in locating problems, and the extremely high cost of individual testing environment, bringing great challenges to efficient delivery.

The full text is 6,228 words, with an estimated reading time of 17 minutes.

Love’s comments from April 20 years product comprehensive cloud, the cloud era, through the conversation with decoupling of business infrastructure, focus the business team and business development itself, greatly improve the business efficiency, through the intelligent generating test case of contract guarantee the reliability of the service invocation between, through the whole link grayscale ability for offline testing and nondestructive can release fu, Achieve efficient product delivery.

I. Business background

Aipanpan is a typical toB business with the following characteristics:

From the product form, the product front line is long, covering (expand, chat, chase, insight) and other core product capabilities;
From the market environment, the market environment competition is extremely fierce, put forward higher requirements for the efficiency and quality of production and research;
From the perspective of research and development mode, the product and research and development adopt agile thinking, which requires continuous innovation and trial and error, and quickly complete the research and development of PoC and MVP products.
In terms of deployment form, in addition to providing SaaS services, it also appeals for diversified sales.

Scrumteams divided by business domains, as shown below:

Second, efficiency system challenges

2.1 Sharp increase in infrastructure costs caused by service explosion

The number of active modules is 200+, and 8 new modules are added every month. The cost of access management and maintenance of pipelined, monitoring and other infrastructure increases sharply. The infrastructure to be connected to each module is as follows:

2.2 It is difficult to locate problems and evaluate the regression range caused by complex topology

The topology among services is complex, as shown in the figure above. Complex topology brings the following problems:

1. It is difficult to evaluate the impact of the upgrade, and there are many regression missed tests; 2. It is difficult to locate online problems; 3. Large scale of environment and high cost of joint adjustment test;

2.3 Contradiction between publishing requirements with increasing frequency and publishing costs with increasing topology complexity

There are a large number of modules and complex topology, and there is a dependency between modules. Every time 100+ modules are online, manual control process, high risk and lower efficiency. However, with increasingly frequent business release requirements, how to ensure the efficiency and security of the release process is also a great challenge in the context of frequent release.

Third, overall efficiency improvement ideas

Process mechanism level: The construction of agile system with user value and flow efficiency as the core includes the following aspects

Agile iteration mechanism: the core concept of user value flow efficiency ensures the consistency of team goals and transparency of information;
Demand split management: standardized, visual and automatic management mechanism, under the premise of controllable cost to achieve small batch demand flow acceleration, rapid verification of value;
Branch mode and environment management: Based on the powerful traffic control capability of cloud native, it can realize the full-link grayscale environment capability based on ISTIO, and realize the simple, flexible and low-risk branch mode;
The data measurement system of the whole process: to understand the status quo through target index measurement, to mine problems through process index measurement, to create tasks automatically, and to promote the problem closed-loop with peer;

Technical level: automatic and intelligent improvement of the whole process, including the following aspects:

Infrastructure: building infrastructure services decoupled from business;
Automation: reasonable hierarchical automation system under micro-service, guarantee effective quality recall under controllable input;
Publishing ability: one-click operation, efficient execution, process visual, perceptive and controllable ultimate publishing experience;
Tool empowerment: rich tool capability empowerment, r&d and testing of various efficiency pain points, empowerment for personnel (under construction, this paper will not be introduced in detail);

The following plans are explained from four technical directions:

Devops infrastructure services decoupled from the business

As mentioned above, the biggest problem facing the infrastructure is the exploding cost of DEVOPS infrastructure access and maintenance due to the exploding number of services. In dealing with this problem, we draw lessons from serverless’s ideas, and make the infrastructure as a service, decouple from the business, and operate and maintain independently. Previously, our business development and QA team, in addition to the need for business development and test work, there is a lot of time has been spent on new applications, access logs, configuration, and environment, the assembly line, monitoring, maintenance and so on has nothing to do with the core business matters, like the following figure on the left, and any infrastructure services to upgrade, For example, the SDK upgrade of the log platform and the assembly line need to add a unified security detection link, etc., all of which need the cooperation of various business teams to upgrade, and it is difficult to promote the implementation.

If we make this infrastructure available to the business team as a service, we can dramatically improve team effectiveness by enabling business R&D and QA to focus on business critical issues. Like the one on the right below. At the same time, there is no perception of infrastructure upgrade business, and there will no longer be difficulties in the implementation and promotion of infrastructure capacity.

How to build a decoupled, service-oriented infrastructure?

4.1 Infrastructure standardization

The first step in decoupling from the business is the standardization of the infrastructure. Only the standardization of the process can scale to enable the technology infrastructure to serve. We have carried out standardization transformation mainly for the following parts:

1. Module standardization: code structure, packaging process, standard container, image management, deployment process

2. Standard assembly line

3. Standard basic services: APM component, configuration center, publishing platform, resource management

4. R&d Mode:

4.2 Declarative infrastructure

The second step in decoupling from the business is to establish declarative infrastructure capabilities based on standardization. By declarative I mean a declarative infrastructure experience for the business team. The service team only needs to declare some basic attributes in the standard configuration to automatically connect all the infrastructure and maintain it at zero service cost. The construction is mainly divided into two aspects:

Access: One-click access at the minute level

Our approach is to build one-click access capabilities for infrastructure through scaffolding as grippers. As shown below: scaffolding is the entry point for our new module creation. All new code bases are created through scaffolding, and he helps develop frameworks that automatically generate a complete set of code that integrates standard components.

When scaffolding is creating a new module, according to the module attributes of the business declaration, such as whether to access APM, module code type, module service type and so on, automatic pipeline creation, basic component access, cluster environment application, configuration file generation and other operations. A new service can be deployed directly to the test cluster in less than 10 minutes from the time the code base is created to the time the full infrastructure of the service is connected.

Scaffolding: automatically generates framework code, including access to basic APM components, API management platform, etc.
ConfigMap: automatically generates standard application configurations and proactively triggers access services based on configuration additions or changes.
Access service: Selects and parses configMap configurations and schedules different infrastructure services based on the configurations to complete access initialization.

Runtime: according to the content of the service statement, dynamic operation, service upgrade maintenance zero cost

As for the basic components, because they all provide services in sidECar mode, the runtime is naturally decoupled from the business. Therefore, the focus is on how to decouple the pipeline from the business during operation. We modelled the pipeline, parameterized it, and combined it with the declarative properties of the business. As shown in the figure below, the pipeline runs dynamically every time, relying on the real-time generation of declaration data in the five parts on the left, including CICD general configuration, pipeline template, task script, task policy, and business declaration attributes. Except for the declaration file of the service itself, other parts are operated and maintained independently by the infrastructure group. Therefore, task optimization, addition, and unified configuration modification are transparent to the service. As shown on the right, the infrastructure group can modify the pipeline template or task script to optimize or add steps to the pipeline.

4.3 Intelligent infrastructure

After servitization, infrastructure is regarded as an independent operation and maintenance service, and all problems need to be independently maintained and checked by the facility team. Therefore, the third step of decoupling from business is to establish the infrastructure capability with high stability, high efficiency and low operation and maintenance cost. Our idea is to ensure efficiency and stability through intelligent strategies. Before, during and after the operation of the assembly line, a “supervisor” is added to the assembly line by means of a strategy to simulate the execution of tasks, analyze and follow up, and repair problems.

The analysis of common pipeline stability and efficiency problems, such as unstable environment, unstable underlying resources, network anomalies and so on, can be generally divided into three types: occasional problems that can be recovered by retry, relatively complex problems that need manual troubleshooting, and blocking problems that need manual repair. In terms of efficiency, a large number of repetitive and invalid tasks, such as only adding a log, have to run the whole test process, resulting in resource waste and low execution efficiency. As shown on the left side of the picture below:

For these scenarios, we added a configurable policy judgment process before and after pipeline operation to determine whether tasks need to be skipped, queued, retry, etc., so as to improve stability and efficiency.

Typical scenario:

Automatic red light analysis: after a task fails, it can automatically analyze the cause of the problem according to the error code in the log and give notes, and then optimize the aspects more effectively according to the statistical data.

** Queuing strategy: ** Automatically detects whether the dependent environment is normal before the execution of tasks such as automation, thus reducing the red light caused by failure of operation.

Five, hierarchical automation system

Automation is an important topic for continuous delivery. How will the level of automation change in the context of cloud native microservices?

Different from the traditional 3-layer pyramid automation, the automation in the cloud native architecture focuses on end-to-end testing of the system due to the relatively simple internal service and complex service topology. The actual hierarchical testing is more like an inverted pyramid.

However, due to the high end-to-end cost, considering the input-output ratio, the layered automation of Aipanfan is constructed according to the structure in the lower right corner, in which the interface DIFF test, contract test and pure front-end DIFF test are the three core parts without manual intervention.

5.1 Interface DIFF automation based on full-link gray environment

5.1.1 Full-link gray scheme

DIFF testing of our interface is built on the basis of a powerful full-link grayscale environment capability, which is a bonus brought to us by the cloud native architecture. First, let’s introduce our full-link gray scheme.

Based on the flexible routing capability of ISTIO, we developed the CRD Operator to build the “full-link gray publishing” platform of Aifanfa through the architecture design of “grouped multi-dimensional routing” at the bottom of isomorphism. The solution supports multiple scenarios for our offline multiplexing environment, capacity assessment for online security, and Canary release.

5.1.2 Test environment multiplexing

Test environment multiplexing refers to using limited resources to logically isolate multiple sets of environments from one set of basic environments to support parallel development and joint commissioning.

As shown in the figure below, different branches correspond to different features. By means of traffic dyeing + traffic rule routing, different branches have logically isolated environments and support parallel development. After traffic is marked orange on the front end, full-link requests are accessed through the orange link.

5.1.3 DIFF test based on multiplexing

With logically isolated test environments as described above, each time a new branch environment is pulled and code is updated, regression testing can be performed by playing back the traffic to the base environment (where the code was last deployed online) and the new branch environment and comparing their returns. Our DIFF scheme is as follows:

The scheme has the following advantages:

The interface DIFF based on traffic playback covers the real scene of online users to the maximum extent.
The whole process is automated without manual participation;
Configure traffic filtering policies and DIFF policies to facilitate expansion and optimization.
Distributed task operation, supporting mass concurrency;

5.2 Contract test to guarantee the invocation problem between recall services

5.2.1 What is contract Testing

In the architecture of microservices, the dependency between services is complex, and usually each service is maintained by an independent team. API calls are mostly made between services and between the front and back ends. In this case, the API developed by the A team also serves the B\C team. I passed all the tests at the beginning. However, in the subsequent iterations, team B had some requirements to adjust field A, and Team A changed it according to the requirements and passed the test. However, after the launch, it was found that the function of team C was abnormal.

The fundamental causes of the above problems are as follows:

As more and more consumers are provided by service providers, it is difficult to assess the impact of service changes, and service changes cannot be timely synchronized to all consumers. Therefore, it is usually the consumers who find problems and give feedback, leading to losses. To avoid the above problems, we introduced contract testing.

The core idea of contract testing is to establish the contracts between the server and each consumer in a consumer-driven way. After the modification of the server, the security of service upgrade can be ensured by testing whether the contracts of all consumers are destroyed. At the same time, contracts can also be used as a means of testing decoupling between the two parties. With contract testing, the team can verify that the content provided by the provider meets consumer expectations in an offline manner (without requiring both the consumer and provider to be online at the same time), using the contract as an intermediate standard

5.2.2 Common contract test schemes

Common contract testing schemes include true practice consumer-driven ones such as PACT, where contracts are generated and maintained by the consumer side and all consumer contracts are pulled for testing after the provider code is updated, which addresses the decoupling of integration testing while ensuring that the service can meet all consumer requirements. (Lower left picture)

There are also non-consumer drivers, where providers produce contracts and provide mock services that consumers can test based on Contract files, such as Spring Cloud Contracts. Can only solve integration test decoupling problem (bottom right)

5.2.3 Contract test scheme of Aipanfan

Mr Aipanpan’s plan is a compromise. On the one hand, the contract is always given by the service provider due to the team’s habit, on the other hand, the desire to preserve the consumer-driven nature ensures that the service can meet the needs of all consumers. We chose to generate contracts on the provider side, but supplemented the simulated consumer contract case with online logging and call chain parsing. And the whole process is fully automated.

5.2.4 Implementation of contract test technology

Step 1: Introduce Swagger to promote all-interface access and ensure that the interface document information of the interface management platform can achieve quasi-real-time synchronization with the actual code. The detailed implementation steps are as follows:

Step 2: Automatically generate the contract case from the interface documentation

Once you have the interface information synchronized with the code, the basic contract test case is automatically generated from the interface documentation information. Each time interface information is uploaded to the platform, the system detects the uploaded content and automatically triggers the generation of new cases and verification of old cases according to the content. The verification will run the contract test case associated with the modified interface to detect whether the interface update destroys the original contract. The running result will be recorded in the report and pushed to the corresponding team to mark, and the case will be judged whether to update according to the mark result.

Step 3: Intelligently analyze the characteristics of the consumer end by relying on the call chain and log information, and generate a case simulating the consumer end

As shown in the figure below, the consumers of each service can be extracted through calling chain information. The contracts of each consumer can be obtained through log analysis of each consumer, and case and interface can be automatically generated for association.

5.3 Intelligent Fault Locating Reduces automatic maintenance costs

Although automation is a good way to improve efficiency, but for a long time, the stability of automation problems, the cost of follow-up investigation is high to prevent everyone from carrying out automation construction or automation construction abandoned. In order to improve the stability of automation and reduce the cost of follow-up, we have built the automatic positioning and repair ability of case failure, so that the intelligent little assistant can help you easily maintain the case operation. Here is an example of the effect of our automatic positioning:

After the automatic case fails to run, we will call the automatic positioning service, automatically mark the failed case, and classify the failed case according to the annotation result.

For example, environmental problems will be automatically retried, batch unknown will be sent to the automation team for troubleshooting, and elements not found will be sent to business QA for troubleshooting.

The following is the implementation scheme. Includes basic positioning capabilities and basic data acquisition. On top of these basic capabilities, a configuration layer is built to implement configuration resolution and scheduling capabilities, enabling us to flexibly combine different positioning policies to quickly support problem location in different scenarios.

Sixth, efficient and safe continuous release

6.1 Release Dilemma

Different types of modules interconnect with different platforms and processes, making unified release difficult. The change of the underlying release mode requires upgrading of all modules, resulting in high migration costs
Due to the large number of modules and complex topology, and the dependency between modules, each time 100+ modules are online, manual control process, high risk and low efficiency. Recording and analyzing the rollout process is also costly.
The overall on-line process is not visible, and the risk perception lags behind

How to solve the above problems?

6.2 Deploying Engines on multiple platforms

A unified multi-platform deployment and release engine is built based on cloud native, and CICD is seamlessly integrated to achieve a high degree of standardization in the release process, while supporting multiple release strategies. The diagram below:

The unified CD publishing platform enables the unified release of all types of modules, and the underlying deployment and migration services are not aware of.

6.3 Release script design

With a unified publishing platform, we hope to achieve a fully automated publishing process in order to solve the problem of complicated and inefficient launching process.

Analyze the items that need to be done before and after the release, as shown in the figure below. Based on these issues, the data that need to be collected to automatically complete the whole publishing process is sorted out, as shown in the figure on the right, including publishing module sealing board information, dependency information, configuration information and so on. Based on this data, the service publishing topology and the online steps are automatically generated according to the fixed orchestration logic. After the generated online topology and step information is manually confirmed, the corresponding online publishing service is automatically called for publishing, and the release process data is automatically counted and the release process summary is generated.

6.4 Process visual, perceptive and controllable one-click release

After the automatic sub-release process, in order to timely perceive the problems in the release process and reduce the release risk, visual construction of the release process was carried out, and combined with APM, Canary release and other strategies to ensure the safety of the release.

Visual release process: the dependency topology of service granularity has been displayed in real time, and the process can be perceived visually.

Canary release strategy: release nondestructive, timely risk perception and recall

7. Overall income

The number of iteration stories increased by 85.8%, the release cycle was stable, the r & D test cycle decreased by 30%, and the bug rate of 1000 lines decreased from 1.5 to 0.5.

Viii. Future prospects

1. Enabled the development and coding testing process through IDE local plug-in tools, and improved the efficiency of the R&D process;

2. Build quality risk identification system through white box capability, and apply it to access, quasi, gray scale and other scenes;

Continuous delivery practices under cloud native architecture