Summary: What exactly is cloud-native DevOps? We think: cloud native enterprise is to make full use of cloud native infrastructure, based on micro/no service architecture system and open standards, languages and frameworks, continuous delivery and intelligence from operational ability, so as to achieve higher than traditional enterprise for the development of service quality, lower operational costs, make fast iteration of development focused on business.

What is Cloud Native DevOps

Let’s start with a simple example of what cloud native DevOps is and how it differs from DevOps.

Pictured above is a food stall where the chef is working very hard to slice, stir-fry, make and sell all kinds of food. From the purchase of raw materials to processing to sales to after-sales, are completed by one or two people. This is a very typical DevOps scenario where the team takes care of everything end-to-end. In this case, when the chef level is relatively high, sales ability is relatively strong, can achieve high efficiency, low waste. The problem is that scaling can be difficult. Because the process is non-standard, it requires a strong personal ability of the chef.

Let’s take a look at this picture of the Nanjing dai dang. Although it has a dai dang in its name, it is obviously not the dai dang we mentioned above. When we walk into any Nanjing food stall, we can find that the chefs of Nanjing food stall can focus on providing better dishes for customers, researching and developing new dishes, and trying and promoting them through small batches of users. Whether the number of users increases or decreases, they can quickly adapt. Stores can also expand quickly. This can be thought of as cloud-native DevOps.

So what exactly is cloud-native DevOps? We think: cloud native enterprise is to make full use of cloud native infrastructure, based on micro/no service architecture system and open standards, languages and frameworks, continuous delivery and intelligence from operational ability, so as to achieve higher than traditional enterprise for the development of service quality, lower operational costs, make fast iteration of development focused on business.

As shown in the chart above, cloud native DevOps is based on two principles: open standards compliance, language and framework independence, two foundations: micro-service/no-service architecture, Serverless infrastructure BaaS/FaaS, and two capabilities: intelligent self-operation and continuous delivery.

Compliance with open standards, languages and frameworks have nothing to do with each other. Compared with a specific language or framework, technology upgrading or iteration can have higher elasticity, better development and vitality, and form a better ecology. Two foundations: microservice-based and no-service architectures that make DevOps possible; Serverless based infrastructure is resource – and demand-oriented to achieve greater resilience. Based on these two principles and two foundations, we can achieve two capabilities: continuous delivery and intelligent self-operation and maintenance.

2. Alibaba Cloud native DevOps upgrade case

Let’s start with an example of a cloud-native DevOps transformation by a team at Alibaba.

Case background: An overseas e-commerce team of Alibaba faces the challenges of many sites, high cost of site construction, rapid demand change, slow delivery, and high cost of operation and maintenance in the overseas market. How to smoothly upgrade to cloud native DevOps to solve these problems and improve the efficiency of business delivery? Here’s how we did it. (1) Architecture upgrade – service governance sidecar and mesh

The first step is an architectural upgrade that starts with sinking the service governance code to the sidecar part outside of the application, while using the service grid to host capabilities such as environment routing. As shown in the figure above, each green dot represents a service application code, and each orange dot represents a service governance code, which is stored in the container in the form of a binary package. With the construction of service governance system, it includes a lot of things, such as log collection, monitoring buried points, operation and maintenance intervention, etc. We call this container rich container. The problem with this is obvious: even an upgrade or adjustment to log gathering requires an application to be upgraded, built, and deployed again. However, this has nothing to do with the application itself. Also, because there is no separation of concerns, a bug in log gathering can affect the application itself.

The first thing we did in order to make the application more application-focused was to take all the service governance code out of the application container and put it in the Sidecar so that the service governance and application code were in two containers. At the same time, we also transferred some of the old service governance things, such as test routing, link tracing, etc. to Mesh Sidecar. In this way, the application is slimmed down, and the application only needs to care about the application code itself.

The benefit of this is that the business can focus on business-relevant application code without relying on service governance.

This is the first step, which is smooth because we can gradually migrate service governance into Sidecar without worrying about the cost of a single migration. (2) Architecture Upgrade — From construction decoupling, release decoupling to operation and maintenance decoupling, we did three levels of decoupling in the second step: construction decoupling, release decoupling, and operation and maintenance decoupling.

Those of you who know about microservices and no-service architectures should know that a business can only run faster and better when it can develop, test, publish, and operate independently. Because it minimizes the coupling with other people.

But we also know that as businesses become more complex and applications continue to evolve, applications will contain more and more business code. For example, the application in the following figure contains some codes for a specific business. For example, as a payment application, some codes are for the specific requirements of Hema, some are for the specific requirements of Tmall, and some are general codes, or platform codes, which are for all business scenarios.

Obviously, from the perspective of improving development efficiency, business parties can reduce communication costs and improve R&D efficiency by changing their related business codes. However, this introduces a new problem: if a business needs to change but does not involve common business logic, it also requires a full regression of all the businesses in the entire application, and if there are other business changes in this time period, they need to be integrated and released together. If there are many changes to the business, people need to queue integration. In this case, the cost of integration testing and communication coordination is very high.

Our goal is to develop, publish, and operate each business independently. To achieve this smoothly, the first thing we need to do is make them decoupled during the construction phase. For example, for a relatively independent business, we build it as a separate Container image, orchestrate it into the init Container of a Pod, and then mount it into the storage space of the main application Container when the Pod starts.

At this point, however, application publishing and operation are still together, and we need to keep them separate.

We know that app intimacy can be roughly divided into three categories: First, super intimate, in the same process, through the function call communication; second, in the same POD of different containers, through communication through IPC; third, in the same network, through RPC communication We can gradually split some business codes into one RPC or IPC service according to the characteristics of the business. This allows them to be published and operated independently. At this point, we have decoupled the build, release, and operation of the application container.

(3) IAC & GITOPS

The third step is to look at development and operations. In a lot of research and development scenario, a thorny question is: different environment and the business will have a lot of their own special configuration, the release and ops often need to modify and select the correct configuration according to the circumstance, and the configuration and application code itself is actually part of the release, traditional way of through the console to maintenance cost will be very high.

In a cloud-native context, we think IAC (Infrastructure as Code) and Gitops are better choices. In addition to having a code base for each application, we also have an IAC repository. This repository will contain the mirrored version of the application and all configuration information. When code changes need to be released or configuration changes are made, they are pushed to the IAC repository as code push. The GITOPS engine automatically detects IAC changes, automatically translates them into OAM compliant configurations, and then applies the changes to the corresponding environment based on the OAM model. Both development and operations can see what has changed through the iAC version of the code, and each release is complete.

(4) BaaS of resources

The final step is the BaaS of the resource. Let’s imagine how resources are used in an application. In general, we will first submit resource application to the corresponding console, describe the resource specifications and requirements we need, and then get the connection string and authentication information of the resource after approval. Add the resource configuration to the application configuration, and if there are any changes later, go to the corresponding console and approve with the code release. Of course, the operation and monitoring of such resources is usually done in a separate console. As we diversify our resources, the cost of operation and maintenance becomes very high, especially when new sites are created. In keeping with the principle of describing resources declaratively and using them on demand, we simplify the use of resources for all applications by defining these resources in the IAC. All resources are declarative description, enabling intelligent management and on-demand use of resources. At the same time, all our resources are common resources and standard protocols on the cloud, which greatly reduces the migration cost. This allowed us to gradually move the business team to the cloud native infrastructure. Therefore, the two key points of resource BaaS are as follows:

  • Describe resource requirements declaratively, manage them intelligently, and use them on demand
  • Adopt common resources on the cloud and align standard protocols

3. Cloud effect drives efficient implementation of cloud native DevOps

What we have shared above is an internal practice of Alibaba, which relies on its internal R&D collaboration platform, Aone. The public cloud version of AOne is called Aliyun Cloud Effect. How can we implement cloud-native DevOps using Aliyun?

As we can see from the previous example, the implementation of cloud-native DevOps is a systematic project involving methodology, architecture, collaboration, and engineering. Among them, the implementation of cloud native DevOps belongs to the category of lean delivery.

Above is a picture of Cloud Cloud’s native DevOps solution. Here, we divide users into two roles:

  • Technical lead or architect
  • Engineer, including development, test, operation and maintenance, etc

As a technical lead or architect, he needs to define and control the R&D activities of the enterprise as a whole. From a large perspective, the R & D process includes four aspects: operational, observable, governable and changeable.

First, he will define the enterprise’s R&D collaboration model, such as whether to adopt Agile R&D or Lean Kanban. Secondly, he needs to understand the overall product architecture, such as which cloud products need to be used, how these cloud products are coordinated and managed, etc. Then he will decide the team’s R&D mode: how to do a good job of R&D collaboration, how to control the quality of R&D, etc. Third, he needs to determine the release strategy, whether to use greyscale release or blue-green deployment, what the greyscale strategy is, etc. Finally, it is the monitoring strategy of the service, such as which monitoring platform the service needs to access, how to detect the service status, global monitoring configuration and so on. First-line development, test, and operation engineers focus on smooth and efficient work processes. After the cloud effect project collaboration platform receives a demand or task, it can code, submit, build, integrate, publish and test through the cloud effect, and deploy it to the pre-release and production environment, so as to truly implement the R&D mode and release strategy configured by the administrator. At the same time, each environment is automatically triggered and flow, do not need to coordinate and pull artificially.

The data generated throughout the development process is an organic whole, generating a wealth of data insights that can drive the team to continuous improvement. When the team encounters bottlenecks or confusion in the process of research and development, they can also obtain professional diagnostic advice and research and development guidance from the cloud effect expert team.

To sum up, the cloud-efficient cloud-native DevOps solution is deeply integrated into the complete DevOps toolchain, guided by the ALPD methodology and based on best practices suggested by experts, to help enterprises step up to the cloud native DevOps.

Next, let’s look at a specific case.

An Internet enterprise has a R&D team of about 30 people without full-time operation and maintenance personnel. Its products include more than 20 micro services and dozens of front-end applications (Web, small programs, APP, etc.). Its business is growing very fast, and in the face of rapidly growing customers and increasing demands, the original Jenkins+ECS script-based deployment approach is gradually unable to meet the demands, especially the problem of zero downtime deployment upgrade. This led to the need for cloud effects and eventually a full migration to cloud-based cloud-native DevOps.

The R&D team faces three major pain points:

  • Large number of customers, more urgent needs
  • There is no full-time operation and maintenance, cloud native technology such as K8S has a high learning threshold
  • The IT infrastructure is complex and delivery time consuming

To solve these problems, cloud effect starts from three aspects: basic ability, release ability and operation and maintenance ability.

First of all, Aliyun ACK is introduced to upgrade the infrastructure on the basis of the existing ECS resources, and the application is transformed into container. In terms of service governance and application architecture, it is simplified from Spring Cloud family bucket to SpringBoot, which supports service discovery and governance through K8S standard capability.

Secondly, automatic container deployment is realized through cloud effect pipeline, and gray scale deployment strategy is combined to achieve gray scale on-line, automatic capacity expansion, and automatic restart in case of failure. At the same time, zero shutdown is achieved based on cloud effect pipeline to quickly roll back any cost, which saves machine cost and solves the problem that enterprises do not have full-time operation and maintenance personnel.

Third, through the cloud effect automated pipeline and branch protection standard research and development mode, including code review, code detection, test card points, etc., to improve feedback efficiency and release quality.

Below is an architectural diagram of the overall solution.

4. Cloud native DevOps upgrade path

We have divided cloud native DevOps into five phases.

The first stage: all manual delivery and operation. This is our initial phase. The application architecture has not been serviced, it has not used cloud infrastructure or IaaS only, it has not used continuous integration, test automation, manual deployment and release, manual operation and maintenance. I believe there are very few enterprises stay at this stage.

The second stage: instrumental delivery and operation. First of all, the application architecture should be serviced and the micro-service architecture should be adopted to improve the service quality. The second is to introduce some development tools, such as Gitlab, Jenkins and other island tools to solve part of the problem. At the same time, we started to implement continuous integration of single modules, but there is generally no automated quality card point, and the release is often assisted by automated tools.

The third phase: limited continuous delivery and automated operations. We further enhanced the basic capacity, and the infrastructure was containerized and built on the basis of CAAS. On the other hand, start to introduce a complete tool chain to access the R&D data, such as the use of cloud-efficient DevOps tool platform, to achieve the complete intercommunication of all the data. It can be deployed continuously in terms of publishing capability, but it also requires some human intervention. At this point, automated testing has become mainstream, services as a whole can be observed, and operations can be service-oriented and declarative.

The fourth stage: continuous delivery and manually assisted self-operation and maintenance. We further let the development students focus on business development. First of all, we began to adopt a large number of no-service architectures in the application architecture, and achieved unattended continuous deployment. Publishing grayscale and rollback can be as automated as possible with intervention. The observation capability has been upgraded from the application level to the business level to realize the observability of the business and achieve partial self-operation and maintenance under the condition of artificial assistance.

The fifth stage: full link continuous delivery and self-operation and maintenance. That’s the ultimate goal we’re looking for. At this stage, all of our applications and infrastructure adopted a no-service architecture and achieved end-to-end unattended continuous delivery. The rollback and grayscale of releases were also automated. Technical facilities and services fully realize self-operation and maintenance. Developers really only need to care about the development and iteration of the business.

However, the devil is in the details. Of course, there are still a lot of problems for us to solve when we really land. With the help of tool platform such as cloud effect and expert consultation from ALPD, we can avoid detour and achieve our goals faster.

This article is the original content of Ali cloud, shall not be reproduced without permission