Brief introduction to the seven principles of cloud native architecture

Summary: As an architectural pattern, Cloud Native Architecture provides core control over application architecture through a number of principles. These principles can help technical leads and architects make technology selection more efficient and accurate, as described below.

Principle of servitization

In the process of software development, when the number of codes and the size of the development team are expanded to a certain extent, it is necessary to reconstruct the application, separate concerns by means of modularization and componentization, reduce the complexity of the application, improve the efficiency of software development and reduce the maintenance cost.

As shown in Figure 1, with the continuous development of the business, the capacity that a single application can bear will gradually reach the upper limit. Even if the bottleneck of vertical scaling Up is broken through application transformation and it is transformed into the ability to support horizontal scaling Out, under the condition of global concurrent access, There will still be problems with computational complexity and storage capacity. Therefore, it is necessary to further split the single application and re-divide it into distributed applications according to the business boundary, so that applications no longer share data directly, but communicate with each other through agreed contracts, so as to improve the scalability.

Figure 1 applies the Serviced Extension

The principle of service-oriented design refers to splitting business units of different life cycles through service-oriented architecture to realize independent iteration of business units, so as to accelerate the overall iteration speed and ensure the stability of iteration. At the same time, the service-oriented architecture adopts the interface-oriented programming method, which increases the degree of software reuse and enhances the ability of horizontal expansion. Service-oriented design principles also emphasize the abstraction of relationships between business modules at the architectural level to help business modules achieve policy control and governance based on service traffic, rather than network traffic, regardless of the programming language in which the services are developed.

The practice of service-oriented design principles has been successful in the industry. Among them, the most influential and the most highly praised in the industry is the large-scale micro-service-oriented practice that Netflix has carried out in the production system. Through this practice, Netflix has not only taken over 167 million subscribers worldwide and more than 15% of the global Internet bandwidth capacity traffic, but also contributed to the open source field of Eureka, Zuul, Hystrix and other outstanding microservice components.

Not only overseas companies are constantly carrying out the practice of service-oriented, domestic companies also have a high awareness of service-oriented. With the development of Internet in recent years, both cutting-edge Internet companies and traditional large enterprises have had good practices and successful cases in the practice of service-oriented. Alibaba’s service-oriented practice started in 2008 with the colorful stone project. After 10 years of development, it has steadily supported the promotion activities over the years. Taking the data of “Double 11” in 2019 as an example, Alibaba’s distributed system hit a single peak of 544,000 transactions per second, and real-time computing processing was 2.55 billion transactions per second. Alibaba’s practice in the field of servitization has been shared with the industry through Apache Dubbo, Nacos, Sentinel, Seata, Chaos Blade and other open source projects. Meanwhile, These components are integrated with Spring Cloud. Spring Cloud Alibaba has become the successor to Spring Cloud Netflix.

With the rise of the cloud native wave, the principle of service-oriented is constantly evolving and landing in the actual business, but enterprises will also encounter many challenges in the actual landing process. For example, compared with the self-built data center, the servitization under the public cloud may have a huge resource pool, which makes the machine error rate significantly higher. Pay-as-demand increases the operation frequency of scaling capacity; The new environment requires applications to start faster, no strong dependencies between applications, applications can be randomly scheduled between nodes of different specifications, and many other practical issues that need to be considered. But it is predictable that these issues will be resolved one by one as the cloud native architecture evolves.

Principle of elastic

The principle of elasticity means that the deployment size of the system can be adjusted automatically as the volume of business changes, without the need to prepare fixed hardware and software resources according to the prior capacity planning. Excellent flexibility can not only change the IT cost mode of enterprises, so that enterprises do not have to consider additional hardware and software resources cost expenditure (idle cost), but also better support the explosive expansion of business scale, no longer because of the lack of hardware and software resources left regret.

In the era of cloud origin, the threshold for enterprises to build IT systems is greatly reduced, which greatly improves the efficiency of enterprises to implement business planning into products and services. This point is particularly prominent in the mobile Internet and game industry. There are many cases where an app has grown exponentially once it has become hot style. The exponential growth of business will bring great tests to the performance of enterprise IT system. Faced with such challenges, in traditional architectures, it is often the developers, operations and maintenance staff who struggle to tune the performance of the system, but even if they do their best, they are not able to completely solve the bottlenecks, and eventually the system can not cope with the influx of users, causing the application to crash.

In addition to facing the test of exponential business growth, the peak characteristics of the business will be another important challenge. For example, the movie ticket reservation system sees more traffic in the afternoon than in the early morning, and more traffic on weekends than on weekdays. There are also takeaway ordering systems, which tend to see peak orders around lunch and dinner. In traditional architectures, in order to cope with this type of scenario with obvious peak characteristics, the enterprise has to prepare and pay for a large amount of computing, storage, and network resources in advance for peak traffic, and these resources sit idle most of the time.

Therefore, in the era of cloud native, enterprises should consider making their application architecture flexible as soon as possible when building IT systems, so as to flexibly respond to various scenarios in the face of rapidly developing business scale and make full use of cloud native technology and cost advantages.

To build a resilient system architecture, you need to follow four basic principles.

1. According to the function of cutting application

A large and complex system may consist of up to hundreds of thousands of Service, the architects in architecture design, you need to follow the principle is: the logic of the related together, not apart into separate Service related logic, between various services via a standard Service Discovery, Service Discovery to find each other, and use a standard interface for communication. Loose coupling between services enables each service to achieve elastic scaling independently, thus avoiding failures associated upstream and downstream of the service.

2. Support level sharding

Functionally segmented applications do not completely solve the problem of elasticity. When an application is decommissioned into many services, as user traffic grows, the single service will eventually encounter a system bottleneck. Therefore, in the design, each service needs to be equipped with the ability of horizontal segmentation, so that the service can be divided into different logical units, and each unit can handle part of the user traffic, so that the service itself has a good ability to expand. One of the biggest challenges is the database system, because the database system itself is stateful, so splitting the data properly and providing the correct transaction mechanism can be a very complex project. However, in the era of cloud native, cloud native database services provided by cloud platform can solve most of the complex distributed system problems. Therefore, if an enterprise builds an elastic system through the capability provided by cloud platform, it will naturally have the elastic capability of database system.

3. Automated deployment

System burst traffic is often unpredictable, so the common solution is to manually expand the system to support a larger scale of user access. After complete architecture split, the elastic system also needs to have the ability to automated deployment and so according to the established rules or external traffic emergency signal trigger system expansion and automation of functions, meet the system to shorten the sudden traffic impact long timeliness requirements, at the same time at the end of the peak automatic shrink capacity system, reduce the resource usage of the system operation cost.

4. Support service degradation

Elastic system needs to design abnormal response ahead of time, for example, classification management for service in the elastic mechanism failure and elastic insufficient resources or abnormal situations, such as peak flow rate than expected, and system architecture needs to have the ability to service degradation, by reducing partial non-critical service quality, or close some enhancements to relinquish resources, And expand the service capacity corresponding to important functions to ensure that the main functions of the product will not be affected.

There are many successful practice cases of large-scale elastic system construction at home and abroad, among which the most representative one is Alibaba’s annual “Double 11” promotion activity. In order to cope with the traffic peak of 100 times compared with normal times, Alibaba buys flexible resources from Aliyun every year to deploy its own applications, and releases this batch of resources after the “Double 11” event, and pays on demand, thus greatly reducing the resource cost of the big promotion event. Another example is the elastic structure of sina weibo. When social hot events occur, sina weibo expands the capacity of application containers to AliCloud through the elastic system to cope with a large number of search and forwarding requests caused by hot events. Through the minute-level capacity expansion response capacity, the system greatly reduces the resource cost generated by hot search.

With the development of cloud native technology, FaaS, Serverless and other technology ecosystems are gradually mature, and the difficulty of building large-scale elastic systems is gradually reduced. When the enterprise takes FAAS, Serverless and other technical concepts as the design principle of the system architecture, the system will have the ability of elastic expansion, and the enterprise will not need to pay extra costs for “maintaining the elastic system itself”.

Observability principle

Different from the passive capabilities provided by systems such as monitoring, business exploration and APM (Application Performance Management), observability emphasizes more initiative. In distributed systems such as cloud computing, observability can be actively used through logging, link tracking and measurement. The time of multiple service calls, return values and parameters generated by one APP click can be clearly visible, and can even drill down into each third-party software call, SQL request, node topology, network response and other information. Operations, development, and business personnel can use this observation capability to keep track of software performance in real time and gain unprecedented correlation analysis capabilities to continuously optimize the health of the business and user experience.

With the comprehensive development of cloud computing, the application architecture of enterprises has undergone significant changes and is gradually transforming from the traditional single application to the micro-service. In the micro-service architecture, the design of loose coupling between various services makes version iteration faster and cycle shorter. In the infrastructure layer, things like Kubernetes have become the default platform for containers; Services can be continuously integrated and deployed through pipelining. These changes minimize the risk of service changes and improve the efficiency of R&D.

In a microservice architecture, the point of failure of a system can be anywhere, so we need to systematically design for observability to reduce MTTR (mean time to failure repair).

To build observability systems, there are three basic principles to follow.

1. Comprehensive data collection

Metric, link tracking and Logging are the “three pillars” of building a complete observable system. The observability of the system requires the complete collection, analysis and display of these three types of data.

(1) Indicators

Metrics are the values of KPIs that are measured over multiple consecutive time periods. Under normal circumstances, the index will be in accordance with the layered software architecture, divided into system resource indicators (such as CPU utilization, disk usage and network bandwidth, etc.), application metrics (e.g., error rate, service level agreements APDEX SLA, service satisfaction, average latency, etc.), business indicators (such as user session number, order quantity and turnover, etc.).

(2) Link tracking

Link tracing refers to the unique identification of traceID to record and restore the whole process of a distributed call, through the whole process of data processing from the browser or mobile terminal through the server to the execution of SQL or the initiation of remote call.

(3) Log

Log is usually used to record the execution process of application running, code debugging, error exception and other information, such as NGINX log can record remote IP, request time, data size and other information. Log data needs to be centrally stored and retrievable.

2. Data association analysis

This is especially important for an observable system to create more correlations between the data. When a fault occurs, the effective correlation analysis can realize the fast delimiting and locating of the fault, thus improving the efficiency of fault processing and reducing unnecessary losses. In general, we will take the application server address, service interface and other information as additional attributes, bind it with indicators, call chain, log and other information, and give the observable system certain customization ability, so as to flexibly meet the needs of more complex operation and maintenance scenarios.

3. Unified monitoring view and presentation

Monitoring views in multiple forms and dimensions can help operators and developers quickly find system bottlenecks and eliminate system hidden dangers. The presentation form of monitoring data should not only be indicator trend chart, bar chart, etc., but also combine with the needs of complex practical application scenarios to make the view capable of drilling down analysis and customization, so as to meet the requirements of multiple scenarios such as operation and maintenance monitoring, version release management, and troubleshooting.

With the development of cloud native technology, scenarios based on heterogeneous micro-service architecture will become more and more complex, and observability is the foundation of all automation capability construction. Only when the overall observability is achieved can the stability of the system be improved and the MTTR be reduced. Therefore, how to build a full stack observable system of system resources, containers, networks, applications and services is a problem that every enterprise needs to think about.

Principle of toughness

Resilience refers to the ability of software to resist the failure of the hardware and software components on which it depends. These anomalies usually include hardware failure, hardware resource bottleneck (such as CPU or network card bandwidth depletion), business flow beyond the software design capacity, fault or disaster affecting the normal operation of the computer room, dependent software failure and other potential influencing factors that may cause business unavailability.

After a business goes live, it may also encounter a variety of uncertain inputs and unstable dependencies for most of the runtime. When these abnormal scenarios occur, the business needs to ensure the quality of service as much as possible to meet the current requirements of “always on” represented by networked services. Therefore, the core design concept of resilient capability is failure-oriented design, that is, considering how to reduce the impact of abnormal dependencies on the system and service quality and return to normal as soon as possible under various abnormal dependencies.

The practice and common architectures of the principle of resilience mainly include service asynchronization capability, retry/current limiting/downgrading/fusing/backpressure, master-slave mode, cluster mode, high Availability of multi-AZ (Availability Zone), unitization, cross-region disaster tolerance, multi-location disaster tolerance, etc.

The following is a detailed example of how to design toughness in a large system. “Singles Day” is a battle that Alibaba can’t lose, so the design of its system needs to strictly follow the principle of resilience in terms of strategy. For example, in the unified access layer through traffic cleaning to achieve security policy, to prevent hacking attacks; By refining the current limiting strategy to ensure the peak flow stability, so as to ensure the normal operation of the back end. In order to improve the overall high availability capacity, Alibaba realized multi-activity disaster recovery across the region through the unit mechanism, and realized double-activity disaster recovery in the same city through the same city disaster recovery mechanism, so as to improve the service quality of IDC (Internet Data Center, Internet Data Center) to the greatest extent. In the same IDC through microservice and container technology to achieve business stateless migration; Enhancing high availability through multi-copy deployment; The asynchronous decoupling between micro-services is accomplished through messages to reduce the dependency of services and improve the throughput of the system. From the perspective of each application, we should sort out our own dependencies, set the downgrade switch, and continuously strengthen the robustness of the system through failure drills to ensure the normal and stable progress of Alibaba’s “Double 11” promotion activities.

With the speeding up of digital, more and more digital business infrastructure become the whole social economic operation, but with the support of these digital business system is more and more complex, rely on the service quality of uncertainty risk is becoming more and more high, so the system must be sufficient toughness design, in order to better deal with all kinds of uncertainty. Especially in the case of core business links involving core industries (such as financial payment links, e-commerce transaction links), business traffic entry, and relying on complex links, resilient design is crucial.

All process automation principles

Technology is a “double-edged sword”. The use of containers, micro-services, DevOps and a large number of third-party components, while reducing distributed complexity and improving iteration speed, also increases the complexity of software technology stack and component size, which inevitably leads to the complexity of software delivery. If not properly controlled, the application will not be able to appreciate the advantages of cloud native technology. Through the practice of IAC, GITOPS, OAM, Operator, and a number of automated delivery tools in the CI/CD pipeline, enterprises can standardize or automate their software delivery processes within the enterprise. That is to realize the automation of the whole software delivery and operation by configuring the data self-description and the delivery process oriented to the final state.

There are four basic principles that need to be followed in order to achieve large-scale automation.

Standardization of 1.

The implementation of automation begins with the standardization of the infrastructure for business operations through containerization, IAC, and OAM, and further standardizing the definition of applications and even the delivery process. Only when standardization is achieved, can the business be removed from the dependence on specific personnel and platform, and the business be unified and automated on a large scale.

2. End state orientation

End-state oriented is an idea that describes the expected configuration of infrastructure and application declaratively, keeps paying attention to the actual running state of the application, and makes the system itself change and adjust repeatedly until it approaches the final state. The principle of end-state orientation emphasizes that instead of directly assembling a series of procedural commands through the work order system and workflow system to change the application, the end-state should be set so that the system can make its own decision on how to execute the change.

3. Separation of concerns

What automation ultimately achieves depends not only on the capabilities of the tools and the system, but also on the people who set the goals for the system, so make sure you find the right goal setter. When describing the final state of the system, the configuration concerned by the main roles of application research and development, application operation and maintenance, and infrastructure operation and maintenance should be separated. Each role only needs to set the system configuration that he/she is concerned about and good at, so as to ensure that the final state of the system set is reasonable.

4. Design for failure

If you want to achieve full process automation, you must ensure that the automated process is controllable and the impact on the system can be expected. We cannot expect an automated system to be error-free, but we can ensure that even in the event of an exception, the scope of the error is manageable and acceptable. Therefore, the automated system also needs to follow the best practice of manual changes when implementing changes, so as to ensure that the changes are grayscale executable, the execution results are observable, the changes can be quickly rolled back, and the impact of the changes can be traced.

Self-healing of a business instance is a typical process automation scenario. After the business has migrated to the cloud, although the cloud platform has greatly reduced the probability of server failure through various technical means, it cannot eliminate the software failure of the business itself. Software faults include not only the crashes caused by the defects of the application software itself, the memory overflow (OOM) caused by the shortage of resources and the ramming caused by the excessive load, but also the problems of the system software such as the kernel and daemon process, as well as the interference of other applications or operations in the mixed section. As businesses grow in size, the risk of software failures is increasing. Traditional way of ops troubleshooting needs the intervention of operations staff, perform repair actions such as reboot or undulate, but under the large-scale scene, operations staff are struggling to cope with all kinds of fault, and even need overtime overnight, service quality is difficult to guarantee, whether the customer, or development, operations staff, could not be satisfied.

In order to enable automated troubleshooting, cloud native applications require developers to describe, through standard declarative Configuration, how to detect application health and how to start an application, the service discovery that needs to be mounted and registered after the application is started, and the Configuration Management Data Base (Configuration Management Data Base) CMDB). With these standard configurations, the cloud platform can repeatedly detect applications and perform automated repair operations when failures occur. In addition, in order to prevent false positives that may exist in fault detection itself, the operation and maintenance personnel of the application can also set the proportion of service unavailable instances according to their own capacity, so that the cloud platform can guarantee business availability while automating fault recovery. The realization of instance fault self-healing not only liberates developers and operation personnel from cumbersome operation and maintenance operations, but also can deal with all kinds of faults in time to ensure the continuity of business and the high availability of services.

Zero trust principle

Traditional security architectures based on the boundary model are designed to build a wall between trusted and untrusted resources, such as a corporate Intranet that is trusted and the Internet that is not. In this security architecture design pattern, once intruders penetrate the boundaries, they are free to access the resources within the boundaries. However, the application of cloud native architecture, the popularization of remote office mode for employees and the current situation of using mobile devices such as mobile phones to deal with work have completely broken the physical boundary under the traditional security architecture. Employees working from home can also share data with partners because applications and data are hosted in the cloud.

Today, boundaries are no longer defined by the physical location of an organization, but have expanded to all places where access to an organization’s resources and services is required, and traditional firewalls and VPNs are no longer able to handle these new boundaries reliably and flexibly. Therefore, we need a new security architecture that can flexibly adapt to the characteristics of cloud native and mobile environments, so that data security can be effectively protected no matter where employees work, where devices are connected, and where applications are deployed. If this new security architecture is to be implemented, it will rely on the zero-trust model.

Whereas traditional security architectures assume that everything inside a firewall is secure, the zero-trust model assumes that firewall boundaries have been breached and that every request comes from an untrusted network, so every request needs to be validated. In short, “Never trust, always verify.” In the zero-trust model, each request is strongly authenticated and authorized based on security policies. The user identity, device identity and application identity related to the request will serve as the core information to judge whether the request is secure or not.

If we discuss security architecture around the boundary, then the boundary of traditional security architecture is the physical network, while the boundary of zero-trust security architecture is the identity, which includes the identity of people, devices, applications and so on. To implement a zero-trust security architecture, there are three basic principles to follow.

1. Explicit validation

Each access request is authenticated and authorized. Authentication and authorization need to be based on user identity, location, device information, service and workload information, as well as data grading and exception detection. For example, for the communication between internal applications in an enterprise, we cannot simply determine that the source IP is internal IP and directly authorize access. Instead, we should determine the identity and device information of the source application, and then combine with the current policy authorization.

2. Minimum permissions

** For each request, grant only the permissions that are currently necessary, and the permission policy should be adaptive based on the current request context. For example, HR employees should have access to HR related applications, but they should not have access to finance applications.

3. Hypotheses are breached

Assuming that the physical boundary is breached, the safe explosion radius needs to be strictly controlled, and an overall network is divided into multiple parts perceived by users, devices and applications. Encrypt all sessions and use data analysis techniques to ensure visibility into the security state.

The evolution from traditional security architecture to zero-trust architecture will have a profound impact on software architecture, which is embodied in the following three aspects.

First, security policies cannot be configured based on IP. Under the cloud native architecture, IP cannot be assumed to be bound to the service or application. This is because IP may change at any time due to the application of automatic elasticity and other technologies. Therefore, IP cannot represent the identity of the application and establish security policy on this basis.

Second, identity should become infrastructure. Authorizing communication between services and human access to services requires that the identity of the visitor is clearly known. In an enterprise, human identity management is often part of the security infrastructure, but the identity of an application also needs to be managed.

Third, the standard delivery pipeline. In the enterprise, the research and development work is usually distributed, including the code version management, the construction, the test and the on-line process, is relatively independent. This decentralized pattern leads to a lack of security for services running in the actual production environment. If the process of versioning, building, and going online can be standardized, then the security of application distribution can be enhanced centrally.

In general, the construction of the zero-trust model includes identity, device, application, infrastructure, network, data and other parts. The implementation of zero trust is a step-by-step process. For example, when all traffic within an organization is not encrypted, the first step should be to ensure that visitor traffic accessing the application is encrypted, and then step by step to encrypt all traffic. If you adopt a cloud native architecture, you can directly leverage the security infrastructure and services provided by the cloud platform to help your enterprise quickly implement a zero-trust architecture.

The principle of continuous evolution of architecture

Nowadays, technology and business are developing very fast. In engineering practice, there are few architectural patterns that can be clearly defined from the beginning and apply to the whole software life cycle. Instead, they need to be constantly reconfigured within a certain scope to adapt to changing technical and business requirements. In the same way, the cloud native architecture itself should and must be capable of continuous evolution, rather than a closed architecture that is designed to be fixed. So in addition to consider when designing incremental iteration, rationalization of target selection etc., also need to consider organization (for example, the architecture control commission) level architecture governance and risk control standard, and the characteristics of the business itself, especially in the business under the condition of high speed iteration, more should focus on how to ensure the balance between the evolution of architecture and business development.

1. Features and values of an evolutionary architecture

Evolutionary architecture is a design that is extensible and loosely-coupled at the very beginning of software development to make subsequent changes easier, upgrade refactoring less costly, and can occur at any stage of the software life cycle, including development practices, release practices, and overall agility.

The fundamental reason evolutionary architectures are important in industrial practice is that the consensus in modern software engineering is that change is hard to predict and expensive to adapt. Evolutionary architecture and can’t avoid reconstruction, but it highlights the architecture evolution, namely when the whole architecture technology, organization, or the change of external environment need evolving, project as a whole can still follow the principle of strong boundary context, ensure the logic described in domain driven design divided into physical isolation. Evolutional architecture, through standardized and highly scalable infrastructure system, adopts a large number of advanced cloud native application architecture practices such as standardized application model and modular operation and maintenance capability, and realizes the physical modularization, reusability and separation of responsibilities of the entire system architecture. In an evolutionary architecture, each service of the system is decoupled from the other services at the structural level, and replacing services is as easy as replacing Lego bricks.

2. Application of Evolutionary Architecture

In the practice of modern software engineering, evolutionary architecture has different practices and manifestations at different levels of the system.

In business-oriented application architectures, evolutionary architectures are often inseparable from microservice design. For example, in alibaba Internet electricity application (for example everyone acquaint with of taobao and Tmall, etc.), the whole system architecture is designed to be as thousands of finely actually boundaries clear component, its purpose is to want to make non destructive changes developers to provide greater convenience, to avoid because of inappropriate coupling change guide the direction of the unpredictable, This hinders the evolution of the architecture. As you can see, software in evolutionary architectures all support a degree of modularity, which is typically embodied in classic layered architectures and best practices for micro-services.

At the platform development level, evolutionary Architecture is more embodied in Capability Oriented Architecture (COA). After the gradual popularization of cloud native technologies such as Kubernetes, cloud native infrastructure based on standardization is rapidly becoming the capability provider of platform architecture, and the concept of Open Application Model (OAM) based on this is just a kind of application architecture perspective. CoA practices will be modularized by capability with standardized infrastructure.

3. Architecture evolution in cloud native environment

Evolutionary architecture is still in a period of rapid growth and popularity. However, there has been a consensus throughout the software engineering community that the software world is constantly changing and that it is a dynamic rather than a static existence. Architecture is not a simple equation either; it is a snapshot of an ongoing process. Therefore, both in business applications and platform development level, evolutionary architecture is an inevitable development trend. A large number of architectural renewal engineering practices in the industry illustrate the huge amount of effort required to keep an application fresh by neglecting to implement the architecture. But good architectural planning can help reduce the introduction cost of new technologies, which requires the application and platform to meet the architectural level: architecture standardization, separation of responsibilities and modularization. In the cloud native era, the Development Application Model (OAM) is rapidly becoming an important impetus for the evolution of architecture.

conclusion

We can see that cloud native architectures are built and evolved based on the core characteristics of cloud computing (e.g., resilience, automation, resilience) and in combination with business goals and characteristics to help enterprises and technical personnel fully unlock the technical benefits of cloud computing. With the continuous exploration of cloud native, various technologies continue to expand, various scenarios continue to enrich, cloud native architecture will continue to evolve. But in the course of these changes. Typical architectural design principles have always been important to guide our architectural design and technology implementation.

This article is the original content of Aliyun, shall not be reproduced without permission.