Brief introduction:As an architectural pattern, cloud native architecture uses a number of principles to control the core application architecture. These principles can help technical leads and architects to be more efficient and accurate when selecting technologies, and they are described in detail in this article.

As an architectural pattern, cloud native architecture uses a number of principles to control the core application architecture. These principles can help technical leads and architects to be more efficient and accurate when selecting technologies, as described below.

Servitization principle

In the process of software development, when the number of codes and the size of the development team are expanded to a certain extent, it is necessary to refactor the application, and separate concerns through modularization and componentization to reduce the complexity of the application, improve the efficiency of software development, and reduce the maintenance cost.

As shown in Figure 1, with the continuous development of business, the capacity of single application will gradually reach the upper limit. Even if application transformation is used to break through the bottleneck of vertical Scale Up and transform it into the ability to support Scale Out, in the case of global concurrent access, There are still problems with data computing complexity and storage capacity. Therefore, individual applications need to be further divided into distributed applications according to business boundaries, so that applications do not directly share data, but communicate with each other through the agreed contract to improve scalability.



Figure 1 applying the servitization extension

The servitization design principle refers to the separation of business units of different life cycles through servitization architecture to realize independent iteration of business units, so as to accelerate the overall iteration speed and ensure the stability of iteration. At the same time, the service-oriented architecture uses the interface oriented programming method, which increases the degree of software reuse and enhances the ability of horizontal expansion. Service-oriented design principles also emphasize abstracting the relationships between business modules at the architectural level to help business modules achieve policy control and governance based on service traffic, rather than network traffic, regardless of the programming language in which the services are developed.

The practice of servitization design principles has many success stories in the industry. One of the most influential and most acclaimed practices in the industry is Netflix’s large-scale microservitization of production systems. Through this practice, Netflix not only has up to 167 million subscribers worldwide and more than 15% of the global Internet bandwidth capacity traffic, but also contributed to the open source area of Eureka, Zuul, Hystrix and other outstanding microservice components.

Not only overseas companies are constantly carrying out servitization practice, domestic companies also have a high awareness of servitization. With the development of Internetization in recent years, both cutting-edge Internet companies and traditional large enterprises have good practices and successful cases in service-oriented practice. Alibaba’s service practice started in 2008 multicolored stone project, after 10 years of development, stable support over the years to promote activities. Taking the data on The “Double 11” in 2019 as an example, the peak number of orders in Alibaba’s distributed system was 544,000 per second, and the real-time computing processing was 2.55 billion per second. Alibaba’s practice in the field of service has been shared with the industry through Apache Dubbo, Nacos, Sentinel, Seata, Chaos Blade and other open source projects. Meanwhile, The integration of these components with Spring Cloud Spring Cloud Alibaba has become the successor to Spring Cloud Netflix.

With the rise of the wave of cloud native, the service-oriented principle is constantly evolving and landing in the actual business, but enterprises will also encounter many challenges in the actual landing process. For example, compared with self-built data centers, the servitization under public cloud may have a huge resource pool, which significantly increases the machine error rate. Pay-as-you-need increases the operating frequency of capacity expansion; The new environment requires many practical problems to be considered, such as faster application startup, no strong dependency between applications, and the ability to schedule applications among nodes of different specifications. It is expected, however, that these issues will be addressed one by one as the cloud native architecture evolves.

Principle of elastic

The flexibility principle means that the system deployment scale can be automatically adjusted as the service volume changes, and fixed hardware and software resources do not need to be prepared based on the capacity planning. Excellent resilience can not only change the IT cost model of enterprises, so that enterprises do not have to consider the additional hardware and software resource cost (idle cost), but also better support the explosive expansion of business scale, no longer because of insufficient hardware and software resource reserves.

In the cloud-based era, the threshold for enterprises to build IT systems has been greatly lowered, which greatly improves the efficiency of enterprises in translating business planning into products and services. This is especially true in the mobile Internet and games industry. There are many cases where the number of users of an app increases exponentially after it becomes popular. Exponential business growth can put the performance of enterprise IT systems to the test. Faced with such challenges, in traditional architectures, developers and operations people are often tired of tuning the system performance, but even if they try their best, they may not be able to fully solve the bottleneck problem of the system, and eventually the system cannot cope with the influx of huge numbers of users, causing the application to crash.

In addition to facing the test of exponential growth of the business, the peak characteristics of the business will be another important challenge. For example, the traffic of movie ticket booking system is much higher in the afternoon than in the early morning, and the traffic on weekends is even several times higher than on weekdays. There are take-out ordering systems, which tend to peak around lunch and dinner. In a traditional architecture, to deal with such scenarios with distinct peak characteristics, an enterprise needs to prepare and pay for a large amount of computing, storage, and network resources for peak traffic that sit idle most of the time.

Therefore, in the cloud native era, enterprises should consider making the application architecture flexible as soon as possible when building IT systems, so as to flexibly respond to the requirements of various scenarios in the face of the rapid development of business scale, and make full use of the cloud native technology and cost advantages.

To build a resilient system architecture, you need to follow four basic principles.

1. Cutting application according to function

A large and complex system may consist of up to hundreds of thousands of Service, the architects in architecture design, you need to follow the principle is: the logic of the related together, not apart into separate Service related logic, between various services via a standard Service Discovery, Service Discovery to find each other, and use a standard interface for communication. The loose coupling between services enables each service to perform elastic scaling independently, thus avoiding the occurrence of upstream and downstream related faults.

2. Support level segmentation

Cutting by function applications do not fully solve the problem of elasticity. When an application is broken down into many services, the individual services eventually run into bottlenecks as user traffic grows. Therefore, in the design, each service needs to have the ability of horizontal sharding, so that the service can be divided into different logical units, and each unit can handle part of the user traffic, so that the service itself has good scalability. The biggest challenge is the database system, because database systems are stateful in themselves, so shard data properly and provide the right transaction mechanism can be a very complex project. However, in the era of cloud native, cloud native database services provided by cloud platforms can solve most complex distributed system problems, so if enterprises build resilient systems with the capabilities provided by cloud platforms, they will naturally have the resilience of database systems.

3. Automatic deployment

The system burst traffic is unpredictable. Therefore, a common solution is to manually expand the system to enable the system to support a larger number of users. After complete architecture split, the elastic system also needs to have the ability to automated deployment and so according to the established rules or external traffic emergency signal trigger system expansion and automation of functions, meet the system to shorten the sudden traffic impact long timeliness requirements, at the same time at the end of the peak automatic shrink capacity system, reduce the resource usage of the system operation cost.

4. Support service degradation

Elastic system needs to design abnormal response ahead of time, for example, classification management for service in the elastic mechanism failure and elastic insufficient resources or abnormal situations, such as peak flow rate than expected, and system architecture needs to have the ability to service degradation, by reducing partial non-critical service quality, or close some enhancements to relinquish resources, In addition, expand the service capacity corresponding to important functions to ensure that the main functions of the product are not affected.

There are many practical cases of successful construction of large-scale elastic systems at home and abroad, among which the most representative is alibaba’s annual “Double 11” promotion event. In order to cope with the traffic peak of hundreds of times compared to normal times, Alibaba buys elastic resources from Aliyun to deploy its own applications every year, and releases this batch of resources after the “Double 11” event, and pays on demand, thus significantly reducing the resource cost of the big promotion event. Another example is the elastic structure of Sina Weibo. When hot social events occur, Sina Weibo expands the application container to Aliyun through the elastic system to cope with the large number of search and forwarding requests caused by hot events. The system greatly reduces the resource cost of hot search by expanding the response capacity on demand at minute level.

With the development of cloud native technologies, the technology ecology such as FaaS and Serverless is maturing, and the difficulty of building large-scale resilient systems is gradually decreasing. When enterprises use FaaS, Serverless and other technical concepts as the design principles of the system architecture, the system has the ability of elastic scaling, and the enterprise does not have to pay extra for “maintaining the resilient system itself”.

Observable principle

Observability is different from the passive capabilities provided by systems such as monitoring, service exploration, and Application Performance Management (APM). Observability emphasizes more initiative. In distributed systems such as cloud computing, active logging, link tracking, and measurement are used. The time, return value and parameters of multiple service calls generated by one App click can be clearly visible, and even can drill down into each third-party software call, SQL request, node topology, network response and other information. With this capability, o&M, development, and business personnel can learn about software performance in real time and gain unprecedented correlation analysis capabilities to continuously optimize business health and user experience.

With the overall development of cloud computing, enterprise application architecture has changed significantly, and is gradually transitioning from traditional single application to microservices. In microservice architecture, the loosely-coupled design among services makes version iteration faster and cycle shorter. Kubernetes and others in the infrastructure layer have become the default platform for containers; Services can be continuously integrated and deployed through the pipeline. These changes minimize the risk of service change and improve the efficiency of r&d.

In a microservice architecture, the point of failure of a system can occur anywhere, so we need to systematically design for observability to reduce MTTR (mean time to failure repair).

To build observability systems, three basic principles need to be followed.

1. Comprehensive data collection

The three kinds of data, Metric, Tracing and Logging, are the “three pillars” of building a complete observability system. The observability of the system requires the complete collection, analysis and display of these three types of data.

(1) Indicators Indicators refer to the kpIs used for measurement in multiple consecutive time periods. Under normal circumstances, the index will be in accordance with the layered software architecture, divided into system resource indicators (such as CPU utilization, disk usage and network bandwidth, etc.), application metrics (e.g., error rate, service level agreements APDEX SLA, service satisfaction, average latency, etc.), business indicators (such as user session number, order quantity and turnover, etc.).

(2) Link tracing Link tracing refers to the whole process of recording and restoring a distributed call through the unique identifier of TraceId, which goes through the whole process of data from browser or mobile terminal to server processing, SQL execution or remote call initiation.

(3) Logs Logs are usually used to record the execution process, code debugging, error and exception information of application running. For example, Nginx logs can record remote IP address, request time, data size and other information. Log data needs to be centrally stored and retrivable.

2. Correlation analysis of data

It is especially important for an observable system to make more connections between data. When a fault occurs, effective correlation analysis can realize fast fault demarcation and location, which improves troubleshooting efficiency and reduces unnecessary loss. Generally, information such as the server address and service interface of the application is treated as additional attributes, which are bound with indicators, call chains, logs, and other information. In addition, the observable system is endowed with certain customization capabilities to flexibly meet the requirements of more complex operation and maintenance scenarios.

3. Unified monitoring view and display

Monitoring views of various forms and dimensions can help operators and developers quickly find system bottlenecks and eliminate system risks. The monitoring data should be presented not only in indicator trend charts and bar charts, but also in complex application scenarios. The view should be equipped with driller analysis and customization capabilities to meet the requirements of multiple scenarios, such as o&M monitoring, version release management, and troubleshooting.

As cloud-native technologies evolve, the scenarios based on heterogeneous microservices architecture will become more numerous and more complex, and observability is the foundation upon which all automated capabilities are built. Only by achieving comprehensive observability can the stability of the system be truly improved and the MTTR be reduced. Therefore, how to build a full-stack observable system of system resources, containers, networks, applications, and businesses is a problem that every enterprise needs to think about.

Principle of toughness

Toughness refers to the ability of software to resist the failure of the hardware or software components on which it depends. The faults usually include hardware faults, hardware resource bottlenecks (such as CPU or NETWORK adapter bandwidth depletion), service traffic exceeding the software design capacity, faults or disasters that affect the normal operation of the equipment room, and faults of dependent software that may cause service unavailable.

Once a business goes live, it may encounter a variety of uncertain inputs and unstable dependencies for most of its runtime. When these abnormal scenarios occur, the business needs to ensure the quality of service as much as possible to meet the current networking services represented by the “always on” requirements. Therefore, the core design concept of resilience is failure oriented design, which considers how to reduce the impact of abnormal on the system and service quality and return to normal as soon as possible in the case of various abnormal dependencies.

The practices and common architectures of the resilience principle include service asynchronous capability, retry/flow limiting/degradation/circuit breaker/backvoltage, primary/secondary mode, cluster mode, high Availability (HA) of multiple Availability zones (AZs), unit mode, cross-region DISASTER recovery, and remote multi-active disaster recovery.

The following examples illustrate how to design resilience in large systems. “Double 11” is an undefeable battle for Alibaba, so the design of its system should strictly follow the principle of resilience in terms of strategy. For example, traffic cleaning is used to implement security policies at the unified access layer to defend against hACKS. The refined traffic limiting policy ensures the stability of peak traffic and ensures the normal operation of the back-end. In order to improve the global high availability, Alibaba implements cross-region multi-active Dr Through the unitized mechanism and same-city active-active Dr Through the same-city Dr Mechanism, so as to maximize the service quality of IDC (Internet Data Center). Stateless business migration in the same IDC through microservices and container technologies; Improve high availability through multi-copy deployment; Asynchronous decoupling between microservices is accomplished through messages to reduce service dependency and improve system throughput. From the perspective of each application, we should sort out our own dependence, set the degrade switch, and continuously strengthen the robustness of the system through fault drill to ensure the normal and stable operation of Alibaba’s “Double 11” promotion activity.

With the speeding up of digital, more and more digital business infrastructure become the whole social economic operation, but with the support of these digital business system is more and more complex, rely on the service quality of uncertainty risk is becoming more and more high, so the system must be sufficient toughness design, in order to better deal with all kinds of uncertainty. Especially when the core business links of core industries (such as payment links of finance, transaction links of e-commerce), business traffic entries, and dependency on complex links, the design of resilience is crucial.

All process automation principles

Technology is a “double-edged sword”. The use of containers, microservices, DevOps, and a large number of third-party components, while reducing distributed complexity and speeding up iterations, also increases the complexity of software technology stacks and component sizes, which inevitably leads to software delivery complexity. If not controlled properly, applications will not be able to appreciate the advantages of cloud native technology. Through the practice of IaC, GitOps, OAM, Operator, and a large number of automated delivery tools in the CI/CD (Continuous Integration/Continuous delivery) pipeline, enterprises can standardize and automate their internal software delivery processes on the basis of standardization. By configuring data self-description and end-state oriented delivery process, the automation of software delivery and operation and maintenance is realized.

To achieve large-scale automation, four basic principles need to be followed.

Standardization of 1.

Automation begins by standardizing the infrastructure of business operations through containerization, IaC, and OAM, and further standardizing the process of defining applications and even delivering them. Only standardization can remove the dependence of business on specific people and platforms, and realize business unification and large-scale automation.

2. Facing the final state

Terminal state oriented refers to a declarative description of the expected configuration of infrastructure and applications, continuous attention to the actual operating state of applications, so that the system itself repeatedly change and adjust until approaching the final state of an idea. The terminal-oriented principle emphasizes that application changes should be avoided by assembling a series of procedural commands directly through the work order system or workflow system. Instead, the final state should be set and the system should decide how to implement the changes.

3. Separation of concerns

What automation ultimately achieves depends not only on the capabilities of the tools and systems, but also on the people who set the goals for the system, so make sure you find the right goal setters. When describing the final state of the system, separate the configuration of the main roles, such as application R&D, application operation and maintenance, and infrastructure operation and maintenance. Each role only needs to set the system configuration that it cares about and is good at, so as to ensure that the set system final state is reasonable.

4. Design for failure

To automate all processes, it is important to ensure that the automated processes are controllable and have a predictable impact on the system. We can’t expect automated systems to be error-free, but we can guarantee that even in the event of an exception, the scope of the error is manageable and acceptable. Therefore, when implementing changes, automated systems also need to follow manual change best practices to ensure that the changes are grayscale, the results are observable, the changes are quickly rolled back, and the impact of the changes is traceable.

Fault self-healing of a business instance is a typical process automation scenario. After services are migrated to the cloud, the cloud platform reduces the probability of server faults by various technical means, but it cannot eliminate the software faults of the services themselves. Software faults include software crashes caused by defects, memory overflow (OOM) caused by insufficient resources, and tamper death caused by excessive load, system software problems such as kernel and daemon processes, and interference caused by mixed applications or jobs. As the scale of the business increases, the risk of software failure becomes higher and higher. Traditional way of ops troubleshooting needs the intervention of operations staff, perform repair actions such as reboot or undulate, but under the large-scale scene, operations staff are struggling to cope with all kinds of fault, and even need overtime overnight, service quality is difficult to guarantee, whether the customer, or development, operations staff, could not be satisfied.

To enable automatic fault recovery, cloud native applications require developers to use standard declarative configurations to describe application health detection methods, application startup methods, service discovery to be mounted and registered after application startup, and Configuration Management Data Base (Configuration Management Data Base). CMDB) information. With these standard configurations, the cloud platform can repeatedly detect applications and perform automated repairs when failures occur. In addition, to prevent false positives in fault detection, application operation and maintenance personnel can set the proportion of service unavailable instances based on the capacity, so that the cloud platform can perform automatic fault recovery while ensuring service availability. The realization of instance fault self-healing not only frees developers and operation and maintenance personnel from cumbersome operation and maintenance operations, but also can deal with various faults in time to ensure business continuity and high availability of services.

Zero trust principle

Traditional security architecture based on the boundary model is designed to build a wall between trusted and untrusted resources. For example, the Intranet of a company is trusted, but the Internet is not. In this security architecture design pattern, once intruders infiltrate the border, they can access the resources within the border at will. The adoption of cloud-native architecture, the spread of telecommuting by employees, and the use of mobile devices such as phones to handle work have completely broken the physical boundaries of traditional security architecture. Employees working from home can also share data with partners because applications and data are hosted in the cloud.

Today, boundaries are no longer defined by the physical location of an organization, but are extended to all places that require access to an organization’s resources and services, and traditional firewalls and VPNS are no longer able to reliably and flexibly respond to these new boundaries. Therefore, we need a new security architecture that can flexibly adapt to the characteristics of cloud native and mobile environments. No matter where employees work, where devices are accessed, and where applications are deployed, data security can be effectively protected. If we want to implement this new security architecture, we need to rely on the zero-trust model.

Whereas traditional security architectures assume that everything inside the firewall is secure, the zero-trust model assumes that the firewall boundary has been breached and that every request comes from an untrusted network, so every request needs to be validated. Simply put, “never trust, always verify.” Under the zero-trust model, each request must be strongly authenticated and authenticated based on the security policy. The user identity, device identity, application identity, and so on associated with the request serve as the core information to determine whether the request is secure.

If we talk about security architecture around boundaries, the boundary of traditional security architecture is the physical network, while the boundary of zero-trust security architecture is identity, which includes the identity of the person, the device, the application, and so on. Implementing a zero-trust security architecture requires following three basic principles.

1. Explicit verification

Each access request is authenticated and authorized. Authentication and authorization are based on user identity, location, device information, service and workload information, data tier, and anomaly detection. For example, for the communication between internal applications, you cannot directly authorize the access simply by determining that the source IP address is an internal IP address. Instead, you should determine the identity and device information of the source application, and then authorize the communication based on the current policy.

2. Minimum permissions ** For each request, only the permissions required at the moment should be granted, and the permission policy should be adaptive based on the current request context. For example, HR department employees should have access to HR related applications, but should not have access to finance department applications.

3. Assume a breach

If the physical boundary is breached, the security explosion radius needs to be strictly controlled to cut the whole network into multiple parts of users, devices and applications. All sessions are encrypted and data analysis techniques are used to ensure visibility into the security state.

The evolution from traditional security architecture to zero-trust architecture has a profound impact on software architecture, which is embodied in the following three aspects.

First, you cannot configure security policies based on IP addresses. In the cloud native architecture, you cannot assume that IP is bound to services or applications. This is because IP may change at any time due to the application of technologies such as automatic elasticity. Therefore, you cannot use IP to represent the identity of an application and establish security policies on this basis.

Second, identity should become infrastructure. Authorization for communication between services and human access to services is based on the clear knowledge of the identity of the visitor. In an enterprise, human identity management is often part of the security infrastructure, but application identity also needs to be managed.

Third, the standard release pipeline. In an enterprise, development efforts are often distributed, including the process of versioning, building, testing, and bringing the code online in a relatively independent manner. This decentralized pattern leads to insecure services running in a real production environment. Application release security can be centrally enhanced if the process of versioning, building, and bringing code online can be standardized.

In general, the construction of the zero-trust model includes identity, device, application, infrastructure, network, data and other parts. The implementation of zero trust is a gradual process. For example, when all traffic moving within the organization is not encrypted, the first step should be to ensure that the traffic of visitors accessing the application is encrypted, and then to implement the encryption of all traffic gradually. With a cloud-native architecture, you can directly use the secure infrastructure and services provided by the cloud platform to help enterprises quickly implement a zero-trust architecture.

Architecture continuous evolution principle

Today, technology and business are moving at such a fast pace that in engineering practice, there are few architectural patterns that can be clearly defined from the start and can be applied throughout the software lifecycle, but instead require constant refactoring to meet changing technical and business requirements to a certain extent. In the same way, the cloud-native architecture itself should and must be capable of continuous evolution, rather than being a closed, designed, fixed architecture. So in addition to consider when designing incremental iteration, rationalization of target selection etc., also need to consider organization (for example, the architecture control commission) level architecture governance and risk control standard, and the characteristics of the business itself, especially in the business under the condition of high speed iteration, more should focus on how to ensure the balance between the evolution of architecture and business development.

1. Features and values of an evolutionary architecture

An evolutionary architecture is designed at the beginning of software development with extensibility and looser coupling that makes subsequent changes easier, upgradable refactoring cheaper, and can occur at any stage in the software lifecycle, including development practices, release practices, and overall agility.

The fundamental reason why an evolutionary architecture is important in industry practice is that the consensus in modern software engineering is that change is difficult to predict and extremely costly to transform. Evolutionary architecture and can’t avoid reconstruction, but it highlights the architecture evolution, namely when the whole architecture technology, organization, or the change of external environment need evolving, project as a whole can still follow the principle of strong boundary context, ensure the logic described in domain driven design divided into physical isolation. Through standardized and highly scalable infrastructure system, advanced cloud native application architecture practices such as standardized application models and modular operation and maintenance capabilities are adopted in large numbers to realize the physical modularity, reusability and separation of responsibilities of the entire system architecture. In an evolutionary architecture, each service of the system is structurally decoupled from the other services, and replacing services is as easy as replacing Lego bricks.

2. Application of evolutionary architecture

In modern software engineering practice, evolutionary architecture has different practices and manifestations at different levels of the system.

In the application architecture for business development, the evolutionary architecture is often inseparable from the microservice design. For example, in alibaba Internet electricity application (for example everyone acquaint with of taobao and Tmall, etc.), the whole system architecture is designed to be as thousands of finely actually boundaries clear component, its purpose is to want to make non destructive changes developers to provide greater convenience, to avoid because of inappropriate coupling change guide the direction of the unpredictable, Thus impeding the evolution of the architecture. As you can see, evolutionary architecture software supports some degree of modularity, which is often reflected in classic layered architectures and best practices for microservices.

At the platform research and development level, the evolutionary Architecture is more reflected in the Capability Oriented Architecture (COA). After the gradual popularization of cloud-native technologies such as Kubernetes, cloud-native infrastructure based on standardization is rapidly becoming the platform architecture capability provider, and the Open Application Model (OAM) concept based on this is a kind of application architecture perspective, The COA practice of modularizing standardized infrastructure according to capacity.

3. Architecture evolution under the cloud native

At present, the evolutionary architecture is still in the stage of rapid growth and popularization. However, there is a consensus across the software engineering community that the software world is constantly changing and that it is a dynamic, not static existence. Architecture is also not a simple equation; it is a snapshot of an ongoing process. Therefore, whether in the business application or in the platform research and development, the evolutionary architecture is an inevitable development trend. A large number of engineering practices in the industry for architecture updates illustrate the problem that the effort required to keep the application up to date is enormous due to neglecting to implement the architecture. However, good architectural planning can help applications reduce the cost of introducing new technologies, which requires the application and platform to meet at the architectural level: architectural standardization, separation of responsibilities, and modularization. In the cloud native era, the development application Model (OAM) is rapidly becoming an important driver of the evolution architecture drive.

conclusion

We can see that cloud-native architectures are being built and evolved on the basis of core features of cloud computing (e.g., resilience, automation, resilience) and in combination with business goals and characteristics to help enterprises and technologists fully release the technical benefits of cloud computing. With the continuous exploration of cloud native, various technologies continue to expand, various scenarios continue to enrich, cloud native architecture will also continue to evolve. But in the course of these changes. Typical architectural design principles are always important to guide our architectural design and technical implementation.

Copyright Notice:The content of this article is voluntarily contributed by real-name registered users of Ali Cloud, and the copyright belongs to the original author. Ali Cloud developer community does not own the copyright, and does not bear the corresponding legal responsibility. For specific rules, please refer to the “AliYun Developer Community User Service Agreement” and “AliYun Developer Community Intellectual Property Protection Guidelines”. If you find any suspected plagiarized content in this community, please fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringing content.