Authors: Zong Quan, Yu Zeng

Alibaba trinity strategy

Ali Cloud has long proposed the trinity strategy of open source, self-research and commercialization. Let’s talk about my understanding of it first.

Years of software development have taught us that there are some key elements to developing great software:

  • communication
  • feedback
  • practice

In software development, we cannot “create” business scenario requirements behind closed doors. Business scenarios and product features need to be refined, and open source provides us with a platform for joint innovation, based on which we can jointly define norms and standards. Different manufacturers follow the corresponding standards, customers do not have the risk of being locked, can constantly migrate, always find the best manufacturer, put their business on, with the simplest, most convenient, the most economical way to operate their business.

When many customers choose Ali Cloud service grid, there is an important evaluation index: compatibility with community Istio. Because customers worry about being locked, they rely heavily on Ali Cloud;

Then speaking of self-research, maybe some students will ask whether open source and self-research contradict each other, the answer is no.

Because, we mentioned here from the research, in fact, is based on open source from the research, not abandon the open source version, build a new wheel. Self-research means that we have a deep understanding of the open source product:

  • To master all source code;
  • The ability to modify every line of code
  • Of course, self-development also means that you may have your own business specific and unique requirements scenarios, some scenarios that cannot be standardized.

Based on our own research and deep control and understanding of open source products, we move functions that have the requirements of common customer scenarios to the cloud and encapsulate them into cloud products so that customers can use them out of the box. This is the original intention of commercialization.

Back inside Ali Group, open source, self-research, business is actually a technology flywheel.

For Ali’s tech classmates, Singles Day is a “feast” every year. In order to provide customers with a smooth shopping experience and merchants with more diversified benefits activities, alibaba e-commerce platform’s requirements for efficiency, reliability and scale have been doubled and doubled under the drive of Double 11, stimulating the potential of technical personnel. As one of the core of basic technology, Alibaba middleware will also usher in a comprehensive technological evolution and upgrade on Double 11 every year.

Ali has launched a number of well-known open source projects in the open source community, including Dubbo, RocketMQ, Nacos, Seata, etc., to encourage developers to build middleware ecosystems, including ServiceMesh related technologies.

Embrace service grid open source technology

Ali Cloud has been investigating and practicing ServiceMesh technology for a long time. In 2018, Istio officially released version 1.0 and entered the public domain. In this earlier period, Alibaba has started to participate in the contribution of related ecological open source products.

Ali Cloud also has some open source service frameworks in the field of micro-service ecology, such as Dubbo and Spring Cloud Alibaba. It can be said that in the field of micro-service, because e-commerce is a large experimental platform, Ali Cloud is a “technical expert” with no problem in this field. We will conduct horizontal function comparison. Compare the advantages and disadvantages of Sidecar mode and the original mode; In the process, we also actively participated in the open source contribution of Istio microservices-related ecological projects; Examples include Envoy, Dubbo Filter, RocketMQ Filter, Nacos Mcp feature, Spring Cloud Alibaba, Sentinel, etc.

There are a variety of popular service frameworks, how to develop interoperable business based on different frameworks? The service framework, like the rails of the railway, is the foundation of interworking. Only when the interworking of the service framework is solved, can the higher level business interworking be completed. Therefore, it is an inevitable trend to unify the same standard, merge into one and build a new generation of service framework.

Dubbo and HSF are both microservices RPC frameworks in use within Alibaba. These frameworks have provided solid support for underlying micro-service capabilities in the process of continuous iterative development of Alibaba’s business and ensured the promotion of Double 11 one after another.

With the wave of cloud native, overall resource cost optimization, DevOps and other background, some shortcomings of the original micro-service framework Dubbo and HSF are gradually exposed, such as multi-language support, separation of configuration and code logic, etc., SDK version upgrade needs to promote business side, Acquisition of business using different framework interoperability issues.

Some internal businesses of Ali Group began to try to use service grid technology to transform the underlying micro-service framework. In the process of Mesh of Dubbo framework, Ali Cloud Service Grid team contributed Envoy Dubbo Filter to Mesh the original Dubbo business. To capture the new incremental value that the service grid brings.

On the other hand, the Dubbo community itself is moving forward in the cloud native space. In order to better adapt to cloud native scenarios (infrastructure changes Kubernetes has become the de facto standard for resource scheduling), The Dubbo team is currently evolving Dubbo 2.0 to Dubbo 3.0 and has come up with Proxyless Mesh.

As services move to the cloud, the facilities for deploying applications change flexibly due to the various paths to the cloud and the transition from the existing architecture to the cloud native architecture, and micro-services on the cloud also show a trend of diversification. Cross-language, cross-vendor, cross-environment invocation will inevitably lead to the emergence of unified protocols and frameworks based on open standards to meet the interoperability requirements. These scenarios, which are the areas where formal service grid is good at, give service grid a good space to play.

Dubbo 3.0 community release is now available with the following core changes:

  • Application-level service discovery
  • Dubbo 2.0 protocol evolved into gPRc-based Triple protocol
  • No ProxylessMesh sidecars

Mesh is not achieved overnight. There is an intermediate transition stage for the original stock business, similar to the business cloud. Traditional microservice framework, such as: Dubbo, Spring Cloud and other storage services use Nacos, Eureka and Zookeeper service registries, which need to be compatible and adaptable. The Mcp Over XDS protocol based on Istio control surface is prefered in Nacos, allowing Istiod to connect directly to the Nacos registry.

Open source products can not be used directly in mass production environment, which requires some adaptation and tuning, as well as some encapsulation of productization capabilities. For example, Intel mTLS acceleration scheme.

Intel has submitted an implementation of an Envoy Upstream, but Istiod has not yet supported it. As a cloud product, we hope to provide customers with out-of-the-box capabilities. Service grid ASM is based on Intel open source mTLS acceleration scheme to achieve extended support for control surface Istiod, and because the mTLS acceleration scheme depends on the actual CPU model of the underlying resource (ICELake), ASM has enabled and disabled the adaptive acceleration function based on the actual deployment of user services. When the multiBuffer acceleration function is enabled, ali Cloud G7 GENERATION ECS is used as the node node, and QPS is improved by nearly 80%.

When it comes to service grid, one oft-raised question is: “What’s the difference between it and Dapr?”

Dapr uses the Sidecar architecture to run as a separate process with applications, including functions such as service invocation, network security and distributed tracing. This often raises the question: How does Dapr compare to service grid solutions like Istio?

While Dapr and service grid do have some overlapping capabilities, unlike service grid, which focuses on network issues, Dapr focuses on providing building blocks that make it easier for developers to build applications as microservices. Dapr is developer-centric, while the service Grid is infrastructure-centric. In addition, Dapr does not provide traffic control functions such as routing or traffic allocation.

Of course, the two can be deployed together, and both Dapr and Sidecar of the service grid run in the application environment.

The landing and practice of service grid in Alibaba

As can be seen from the above, Ali has opened some open source products for the micro-service ecosystem. In fact, these products are all based on internal business scenarios at the very beginning. Based on the incubation and large-scale business testing of these internal business scenarios, the internal thinks that external customers also have similar needs, so they open source all these internal products.

The same is true for Istio Mesh. The internal businesses of the group began to explore the business of Mesh at a very early time.

As can be seen from the overall architecture diagram, Ali Group provides a console for Mesh users to operate. From the perspective of application, the console integrates CICD, rights management, security production, SRE operation and maintenance system and other platforms to provide a unified Portal after application access to Mesh. Based on the DevOps concept, users can manage the whole life cycle of applications. In addition, users can use Mesh to provide application service governance, full-link gray scale, and secure production, so that the Owner of applications can operate and repair themselves.

The core capabilities of Mesh support RPC protocols such as Dubbo, MetaQ(RocketMQ), and LWP, and extend Mesh capabilities such as full-link dyeing, routing policy, and plug-in market.

At the same time, Alibaba Group also supports the ability to provide third-party system integration through OpenAPI and Kubernetes API.

On the basis of community Istio architecture, Ali Group has made deep integration between internal and internal middleware (Diamond and ConfigServer), which is compatible with the original use mode of business and allows business to seamlessly access to Mesh. This is also in consideration of the need to use Nacos for some Mesh businesses. Multiple registry scenarios such as Nacos are supported at the ASM product level;

At the same time, the transport and maintenance plane can be abstract, and the configuration of service traffic governance rules (VirtualService, DestinationRule, etc.) can be realized through the UI console. At the same time, through the integration with OpenKrusise, Enable, close, and hot upgrade of POD granularity Sidecar, and observe and monitor microservices through the integration of Prometheus and Grafana and alarm ARMS within the group.

Evolution path of Service grid of Ali Group

The evolution of Ali Group’s service grid can be divided into three stages: non-invasive partial scale, non-invasive comprehensive scale and cloud native final state. At present, cluster business Mesh is in the second stage.

The first stage: There is a transition stage in the inventory business Mesh, and it is necessary to ensure that this transition stage is relatively non-intrusive, so that the business developers do not perceive; This is the background and premise of why we need to adopt non-invasive solutions; It also needs to use Mesh to overlay existing microservice governance capabilities while providing incremental value of Mesh.

Stage 2: Comprehensive scale, while solving the problem of resource cost and performance caused by scale, through Sidecarcrd to achieve lazy loading of service configuration, to achieve the problem of configuration isolation, through the optimization and clipping of Metrics, reduce Sidecar memory cost. At the same time, Dubbo/HSFFilter is optimized to realize lazy codec to improve the performance of data surface processing and reduce the delay.

As the internal service Dubbo 2.0/HSF evolves to Dubbo 3.0, it eventually evolves to the cloud native end-state solution.

Stage 3: As the infrastructure evolves to Kubernetes, service discovery and service governance capabilities sink in the cloud native scenario. Through Mesh, business logic and service governance can be decoupled to achieve separation of configuration and code logic, so as to achieve better DevOps. And enjoy the rich scalable traffic scheduling capabilities and observability of Mesh.

Dubbo/HSF RPC supports a variety of serializations, and Mesh does not provide friendly support for some serializations, such as Java serialization.

Therefore, in the first phase of service Mesh, Sidecar does not encode or decode Java serialization, and Passthrough traffic is adopted. Mesh implements full codec support for Hessian2 serialization, and implements lazy codec for performance reasons. Based on this, we can implement traffic marking (coloring) for this kind of traffic and implement tag routing and Fallback capabilities by extending VirtualService. It can also realize some specific business scenarios, such as canary release, full-link gray scale and other scenarios;

The MeshSDK layer of internal services will be gradually upgraded to Dubbo3.0 SDK. When Mesh is enabled, Dubbo3.0 SDK only provides RPC and other capabilities, corresponding to ThinSDK mode. After Mesh, The protocol support of Sidecar is more friendly and the cost of resources is reduced. When Sidecar fails, fast read can be switched back to FatSDK mode without service awareness.

For intra-cluster services, traffic scheduling is complicated, especially for some large-scale services. For example, in the deployment of multiple equipment rooms and areas, routes that serve multiple versions and environments exist in a single area

This involves routing and back-end cluster selection in different dimensions, which may include:

  • Regional routing
  • Computer room routing
  • Unitized routing
  • Environmental routing
  • Multi-version routing

The corporate emporio scene is particularly typical, and with this in mind, internal extension Istio has implemented the ability to mark and route traffic by introducing new CRD: RouteChain, TrafficLable, and extensions to VirtualService.

Alibaba cloud service grid ASM, A commercial product, also reveals these capabilities to different degrees, which can be realized based on this, such as Canary release, A/B testing, full-link gray scale and other scenarios.

Cloud products: ALIBABA Cloud service grid ASM

Previously, we introduced the practice of Alibaba service grid in open source and large-scale implementation. Next, we will share the design of cloud products in the cloud native trinity. Ali Cloud continues to drive the development of technology and precipitates a series of core technologies of service grid by summarizing the experience of business scenario implementation.

In terms of large-scale implementation, such as dynamic push rule configuration on demand, Sidecar hot upgrade without large-scale services, support for the most comprehensive heterogeneous computing infrastructure, support for multiple registries and platforms.

In terms of traffic governance, it provides refined traffic control, dynamic interception of traffic protocols and ports on demand, zero-configuration request label routing and traffic dyeing, and supports refined governance of a variety of protocols.

In terms of observability: it provides integrated intelligent operation and maintenance integrating log, monitoring and tracking. Meanwhile, it enhances observability based on eBPF to realize non-intrusion and full-link observability and assist in fast service obstacle removal.

In terms of security capability: support Spiffe/Spire, realize zero trust network, enhance authentication mechanism, support gradual realization of mTLS step by step.

In terms of performance optimization: network acceleration is carried out through eBPF technology to achieve performance optimization of hardware and software.

Alibaba Cloud service grid ASM is the industry’s first ISTIO-compatible managed service grid platform, supporting the complete service grid product capabilities: refined application traffic management, end-to-end observable capabilities, security and high availability; Supports complex scenarios such as multi-language and multi-environment, multiple micro-service frameworks, and multi-protocol interconnection. Service grid ASM technology architecture has been upgraded to full V2.0, hosting the core components of the control surface, ensuring the unified architecture of the standard edition and professional edition, smooth support for the upgrade of various versions of the community. At the same time ASM and community standards on the basis of a variety of capabilities. It mainly includes traffic management and protocol enhancement, supports multiple zero-trust security capabilities, and supports interconnection with multiple registries such as Nacos and Consul. In addition, the grid diagnosis ability is used to quickly analyze the health status of the grid and coordinate with the control plane alarm to respond quickly.

The service grid ASM is fully integrated with various cloud service capabilities, including link tracing, Prometheus monitoring, logging services and other observable capabilities. Integrated AHAS supports service traffic limiting, cluster traffic limiting, and adaptive traffic limiting. Combined with the microservice engine MSE, AHAS supports service governance and provides consistent governance experience for multiple clusters across VPCS. Support OPA security engine, webAssembly and other custom extension capabilities in custom extensions.

Users can use service grid technology through the ASM console, OpenAPI, declarative cloud native API, data side and control side Kubeconfig. By polishing the control plane and management plane of Service grid ASM, it can provide Anywhere Service Mesh for services running on heterogeneous computing infrastructure, from gateway to Sidecar injection on data plane. Supports multiple infrastructures such as container service ACK, Serverless Kubernetes, edge cluster and external registered Kubernetes cluster, and ECS virtual machines.

Functional design of service grid ASM

ASM based traffic marking and label routing to achieve the full link gray scale. Under the microservice software architecture, it is quite time-consuming to build a complete set of test system for verification of new business functions before they go online. As the number of microservices to be split increases, it becomes more and more difficult. Based on the capability of “traffic marking” and “routing by standard” is a general scheme, which can solve the related problems such as test environment governance and online full-link gray publishing. And based on the service grid technology can be independent of the development language, the scheme is adapted to different 7-layer protocols, the current service grid ASM has supported HTTP/gRpc and Dubbo protocols. A new TrafficLabel CRD has been introduced in ASM to define where traffic labels for transparent Sidecar traffic are to be obtained, whole-link traffic control is logically isolated, traffic is marked (dyed) and routed according to standards, and by using the service grid ASM, There is no need for each technical r&d personnel to deploy a complete set of environment, realizing multi-environment governance and greatly reducing r&d costs.

The service grid ASM supports implementing Canary publishing. Release is the last link for the whole function to be updated online. Some problems accumulated in the research and development process will be triggered in the final release. At the same time, releasing itself is also a complex process, in the process of releasing, often prone to some errors or omissions of key operations. Canary release configuration is flexible, the policy can be customized, according to the traffic or specific content of gray (such as different accounts, different parameters), problems will not affect the whole network users. To label an application with an environment, use TrafficLable to handle user traffic such as http-header: User-id % 100 == 20 Put the gray label and deliver the label traffic routing rule through VirtualService. Therefore, user traffic with userId 120 will be routed to the Gray environment. The traffic of user 121 is routed to the normal environment. The service grid ASM implementation of canary publishing supports routing by traffic percentage, routing by request characteristics (such as HTTP headers, method parameters, etc.), and perfect integration with the service grid gateway, support HTTP/gRPC/Dubbo protocol.

In addition to full-link grayscale and Canary publishing using traffic marking and label routing, service grid ASM also supports progressive publishing in conjunction with KubeVela. KubeVela is an out-of-the-box, modern application delivery and management platform that simplifies the process of delivering applications for mixed environments. At the same time, it is flexible enough to meet the iterative pressure brought by the rapid change of business at any time. The Open Application Model (OAM) after KubeVela is a highly extensible Model in both design and implementation, which is completely application-centric, programmable delivery workflow, and infrastructure-independent. Ali cloud service grid ASM supports the complex Canary publishing process combined with KubeVela, which can transform the relevant configurations defined by KubeVela into traffic governance rules and deliver them to the data surface.

Ali cloud service grid ASM implements zero trust security capability. Interactions using HTTP communication in microservice networks are not secure, and once an internal service is compromised, attackers can use that machine as a springboard to attack the network. The service grid ASM can reduce the area of attack in the cloud native environment and provide the infrastructure required for zero-trust application networks. Managing service-to-service security with ASM ensures end-to-end encryption, service-level authentication, and fine-grained authorization policies for the service grid.

ASM zero-trust security has the following advantages over traditional security mechanisms built into application code:

  • The policy life cycle of ASM Sidecar agents remains independent of the application, making it easier to manage these Sidecar agents.
  • ASM supports dynamic policy configuration, making it easier to update policies immediately without redeploying the application.
  • ASM provides the ability to authenticate end-user credentials attached to a request, such as JWT.
  • ASM’s centralized control architecture enables an enterprise’s security team to build, manage, and deploy security policies that are applicable across the enterprise.

Authentication and authorization systems are deployed as services in the grid, and like other services in the grid, these security systems can also be secured from the grid itself, including encryption in transit, identity, policy enforcement points, authentication and authorization of end-user credentials, and so on. The policy control plane defines and manages various types of authentication policies. The grid control surface assigns identity to the workload in the grid and automatically rotates certificates; Sidecar code on the data side executes the policy. The user configuration rule in the figure only allows the transaction service to invoke the order service and denies the shopping cart service to invoke the order service.

As the service grid ASM is a control plane hosting, supporting the management and control of multiple data plane clusters, traffic governance CR has a control plane, supporting users to operate governance rules through the KubeAPI of the control plane. In the new version of the Service Grid, in order to:

1. Support users’ operation habits in unmanaged mode, and be able to read and write Istio resources in Kubernetes cluster on the data surface;

2, support Helm common command tools;

3. Compatible with the API operation of other open source software in single cluster Addon mode, Ali Cloud service grid ASM supports data plane cluster Kube API to access Istio resources. Both are provided externally at the same time. Users can use them as required.

ASM is compatible with community standards and provides smooth upgrade of the control plane. The data plane can be upgraded in two ways: Rolling upgrade and hot upgrade capabilities. For rolling upgrade capabilities, set the upgrade Strategy to RollingUpdate. Pods injected with Sidecar will be automatically upgraded to the new version when published. The figure mainly introduces the hot upgrade function of ali cloud service grid ASM combined with OpenKruise project in the second way, which will not interrupt the service when upgrading the data plane, so that the data plane can be upgraded without the application being aware. Applications publish and update automatically generate SidecarSet configuration, update SidecarSet configuration to complete the data face upgrade, this capability is currently in the new version of grayscale.

Service grid ASM and Aliyun application high availability service AHAS can control the flow of applications deployed in the service grid. Currently, it supports single-machine traffic limiting, cluster traffic limiting, and adaptive traffic limiting. At the same time, service grid ASM also natively supports full limiting and local limiting of Istio. Full limiting uses the global gRPC service to provide rate limiting for the entire grid. Local limiting is used to limit the request rate of each service instance.

Service Grid ASM also supports the MCP over XDS protocol to connect to the registry of the microservice engine MSE and synchronize service information to the grid. MSE Nacos natively supports MCP protocol, users only need to create or update ASM instances to enable the Nacos registry docking function, realize the registry service synchronization to the service grid, can easily support Dubbo, Spring Cloud service grid. No service code modification is required on the user side.

Finally, several customer cases are shared. How customers use service grid ASM to shorten the landing cycle of service grid technology, reduce the cost of troubleshooting exceptions, and save the cost of control surface resources.

1. With the development of dongfeng Nissan’s business, the “Twelve Zodiac” (twelve complete test environments) created earlier can no longer meet many concurrent demands, and even need a lottery distribution environment. Through the introduction of ALI cloud service grid ASM, the “infinite Zodiac” system based on flow management is built to meet the demand of automatic on-demand environment. With the o&M, upgrade, and product support capabilities that ASM provides, production teams can focus on the benefits of ServiceMesh.

2. In order to cope with the global expansion and integrated operation of business, We deploy business applications across regions based on ALIBABA cloud service grid ASM and container service ACK, optimize customer access experience by regional access strategy, effectively reduce business access delay and improve business response speed.

3. Shangmi Technology introduces ASM of Ali cloud service grid to build intelligent digital business intelligent POS software and hardware integration system solution, and uses ASM of service grid to solve core problems such as gRPC service load balancing, link tracking and unified traffic management.

This article shares the thinking and practice behind the Trinity strategy of Alibaba’s service grid technology. About some product functions of ALIBABA cloud service grid ASM, including some recently released functions, For example, Istio resource historical version management, support for data plane cluster Kubernetes API to access Istio resources, support for cross-region failover and cross-region traffic distribution, support for control plane log collection and log alarm, support for Kubevela-based progressive publishing and other details. For more information about traffic management, observability, zero-trust security, solutions and other product features, please click to read the original article to visit the product documentation of ASM, Alibaba cloud service grid. If you are interested in service grid ASM, please scan the qr code below or search group number (30421250) to join the service grid ASM user communication group and explore the service grid technology together.

Click here for more information about service grid ASM