preface

Service Mesh, also known as Service Mesh, claims to be the next generation of microservice architecture technology, which can effectively solve the pain points of Service governance in the current microservice architecture. Since its launch in 2016, Service Mesh has been a hot topic in the architecture field.

In the previous article, we talked about containerization, which solves the problem of scaling microservices by standardizing the entire deployment process. However, containerization cannot solve the problem of Service runtime, while Service Mesh can realize the standardization of Service communication and Service governance, so as to reduce the communication and transformation costs caused by inconsistent Service governance standards among multiple services, and improve the efficiency of global Service governance.

Next, let’s dive into the Service Mesh, starting with its conceptual definition.

Concept definition

The concept of Service Mesh was first introduced by “William Morgan”, CEO of Buoyant. Initially, William talked about Service Mesh at an in-house briefing. In 2017, William published What’s a Service Mesh? And Why Do I Need One? , provides an authoritative definition of the Service Mesh:

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It’s responsible for the reliable delivery of requests through the complex topology of services that comprise a modern, cloud native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the application needing to be aware.

A service grid is a layer of infrastructure focused on handling communication between services. Cloud native applications have complex service topologies, and the service grid ensures that requests can reliably navigate through these topologies. In fact, a service grid is typically implemented as a set of lightweight network agents that are deployed with the application code and are transparent to the application.

However, this article was actually updated on October 12, 2020, and the definition of the Service Mesh has also changed:

A service mesh is a tool for adding observability, security, and reliability features to applications by inserting these features at the platform layer rather than the application layer.

A service grid is a tool for adding observable, security, and reliability features to an application by plugging them into the platform layer rather than the application layer.

Obviously, this definition covers a much broader range of functionality than just dealing with communication between services. In fact, Service Mesh has evolved into a standardized, systematic, non-invasive distributed Service governance platform and is one of the key components of the cloud native technology stack.

Of course, understanding a Service Mesh by definition is tricky. Next, let’s look at the Service Mesh from an evolutionary perspective, which is much more straightforward.

The evolution

One of the most classic articles on the evolution of Service Mesh is Phil Calcado’s “Pattern: Service Mesh.” Much of what follows in this section is distilled and summarized based on the content of this article.

Starting from the first generation of network computer system, the evolution of communication mode is as follows:

Initially, computers were rare and expensive, so every link between two nodes was carefully designed and maintained. As computers have become cheaper and more popular, the number of connections and the amount of data passing through them has increased dramatically. As people rely more and more on networked systems, engineers need to ensure that the software they build can meet the quality of service their users require.

There are many questions that need to be answered in order to achieve the desired level of quality. People need to find ways for machines to find each other, handle multiple simultaneous connections through the same wire, allow machines to communicate with each other when not directly connected, route packets over the network, encrypt traffic, and so on. Flow Control is used to prevent server A from sending overloaded data packets to downstream server B. Initially, the implementation needs to be handled by the service itself, so the service itself contains both its own business logic and flow control processing logic, coupled together.

Later, every service in order to avoid the need to achieve a similar set of network transmission processing logic, such as TCP/IP network protocol, it solves the general flow control problems in the network transmission and many other issues, the technology stack down, pulling away from the implementation of the service, become a part of the operating system, the network layer.

The era of microservices faces something similar. In addition to dealing with its own business logic, each microservice needs to deal with non-business functions such as service discovery, circuit breakers, load balancing, monitoring, and tracking. Initially, the code implementation of these non-business functions was also implemented by the developers responsible for the services themselves, coupled with the code for the business logic, as shown below:

Later, in order to avoid rewriting the same logic in every service, a number of microservice architecture development frameworks emerged, such as Twitter’s Finagle, Facebook’s Proxygen, Dubbox, Spring Cloud, etc., which implemented such functions as service discovery, circuit breaker, etc. This allows developers to add these capabilities to the service with less framework code.

This may seem perfect, but there are actually several deadly pain points:

  • Framework content is too much, learning threshold is very high. Although the framework itself blocks some of the common features of distributed system communication implementation details, developers have to spend more effort to master and manage the complex framework itself, in practice, it is not easy to track and solve the problems of the framework. Spring Cloud, the most widely used, has a dozen components that most people need three to six months to master.
  • Focusing on inter-service communication affects the speed of business iteration. The core goal of a business development team should be to meet business requirements, but now a lot of time and effort is being spent focusing on non-business issues, and business iterations are being slowed significantly.
  • Across languages. Micro service is an important feature of language has nothing to do but development framework usually only support one or several specific language, and the framework supports was written in a language service, it is difficult to blend in micro service oriented architecture system, to adjust measures to local conditions in various language implementation architecture system of different modules also hard to do.
  • Framework components are difficult to upgrade. The framework is linked to the service in the form of lib library. When complex projects rely on it, the compatibility of the library version is very difficult. At the same time, the upgrade of the framework library cannot be transparent to the service, and the service will be forced to upgrade because of the upgrade of lib library irrelevant to business.

In this case, a natural thought arises: If you can move down the stack of technologies for network access such as traffic control, can you also move down the stack of technologies for the microservices framework into a separate base layer?

The ideal scenario would be to change the network stack to add this layer, but this is not feasible because of standards issues.

Some of the pioneers found solutions to implement it as a set of agents. The main idea is that instead of a service directly connecting to its downstream dependencies, all traffic will transparently add the required functionality through a small piece of software. These proxy services are typically deployed as sidecars and are deployed together with business services to provide additional functionality to the business services.

In this model, each service will have an accompanying proxy service that communicates with each other through sidecar agents, so from a global perspective, the following deployment diagram is obtained:

You can see that the interconnection of these proxy services forms a mesh network, which is called a Service mesh. However, this is only the first generation of Service Mesh, representing products such as Linkerd, Envoy, and NginMesh. Its characteristic is that all service communication and governance functions are handled in this proxy service, resulting in a heavy proxy. The proxy service carries too many features and functions, which makes the update and modification of the proxy service especially frequent. Frequent update and upgrade will increase the probability of problems in the proxy service and affect the stability of the proxy service. At the same time, the agent service carries all the traffic of microservice communication and has a high requirement for stability. Any failure of this service will have a great impact on the stability of the whole system.

In order to solve the contradiction between frequent upgrades and stability mentioned above, the policy and configuration decision logic are separated from the agent Service to form an independent Control Plane, and the agent Service is called the Data Plane, which is the second generation Service Mesh. Its model is shown as follows:

The data plane is responsible for the communication between microservices, including RPC communication, service discovery, load balancing, degraded circuit breaker, and current-limiting fault tolerance. The data plane can be thought of as a linguistically independent process that separates communication and service governance capabilities from linguistically related microservice frameworks like Spring Cloud and Dubbo, with a greater emphasis on generality and extensibility.

The control plane manages the data plane and defines policies for service discovery, routing, traffic control, and telemetry statistics. The policies can be global or specified by configuring a data plane node. The control plane sends policies to each data plane node through a certain mechanism. Data plane nodes use these policies to communicate with each other.

The second generation Service Mesh is represented by Istio.

Linkerd

Linkerd is the first Service Mesh project, developed by Buoyant and joined CNCF in early 2018. Currently there are actually two versions: Linkerd 1.x and [Linkerd 2.x](Linkerd 1.x).

Linkerd 1.x is a first-generation implementation of Service Mesh. It was first released in January 2016. The development language uses Scale, and the final version is 1.7.4. Linkerd 1.x provides two deployment models: per-host and Sidecar. In the per-host model, one instance of Linkerd is deployed on each host, and all application service instances on the host are routed through the instance.

The Sidecar model, in which each instance of an application service is deployed with one instance of Linkerd, is useful for primarily instance – or container-based rather than host-based deployments.

There are also three configurations for how application services and Linkerd communicate with each other: service to Linkerd, Linkerd to service, and Linkerd to Linkerd.

Linkerd 1.x was a relatively simple version, but it’s not recommended anymore. Parts of the design are now outdated, memory is expensive, and TCP requests are not supported.

Linkerd 2. X is the second generation Service Mesh product recommended. Linkerd 2.x, however, was not implemented as an upgrade over the 1.x version, but was completely rewritten with Golang and Rust, and specifically for Kubernetes. The architecture of Linkerd 2.x is as follows:

As you can see, Linkerd 2.x is divided into two parts: data plane and control plane. The data plane consists of lightweight agents that are deployed as sidecar containers along with application service instances. The control plane contains a set of components that run in a dedicated Kubernetes namespace. These components accomplish many things: aggregate telemetry data; Provide a user-facing API; Provides control data to the data plane agent, etc., which together drive the behavior of the data plane.

The data plane and control plane of Linkerd 2.x are tightly coupled. The advantage of this is simple configuration and low complexity, but the disadvantage is poor scalability. In particular, the trend is that the data plane and the control plane will be separated, and the two will communicate through standard apis.

Envoy

Envoy is an open source edge and service proxy designed for cloud native applications. Envoy, originally built on Lyft and later added to CNCF, is a high-performance C++ distributed proxy designed specifically for individual services and applications, as well as a communication bus and “common data plane” for large microservices “service grid” architectures. Envoy’s design learns solutions such as Nginx, HAProxy, hardware load balancer, and cloud load balancer, Envoy runs with each application and abstracts the network by providing common functionality in a platform-independent way. When all service traffic in the infrastructure flows through the Envoy grid, problem areas can be easily visualized with consistent visibility, tuning overall performance, and adding substrate functionality at a single location.

In the Service Mesh, an Envoy does only the generic data plane. Although the Envoy does not have its own control plane, it provides standard apis for other control planes to access. This is crucial, and because of this, Envoy is hotter than Linkerd 1.x. Today, Envoy can be said to be the de facto standard for data planes in the cloud native era. Istio, Kuma, AWS App Mesh, and others all use Envoy as the default data plane.

Envoy’s architecture looks like this:

An Envoy receives a request, FilterChain the request, microprocesses it through various L3/L4/L7 filters, routes it to the specified cluster, obtains a target address through load balancing, and forwards it. Each link can be configured statically or discovered dynamically. Dynamic service discovery is achieved through xDS. The bottom series of Discovery services, also known as Discovery services, are collectively called xDS.

Each configuration resource in the xDS API has a type associated with it. Currently, eight resource types are supported. Among them, the four core resources are Listener, Router, Cluster, and Filter.

  • Listener: The basis of Envoy’s work. Briefly, a Listener is a listening port that Envoy opens to receive connections from Downstream. Multiple listeners can be supported, and nearly all configurations are isolated from each other. The Listener configuration includes listening addresses and FilterChain. The xDS corresponding to the Listener is called the Listener Discovery Service (LDS). The LDS is the basis for the proper functioning of an Envoy; without LDS, an Envoy cannot implement port listening (if the startup configuration does not provide a static Listener), and all other xDS services are disabled.

  • Cluster: Abstraction of Upstream services. Each Upstream service is abstracted into a Cluster. A Cluster contains the pool of connections for the service, the timeout, the endpoints address, the port, the type (which determines the endpoint method the Envoy gets to access for the Cluster), and so on. Cluster xDS is called Cluster Discovery Service (CDS). Typically, the CDS service pushes all the accessible services it has found to an Envoy. Another Service closely related to CDS is called Endpoint Discovery Service (EDS). The CDS service is responsible for pushing Cluster resources. If the Cluster type is EDS, all endpoints of the Cluster need to be delivered by the xDS service rather than resolved by DNS. The service that delivers Endpoints is called EDS.

  • Router: indicates the bridge between upstream and downstream. After the Listener receives the downstream connection and data, the Router determines which Cluster to send the data to, and it defines the rules for data distribution. Although most of the time a Router can be understood as an HTTP route by default, an Envoy supports multiple protocols such as Dubbo, Redis, and so on, so a Router refers to any set of rules and resources used to bridge listeners and back-end services (without limiting HTTP). The xDS corresponding to a Route is called Route Discovery Service (RDS). The core configuration of a Router includes matching rules and target clusters. In addition, the core configuration may include retry, traffic diversion, and traffic limiting.

  • Filter: Generally speaking, it is a plug-in. Envoy provides extremely powerful scalability through the Filter mechanism. In Envoy, many of the core functions are implemented using Filter. For example, the governance of HTTP traffic and services relies on two plug-ins, HttpConnectionManager (Network Filter, which is responsible for protocol resolution) and Router (which is responsible for traffic distribution). Using Filter mechanism, Envoy theoretically can support any protocol and convert between protocols, and can modify and customize the request traffic in an all-round way. The powerful Filter mechanism brings not only strong extensibility, but also excellent maintainability. The Filter mechanism allows an Envoy’s user to enhance every aspect of an Envoy without intruding into the community source code. Filter itself does not have a dedicated xDS to discover configuration. All configuration of Filter is embedded in LDS, RDS, and CDS.

The relationship between these four core resources and their xDS is shown below:

In addition, CNCF has been working on a working group since May 2019, initially including delegates from the Envoy and gRPC projects, to develop a standard Data Plane API, called UDPA (Universal Data Plane API). UDPA provides the de fact standard for L4/L7 data plane configuration, similar to the role played by L2/L3/L4 OpenFlow in SDN. UDPA also evolves from the Envoy xDS API, which covers service discovery, load balancing, route discovery, listener configuration, security discovery, load reporting, health check delegates, and more.

The progress of UDPA is slow, but it is certain that xDS is gradually moving towards UDPA and will be based on UDPA in the future.

Istio

Istio, a representative of the second generation of Service Mesh, was founded by Google, IBM and Lyft. Google and IBM are the main developers, while Lyft’s contributions focus on Envoy, Envoy acts as the data plane of Istio. Since Istio was launched, the community has been very responsive. It can also be said today that it is the de facto standard of control plane in the cloud native era.

Istio has four main functions:

  • Connect: Intelligently control the flow of traffic and API calls between services, perform a series of tests, and escalate through red-black deployments.
  • Secure: Automatically protects your services through managed authentication, authorization, and encryption of communication between services.
  • Control: Applying policies and ensuring that they are enforced so that resources are fairly distributed among consumers.
  • Observe: Diversified, automated tracking, monitoring, and logging of all your services to keep abreast of what is happening in real time.

Let’s take a look at Istiio’s overall logical architecture:

The data plane consists of a set of intelligent agents, dispatched as sidecars, which coordinate and control all network communications between the microservices and collect and report telemetry data on all grid traffic. The control plane consists of the Pilot, Citadel, and Galley components, which manage and configure proxies to route traffic.

Pilot provides service discovery for the Envoy Sidecar, traffic management capabilities for intelligent routing (for example, A/B testing, Canary Posting, etc.), and elastic capabilities (timeouts, retries, fuses, etc.). Pilot translates advanced routing rules that control traffic behavior into environment-specific configurations and propagates them to sidecar at run time. Pilot abstracts the platform-specific service discovery mechanisms and synthesizes them into a standard format that any sidecar conforming to the Envoy API can use.

Citadel supports powerful service-to-service and end-user authentication with built-in identity and certificate management. You can use Citadel to upgrade unencrypted traffic in the service grid. With Citadel, operators can implement policies based on service identity rather than the relatively unstable Layer 3 or layer 4 network identity. Starting with version 0.5, you can use the authorization feature of Istio to control who can access your services.

Galley is the configuration verification, extraction, processing, and distribution component of Istio. It is responsible for isolating the remaining Istio components from the details of getting the user configuration from the underlying platform, such as Kubernetes.

As a side note, the components of the control plane are actually packaged into a binary called IstiOD, starting with Istio 1.5, before which they were deployed as separate microservices.

Fall to the ground

After all this talk, the most important thing is how to implement Service Mesh into a production project. It should be noted that Service Mesh is not implemented from scratch in most production projects, but is modified and upgraded on the basis of the original microservice architecture.

As we know, in the original microservice architecture, the microservice framework occupies the core position, but the Service Mesh architecture needs to replace the microservice framework, which is equivalent to a “heart change operation”. Furthermore, the transition from a traditional microservices architecture to a Service Mesh architecture requires a smooth transition without disrupting the Service. So it’s a big challenge.

To make the transition smooth, you cannot switch all microservices from the microservices framework to the Service Mesh in one step. During the transition, the microservices framework and Service Mesh should coexist. We can start with a relatively marginal independent business as a pilot, run it for a period of time, and then gradually expand to other business segments, including the core business, until all the segments have been migrated.

There is nothing to say about the technical selection of Service Mesh, Istio + Envoy has become almost standard. In addition, the Envoy proxy runs alongside the microservice as a sidecar and intercepts and forwards traffic to the service through iptables, oblivious to the service. Service governance functions, including service discovery, load balancing, circuit breaker, current limiting, monitoring, and so on, are based on traffic, so implementing these functions is not intrusive to microservices. Microservices frameworks, on the other hand, are known to be intrusive to business code.

The main work of switching system architecture from traditional microservices to Service Mesh can be divided into four steps:

  1. Container environment construction;
  2. Service Mesh environment construction;
  3. Removal of microservices framework functionality;
  4. Inject microservices into the Service Mesh platform.

Because the Service Mesh environment building depends on the container environment building, the first step needs to be containerization. Since the microservice framework is intrusive to services and the Service Mesh is not, the main change to the business code is to remove the dependency on the microservice framework.

Finally, I would like to emphasize that I don’t have any practical experience with Service Mesh, so these are just my personal thoughts. If there are any mistakes, you are welcome to point them out.

conclusion

Service Mesh is gaining momentum, and the ecosystem is definitely moving toward standardization, just like the container ecosystem. Given the trend, it’s important to understand the technology well enough to prepare for the future.

This is the end of this series, and while there may be no end to the evolution of architectures, my current understanding of more advanced architectures, such as mid-platform and Serverless, is too shallow to cover in depth.


Previous articles:

Transaction System Architecture evolution (vi) : containerization

Evolution of transaction System architecture (5) : Service governance

Evolution of transaction system architecture (iv) : Distributed transactions

Transaction system Architecture evolution road (3) : micro service

Evolution of transaction System Architecture (II) : Version 2.0

Evolution of transaction System Architecture (I) : Version 1.0