The Service Mesh is an infrastructure layer that enables secure, fast, and reliable communication between services. If you are building cloud native applications, you need Service Mesh.

Over the past year, Service Mesh has become a key component in the cloud native stack. Many companies with high traffic loads include Service Mesh in their production applications, such as PayPal, Lyft, Ticketmaster, and Credit Karma.

In January, Linkerd, the Service Mesh component, became an official project of CNCF (Cloud Native Computing Foundation). But then again, what is a Service Mesh? Why is it suddenly so important?

In this article, I will define Service Mesh and trace the evolution of Service Mesh in application architectures over the past decade. I’ll explain the differences between Service Mesh and API gateways, Edge Proxies, and enterprise Service Buses. Finally, I describe where Service Mesh is going and what we can expect.

What is a Service Mesh?

The Service Mesh is an infrastructure layer that handles communication between services. Cloud native applications have complex Service topologies, and Service Mesh ensures that requests can be reliably shuttled through these topologies. In practice, a Service Mesh is usually composed of a series of lightweight network agents that are deployed with the application without the application needing to know about their existence.

With the rise of cloud native applications, Service Mesh is becoming an independent infrastructure layer. In the cloud native model, an application can consist of hundreds of services, each of which may have thousands of instances, and each instance may change continuously. Communication between services is not only incredibly complex, but also fundamental to runtime behavior. Managing communication between services is critical to ensuring end-to-end performance and reliability.

Is Service Mesh a network model?

A Service Mesh is essentially an abstraction layer on top of TCP/IP. It assumes that the underlying L3/L4 network is capable of point-to-point byte transfer (of course, it also assumes that the network environment is unreliable, so the Service Mesh must be capable of handling network failures).

In some ways, Service Mesh is similar to TCP/IP. TCP abstracts the mechanism for transferring bytes between network endpoints, while Service Mesh abstracts the mechanism for routing requests between Service nodes.

The Service Mesh does not care what message bodies are or how they are encoded. The goal of an application is to “get something from A to B,” and all the Service Mesh does is achieve that goal and handle any failures that might occur along the way.

Unlike TCP, Service Mesh has a higher purpose: to provide unified, application-level visibility and control for the application runtime. The Service Mesh separates the interservice communication from the underlying infrastructure, making it a first-class citizen of the entire ecosystem — it can therefore be monitored, hosted, and controlled.

What can A Service Mesh do?

Transferring service requests in cloud native applications is a complex task. Linkerd, for example, uses a number of powerful techniques to manage this complexity: circuit breakers, load balancing, delay awareness, ultimate consistency service discovery, retries, and timeouts. These technologies need to be combined and coordinated, and their interactions with the environment are subtle.

For example, when a request flows through Linkerd, the following sequence of events occurs:

  1. Linkerd uses dynamic routing rules to determine which service the request is destined for. For example, is it destined for a production service or a staging service? Is it the service sent to the local data center or to the cloud? To the latest version of the service or to the old version of the service? These routing rules can be dynamically configured and applied to global or partial traffic.
  2. After determining the target service for the request, Linkerd retripes the corresponding service instance from the service discovery endpoint. If the information for the service instance is skewed, Linkerd needs to decide which information source is more trustworthy.
  3. Linkerd selects instances that are more likely to return a response quickly based on certain factors, such as recent delays in processing requests.
  4. Linkerd sends the request to the selected instance and logs the latency and response type.
  5. If the selected instance is down, unresponsive, or unable to process the request, Linkerd sends the request to the other instance (provided that the request is idempotent).
  6. If an instance continues to return errors, Linkerd removes it from the load balancing pool and periodically retries it later (the instance may have failed temporarily).
  7. If the request times out, Linkerd will voluntarily abandon the request without additional retries.
  8. Linkerd records these behaviors in the form of metrics and distributed logs, and then sends the metrics to the central metrics system. In addition, Linkerd can initiate and terminate TLS, perform protocol upgrades, dynamically adjust traffic, and failover between data centers.

These features of Linkerd ensure local elasticity as well as application layer elasticity. Large-scale distributed systems have a common feature: local failures accumulate to a certain extent and result in system-level disasters. The purpose of the Service Mesh is to spread traffic and prevent failures from spreading to the entire system before the load on the underlying system reaches its limit.

Why do we need a Service Mesh?

Service Mesh is not a new feature. Web applications have always had to manage complex communications between services themselves, and the evolution of applications over the past decade can be seen in the shadow of the Service Mesh.

Mid-sized Web applications circa 2000 typically used a three-tier model: application logic, Web services logic, and storage logic. The layer to layer interaction is not simple, but the complexity is limited, as a request requires at most two hops. Although there is no “grid,” there is still the jump communication logic.

As the scale grew, the structure became inadequate. Companies like Google, Netflix, and Twitter, faced with the challenge of mass traffic, implemented an efficient solution that was a precursor to cloud native apps: the application layer became a topology when it was broken up into multiple services (also known as microservices). Such a system would require a common communication layer, in the form of a “rich client” package, such as Twitter’s Finagle, Netflix’s Hystrix, and Google’s Stubby.

In general, packages such as Finagle, Stubby, and Hystrix are the original Service Mesh. The cloud native model adds two additional elements to the original microservices model: containers (like Docker) and choreography layers (like Kubernetes). Containers provide resource isolation and dependency management, and the choreography layer abstracts pools the underlying hardware.

These three components enable applications to scale and handle local failures in the cloud. But as the number of services and instances grew, the orchestration layer needed to dispatch instances all the time, the route of requests across the service topology became extremely complex, and the ability to develop different services in any language made the previous “rich client” package approach impossible.

This complexity and urgency gave rise to a layer of communication between services that is not coupled to the application code but captures the highly dynamic nature of the underlying environment, the Service Mesh.

The future of Service Mesh

Although the use of Service Mesh in cloud native systems has grown rapidly, there is still a lot of room for improvement. Serverless computing (such as Amazon’s Lambda) requires a naming and linking model for Service Mesh, which highlights the role of Service Mesh in the native cloud ecosystem.

Service identification and access strategies are still rudimentary in cloud native environments, and Service Mesh will undoubtedly become an indispensable foundation for this. Just like TCP/IP, Service Mesh takes the underlying infrastructure a step further.

conclusion

Service Mesh is a key component in the cloud native stack. The Linkerd project became an official CNCF project more than a year after its launch, and now has a large number of contributors and users. Linkerd’s users range from startups (like Monzo) to massive Internet companies (like PayPal, Ticketmaster, Credit Karma) to centuries-old companies (like Houghton Mifflin Harcourt).

WHAT’s A SERVICE MESH? AND WHY DO I NEED ONE? , has authorized to talk about architecture forwarding propagation. Buoyant. IO /2017/04/25/…