Zhao Huabing, Senior engineer of Tencent Cloud, Istio Member, Managing Member of ServiceMesher, contributor of Istio project, founder of Aerika project, keen on open source, network and cloud computing. Currently, he is mainly engaged in the open source and research and development of service grid. Tang Yang is an infrastructure engineer at Zhihu. Istio project contributor, Argo project contributor, focusing on open source, cloud native and microservices. Currently, I am responsible for the research and development of Zhihu service grid.

Note: This article is based on the speech “How to Manage Any Layer-7 Traffic in an Istio Service Mesh?” by Zhao Huabing and Tang Yang from Tencent Cloud in IstioCon 2021. Made up of.

Hi, the topic we want to share with you today is how to extend Istio to support any 7-tier protocol? As one of the most popular open source projects in the cloud native space, Istio has become the de facto standard for Service Mesh. Tencent Cloud also provides the Service Mesh management Service TCM (Tencent Cloud Mesh), which is enhanced based on Istio and fully compatible with Istio API. To help our users quickly leverage the traffic management and Service governance capabilities provided by Service Mesh with minimal migration and maintenance costs. Today I am very glad to have this opportunity to share with you some of our experience in this process.

Service Mesh provides an application-transparent infrastructure layer that addresses common challenges we encounter in distributed applications/microservices, such as: How do you find a Service provider? How to ensure the security of communication between services? How do I know the invocation relationships between services? How to manage traffic such as grayscale publishing? And so on. The Service Mesh is implemented by deploying a Sidecar Proxy along with the application. The Sidecar Proxy intercepts the inbound and outbound traffic of the application, analyzes and processes the traffic, and implements traffic management and security encryption for the Service without modifying the application code. Purpose of telemetry data collection. To achieve these service governance capabilities, Sidecar Proxy needs to process traffic not only at layers 3 and 4 of the OSI network model, but more importantly at layers 7. At layer 7, Istio supports only HTTP and gPRC by default. However, there are other seven layer protocols that we often use in microservices. When migrating these microservices applications to the Service Mesh, we wanted to manage all of these seven layer protocols in a consistent way to take advantage of the cloud native capabilities provided by the Service Mesh infrastructure.

In today’s share, I will introduce several approaches to extending Istio traffic management capabilities to other tier 7 protocols, and compare the advantages and disadvantages of each approach. I’ll show you how to use the Aeraki open source project to manage any seven-tier protocol in Istio, including Dubbo, Thrift, Redis, etc. To give you an idea of how Aeraki works, an example of Thrift service Traffic Splitting using Aeraki will be shown. Tang Yang from Zhihu will also show us some interesting real-life examples of how Aeraki is used.

Seven layer protocols common in Service Mesh

As shown in the figure below, a typical microservice application typically uses these seven-layer protocols:

  • Synchronous invocation: Different services call each other using RPC (remote method invocation). Common RPC call protocols include gRPC, Thrift, and Dubbo. HTTP is also known as RPC (only GET, SET, and POST are supported). In order to meet the needs of their specific business scenarios, some large companies often use some private RPC protocols.

  • Asynchronous messaging: In addition to RPC, asynchronous messaging is a common mode of microservice communication, including Kafka, RabbitMQ, ActiveMQ, etc.

  • Various databases and caching systems: Redis, MySQL, MongoDB, etc.

What management capabilities do we expect from Service Mesh when adding such a microservice application to the Service Mesh?

Ideally, we want the Service Mesh to manage traffic for all seven layer protocols used in microservices, including RPC, Messaging, Cache, DB, and so on. Such as:

  • Request-based load balancing: Multiple independent requests from the same TCP connection can be distributed to different back-end servers to achieve smarter and more reasonable load balancing.

  • Layer 7 Header-based traffic routing: Routes based on attributes in the Layer 7 Header, such as the service name in the Dubbo request or the Key in the Redis request.

  • Inject delays or errors into client request responses to test resilience for microservices.

  • Provide application-level security, such as authentication based on the JWT Token in the HTTP Header or authentication of Redis servers.

  • Request level telemetry data, including request success rate, request time, call trace, and so on.

To achieve these traffic management and Service governance capabilities, the Service Mesh needs to analyze and process layer 7 protocol headers in TCP packets. That is, the Service Mesh must be able to manage layer 7 protocols, not just TCP.

In Istio, however, protocols other than HTTP and gRPC can only be handled at OSI layers 3 to 6. This means that protocols can only be routed based on layer 3 IP addresses, layer 4 TCP ports, or layer 6 SNI (Server Name Indication). Only TCP indicators can be collected, such as the number of RECEIVED and sent TCP packets or the number of opened/closed TCP links. Only mTLS can be used for authentication and permission control at the link level. In other words, for these protocols, we still need to deal with common issues in the application code, such as flow control, observability, and security authentication, which should be handled by the Service Mesh infrastructure. This defeats the purpose of moving microservices to the Service Mesh: to sink the common issues of microservices communication and governance from the application code to the Service Mesh infrastructure layer.

How do I extend Istio’s protocol management capabilities?

If we wanted to be able to manage these seven-tier protocols in Istio, how would we do that? Suppose we have a BookInfo microservice that uses a protocol called AwesomeRPC instead of HTTP to implement remote calls between services.

Let’s take a look at how we can implement traffic management of the AwesomeRPC protocol in Istio, such as routing requests from ProductPage to different versions of Reviews based on the user name field in the request header. To achieve a grayscale publishing scenario.

The most obvious way to do this is to modify the Istio code directly. First we need to support the AwesomeRPC protocol in Istio’s VirtualService CRD. The enhanced VirtualService CRD is shown in the left-most rule configuration in the following figure. AwesomeRPC and HTTP routes are semantically similar in that they are routed based on the values of certain attributes in the Header. Therefore, we only need to change the HTTP protocol type to AwesomeRPC, and can directly use the HTTPRoute structure in VirtualService to represent the routing rules of AwesomeRPC. We then need to generate the actual configuration required by the Envoy in the Pilot code based on AwesomeRPC’s service definition and the routing rules defined by VirtualService and send it to the data side Envoy under xDS. AwesomeRPC’s Filter plugin has been written using Envoy’s Filter extension mechanism, implementing AwesomeRPC’s encoder and decoder, Header parsing, dynamic routing and other functions required by the data surface.

In this way, the process of adding a new seven-layer protocol to the control surface is relatively simple in the case of the Envoy Filter already implemented. However, since we changed Istio’s source code, we had to maintain a private branch of Istio ourselves, which resulted in additional maintenance costs and made it difficult to keep up with Istio’s rapid pace of iteration.

If you don’t want to maintain your own Istio code branch, a viable alternative is to use Istio EnvoyFilter CRD: EnvoyFilter is a flexible and powerful configuration mechanism provided by Istio. We can use EnvoyFilter to patch the default Envoy configuration generated by Pilot, adding, modifying, or removing parts of the default Envoy configuration to modify the default behavior of the Envoy in the Istio Service Mesh as desired.

As shown in the figure below, because Pilot does not understand the AwesomeRPC protocol, to Pilot the AwesomeRPC service is just a TCP service. In the default configuration generated by Pilot, a TCP Proxy is used in the FilterChain of the Outbound Listener corresponding to the AwesomeRPC service to process its traffic. We selected the TCP Proxy in the Match section of EnvoyFilter and replaced it with an AwesomeRPC Filter configured with Traffic Splitting rules in the Operation section. Pilot modifies the default Envoy configuration it generates based on EnvoyFilter and then delivers it to the Envoy on the data surface. Thus we have implemented support for the AwesomeRPC protocol in Istio via EnvoyFilter.

Let’s look at a real-world example of using the Thrift protocol. Thrift is the Apache Foundation’s next lightweight, multilingual open source RPC framework. Thrift has been supported in Envoy, but Istio only provides limited support for Thrift and cannot implement advanced Traffic management functions such as Traffic Splitting. If we wanted to provide Traffic Splitting Traffic control in Istio as shown in the bottom right corner of the Thrif service, we could do so through EnvoyFilter.

(The source code for this example is available at github.com/aeraki-fram… Download)

First, we need to create an EnvoyFilter as shown on the left side of the figure to handle outbound traffic from the client, Tcp_proxy in the Outbound Listener $(thrift- sample-server-VIP)_9090 is selected for the EnvoyFilter Match condition. Replace it with a THRIFt_proxy in the Patch section. In this Thrift_proxy, we configure corresponding routes for it according to the requirements of Traffic Splitting: 30% Traffic is routed to Server V1 and 70% Traffic is routed to Server V2. We also need to create an EnvoyFilter for the Thrift Server side as shown in the upper right to handle incoming traffic on the Server side. Compared with EnvoyFilter on the client, EnvoyFilter configuration on the server is simpler. Therefore, we do not need to configure any routing rules on the server. We only need to replace TCP_proxy with THRIFt_proxy. Although this Thrift_proxy does not have routing rules, it provides a large number of seven-layer service communication and governance capabilities, including load balancing at the request level, Metrics data generated at the request level, etc.

As you can see from the introduction and examples above, EnvoyFilter CRD is like a Swiss Army knife in Istio, enabling very flexible customization of the Pilot generated Envoy configuration for the purpose of managing seven-layer protocols. But EnvoyFilter also poses some tough questions:

  • EnvoyFilter exposes the Envoy’s low-level implementation details directly to operations personnel: Operations personnel must be knowledgeable about the configuration details of the Envoy Filter, which are closely related to the implementation mechanisms inside the Filter, such as the Filter name and configuration format inside the Filter. As a result, creating EnvoyFilter is highly coupled to code details and difficult to deliver directly to operations. A more logical approach would be to mask these implementation details with a user-oriented high-level configuration language, such as VirtualService and DestinationRule in Istio.

  • The matching criteria in EnvoyFilter depend on the structure composition and element names in the Envoy configuration generated by Pilot, such as the name of the Listener, the composition of the FilterChain, and so on. These structures and names can change between Istio versions, causing problems with EnvoyFilter in the new version, which worked well.

  • The matching criteria in EnvoyFilter also depend on something that is specific to a particular K8s Cluster, such as the Service Cluster IP, which means that one EnvoyFilter cannot be used for the same Service in multiple different clusters. When the Service is rebuilt, because the Cluster IP will change, the corresponding EnvoyFilter must also change the Cluster IP in the Match condition.

  • We need to create envoyFilters for each Service. When there are many services managed in the Mesh, manually creating hundreds or thousands of EnvoyFilters is tedious and error-prone.

  • As far as Istio is concerned, the Patch part of EnvoyFilter is basically a black box, so Istio can only carry out very limited verification of EnvoyFilter’s correctness. This makes debugging EnvoyFilter very difficult, and when EnvoyFilter fails to work as expected, it’s hard to know exactly what’s wrong with it.

Because of the above problems, we can see that although EnvoyFilter can be used to manage the seven-layer protocol in Istio, it is very difficult to manage and maintain these Envoyfilters in a production system, especially in a medium to large Service Mesh.

Aeraki: Manage any seven-tier protocol in Istio

Since EnvoyFilter is difficult to manage and maintain manually, we created the Aeraki (pronounced: [air-rah-ki]) project to automate this process. Aeraki is the Greek word for “breeze,” and we hope that Aeraki will help Istio sail further on his cloud-native journey.

The basic working principle of Aeraki is shown in the figure below: Aeraki pulls service data from Istio, generates Envoy configuration according to ServiceEntry and Aeraki traffic rules, and pushes the generated configuration to Istio using EnvoyFilter. In short, you can think of Aeraki as an Operator for the seven-layer protocol managed in Istio.

Compared to directly modifying Istio code and using EnvoyFilter to extend Istio traffic management capabilities, Aeraki offers the following benefits:

  • There is no need to modify Istio code, so the extra work of maintaining a separate private branch of Istio code is saved and upgrades can be quickly followed through version iterations of Istio.

  • Aeraki is deployed as an independent component on the control side of the Mesh and can be easily integrated with Istio as a plug-in to extend Istio’s traffic management capabilities.

  • The default configurations for protocols are automatically generated by Aeraki and can be adjusted automatically based on Istio version and K8s cluster information. Save a lot of EnvoyFilter manual creation and maintenance work.

  • Aeraki abstracts on top of the Envoy configuration, providing a layer of configuration CRD to the user to manage these seven-layer protocols. These advanced CRDS hide Envoy configuration details and mask the differences between default Envoy configurations generated by different Istio versions, making them operation-friendly. For RPC protocols such as Thrift and Dubbo, Aeraki directly adopts Istio VirtualService and DestinationRule because their semantics are similar to HTTP. For non-RPC protocols, Aeraki defines new CRDS to manage, such as RedisService and RedisDestination. We will further explain how to use these configuration CRDS to customize rules, such as implementing Traffic Splitting.

Like Istio, Aeraki uses port names to identify protocol types. The port name must comply with the tcp-Layer 7 protocol name-xxx naming rule. For example, a Thrift service should be named tcp-thrift-service. Note that we must keep the “TCP -” prefix in the port name, because this is a TCP service for Istio. Aeraki will generate the Envoy configuration based on the seven-tier protocol in the port name and replace the tcp_proxy generated by Istio by default.

Let’s see how Aeraki can be used to implement the Traffic Splitting use case of the Thrift service above. First we need to declare the layer 7 protocol type of the Thrift Service in the Port name defined by the Service: Tcp-thrift -hello-server, and then create a VirtualService to route thrift requests to different service versions in a specified proportion. Aeraki will generate the required Envoy configuration based on the service definition and VirtualService and send it to Istio via EnvoyFilter.

As you can see, managing Thrift with Aeraki is much simpler than manually creating EnvoyFilter. If you don’t need special traffic rules, it’s even easier, just declare the Thrift protocol in the Port name according to the naming convention, and Aeraki will generate the desired Envoy configuration without any extra work.

Want to try out Aeraki’s Thrift, Dubbo, Redis service management capabilities for yourself? To install an Istio cluster with Aeraki plug-in and the corresponding Demo program, run the following two lines of code on a command line terminal connected to the K8s cluster.

`git clone https:``//github``.com``/aeraki-framework/aeraki``.git``aeraki``/demo/install-demo``.sh`
Copy the code

Can also access Aeraki online Demo, viewing from the Thrift, Dubbo, Redis service monitoring indicators collected panel: aeraki.zhaohuabing.com: 3000 / d/pgz7wp – Gz…

Enhance the Service Mesh with Aeraki

Let’s look at some examples of using Aeraki’s seven-layer protocol management capabilities to enhance Service Mesh.

Mask differences in development/production environments

We often need to access different back-end resources in development, test, and production environments, such as connecting to different Redis caches or mySQL databases. In general, we need to change the back-end resource address in the configuration file shipped with the application in order to switch back-end resources between different environments. With Aeraki’s help, we can use the Service Mesh to mask the configuration differences of different backend resources so that applications can access backend resources in different environments in the same way.

As shown in the following figure, we need to access the Redis services in Dev, Staging, and Prod environments. These three Redis services have different IP addresses, access passwords, and deployment modes. In a development environment, we might use a single instance of Redis to save resources and simplify deployment; In test and production environments, we will use Redis clusters to ensure high availability and scalability of Redis services, or we may directly use Redis hosted services provided by cloud providers. When switching between these three environments, we need to configure different IP addresses and access passwords. If Redis is deployed in different ways, we may even need to modify the client code to switch between Redis single instance mode and Redis cluster mode, which greatly affects the efficiency of our development, testing and rollout.

With the RedisService and RedisDestination CRD provided by Aeraki, we can mask the differences between these different Redis service providers and allow clients to access the back-end Redis service ina uniform way.

Prior to adopting Aeraki, we needed to configure different IP addresses and Redis access passwords in different environments. With Aeraki, the same code and configuration can be used on the client side and the Redis configuration can be switched between different environments by modifying Aeraki CRD, greatly reducing the cost of switching between different environments. Even if Redis is changed from a single instance to a Redis cluster, clients can access it in the same way.

Traffic mirroring is used for comparison test

Some databases or database agents use the same network protocol. TiDB, Oceanbase, Aurora, and Kingshard are all compatible with MySQL. Twemproxy, Codis, Tendis and Pika all adopt Redis protocol. Due to business requirements, we sometimes need to migrate from one implementation to another. Before migrating, we need to do comparative testing to compare the performance, functionality, and compatibility of different implementations.

For example, in the following scenario, we only used a single instance Redis to do cache at first. With the continuous expansion of online business, the access bottleneck of the Redis instance has appeared. We hope to switch to Twemproxy for horizontal expansion of Redis. By using Aeraki to mirror online Redis traffic to Twemproxy test environment, we can fully test Twemproxy with real business data to evaluate its impact on online business.

Full flow fault injection was used to test system elasticity

Istio can implement HTTP and gRPC fault injection, but this is not enough. In a distributed system, application services, databases, caches, messaging systems, and so on May become unavailable due to network or other reasons. With Aeraki, we can perform a complete simulation of all of these possible points of failure in the system to test the resilience of the system and ensure that our system can heal itself after a partial failure or degrade to make the system more or less usable without the entire system crashing.

summary

Service Mesh has a large number of layer 7 protocol traffic, including RPC, Database, Cache, Messaging and other layer 7 protocols. However, Istio only provides layer 7 management capability of HTTP and gRPC, and has limited support for other layer 7 protocols. The Aerkai open source project provides Istio with the ability to support any seven layer protocols in a non-intrusive manner and provides advanced user-oriented configuration CRD, which can easily manage the traffic of these protocols and achieve advanced traffic management capabilities such as grayscale publishing. Aeraki already supports Thrift, Dubbo, Redis, Kafka, Zookeeper, and more protocols soon. Aeraki is positioned as a non-invasive Istio enhancement tool set. In addition to protocol extension, Aeraki also focuses on solving other common problems encountered in Istio usage, including efficiency optimization, configuration simplification, third-party service discovery access, and function extension. If you’d like to learn more about Aeraki, visit Github at github.com/aeraki-fram… .

** Recruitment information

Tencent Cloud Service Mesh team is recruiting Base Chengdu, Beijing, Shenzhen or Xi ‘an, candidates are required to be familiar with Kubernetes/Istio/Envoy. Please send your resume to [email protected] or contact Zhao_ Huabing on wechat.

Reference links:

  • IstioCon talk “How to Manage Any Layer-7 Traffic in an Istio Service Mesh?” Video playback www.bilibili.com/video/BV1XN…

  • IstioCon talk “How to Manage Any Layer-7 Traffic in an Istio Service Mesh? Lecture download zhaohuabing.com/slides/how-…

  • Aeraki Github homepage github.com/aeraki-fram…

  • Aeraki online Demo at aeraki.zhaohuabing.com: 3000 / d/pgz7wp – Gz…