Author’s brief introduction

Yang Dihang, Istio community member, netease Digital Sail architect, responsible for the configuration and management of The Light Boat Service Mesh, leading the design and development of Slime components, and participating in the construction of The Service Mesh of Netease Yan Xuan and netease Media. 3 years experience in function expansion and performance optimization of Istio control surfaces.


Slime is an open source service grid component from the netease Sfufan Micro Services team. It acts as a CRD manager for Istio and aims to implement the high-level functions of Istio/Envoy through a simpler configuration. Slime currently contains three very useful submodules:

  1. Lazy configuration: You do not need to manually configure SidecarScope and load configuration and service discovery information as required
  2. The Http plug-in management: the use of new CRD pluginmanager/envoyplugin packaging the readability, maintainability, poor envoyfilter, makes plug-in extension more convenient
  3. Adaptive flow limiting: automatically adjust the flow limiting strategy based on monitoring information. In the future, our team will open more practical functions in Slime, hoping that Slime can help users better control Istio, a small sailing boat

1. The background

As a new generation of microservice architecture, service grid adopts sidECar mode to realize the decoupling of business logic and microservice governance logic, and reduce the development, operation and maintenance costs of microservice framework. Clear accountability, easy maintenance, observable, multi-language support and other advantages make it gradually become the focus of the topic of micro-services. And Istio+Envoy, its most widely used implementation, remains in position C, with Istio looming on the horizon as an industry standard, backed by Google’s tree.

He that would wear the crown must bear its weight. Standing at the forefront of the storm, Istio has won praise but also attracted a lot of criticism. Needless to say, Istio has a set of effective upper level abstraction. By configuring CR such as VirtualService and DestinationRule, functions such as version splitting, grayscale publishing and load balancing can be realized. However, in the face of local traffic limiting, blackand whitelist, In the process of demoting higher-order functions such as microservice governance, this abstraction seemed inadequate. At first, Istio proposed a solution to increase the functions originally in the data plane to the Mixer Adapter. Although the problem of function expansion was solved, its centralized architecture was questioned by many observers about its performance. In the end, Istio cut itself off in the new version and abandoned Mixer, leaving the extension of higher-order functionality a blank space in the current version. On the other hand, Istio configuration is pushed in full quantity, which means that a large number of configurations need to be pushed in large-scale grid scenarios. In order to reduce the amount of configuration push, users have to make clear the dependency relationship between services in advance and configure SidecarScope for configuration isolation, which undoubtedly increases the mental burden of operation and maintenance personnel. Ease of use and performance have become mutually exclusive.

In response to some of Istio’s current shortcomings, our team started slime. This project is implemented based on K8S-Operator. As the CRD manager of Istio, it can seamlessly interconnect with Istio without any customization. Slime has a modular architecture internally and currently contains three very useful sub-modules:

  1. Lazy configuration: You do not need to manually configure SidecarScope and load configuration information and service discovery information on demand, which solves the problem of full push.
  2. The Http plug-in management: the use of new CRD pluginmanager/envoyplugin packaging the readability, maintainability, poor envoyfilter, makes plug-in extension more convenient.
  3. Adaptive traffic limiting: You can automatically adjust traffic limiting policies based on monitoring information, which makes up for the shortcomings of Istio traffic limiting.

2. Configure lazy loading

With the gradual expansion of business scale on the service grid, the first problem we encountered was the performance problem caused by configuring full delivery, which had a serious impact on the performance of both the data plane and the control plane:

  1. A) Envoy launches longer b) Memory overhead increases c) Occupies the Envoy main thread and blocks the Pilot event push
  2. To enable Istio to support clusters of a certain magnitude, we have to require the service provider to inform the service on which the service depends when the service is released. And set SidecarScope to mask configuration and service discovery information for irrelevant services. However, in the process of implementation, it encountered resistance, on the one hand, information dependent on the service is not easy to obtain, on the other hand, once the business side is incorrectly configured, it will lead to invocation problems. This makes it prohibitive for businesses to get on the grid.

Is there any way to get configuration on demand for services? The easiest thing to think of is to get this information from the service invocation relationship, but without the discovery information of the called service, it cannot be successfully accessed, which leads to some services with low fault tolerance that cannot accept this scheme, on the other hand, the service invocation relationship obtained when the access is unsuccessful is not reliable. In other words, if there is a way for the service to be successfully invoked without the called party configuration information and the service discovery information, configuration lazy loading (load on demand) can be implemented by automatically generating SidecarScope.

The backend of this back-pocket route is a globally shared sidecar, called global-Sidecar, which has full configuration and service discovery information. The invocation of missing service discovery information is hijacked by the bottom-pocket route to global-Sidecar, which acts as a secondary proxy for the invocation and forwards it to the corresponding back-end service.After global-Sidecar completes the proxy, it will report the service invocation information to Slime, and SLIme will update Scope according to the invocation information. After the first invocation, the service can perceive the information of the called party, and global-Sidecar does not need to forward, as shown in the figure below.

When the service name of the called service is directed to another service by routing rules in VS, Slime can only add the called service to Scope. The service discovery information of the directed service is still missing, causing 503 to appear when the called service is called again. To solve this problem, we introduced our own crD-Servicefence, which allows us to build mappings between service names and back-end services. Slime avoids this problem by adding both the called service and the directed service to the Scope.Servicefence also manages the life cycle of the generated SidecarScope, automatically cleaning up call relationships that have not been used for a long time. Of course, these CRDS are generated and maintained automatically. Users do not need to care about ServiceFence resources or SidecarScope resources, but only need to click on SVCistio.dependency.servicefence/status: "true"The service requires configuration lazy loading to be enabled.

3. Http plug-in management

In the gateway scenario, traffic management is more complex, need to use customized plug-in to handle traffic, before the development of slime plug-in module, plug-in extension can only be realized through EnvoyFilter, EnvoyFilter is xDS level configuration, management and maintenance of such configuration requires a lot of energy. The error rate is also extremely high.

In order to simplify the difficulty of plug-in management, we decided to make a level of abstraction to plug-in management on EnvoyFilter. The configuration of HTTP plug-ins in xDS has two sections, one in LDS, as a SubFilter for HttpConnectionManager, which determines which plug-ins will be loaded and in which order they will be executed. The other part is in RDS and has two granularity, virtualHost granularity perFilterConfig and Route granularity perFilterConfig, which determines the plug-in behavior required for the current Host or route.

Parts of the LDS are abstracted as pluginManagers, which we can start and stop with the Enable option. PluginManager can also be used to manage the execution priorities of plug-ins, where the order of plug-ins is consistent with the order in the LDS plug-in chain, and the higher the plug-in execution priority is, as shown in the following figure:

Part of the RDS is abstracted as EnvoyPlugin, and the Host/Route field of EnvoyPlugin allows you to set the effective scope of the plug-in configuration. EnvoyPlugin is more suitable for the configuration model of the gateway. On the console of the gateway, the back-end service is often mapped to some API interfaces under A certain Host. For example, we need to configure the self-developed whitelist plug-in and trace sampling plug-in for service A. The plug-in configuration for service A, whose interfaces on the gateway are/ABC and/XYZ, will be mapped as:

apiVersion: microservice.netease.com/v1alpha1
kind: EnvoyPlugin
metadata:
  name: gateway-proxy-svc-a 
  namespace: gateway-system
spec:
  gateway:
  - gateway-system/gateway-proxy
  host:
  - gwtest.com
  The route:
 - name: abc
 - name: xyz
  plugins:
  - name: com.netease.iprestriction
    inline
      settings:
        list:
        - 1.11.1.
        type: BLACK 
  - name: com.netease.resty
    inline
      settings:
        plugins:
        - config:
            sample_rate: 0.001
            whitelist:
            - aaa
          name: neTraceSample 
Copy the code

EnvoyPlugin does not care about the specific configuration of each plug-in (the specific configuration will be passed through in the type.struct structure). It is more concerned about the effective scope of the plug-in. Users can configure plug-ins in the required dimension to do aggregation, which is more suitable for the habit of plug-in users. On the other hand, there is less redundancy in the upper configuration. The following figure shows EnvoyPluing mapping at the xDS level. Although the xDS level will still be expanded, at least when managing them, we are dealing with an ordered aggregated array rather than a huge plug-in tree:

4. Adaptive current limiting

With the removal of the Mixer, the realization of limited flow in the service grid becomes very complex. Fully confined streams require additional deployment of RLS (Ratelimit Server), and even local limiting streams require the help of the Envoy built-in plug-in – EnvoyFilter.local.ratelimit, for which users have to face complex EnvoyFilter configurations again. Compared to the mature flow limiting components in the second generation of microservices framework, the local flow limiting components are relatively simple. For example, they cannot be adaptive and can only be configured on an instance scale.

In order to solve the shortcomings of service traffic limiting in Istio, we developed the adaptive traffic limiting module. In terms of ease of use, we also designed a new API for it – SmartLimiter. The main architecture of adaptive flow limiting is divided into two parts, one is the conversion logic of SmartLimiter to EnvoyFilter, the other is the acquisition of monitoring data. Slime supports CPU, Memory, and copy data acquisition from K8S Metric-Server. Of course, we also provide a metric Discovery Server interface (MDS). You can synchronize user-defined monitoring indicators to traffic limiting components.

The configuration of SmartLimiter is similar to the natural semantics. For example, if you want to trigger the access restriction of service A when the CPU exceeds 80%, the limit is 30QPS. The SmartLimiter is defined as follows:

apiVersion: microservice.netease.com/v1alpha1
kind: SmartLimiter
metadata:
  name: a
  namespace: default
spec:
  descriptors:
  - action:
      fill_interval:
        seconds: 1
      quota: "30/{pod}"    # 30 is the amount of this service, which is divided equally among each POD
    condition: "{CPU} > 0.8" The template is automatically populated according to the value of the monitoring item {CPU}
Copy the code

The resulting limiting flow is, as shown in the figure below:

5. How do I get and use Slime

The source code of Slime is now open, you can get the latest developments of Slime here, and our team will open more practical features in Slime in the future. You can also read the guide to getting started on Slime. In the usage guide, we have written a simple example for Slime based on Bookinfo that we hope will help you.

Finally, Slime is still in its early stages and would love to have more Mesher join us or suggest ways to help us improve it.

Hopefully slime will help you navigate Istio better!