Service grid solution introduction

Service architecture develops from single unit to micro service, and in the era of cloud native, micro service develops into service grid. Relying on technologies such as containers, service governance functions become independent infrastructure layers, separated from the governed services.

Introduction to Service Governance

The software field has been pursuing high availability, high performance and easy extension. As software becomes more complex and requires collaborative development by multiple people, the availability, high performance and scalability of services are increasingly challenged. It’s a developer’s nightmare to iterate on a huge, complex single service and have to do everything at once. To change this situation, it is natural to split a single large and complex service, or servitization. A service consists of multiple microservices that invoke each other through interfaces to provide complete services. Software architecture guru Martin Fowler gave a review paper on microservices software architecture in 2014. See Reference 1 for the translation. Under the microservices architecture, software is broken down into bounded microservices that can be developed and deployed independently and communicate with each other.

After services are split into microservices, the boundaries of services are clarified, functions and logic are clarified, and maintenance is easier. With the popularity of Spring Cloud’s bucket, microservices are almost a must for software development. A variety of other languages (Golang, PHP, etc.) also mimic Spring Cloud to provide microservices frameworks. After the service is divided into micro services, the obvious benefits are as follows.

  • Clear boundaries between services
  • Each service is easier to maintain
  • Independent service deployment
  • Service independent extension
  • Better isolation
  • Multiple technology stacks coexist easily

However, it is often said in software that “there is no silver bullet” and micro-services are not one either. The number of small services will bring new problems, such as:

  • Distributed architectures complicate design
  • The system is divided into multiple services. The deployment process is complex
  • The system throughput increases but the response time increases
  • Troubleshooting is complex and involves many different services
  • Diversified technologies and increased overall maintenance complexity

In response to these problems, the concept of service governance emerged. Service governance aims to provide the infrastructure or mechanism to solve the management problems faced by many microservices. Microservice governance frameworks or suites generally provide the following capabilities.

  • Service Registration Discovery
  • Service link tracing
  • Service resiliency fault tolerance
  • API gateway

Similarly, as services are broken down into more microservices, automation tools are needed to provide efficiency, often including automated testing and deployment. Fully customizing an infrastructure that supports microservices can be challenging, especially when multiple development languages need to be supported. In the era of cloud computing, with the popularity of container technology, the cloud native technology stack brings a new approach to microservice governance, after Spring Cloud cloud and its modelers defined the microservice foundation specification.

Introduction to service Grid

When it comes to service grids, linkerd has to be mentioned. The current service grid technology was also developed by Linkerd. In the service grid, the service governance capability is not provided to applications in the form of SDK, but provides a single layer of basic service infrastructure to provide service governance capability for services. Reference 2 describes in detail the evolution from microservices to service grids, and is highly recommended for those interested in this process. This article directly and briefly describes how service governance is implemented in this pattern.

Referring to the diagram in Reference 2, as shown in the figure above, the service grid provides a sidecar(sidecar) that provides service governance capabilities for service discovery, fuses, limiting traffic, load balancing, timeout retries, trace collection, and more. Multiple agents are connected into a network structure, as shown in the figure below. Through a centralized control panel to control the behavior of each agent sidecar, so as to achieve the precise behavior control of each agent service. In the service grid mode, the proxy service and the proxy service can obtain service governance through interprocess communication. The proxy service itself does not need to care about service governance and is completely decoupled from the service governance infrastructure.

Current situation of online school service governance

In the network school, most services still rely on API gateway to achieve service governance, API gateway provides very rich functions: service discovery, load balancing, traffic limiting, authentication and so on. API gateways, developed at OpenResty, use nginx routing mechanisms to implement service discovery and provide rich plug-ins for other service governance. The API gateway supports services that use the HTTP protocol to access each other very well, and is a very mature solution for providing stable, reliable services in multiple major activities. In addition to THE API gateway, some services use RPCX to achieve service governance, use ZooKeeper to provide a service registry, and use RPCX to provide service governance capabilities to achieve other requirements such as timeout retry. This part of the service governance work is also working very well. In addition, with the gradual implementation of containerization in network school, some services directly use coreDNS for service discovery provided by K8S, and use the mapping of K8S service virtual IP address to POD address to achieve load balancing. This type of service leverages the basic capabilities of K8S to implement basic service registry discovery and load balancing. Other service governance capabilities, such as trace collection and service resiliency fault tolerance, need to be implemented by themselves.

Here are just a few service governance schemes that are widely used. Under different technology stacks, there are different service governance modes to choose from. These modes work well under their respective expected goals, but each service governance system is incompatible with each other and cannot reuse existing basic capabilities well. To solve this problem, service grid provides common service governance capabilities for different scenarios.

Option

Service Grid solutions, the most active open source project is ISTIO. Despite claiming to support vm and other forms of deployment, IStio is still an open source service grid solution that is deeply customized for K8S. As shown in the figure below, the ISTIO service grid solution is divided into a data side and a control side, depending on the Envoy intelligent agent. The control surface ISTIOD consists of three parts, including service discovery Pilot, security and certification-related Citadel, and configuration management related Galley. The data side uses Lyft’s open source Envoy Proxy to implement 4/7 layer proxies for services and provide capabilities such as service discovery, load balancing, traffic management, resilient fault tolerance, and more.

Project landing practice in the platform R&D Department

The open source ISTIO solution is not an out-of-the-box solution, and many places need to be expanded according to the current situation of the business. According to the technical articles published by major manufacturers, isTIO solutions are deeply customized to meet business requirements. Typical examples are optimizing service grid proxy performance, optimizing supported service size, optimizing traffic hijacking, optimizing plug-in extension porting OpenReSTY, and so on. Companies customize isTIO solutions to varying degrees according to their business profile and needs. According to the current business characteristics of online school, we summarize the following requirements for the service grid.

  • Unify service governance infrastructure capabilities
  • Supports multiple languages and scenarios
  • Compatible with API gateway, high stability requirements, support smooth migration or smooth degradation
  • You don’t have to go for extreme performance
  • Ease of use is one of the key goals, especially as many lines of business are new to containers and products need to be easy to use
  • Connect with logging, monitoring, and other infrastructure, rather than building new infrastructure from scratch
  • Extend the open source solution without making incompatible customizations to the open source solution

After determining the above goals, we developed some plug-ins on open source ISTIO and connected with basic services such as logging. The overall product architecture is shown in the figure below. The entire service grid is currently based on K8S and containers, compatible with future cloud’s log base services and network calibration trace standards. EnvoyFilter is extended to achieve such functions as traffic limiting, retry and authentication. Given that the IStio community is very active and version iteration is fast (a larger version is released every quarter), we avoided modifying istio or Envoy’s code and needed special features were extended through Lua or WASM plug-ins. Of course, at present, these two expansion methods still have greater limitations. For HTTP protocol, basically complete support HTTP processing flow, expansion is also more convenient. For non-HTTP protocols, such as RPCX, currently wASM implements RPCX protocol parsing. Although the WASM extension approach is being promoted by the ISTIO community and is considered standard for future extensions in terms of envoys, it currently has limitations, such as modifying the processing flow of a request that is not supported. Most business currently uses HTTP for communication, and most requirements can be met using regular extensions.

Service Traffic Access

Traffic access consists of two parts, one is how to comply with THE API gateway to realize traffic into the service grid, and the other is how to access traffic between the proxy service and sidecar. In the case of external traffic access, we considered several schemes and finally selected the scheme as shown in the figure below. After the traffic enters the K8S cluster, it needs to enter the service grid network through istio Ingress Gateway. Of course, istio Ingress Gateway inside the service grid is optional to support the processing of the traffic at the edge of the service grid. Here we have considered an alternative solution, which is to inject K8s ingress directly into sidecar, so that k8S Ingress can forward traffic directly to pod, but now we use iptables traffic hijacking. To avoid NAT problems, k8S ingress is deployed in host network mode and cannot hijack traffic using iptables. On the service grid we added istio recommended gateways to handle the edge traffic of the entry.

Traffic hijacking within a POD takes different forms depending on requirements. The OCTO solution of Meituan uses the SDK that provides common development languages, and the APP communicates with the proxy through UDS. The use of SDK is also quite common, many large factories adopt this way. The main concerns of the SDK approach are the difficulty of iptables management and performance bottlenecks. Considering that the SDK is slightly intrusive to the business, and that we are not demanding extreme performance, we used iptables as a form of traffic hijacking to promote the technology by using the service grid quickly and transparently for the business. From the test data so far, the performance loss of a single POD in 2000 QPS using iptables traffic hijacking to envoy agents is not as large as expected after optimization. Optimized tests show that the overall performance loss is less than 2%, while a single POD 2K QPS represents the majority of applications and traffic hijacking with Iptables is not a bottleneck for overall performance.

Smooth project migration

For a new technology, if the use cost is huge, it will undoubtedly hinder the promotion of the new technology. Here, we design a smooth migration scheme compatible with API gateway. The following figure shows several typical access paths. Part of the traffic of APP1 and App2 is on the service grid, and all the traffic of APP3 is on K8S. Whether an application is on the service grid or on K8S, app3 is still accessed through the old API gateway. Projects can be connected to the service grid without any modifications. When traffic from APP2 on K8S accesses APP1, it is shunted through the API gateway and part of the traffic can be forwarded to the service grid as required. App2 on the service grid, on visiting APP1, receives hijacked traffic and rewrites it to app1 on the service grid, no longer passing through the API gateway. In order to achieve smooth degradation, the service grid cluster can be rolled back and degraded in emergency, and the current online K8S cluster is independent. This also makes it easy for users to migrate/try out the service grid, requiring almost no changes, just synchronizing the application to the service grid cluster on the container platform, and the corresponding Webhooks will automatically inject sidecar to use the service grid.

Fast smooth degradation

As you can see from the previous section, service grid traffic hijacking uses Webhook for sidecar injection. After sidecar injection, the iptables rules are injected by init-Container. The injected iptables rules transparently hijack the traffic accessing the application. When accessing online traffic, smooth service degradation is required, that is, when the service grid agent fails, it can be smoothly unloaded from the service grid. Using THE SDK to forward traffic to the agent usually detects the health status of the agent. Once the agent is abnormal, the SDK directly forwards the traffic to the target service without passing through the agent, thus achieving smooth degradation of the traffic. We use Iptables for traffic hijacking. When the application needs to exit the service grid, we only need to remove the traffic hijacking rules of Iptables to achieve smooth degradation of the service. We developed Webhook, and when we inject sidecar, we inject our iptables tool, which is responsible for smoothly cleaning/adding iptables traffic hijacking rules. Together with the service grid console and gateway, we can easily manage the traffic of a single application, a command space, and the entire cluster to achieve smooth degradation.

Vision of the future

Most of the functions supported by the current service grid were developed several months ago. At present, the container team and the online school platform RESEARCH and development Department jointly create the project. These existing functions are used to access the service grid, and the project will be reformed and developed according to the business needs. Currently, it connects with infrastructure such as log and trace, and provides unified service governance capabilities. Up to now, with the full support of teachers from the platform R&D department, there are 10 projects in the whole service grid to undertake online flow, and another 6 projects have the condition of online flow cutting. We have extended development on the open source ISTIO solution, but there is still a lot of work to be done on the overall service grid.

Lazy loading of service configurations

Service grid Sidecar implements service discovery and load balancing. Service information is stored in sidecAR. For large-scale clusters or multiple clusters, service information is abundant, and each Sidecar stores full information. The IStio community is also aware of the problem and has introduced restrictions on the extent to which Sidecar can receive service information. The proxy of a service stores information about the services on which a service depends. It is very boring and error-prone to sort out the dependency of services. At present, there is a corresponding open source project in the community on how to realize lazy loading of service configuration, namely netease’s open source Slime. When service A accesses service B, the default sidecar of service A has no information about service B. The proxy of service A sends the traffic requesting service B to service C, which contains the full service information. After service C processes the request from A, service B sends the information about service B to service A. In this way, service A relies on service B for lazy loading. The service grid supports lazy loading, which greatly reduces the use of sidecar agent memory and avoids tedious dependency sorting.

Envoy gray scale publishing

The service grid uses Envoy as a proxy. Currently, the entire mesh cluster uses one version of Envoy, and updates to new versions of Envoy are carried out uniformly. Uniform changes are risky, and the community has plans to explore the Envoy grayscale publishing mechanism. The envoy grayscale on the mesh also needs to be solved. There are a number of solutions, such as adding a list of projects to the Webhooks that inject Sidecar using a new version of Envoy for grayscale updates. Similarly, the propped service does not expect to recreate pods due to envoy updates. Pod in-place upgrades are also an issue that needs to be addressed, and there are currently solutions in the community.

Monitoring through

One of the important values of the service grid is the perfect observability. Whether it is the agent of layer 4 services or layer 7 services, rich monitoring indicators can help us to quickly warn and locate and deal with various faults. However, there are so many indices envoy at present that enabling index collection brings great pressure to standalone Prometheus, and how to effectively collect service grid monitoring indices and integrate with alarm systems needs to be improved gradually.

Support for more custom plug-ins

At present, there is no good way to manage the various plug-ins of the service grid. Many plug-ins exist in K8S in the form of CRDS. Properly managing these plug-ins and opening up the plug-in capabilities to more developers are also issues that mesh platform needs to consider.

The service grid adopts a new way to realize unified service governance capability, which is implemented to different degrees in each large factory with different requirements and different landing methods. However, they basically refer to ISTIO or transform ISTIO schemes. The transparent and unified service governance capabilities of the service grid facilitate the separation of business and infrastructure in a more elegant form, allowing business and infrastructure to evolve independently of each other. But the new technology is not overnight and out of the box, we still need to combine with the actual needs of the development of adaptation and expansion work. As one of the cores of cloud native technology stack, service grid has been gradually implemented in various large factories. As more and more technical forces join the tide of cloud native, the deficiencies faced by the current service grid will be quickly remedied, and will gradually be widely adopted and accepted in various industries.

reference

  1. Martin fowler “micro service” Chinese translation developer.aliyun.com/article/385…
  2. Service grid pattern philcalcado.com/2017/08/03/…
  3. Meituan OCTO2.0 service grid is introduced tech.meituan.com/2021/03/08/…
  4. Istio /latest/docs…
  5. Netease Service Grid Configuration Management project github.com/Slime-io/Sl…

Good future Gao Jun

Want to know more about educational technology dry goods can scan the qr code below to join the good future official communication group