The foreword 0.

There are a lot of articles about Servicemesh concepts, architectures, methodologies, and standardized implementations, but not much about how Servicemesh can be implemented effectively and reliably, and the difficult choices we face. From this perspective, this paper hopes to explore how to better evolve the system into Servicemesh architecture based on the author’s experience in the production environment. This article is intended for those who have some knowledge of Servicemesh concepts or want to explore the field.

1. A glimpse of the changing concept of service governance

Before we talk about Servicemesh, let’s take a look at the history of service governance. As we all know, the development of service is made up of single application, service level layering, service vertical cutting, and the subsequent rise of micro-service idea, “two weeks” refactoring principle and “two pizza” team and other dazzling service separation methodology. These are common arguments, but they are not the focus of this chapter. This chapter pays more attention to a core problem that must be faced in the process of service governance development and separation — “How can services be connected after separation?” That is:

How to form a network of complex service nodes?

For the sake of space, let’s briefly review some of the major currents of thought on this issue.

1.1 Server Proxy

I’ll call it the centralized agent phase. This stage is also the simplest solution, which adopts centralized single-point service cluster to undertake more service governance functions. For example, centralized deployment of Lvs, Nginx, Haproxy, Tengine, Mycat, Atlas, Codis, etc., is the most common approach. The advantage of this method is that it is convenient for centralized maintenance of functions, language irrelevant, and low cost to get started. Therefore, it has a large audience in large and medium-sized companies, especially start-up companies. But its problems are also obvious, take the most popular domain name +Http Rest call + Nginx combination set as an example, call link needs to go through DNS (DNS Cache), KeepAlived, (Lvs), Nginx several layers. Extremely long invocation links and multiple single-point systems can cause stability and performance problems as the service scale increases. In Dingding rental and Meituan on the occurrence of similar single point of failure. Dingding Rental once had large business impact due to failures of Nginx, DNS and CODIS-proxy. Meituan’s DNS and MGW have all experienced similar problems, so they further launched a technical project to transform the Intranet Http call into Thrift direct connection. In general, there is definitely a place for it. It is more suitable for startup companies, or especially for cross-language application scenarios, Server Proxy can be used as a quick start solution. However, in use, it is necessary to have a clear understanding of its defects in order to make correct decisions in the subsequent evolution.

1.2 Smart Client



The simple and direct Server Proxy not only brings us convenience, but also brings us a lot of problems. At this time, in order to deal with such problems, Smart Client rose rapidly and blossomed its strong vitality. Let’s call it the framing stage. In this stage, in order to solve the problem of single point long link existing in Server Proxy, the connection between two nodes is constructed by direct connection instead. This step solves the problem of single point long link in place, and enables greater performance optimization and stability lift on the basis of direct connection. Representative open source products (service framework or Rpc framework) include Uber’s Ring-Pop, Ali’s Dubbo, Ant’s Sofa-Rpc, Dangdang’s Dubbox, Weibo’s Motan and Dianping’s Pigeon/Zebra, Baidu’s BRPC /Stargate, Grpc, Thrift, etc. There are two big ideas:

1. If hard doesn’t work, go soft. Hardware service governance does not work, so switch to software service governance.

2. If far is not enough, make it close. Remote centralized deployment doesn’t work, so pull it into our application process.

In this way, many of the drawbacks of the Server Proxy trend are solved.

Having preached the virtues of Smart Client for a long time, doesn’t he have a problem? Certainly still exists, and the problem is very obvious and outstanding.

  1. First, we know that the ethos of microservices tends to be autonomous. Smart Client has been become the rich Client, by this way will be severely limit our choice in technology stack, in a language you need to implement the basic all the governance ability again, and after many language maintenance upgrade cost, cost growth under this for most of the company is unbearable burden. Uber has had to launch Node and Go versions for Ring-Pop, Weibo’s Motan has launched multiple language versions, Grpc and Thrift have not escaped this fate. Most companies now mix two or more languages for actual development. So this problem must also be faced.

  2. Second, Smart Client uses SDK to embed application process for coexistence, which will make the operation and maintenance of service governance and the maintenance of business applications completely mixed together. This increases the operational difficulty of service governance by several orders of magnitude in real work scenarios, where the service governance team is forced to face multiple versions of the architecture code and the nightmare that it may take more than half a year to implement a version.

Of course, despite the above problems, Smart Clients are still dominant in high-concurrency and high-traffic business scenarios due to the lack of a particularly mature alternative. Scenarios can also be used if the language is as convergent as possible, such as using Smart Client of Java version, and using Http Rest plus domain name to compromise service governance for low Nodejs traffic.

1.3 the Local Proxy

Both Server Proxy and Smart Client have unavoidable problems. Are there other solutions? In response to this problem, Local Proxy came into being. Since centralized deployment has single point problems and rich client has coupling problems, why not compromise? At this point, the idea becomes:

Process level governance is performed locally. The local process-level proxy can avoid the problems of centralized single-point deployment and language dependence and application coupling.

This train of thought begins to be all the rage step by step, very much management plan also is below this thought like bamboo shoots after a rain appear.

Airbnb’s SmartStack uses four pieces to complete the whole core service series governance process, which is a simple solution.

Ctrip OSP has a similar approach. The main difference with Airbnb is that Synapse and Haproxy are combined into one Proxy.

In the cloud field, since cross-language and operation and maintenance efficiency are viewed from a more important perspective, this idea is even more dominant. In the cloud architecture of Mesos+Marathon, a similar scheme adopts Haproxy for routing, and the central control node refreshes the corresponding routing information.

The same is true for K8s, Google’s own son. Considering the performance of Proxy, it takes a compromise approach and uses Iptables rule injection for forwarding (of course, there are unavoidable problems with this approach).

Each of these approaches has its own problems, but the biggest one is this:

How to solve the performance degradation caused by it. Whether iptables or Agent is used for governance, forwarding, and communication, there is an unavoidable problem. How much performance loss does iptables cause in high traffic and high concurrency? What is the performance gap compared to the many applications already equipped with rich client direct connection? Currently known products, QPS and RT have considerable loss under high traffic, and some solutions can even achieve 20% performance loss, which is obviously unacceptable in many scenarios.

At this time, Servicemesh, the last killer of Local Proxy, is officially on the forefront. 2018 is also known as the first year of Servicemesh. The idea, I think, is this:


1. Sacrifice certain performance and resources for a high degree of autonomy and operationability of the overall service governance;

2. Separation of execution and control, cutting of data plane and control plane;

3. Virtualization, standardization, productization and specification definition.

Servicemesh has liberated itself from the chaotic and diverse Local Proxy trend of thought and proposed a more systematic thinking. This article is not intended to provide more conceptual descriptions of Servicemesh, as there are many similar articles on the web, so there is no space to do so. As a result, following the lead of Istio (which was originally designed to help applications make the cloud better), other companies have developed their own solutions based on/referencing Isito:

  1. Ariki and Golang rewrote the Envoy and constructed Sofa-Mosn/Pilot on this basis;
  2. Sink the current limiting capacity of the notorious Mixer in Istio;
  3. Tencent revamped and integrated its internal TSF service framework based on Envoy;
  4. Weibo developed Motan-Mesh based on Motan-Go and integrated its own service governance system.
  5. Huawei’s Service ecomb did a similar thing, with the Mixer sinking completely;
  6. Twitter launched Conduit, based on Rust, which also completed the sink of Mixer;

However, there are still several problems with Servicemesh. In addition to the performance problems, there are also a growing number of mesh rethinking issues:

What are the standards for control panel and data panel cutting? Is it too idealistic?

This is a matter of opinion at present and will not be expanded here. Although Servicemesh is still in its infancy and a lot of problems are still being explored, the three concepts of Servicemesh mentioned above are definitely the trend of the future from the perspective of the development trend of micro-services.

1.4 summarize

This chapter reviews the development process of service governance and goes through the thoughts of three major stages. We return to the original intention, and we can see that the plan of each stage has its applicable and inapplicable scenarios. There is no best plan, only the most appropriate one. We can also follow the thinking logic of these three stages and find that service governance is also in a process of repeated groping, entanglement and spiraling upward.

2. Have you considered the drain on resources?

A mesh is essentially a parasite on a business machine. The resources of the business machine are used. In fact, it is found in the test that the memory consumption of the mesh implemented by c++/go is relatively controllable. By default, it only occupies a few meters, and generally only increases to tens of meters under high concurrency. This is negligible on a normal 8GB / 16GB application machine. So the problem of extra memory can be basically ignored. However, the CPU resource consumption is relatively high. The CPU resource consumption tends to be close to the normal service usage. This means that after the mesh is added, it is possible for a business to use only half of its original CPU resources. That’s a bigger problem.

On this issue, the main discussion in the industry believes that, as the resource utilization rate of normal business machines is less than 10%, the extra occupation of this part will not have a substantial impact on the business under actual circumstances, but can make better use of idle resources to avoid waste. The business and Mesh are mutually beneficial and win-win.

This logic will certainly hold true for a long time to come. However, I think two new questions arise from this logic:

  1. Resources do not sit idle indefinitely. We have noticed that more and more businesses are paying more attention to resource usage because of the cost sharing involved, and one of the goals of cloud native is to increase machine resource utilization. Following this trend, when the resource utilization problem has been solved relatively well, the problem of CPU usage of mesh will become prominent. How will the problem be solved? If the mesh is bound to a separate CPU core or pod resources are bound to separate the resources from each business instance, it will inevitably lead to considerable cost waste.
  2. In addition to this resource utilization figure, there is the problem of high and low business peaks. We all know that business has its ups and downs. For example, the daily meals of take-out business are at their peak, the hotel business is at their peak every holiday, and the movie ticket business is at its peak every Spring Festival. High peaks and low peaks mean there must be some redundancy of resources. Therefore, although the resource utilization rate is not high, the SYSTEM CPU resources of some businesses will soar and even reach full when it reaches the peak. In this case, if mesh is introduced, the business side will directly feel that the service processing capacity decreases by half in the peak period. I’m sure the business side will be shocked to hear this. How to deal with the problem at this time? Is there a solution other than doubling the machine on the business side?

This may seem like an unsolvable proposition, because servicemesh’s architecture is just that. That’s what resources are. They don’t just come out of nowhere. However, I would like to ask, can we break the architecture of Servicemesh, or optimize it?

Review the three important trends we mentioned earlier in the history of service governance:

Server Proxy

Smart Client

Local Proxy

Servicemesh is one of the Local proxies, which can solve problems such as strong coupling with the service side, strong language correlation, and single point. But are the other currents of thought useless? Obviously the answer is no. Other methods still have strong vitality and value. Our solution is to use Server Proxy as a last-resort solution when idle resources are insufficient, and adopt logical Central Mesh to solve the above problems:

  1. Sidecar performs idle resource detection
  2. If idle resources are about to be insufficient, the SDK is informed to switch traffic to the Central Mesh
  3. Central Mesh does all the work for Sidecar.

Central Mesh is loaded with all the information needed in the region and takes on all the capabilities of Sidecar. That is, the Central Mesh acts as the backup for the Sidecar in the region. When the Sidecar fails or idle resources are insufficient to run the Sidecar properly, the Central Mesh proactively switches traffic.

Central Mesh is called “logically” because it is not necessarily a Central cluster, but can be deployed locally to minimize network latency and additional risks associated with a single point of view. For example, it can be deployed to the host by equipment room, region, gateway, or even the nearest host.


3. Have you considered performance losses?

Performance wastage is an unavoidable problem. Performance is naturally worse than direct RPC because of the extra forwarding and service governance. Based on our actual performance tests, the performance of the mesh mode degrades by 20-50% compared to direct connection, even without using iptables, which is more expensive. Of course, this increased latency in milliseconds is actually acceptable for most business requirements. The impact on service performance is minimal.

However, we still need to consider the following potential problems:

  1. Service application problems. For some high concurrent business scenario, may itself low latency (millisecond), and is sensitive to time delay, there may be seven plus one call link more than eight or ten times RPC calls, if, in this way may lead to such business performance degradation, serious and even may cause such as overtime, the problem such as thread pool played.
  2. Basic application issues. If servicemesh’s future trend is for all traffic to be meshed, not just business traffic, then we need to consider that the ability to mesh extremely time-sensitive storage traffic, such as Redis, will further reduce our tolerance for additional latency. Redis itself is a very high concurrency, extremely low latency, very delay-sensitive scenario. Any additional millisecond delay can cause Redis availability to decline exponentially or even cause a business failure.

Therefore, it is important to anticipate degradation of performance, and to optimize and squeeze the performance limits of Servicemesh as much as possible, rather than simply choosing Mesh and leaving performance behind. Consider Netty’s famous “EventLoop pick” for squeezing performance to the hilt.

In terms of mesh communication performance optimization, several points can be considered:

  1. Local process communication optimization. Because the mesh and service processes are on the same machine, local processes can be used to communicate with each other to accelerate communication performance. There are many ways for local processes to communicate, such as Mmap, Unix Domain Socket, PIPE, signal, and so on. Among them, THE performance of MMAP is the most outstanding. Traffic-shm is an asynchronous no-lock IPC framework, which can easily support millions of TPS, and it uses MMAP to communicate. Through actual tests, mMAP combined with appropriate event notification mechanism can improve its performance by more than 30% compared to TCP in some high-concurrency scenarios.
  2. Thread model. The Reactor pattern is used at the base of the basic high-performance service to implement the threading model. Of course, there are multiple paths that can be implemented with thread/coroutine pools and Reactor hierarchies. There are models like Nginx that have one main, multiple child processes + one Reactor + one thread (the older version provides a thread pool mechanism), Evio and Envoy utilize a “one Reactor coprocess pool”, and Netty uses a multiple Reactor + multiple thread pool. Avoid mesh blocking design.
  3. Byte reuse. We are used to creating a new space for each request to hold some information. However, when the amount of concurrency is increased, such space allocation will lead to a large performance overhead and reclamation pressure. So it pays to allocate a byte of memory using partner algorithms or Slab algorithms or whatever. For example, Netty adopts the buddy algorithm due to the application of off-heap memory, while Nginx adopts Slab mechanism. Mosn introduces multi-level capacity and sync.Pool mechanism cache to optimize on the basis of Golang allocation mechanism.
  4. Memory alignment. The operating system manages memory by page. If you manipulate memory addresses directly for data transfer (such as with MMAP), the lack of memory alignment will cause you to pull unneeded memory space and have the overhead of moving memory concatenation, which will directly affect your performance. Disruptor is also optimized for memory alignment.
  5. Unlocked. The first instinct of communication is to deal with concurrency security, and many times you may have to use locks to ensure security. At this point, consider using hardware-level CAS operations instead of regular lock operations, single-threaded processes such as Redis, or connections such as Envoy that take a pool of threads but bind to a single thread to circumnavigate concurrency problems.
  6. Pooling. Thread resources are valuable, and threads themselves can’t claim thousands of them, so thread pools are the default. Here’s the thing, for example, if you use coroutines, even though coroutines are revered as lightweight threads, they’re very high performance and you can easily open tens of thousands of them without losing your cool. But you need to know, it is seriously affect your actual processing performance, and because the golang coroutines allocation principle of itself, coroutines some metadata will not be recycled after use, this is because the golang developer’s philosophy is “once they have arrived at this traffic, means that the system could once again face the peak flow, Prepare ahead of time.” So we still have to think about pooling for coroutines. Motan-go and Golang versions of Thrift do not currently have this consideration, while Sofa-Mosn has been pooled accordingly.

There are many other performance optimizations that I won’t list here.

4. Interaction between Sidecar functions

One of the primary reasons for us to do Servicemesh is to solve the current situation of strong coupling with business, and to carry out the sinking of service governance capability. However, as we sank down, we realized that service governance itself included a lot of things: dynamic configuration, flow control, fuses, failure drills, load balancing, routing, communications, service registry discovery, centralized logging, distributed link calls, monitoring hotspots, and so on. It all crumbles into a thin Sidecar. When we do this, do we also need to start to examine whether there are similar problems among so many functions of service governance itself, which will cause mutual interference, dependence, influence and even conflict at the organizational and technical levels? Yeah, that’s what’s going to extend.

  • For example, how do you ensure that large but unimportant traffic such as log collection does not affect core business traffic at all?
  • For example, how do you ensure that a feature upgrade does not affect core business communications?
  • For example, how do multiple teams maintain a Sidecar?

These are all possible problems, of course, you can use isolation, you can use hot deployment, you can use code repository to figure out how to split. But when everything sinks, can you really do a good job with a capability set of seven or eight teams managing maintenance?

At this point, we suggest, take Sidecar apart. Take apart your Sidecar based on the rules and the development stage of your mesh. Maybe if you open it, everything will work out. At the time of writing this article, it is also true that Ant Financial, for example, has removed a separate DBmesh from its Sofa mosN.

It should be noted that too many sidecArs should not be dismantled, otherwise it will lead to Sidecar flooding and costly operation and maintenance upgrade costs. So do you think it’s a little bit like the process of servitization, where you take things apart, simplify things, and introduce new things? That’s the beauty of the business we’re in, because you can always find similarities in seemingly unrelated places.


5. Only responsible for service subscription but not service publishing?

Pilot can subscribe to services and bridge to the XDS interface system. But why is there no ability to register services? I guess this is because Servicemesh has evolved based on Local Proxy in the cloud native context. However, in cloud native solutions, Local Proxy will not be responsible for service registration, because they will entrust service registration to cloud (Mesos, Marathon, K8s, etc., all have existing solutions), or they will integrate Consul/ETcd/ZK to complete service registration. In this case, the Local Proxy simply focuses on the work of the reverse Proxy.

And that, for our actual production environment use, is not so friendly. As the business grows to a certain stage, it usually has its own service governance framework and adopts its own service registration and subscription method. They are unlikely to move their entire subscription publishing system to a cloud native system in order to mesh service governance. That would be putting the cart before the horse. So it’s inevitable to choose adaptation. During the adaptation process, in order to use the service to subscribe and publish, they had to deeply modify Sidecar, adding the ability to register the service, and connecting Sidecar to a third party registry. The complexity of the work also defeats Servicemesh’s desire to control the plane and mask differences in infrastructure.

So I think Servicemesh needs to completely block the existence of specific registries. Publish and subscribe should be piloted to provide a unified front for users. In the future, no matter how the registry is switched, there is no need for deep intrusion to modify Sidecar.

                                                                  

6. How to cut control panel and data panel?

This is a perennial problem. The Mixer in the control panel is now basically in a “wall-to-wall push” mode. Istio is paranoid about singling it out, dealing with things like stream limiting and data telemetry. The former can cause significant performance bottlenecks (even with Istio’s subsequent caching in Envoy), while the latter can double traffic consumption. Many Servicemesh implementations eliminate or simplify Mixer. There are many corresponding articles on the Internet, which will not be repeated here.

While Istio’s design is idealistic, the idea is to block out the infrastructure variability and provide unlimited memory capacity for Sidecar, while removing as much complex logic as possible to ensure that Sidecar is as stable and reliable as possible. But the harsh reality is that where there is communication, there are problems. The complexity of distributed environments is largely due to networks.

However, if Mixer was to sink all at once, how to ensure that Sidecar itself was stable and reliable, consumed less resources, and introduced into a small and beautiful state with as little dependence as possible under various complex logic?

Although it is a difficult proposition to control the cutting of interview and data panel, and Istio has caused some ridicule in this aspect, from the perspective of pioneer, Istio has successfully integrated the Local Proxy solution which has been developed for a long time and raised it to the level of methodology. The biggest contribution Istio has made is that it has successfully triggered the industry to think systematically about the control panel and data panel. It is a change from tactics to strategy, from technology to technology innovation.

7. Conclusion

We analyze some problems that may exist in Servicemesh development up to now from various angles. Some solutions based on actual production experience are put forward, hoping to bring some help to everyone. Of course, all the choices are difficult, and there is no standard answer. How to combine the actual situation of each company is the place to reflect the ability and value. Although Servicemesh has some problems, it is bound to be the trend of future development. It can bring us great imagination and human liberation.