Author: Chen Peng, R&D engineer of Baidu, is now working in cloud native team of Baidu Infrastructure Department. He has led and participated in the large-scale implementation of core business of Service grid in Baidu’s internal mobile phone baidu, Feed, Baidu Map and so on. He has in-depth research and practical experience in cloud native, Service Mesh, Isito and other directions.

For more exciting content, see cloudnative community: Cloudnative. To /

preface

If you want to implement a Service Mesh in a production environment in 2021, Istio is definitely on your radar.

Istio is one of the most popular Service Mesh technologies, with an active community and numerous landing cases. But if you really want to implement Isito on a large scale in your production environment, there are a lot of dangers lurking under the seemingly beautiful iceberg.

This paper is a summary of the author’s experience and some thoughts during the past two years of in-depth participation in the development and implementation of Istio in production environment with a volume of 10 billion, in the hope that readers can have some reference and inspiration before introducing Isito their production environment, and make more adequate preparation, so that they can “sink into” Istio more easily.

If you are not familiar with the concept of Service Mesh, you should first read Service Mesh in the Cloud Native Era.

Considerations before using Isito

You can’t be completely transparent to your application with Istio

After service communication and governance-related functions are migrated to the Sidecar process, the SDK in your application usually needs to make some corresponding changes.

For example, the SDK needs to turn off some functions, such as retry. A typical scenario is that the SDK retries m times and the Sidecar retries n times, leading to an M * N retry storm that causes a risk.

In addition, pass-through, such as the Trace header, requires an SDK update. If there are other special logic and features in your SDK, they may need to be handled carefully to fit in with Isito Sidecar.

Istio has limited support for non-Kubernetes environments

When a business is migrated to Istio, it may not be synchronized to Kubernetes and run on top of the original PAAS system. This presents a number of challenges:

  • The original PAAS may not have a container network, and Istio’s service discovery and traffic hijacking may have to be adapted to the legacy infrastructure to work properly
  • If a single instance of the old PAAS does not manage multiple containers well (analogous to Kubernetes’ Pod and Container concepts), the deployment and operation of a large number of Istio Sidecars can be a major challenge
  • Without The Kubernetes Webhook mechanism, Sidecar injection may also become less transparent and need to be coupled in the business deployment logic

Only HTTP is a first-class citizen

Istio native provides full support for HTTP, but it does not provide native support in real business scenarios where privatization protocols are common.

As a result, some services using proprietary protocols may be forced to use TCP for basic request routing, which leads to the loss of many features, including Istio’s powerful content-based message routing, such as weighted routing based on headers, paths, and so on.

The cost of expanding Istio is not cheap

Although the overall architecture of Istio is designed to be highly extensible, due to the complexity of the overall Istio system, if you actually scale Istio, you will find that the cost is not cheap.

Take extending Istio to support a proprietary protocol. First, you need to extend the protocol in Istio’s API code base. Second, you need to modify the Istio code base to implement new protocol handling and delivery. Finally, you’ll implement the corresponding Filter in the Envoy to perform protocol parsing and routing functions.

In the process, you may also face engineering challenges such as compiling several of the above complex codebase (if your development environment doesn’t work well with Docker or if you don’t have access to some foreign networks).

Even after all this work is done, you may not be able to incorporate it back into the community, and the community may not have a strong support for extending proprietary protocols, which can lead to your code being disconnected from the community and causing problems with subsequent updates.

Performance issues for Istio in large clusters

In the default working mode of Istio, each Sidecar receives information about all services in the entire cluster. If you’ve deployed Istio’s official Bookinfo sample application and looked at it using The Envoy’s Config Dump interface, you’ll see that about 20 million lines of configuration messages have been sent to just a few services Envoy.

As you can imagine, in larger clusters, Envoy memory overhead, Istio CPU overhead, XDS delivery timeliness, and so on will become more prominent.

Istio does this out of the box so that users do not need to do too much configuration. In addition, in some scenarios, it may not be able to figure out the exact call relationship between services. Therefore, Istio directly delivers the full service configuration to each Sidecar, even though the Sidecar only accesses a small number of services.

Of course, there are solutions to this problem. You can dramatically reduce the Envoy’s resource overhead by specifying the service invocation relationships through the Sidecar CRD display so that the Envoy only gets the service information he needs, but only if you can sort out the invocation relationships in your line of business.

XDS distribution has no hierarchical publishing mechanism

When you make a change to a service’s policy configuration, XDS does not have the ability to publish hierarchically, and all Envoy visiting the service will immediately receive the updated configuration. This may be risky or even unacceptable in change-sensitive, demanding production environments.

If your production environment strictly requires a hierarchical release process for any changes, you may want to consider implementing such a mechanism yourself.

Is there an escape route when the Istio component is faulty?

The uniqueness of the Sidecar architecture, represented by Istio, is that Sidecar directly accepts business traffic, rather than being a bypass component of the entire system like some other infrastructures (such as Kubernetes).

So in the early days of Isito, you have to think, what happens to the service if the Sidecar process dies? Is there a way out? Can I fallback to direct connection mode?

During Istio landing, whether a fallback is lossless determines whether core services can access the Service Mesh.

The Isito technology architecture has not matured as expected

Although Istio 1.0 has been released for a long time, if you look at each iteration of the community, the architecture of Istio is still in a precarious state, especially for the major releases around 1.5. Major changes, including the removal of Mixer components, consolidation into a single architecture, and support for only higher versions of Kubernetes, were very unfriendly to users already using Istio in production environments due to various incompatibilities associated with the upgrade.

Fortunately, the community has been aware of this problem, and in 2021 a dedicated community group was formed to focus on improving Istio compatibility and user experience.

Istio lacks a mature product ecosystem

Istio as a technical solution is not a product solution.

If you are using it in a production environment, you may also need to address issues such as visual interfaces, permissions and account systems, integration with your company’s existing technology components, and product ecology. Using it from the command line alone may not meet your organization’s requirements for permissions, auditing, and ease of use.

Isito’s built-in Kiali functionality is rudimentary and far from being usable in a production environment, so you may need to develop a superproduct based on Isito.

Istio currently addresses a limited range of problems

Istio currently focuses on service invocation between distributed systems, but there are some complex semantics and functions of distributed systems that are not included in Istio’s Sidecar runtime, such as message publishing and subscription, state management, resource binding, and so on.

Cloud native applications will continue to move toward multiple Sidecar runtimes or more distributed capabilities into single Sidecar runtimes to make the services themselves more lightweight and completely decouple the application from the infrastructure.

If you are in a production environment where business systems are interfacing with many and complex distributed system middleware, Istio may not be able to fully address the cloud biogenesis requirements of your application at this time.

Write in the last

Do you feel a little frustrated by this and lose faith in Isito?

Don’t worry. Isito is still one of the most popular and successful Service Mesh models. The fact that Istio is constantly changing shows that it has a vibrant community, and we should have confidence in something new. Isito’s community is constantly listening to end users and evolving in the direction that we expect.

At the same time, Istio is still worth trying out of the box if your production environment is not very large, the service is already hosted on Kubernetes, and only uses the capabilities that Istio provides natively-provided.

However, if your production environment is complex, with heavy technical debt, many proprietary features and policy requirements, or a large service scale, you need to weigh these factors carefully before starting to use Istio to assess the complexity and potential cost of introducing Istio into your system.