10 common exceptions to ISTIO

Summary of 10 common exceptions using ISTIO:

Service Port naming restriction
The delivery sequence of flow control rules is incorrect
Request interrupt analysis
Sidecar and User Container startup sequence
The Ingress Gateway interworks with the Service port
VirtualService scope
VirtualService does not support host Fragment
Full link tracing is not completely transparent access
MTLS causes the connection to be interrupted
User service listening address limit

1. Service port naming restriction

Istio supports multiple platforms, but the compatibility between ISTIO and K8S is optimal. The design philosophy, core team and community are consistent. However, the adaptation of ISTIO and K8S is not completely free of conflicts. A typical problem is that ISTIO requires K8S Service to perform port naming according to the protocol.

Traffic anomalies caused by inconsistent port names are the most common problem when mesh is used. Protocol-related flow control rules do not take effect. You can locate traffic anomalies by checking the filter type in the PORT LDS.

why

The k8S network is unaware of the application layer. The main traffic forwarding logic of K8s takes place on Node and is implemented by iptables/ IPVS. These rules do not care what protocol is in the application layer.

The core capability of ISTIO is to manage layer 7 traffic, but the premise is that ISTIO must know what protocol each service under control is. Istio will send different flow filters based on the port protocol, and K8S resources do not envoy filter layer 7 information. So ISTIO needs to be explicitly provided by the user.

Istio’s solution: Protocol Sniffing

Protocol Sniffing Overview:

Test the TLSCLIENT_HELLOExtract SNI, ALPN, NPN and other information
Try to detect the application layer plaintext content based on the known typical structure of common protocols A. HTTP2 spec: Connection Preface, HTTP2 spec: Connection Preface, Check whether it is HTTP/1.x based on the HTTP header structure
During the process, timeout control and packet size limit are set. By default, TCP is used to process packets

Best practices

Protocol Sniffing reduces the configuration required for newcomers to use IStio, but may introduce uncertain behavior. Uncertain behavior should be avoided in a production environment.

Some examples of sniffing failures:

The client and server are using some kind of non-standard layer 7 protocol, which can be correctly resolved by both the client and server, but cannot be guaranteed to be recognized by isTIO automatic sniffing logic. For HTTP, for example, the standard line break is CRLF (0x0d 0x0a), but most HTTP libraries use and recognize LF (0x0a) as a separator.
For some customized private protocols, the initial format of data flows is similar to that of HTTP packets, but subsequent data flows are in a customized format: When sniffing is disabled: Data flows are routed according to L4 TCP, which meets user expectations. If sniffing is enabled: The data flow is initially identified as L7 HTTP, but subsequent data does not conform to THE HTTP format and the traffic is interrupted

You are advised not to use protocol sniffing in the production environment. The service that connects to the mesh should be named with a protocol prefix.

2. The delivery sequence of flow control rules is incorrect

Abnormal description

During batch updates to traffic rules, occasionally a traffic exception (503) appears in the RESPONSE_FLAGS envoy containing the “NR” flag (No Route configured) which lasts a short time and automatically resumes.

Cause analysis,

When using kubectl apply-f multiple-virtualService-destinationrule-yaml, the order in which these objects propagate and take effect is not guaranteed. For example, if a sub-version of a DestinationRule definition is referenced ina VirtualService, the Propagation and validity of the DestinationRule resource may lag behind that of the VirtualService resource.

Best practice: Make before break

Divide the updating process from batch single step into multiple steps to ensure that non-existent subset is not referenced in the whole process:

When a DestinationRule subset is added, apply DestinationRule subset first and then reference the VirtualService of that subset after that subset takes effect.

Before deleting a DestinationRule subset, delete the reference to that subset in VirtualService and wait for the modification to take effect. Deleting DestinationRule Subset.

3. Request interrupt analysis

The request is abnormal. Is it caused by ISTIO flow control rules or the return of business applications? Which specific POD does the breakpoint appear on?

This is the most common dilemma with mesh. Introducing an envoy proxy in a microservice makes it difficult for users to quickly determine where the problem is when traffic visits do not match the expected behavior. The abnormal response received by the client, such as 403, 404, 503, or connection interruption, may be the result of traffic control by any SIDecAR on the link, or it may be a logical response from a service.

Envoy flow model

An Envoy receiving requests is called an Upstream or Downstream traffic. There are two traffic endpoints, namely the originating and receiving end of the request, respectively, during Downstream and Upstream processing:

In this process, the envoy calculates a set of eligible forwarding destination hosts, called UPSTREAM_CLUSTER, based on user rules and selects a host from the set as a receiving endpoint for traffic forwarding based on load balancing rules. This host is UPSTREAM_HOST.

So that’s the quintuple of traffic processed by envoy requests. This is the most important part of the envoy log, where we can observe exactly where the traffic came from and where it was going.

UPSTREAM_CLUSTER
DOWNSTREAM_REMOTE_ADDRESS
DOWNSTREAM_LOCAL_ADDRESS
UPSTREAM_LOCAL_ADDRESS
UPSTREAM_HOST

Log Analysis Example

Two key observations were made through logs:

Where is the break point?
What’s the reason?

Example 1: A normal client-server request

You can see that the logs on both ends contain the same request ID, so you can concatenate traffic analysis.

Example 2: No healthy upstream, for example, target deployment with 0 healthy copies

Flag “UH” in the log indicates that there is no healthy host in the upstream cluster.

Example 3: No route configured, for example, DestinationRule lacks a corresponding subset

Flag NR in logs indicates that no route is found.

Example 4: Upstream Connection failure, where the service does not listen on a port properly.

In logs, flag UF indicates that Upstream fails to connect, and you can determine the traffic breakpoint.

4. Startup sequence of Sidecar and User Containers

Abnormal description

The Sidecar mode is popular in the Kubernetes world, but for the current K8S (V1.17), there is no concept of a Sidecar. The role of the Sidecar container is subjective given by the user.

A common frustration for Istio users is the startup sequence of sidecar and user containers:

The order in which the sidecar and the user container are activated is uncertain, if the user container starts first and hasn’t started yet, and the user container sends a request, the request will still be intercepted and sent to the unenvoy requesting an exceptional.

Similar exceptions occur during the Pod termination phase, again due to the uncertainty of the sidecar and normal container life cycles.

The solution

At present, conventional avoidance schemes are mainly as follows:

The business container is delayed to start for a few seconds or retry on failure
Launch the script to actively detect whether envoy is ready, as in127.0.0.1:15020 / healthz/ready

Either solution is lame. To fix the pain point once and for all, starting with Kubernets version 1.18, the built-in Sidecar feature in K8S will ensure that Sidecar is up and running before the normal business process begins, by changing the startup life cycle of the POD, Start the Sidecar container after the init container is complete, and start the service container after the Sidecar container is ready to ensure sequence in the startup process. In the Pod termination phase, a SIGTERM signal is sent to the Sidecar container only when all normal containers have reached the termination state.

5. The Ingress Gateway interworks with the Service port

The Ingress Gateway rule does not take effect because the listening port of the Gateway is disabled on the corresponding K8S Service. First, we need to understand the relationship between Istio Ingress Gateway and K8S Service.

In the figure above, although the Gateway defines that it expects to control ports B and C, its corresponding service (through Tencent Cloud CLB) only enables ports A and B. Therefore, the incoming traffic from LB port B can be controlled by the IStio Gateway.

Istio Gateway and K8S Service are not directly related. Both of them bind pod through selector to achieve indirect association
Istio CRD Gateway only delivers user flow control rules to grid edge nodes, and traffic still needs LB control to enter the grid
Tencent Cloud TKE Mesh implements dynamic Port linkage in the definition of gateway-Service, enabling users to focus on the configuration within the grid.

6. VirtualService scope

VirtualService contains most of the outbound traffic rules, which can be applied to either the data plane agents inside the grid or the agents at the edge of the grid.

Attributes of VirtualService gateways are used to specify the validity range of VirtualService:

ifVirtualService.gatewaysIf it is null, IStio assigns a default value to itmesh, indicates that the effective range is inside the grid
If you want VirtualService to be applied to a specific edge gateway, you need to display an assignment to it:gateway-name1,gateway-name2...
If you want VirtualService to be applied both inside the grid and on the edge gateway, you need to display itmeshValue addVirtualService.gateways, such asmesh,gateway-name1,gateway-name2...A common problem is the third case above. VirtualService initially works inside the gateway, but when the VirtualService rules are extended to edge gateways, users often add only the specific gateway name and omit itmesh:

Istio automatically sets default values for VirtualService.gateways to simplify user configuration. However, a feature may be used as a bug.

7. VirtualService does not support host Fragment

Abnormal case

After a VirtualService is added or modified to a host, the discovery rules do not take effect. If other VirtualServices apply other rules to the host, the content of the rules may not conflict, but some of the rules may fail to take effect.

background

VirtualService rules are aggregated by host
As services grow, VirtualService content grows rapidly. Flow control rules for a host may be maintained by different teams. For example, security rules are separated from service rules, and different services are separated by subpaths

Istio supports cross-resource VirtualService:

On the edge of the grid (gateway), flow control rules of the same host can be distributed to multiple VirtualService objects. Istio automatically aggregates the rules, but it depends on the definition order and avoids conflicts by users themselves.
In the grid (for Sidecar), flow control rules of the same host cannot be distributed to multiple VirtualService objects. If multiple VirtualServices exist in the same host, only the first One takes effect. There is no conflict detection.

VirtualService does not support host sharding very well, so the maintenance responsibilities of the team are not well decoubled. The configurator needs to know all the flow control rules of the target host before he can modify VirtualService with confidence.

Istio: Virtual Service Chaining (Plan in 1.6)

Istio plans to support Virtual Service proxy chains in 1.6:

Virtual Services support sharding + proxy chains
Teams can flexibly fragment Virtual services of the same host, for example, by SecOps/Netops/Business, and each team can maintain various independent Virtual services

8. Full link tracing is not completely transparent

Abnormal case

After the microservice is connected to the Service mesh, the link tracking data is not connected in series.

why

In the Service Mesh telemetry system, the implementation of call chain tracing is not completely zero intrusion and requires a small amount of modification by the user business to support it. Specifically, when the user sends (HTTP/GRPC) RPC, Write B3 Trace Headers that exist in the upstream request to the downstream RPC request header. These headers include:

Some users find it hard to understand why the application is required to display headers since inbound and outbound traffic is already completely intercepted to the envoy, which can implement complete traffic control and modification.

For an envoy, inbound and outbound requests are completely independent, and the interactions between requests cannot be felt by the envoy. In fact, it is up to the application to decide whether or not these requests are subordinate or not. For A special business scenario, if Pod X receives request A, the business logic is to send A request to Pod Y every 10 seconds, such as B1, B2, B3, then these fan out requests Bx (X =1,2,3…). How does request A relate to request A? A business may decide that A is A parent request of Bx, or that Bx is A separate top-level request.

9. MTLS interrupts the connection

In user scenarios where ISTIO mTLS is enabled, connection termination is a high-frequency exception:

The reason for this exception is related to the mTLS configuration in DestinationRule, which is an unrobust interface design in ISTIO.

When global mTLS is enabled through MeshPolicy, mTLS works fine if no other DestinationRule is defined in the grid
If a DestinationRule is added to the grid later, the mTLS value of the sub-version can be overridden in DestinationRule (default is disabled!). When using DestinationRule, users tend to pay little attention to mTLS attributes (left blank). MTLS becomes disabled after DestinationRule is added, resulting inconnection termination
To fix the above problem, the user had to add mTLS properties and set them to on in all destinationRules

This isTIO MTLS user interface is extremely unfriendly. Although MTLS is globally transparent by default, the business does not know the existence of MTLS, but once the business defines the DestinationRule, it must know whether the current MTLS is enabled and make adjustments. Imagine that the security team is responsible for THE MTLS configuration, while the business team is responsible for their DestinationRule, the coupling between the teams is very serious.

10. Limit the listening address of the user service

Abnormal description

If the IP address monitored by the service process in the user container is POD IP instead of 0.0.0.0, the user container cannot access ISTIO and traffic routing fails. This is another scenario that challenges Istio’s design goals for Maximize Transparency.

Cause analysis,

Iptables for istio-proxy:

ISTIO_IN_REDIRECT is virtualInbound, with port 15006. ISTIO_REDIRECT is virtualOutbound on port 15001.

The key point is rule two: If destination is not 127.0.0.1/32, forward to 15006 (virtualInbound, envoy listening), which causes traffic to pod IP to always return to the envoy.

Interpretation of this rule:

# Redirect app calls back to itself via Envoy when using the service VIP or endpoint # address, e.g. appN => Envoy (client) => Envoy (server) => appN.

This rule is intended to work here: Assuming that the current Pod A belongs to Service A, the user container in the Pod visits service A by the service name, and load-balancing logic in the envoy forwards this visit to the current Pod IP. Istio expects this scenario to still have traffic control on the server. As shown:

Modification Suggestions

You are advised to set the service listening address to 0.0.0.0 instead of the specific IP address before connecting the application to isTIO. If the business side thinks that the transformation is difficult, it can refer to the solution shared earlier: Route exception analysis of service listening POD IP in ISTIO

[Tencent cloud native] cloud said new, cloud research new technology, cloud travel new live, cloud appreciation information, scan code to pay attention to the public account of the same name, timely access to more dry goods!!

1. Service port naming restriction

why

Istio’s solution: Protocol Sniffing

Best practices

2. The delivery sequence of flow control rules is incorrect

Abnormal description

Cause analysis,

Best practice: Make before break

3. Request interrupt analysis

Envoy flow model

Log Analysis Example

4. Startup sequence of Sidecar and User Containers

Abnormal description

The solution

5. The Ingress Gateway interworks with the Service port

6. VirtualService scope

7. VirtualService does not support host Fragment

Abnormal case

background

Istio: Virtual Service Chaining (Plan in 1.6)

8. Full link tracing is not completely transparent

Abnormal case

why

9. MTLS interrupts the connection

10. Limit the listening address of the user service

Abnormal description

Cause analysis,

Modification Suggestions

Related Posts

Build your Wiki with Docker

ZStack source code analysis of the core library appreciation – ThreadFacade | Java development practice

10W level data Excel import, 4 version of the complete optimization record, the effect is too obvious!