background

Recently, I started to study the use of ISTIO. I have successfully installed and configured ISTIO in an environment without AN ISTIO cluster and can use it normally.

However, in a cluster with a newly upgraded ISTIO version (1.11.0), all pods injected into ISTIO Sidecar cannot access HTTPS. The configuration cannot be delivered. The IP address can communicate, but the TLS handshake fails.

  • A 404 message is displayed when the configuration cannot be delivered.
  • After some time, TLS cannot be established, and 000 is returned.
  • All external HTTPS services cannot be accessed.

After the initial debug, the performance becomes even weirder, with different performance in different pods:

Since the shell output at the time of the incident is no longer available, only the text narrative.

  • During the TLS handshake, the server suddenly rejects the connection.
  • In the TLS handshake phase, the server returns a certificate that does not match the server itself. For example, curl baidu.com indicates that the server certificate is *. Example.com.
  • After skipping certificate validation, the server responded unexpectedly.

The analysis process

The preliminary analysis

As external HTTPS cannot be accessed, the access policy or other security-related configurations may be set on ISTIO, or some configuration errors may occur on the HTTPS proxy.

For some time after the problem occurred, we compared the ISTIO configuration of the normal isTIO cluster with that of the problem cluster, all CRDS. No special configuration was found.

But in addition, the cluster has a few special features:

  1. Serviceentry is configured on istio for external access (public network services) as follows:

    $ kubectl -n default get serviceentries.networking.istio.io NAME HOSTS LOCATION RESOLUTION AGE external-svc-https-qqapi [baike.baidu.com www.baidu.com www.bing.com apis.baidu.com nlp.xiaoi.com api.map.baidu.com music.baidu.com api.spotify.com api.cognitive.microsoft.com dict.baidu.com baike.baidu.com ...]  MESH_EXTERNAL 72mCopy the code

    Some of the private domain names have been deleted, and there are dozens of domain names in this configuration.

    After the query, the answer is as follows: These domain names need to be added to ServiceEntry so that the services in the Mesh can access external services normally.

  2. The cluster also allows a DNS-controller, a controller developed earlier. For DNS discovery across clusters, etc. In this case, it simply means that the domain name of the mesh is similar to CNAME. For example, in the ABC namespace, the service-api runs. The default access domain is service-api.abc. In istio, the service-api.hello needs to be implemented. This controller is then given a serviceEntry for the service in space for “CNAME”.

Here comes the first question: Why do I have to add ServiceEntry to access a public domain name?

Istio does not restrict the access of mesh to the public network by default, and this configuration is not required in normal clusters.

Before this, sidecat memory usage exceeded 1GB because sidecar obtains global mesh configuration by default. We also used sidecar to limit the configuration to the current space to reduce sidecar usage.

apiVersion: networking.istio.io/v1beta1
kind: Sidecar
spec:
  egress:
    - hosts:
        - . / *
Copy the code

Try to solve

After problems arose, we tried several directions:

  1. There was feedback that ServiceEntry was not added. This is serviceEntry for accessing the public network. We also tried to add serviceEntry for the corresponding domain name, but the problem still exists. But this is better than before, the added domain can be accessed successfully. In other words, it works, but it doesn’t work completely. The first question remains unanswered.
  2. A new space istio-demo was opened and the book-info sample program was deployed. In the example program that deployed successfully, the same problem occurs.

In later debug, we tried to establish a TLS connection in pod:

In addition to the case where the connection is closed by the server directly during the handshake phase, there are also encountered:

$ curl https://baidu.com
SSL: certificate subject name '*.example.com' does not match target host name 'baidu.com'
Copy the code

Here comes the second question: where does ‘*.example.com’ come from? Is there an intermediary? Who is the middleman and where is the certificate configured?

The turning point

Since the problem arose when HTTPS failed to establish a connection, directions were led to issues such as TLS proxies.

Until:

$ curl -k  https://baidu.com
404 page not found
Copy the code

This is a well-known domain name visited, why returned “404 page not found”? This response indicates that the response should come from a service written in an internal Golang language.

Access to other HTTPS domains has the same effect, as if all HTTPS access has been diverted to some internal service.

istio sidecar configuration dump

During this process, we used the istio proxy-status command to check the sidecar configuration injection status, and no obvious exception was found.

In frustration, I tried using the istioctl proxy-config command to see the Sidecar configuration, aka envoy configuration.

$istioctlproxy-config all sample-c59f744df-8kq7m.istio-demo ... 0.0.0.0 443 SNI: music.baidu.com Cluster: outbound | 443 | | music.baidu.com 0.0.0.0 443 SNI: i.xiaoi.com Cluster: 443 ALL outbound | 443 | | i.xiaoi.com 0.0.0.0 Cluster: outbound | 443 | | traefik. Cutom -name 0.0.0.0 443 SNI: Dict.baidu.com Cluster: outbound | 443 | | dict.baidu.com 0.0.0.0 443 SNI: datarobotapi.bdia.com.cn Cluster: outbound|443||datarobotapi.bdia.com.cn ...Copy the code

In the output from this command, found in the thousands of rules so a 443 ALL 0.0.0.0 Cluster: outbound | 443 | | traefik. Cutom – name.

The intuition is that there is something wrong with this rule. The presence of this rule forwards traffic to traefik. Cutom-name on port 443, which is generated by dnS-controller for CNAME Serviceentry.

Because the rules of the Istio envoy are not well understood, I can only guess. If it’s familiar, you can see the problem at a glance.

If you can prove that “404 Page not found” is from Traefik, then this is the problem. So:

$ curl -k -vvv https://baidu.com ... * Server certificate: * subject: ... =traefik... . 404 page not foundCopy the code

The TLS certificate used by the server is found in the debug information provided by curl, and the Traefik field is contained in Subject. That is, the certificate is self-issued by Traefik. This backend service is traefik.

It can be concluded that some incorrect configuration resulted in a mapping of 0.0.0.0:433 -> traefik.cutom-name.

I later explained the purpose of these rules after reading Envoy’s document LDS. 443 ALL 0.0.0.0 Cluster: Outbound | 443 | | traefik. Cutom – the name means to match all IP address & port 443 & all traffic SNI mesh to purpose service traefik. The cutom – name

Become clear

Since traefik. Cutom-name is related to the dnS-Controller mentioned above, it is related to serviceEntry that it generates. Take one of them:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  namespace: abc
  name: traefik-cutom-name
spec:
  hosts:
    - traefik.cutom-name
  location: MESH_INTERNAL
  ports:
    - name: https-443
      number: 443
      protocol: TLS
  resolution: DNS
Copy the code

Dns-controller generates the above ServiceEntry. It doesn’t seem like there’s a problem at first glance, but there’s definitely something wrong.

After scrolling through the ServiceEntry documentation. Finally found the problem:

It is set to MESH_INTERNAL but does not specify the back end of the service. Neither the desired service DNS nor the IP address is specified. This ServiceEntry exists for the purpose of doing a “CNAME” without specifying a destination for the CNAME.

There are two solutions:

  1. Specify address to K8S pod IP
  2. Specify DNS to the target service. In this case, traefik.abc. ABC indicates the namespace name.

We don’t need to use the first option, if we use the second option, we need to change it to:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: traefik-cutom-name
  namespace: abc
spec:
  endpoints:
    - address: traefik.abc # specify endpoints. See isTIO documentation for details.
  hosts:
    - traefik.cutom-name
  location: MESH_INTERNAL
  ports:
    - name: https-443
      number: 443
      protocol: TLS
  resolution: DNS
Copy the code

After fixing the problem, check config-dump again

$istioctlProxy-config all sample-55C7969547-z92zk. Istio-demo 0.0.0.0 80 all PassthroughCluster 0.0.0.0 443 all PassthroughCluster 0.0.0.0 443 SNI: traefik. Custom -name Cluster: outbound | 443 | | traefik. Custom - the nameCopy the code

The rules are all right. PassthroughCluster is a special rule that indicates that traffic passes through IStio without modification.

analyse

Several issues that occurred during the debug process were also resolved.

  • Q: Why do I need to add ServiceEntry to access a public domain name?

    In the past, HTTPS services accessing the public network are forwarded to an internal service because of this bug. Therefore, MESH_EXTERNAL ServiceEntry cannot be accessed. In this case, adding MESH_EXTERNAL ServiceEntry adds a special matching rule to the rule to ensure that the matching succeeds and the traffic exits isTIO. Example: 0.0.0.0 443 SNI: music.baidu.com Cluster: outbound | 443 | | music.baidu.com

    After this bug is fixed, you no longer need to configure services on the public network.

  • Q: *.example.com’ from where?

    *.example.com is actually a certificate used on a service at the back end, and packets are incorrectly distributed due to incorrect configuration. In fact, all HTTPS is connected to the wrong backend.

  • Q: Why is the performance of different PODS inconsistent, some handshake failure, some certificate error?

    During the sidecar configuration, there are multiple service entries such as 0.0.0.0 443 ALL. During the production of sidecar rules, a backend is randomly selected as the service at this address. If you connect to a service such as mysql, the handshake will not succeed.

  • Q: Why is serviceEntry without a specified address matched to 0.0.0.0?

    This is istio’s consideration, and after looking at the source code conversion. Go# L57, 0.0.0.0 is its default value.

  • Q: Why was the problem only discovered after this upgrade?

    Because of the sidecar configuration. This problem also existed in previous IStio, but was circumvented by setting ServiceEntry for external HTTPS. The rule exists in the default namespace. After the upgrade, we set sidecar for the space. Serviceentry in the default space cannot be propagated to other Spaces, so that HTTPS in other Spaces are transferred to a back-end service.