To better manage growing services and traffic, the Houzz infrastructure team recently migrated Web Server services from Amazon Elastic Compute Cloud (Amazon EC2) to the Kubernetes cluster. This migration resulted in a 33% reduction in resource usage and a 30% improvement in home page latency.

The overall architecture of the Kubernetes cluster consists of multiple applications, including front-end (FE) applications written in NodeJS and back-end (BE) services written in HHVM. FE applications communicate with BE services through the Apache Thrift protocol over HTTP. Horizontal Pod auto scaling (HPA) is enabled for each application. Communications within the cluster and with external services are managed by Istio and pass through the Envoy Sidecar.

There are many challenges in the migration process, and the main purpose of this article is to share with you our best practices in the migration process.

Pod delayed start

When starting the Kubernetes migration, we noticed that delayed Pod startup sometimes occurred on newly configured nodes. It took about six minutes for the Envoy to prepare, preventing the others from activating. From the Envoy log, we observe that the Pilot-Agent keeps reporting that the Envoy is not ready and makes suggestions to check whether Istiod is still running.

We implemented a DaemonSet whose sole job was to resolve the FQDN of the Istiod service. From its metrics, we observed that DNS name resolution timed out a few minutes after the new node booted, assuming the Envoy encountered the same timeout problem.

We determined that the root cause of the problem was dnsRefreshRate, which defaults to 5 minutes in Istio 1.5.2 and roughly matches the observed delay. Because the DNS client on the new node is ready after some PODS are started, the long retry interval causes the Envoy to fail to detect the DNS client’s ready state in a timely manner. Reduce the additional Pod initiation delay from 360 seconds to 60 seconds by forcing the Envoy to perform more frequent retries.

Note that in Istio 1.6, the default dnsRefreshRate has been changed to 5 seconds.

HHVM Pod preheating and cascading scaling

Our BE service is built into HHVM, which has high CPU utilization and high latency before its code cache heats up. The warm-up phase typically takes a few minutes, so it doesn’t work well within the default 15-second HPA synchronization cycle or the time interval when the HPA evaluates CPU usage metrics and adjusts the number of pods needed. When new pods were created due to increased load, the HPA detected higher CPU utilization from the new pods and extended more pods. This positive feedback loop continues until new pods are fully warmed up or the maximum number of pods is reached. After the new Pod was fully warmed up, the HPA detected a significant decrease in CPU utilization and scaled back a large number of pods. Cascading scaling leads to instability and latency spikes.

We made two changes to address the cascading scaling problem. We improved the HHVM preheating process according to official recommendations. CPU usage during warm-up is reduced from 11 times normal usage to 1.5 times normal usage. The CPU usage of the Pod after it started serving traffic was reduced from 4 times normal usage to 1.5 times.

In addition, we increased the HPA synchronization time from 15 seconds to 10 minutes. Although HPA is slow to respond to increased load, it avoids cascading expansion because most PODS can finish warming up and begin normal CPU use within 10 minutes. We found this to be a trade-off worth making.

Load balancing

Load imbalance was the most notable challenge we encountered during the migration to Kubernetes, although it only occurred in the largest virtual services. The symptom is that some pods fail to pass the readiness check under heavy load, and then more requests are routed to these pods, causing them to oscillate between the ready and unready states. Adding more nodes or pods in this case will result in more flapping pods. When this happens, delays and error counts increase significantly. The only way to mitigate this problem is to forcibly scale down the deployment to kill pods that are constantly fluctuating without adding new ones. However, this is not a sustainable solution, as more PODS soon start to fluctuate. We rolled back the migration several times because of this problem.

To facilitate troubleshooting, we added additional logging, which was triggered when the load was unbalanced and one availability zone (AZ) had significantly more requests than the other two. We suspected that this imbalance was due to a positive feedback loop in the least-request load balancing strategy we were using at the time. We tried several other strategies (Round Robin, Locality Aware and Random), but none of them solved the problem.

After excluding the load-balancing strategy, we looked for positive feedback loops in two other areas: retry failed requests and exception detection.

Although Istio’s official documentation states that failed requests are not retried by default, the actual default number of retries is set to 2. Retry causes cascading failures because more requests are sent after certain requests fail. In addition, we observed that certain behaviors in exception detection (also known as passive health checks) could not be explained, so we decided to disable both features. After that, the imbalance disappeared and we were able to migrate 95% of our requests to Kubernetes. We reserved 5% of our resources on the old platform for performance comparison and tuning. Initially, we weren’t sure which of the two functions, retry or exception detection, was responsible for the load imbalance, although we now think it was related to retry.

After upgrading Istio to version 1.6, making some performance improvements, and migrating 100% of requests to Kubernetes, we tried to re-enable discrete value detection – a risk we were willing to take because changes could be undone in seconds. At the time of this writing, we have not encountered the problem of load imbalance. That is, we prove our theory with the fact that the cluster configuration for the current VERSION of Istio is different from the configuration when the imbalance occurred.

Performance deteriorates after publication

We observed that latency on Kubernetes increased over time after each Posting, so we created a dashboard to show the inbound/outbound delays reported by Envoy in Ingress Gateway, FE application, and BE Service Pod. The dashboard shows that the overall increase is driven by an increase in incoming delays reported by Envoy in BE Pods, both service delays and delays from Envoy itself. Since there was no significant increase in service latency, agent latency was considered to be the driver of the increase in latency. We found that Envoy memory usage in BE Pods increased over time with each release, leading us to suspect that the increase in latency was due to an Envoy memory leak in BE Pods. We exec into a BE Pod and list the connections in the Envoy and the main container and find about 2800 connections in the Envoy and 40 connections in the main container. Of the 2,800 connections, the vast majority were to FE Pod (BE Pod’s customers).

To address the Envoy memory leak, we made a few changes, including:

  • Reduce the idleTimeout for connections between FE Pod and BE Pod from the default of 1 hour to 30 seconds. This change reduces the number of errors and increases the success rate of requests, but it also increases the number of connection requests per second between the FE and BE containers.

  • Reduce the number of concurrent or threads Envoy from 16 in FE Pod to 2. This change cancels the increase in most connection requests per second since the first change.

  • Set the Envoy memory limit to 300MB in BE Pod. Restart when the expected behavior is observed and the Envoy’s memory usage exceeds the limit. The container continues to run, but memory usage is low. Some pods have a short ready time when restarting the Envoy, which complements the first two changes. While the first two changes reduce Envoy’s memory usage, the third change will restart if Envoy’s memory usage exceeds its limit. Restarting the Envoy causes significantly less downtime than restarting the main container because the latter generates a few minutes of warm-up time in HHVM.

After addressing the post-release performance degradation, we migrated 100% of our requests to Kubernetes and shut down the old host environment.

Cluster bottlenecks

As we migrated more requests to the largest virtual service in Kubernetes, we encountered problems with resources that were shared across the cluster scope among the virtual services, including APIServer, DNS server, and Istio control plane. During the event, an error spike was observed for all virtual services that lasted one or two minutes, which we found was due to a failure to resolve the DNS name of the BE virtual service in FE Pod. The error spike is also associated with DNS resolution errors and a drop in DNS requests. Ingress service invocation should not depend on DNS. Instead, you should direct the Envoy in the FE Pod to the IP address that will direct the outbound HTTP request to the endpoint in the BE service. However, we found that the NodeJS Thrift client library did A DNS lookup on the useless service IP. To eliminate DNS dependencies, we deployed Sidecar to bind the host of the BE Service in the Virtual Service to a local socket address.

While Istio maximizes transparency from an application perspective, in addition to replacing DNS names with local IP addresses and port numbers in application code, we also had to explicitly add the Host header. It is worth mentioning that a side benefit of Sidecar is the ability to optimize memory usage. By default, Istio adds an upstream cluster of each service across the Kubernetes cluster to the Envoy, whether it is needed or not. One significant cost of maintaining those unnecessary configurations is the memory usage of the Envoy container. Using the Sidecar solution, we isolated DNS server failures from service calls in the critical path, reduced THE QPS on the DNS server from 30,000 to 6,000, and reduced the average memory usage on the Envoy from 100MB to 70MB.

Another error peak we encountered was related to inconsistent cluster membership, which occurs when a node terminates. Although Kubernetes should handle node terminations gracefully, there is a special case when a node terminates that causes an error spike: running an Istiod Pod on the terminating node. After the node terminated, some FE Pods took about 17 minutes to receive updates from the new Istiod Pods. Before they received the update, they disagreed about their BE cluster membership. Given this, it is likely that the cluster members in these problematic FE Pods are out of date, causing them to send requests to terminated or not ready BE Pods.

We found that the tcpKeepalive option works in detecting terminated Istiod pods. In our Istio Settings, keepaliveTime, keepaliveProbes, and keepaliveInterval are set to default values of 300 seconds, 9 seconds, and 75 seconds, respectively. Theoretically, Envoy can

It takes at least 300 seconds plus 9, multiplied by 75 seconds (16.25 minutes) to detect a terminated Istiod Pod. We solved this problem by customizing the tcpKeepalive option to a lower value.

Building large Kubernetes clusters is challenging and makes a lot of sense to everyone. We hope you find our experience useful.

The original link: blog.houzz.com/challenges-…