A few years ago, Kubernetes was considered an impossible challenge for Zendesk. At that time, it was widely believed that:

There are small things we can do in Kubernetes, but it didn’t make sense for our original single Ruby on Rails product. Even if we did, we don’t know if Kubernetes could handle it.

But everything can change, and the seeds of change began at the Zendesk Dev Leads Summit in 2017, where Jon Moter hosted a seminar on “Zendesk’s Kubernetes.” I had the idea to explore further.

In the beginning, I used Kubernetes in my local development environment. We’ve introduced it into a Docker-based environment, and I just want to launch the application while it’s running in Kubernetes. However, the application often crashes, leaving me stranded:

  • What do you mean by various errors on Kubernetes?
  • How do I view logs?
  • How to know the cause of failure?

The first few months were about learning how to use Kubernetes instead of running applications on it. I wasn’t expecting this at all, but it was fun.

Launching the application in a local development environment without immediately throwing errors is my primary goal. Once that was done, I started testing the other apis. Exceptions are thrown and I can’t figure out why, at which point I can begin to resolve the configuration problem. I realized that configuration issues needed to be resolved so that Kubernetes could be included in the deployment environment.

Configuration issues It’s worth a bit of background on configuration: Zendesk follows a path typical of Ruby on Rails applications: start with a few YAML files… Finally, YAML files and environment variables… You can then customize company-specific content from those files. We have all these things. Use Chef to “bake” YAML files and environment variables in EC2 instances and data centers. The result: configuration management chaos.

The first thing you try to do is inject all the configuration into environment variables. This is what we started with, and it works well for many things. Then we get into structured data, and things get interesting. In retrospect, the first challenge was network configuration. The value of an environment variable can be finite on UNIX systems. In most systems, a variable is not really limited, but the entire environment is. There are tens of megabytes or more, but we’re also talking about megabytes of structured data. In addition, our deployment tool, which injects environment variables, stores values in a text column of no more than 2048 characters. In fact, we can’t put all of our structured data in some environment variable. So we have to separate them. This means putting a block of data in a key. We set up that all 20 keys have the same prefix, and the suffix indicates which part of the structure the data contains.

The original definition of YAML is as follows:

Try using environment variables instead:

This was great at first, because it covered 80% of the scenarios. However, a configuration of only 40 environment variables wasn’t the right answer and didn’t get us into a pre-production environment, so we started using Kubernetes ConfigMap.

“Network Configuration” was the first big map we worked on. In the ConfigMap of the Kubernetes cluster, we wrote a small service to put the YAML files maintained by Chef into it. After our initial success, we started loading more ConfigMaps.

We got some pretty interesting edge cases where we had secret values involving files. Until then, we have no objection to keeping secrets on disk in restricted access mode. But with ConfigMap, we don’t want to keep this habit. Based on the recipe’s evaluation location, we create a recipe that produces “data shapes” that are shared everywhere. Our singleton EC2 instance is evaluated when it runs, and it populates the secret values in some way. For other applications, populate them in a different way. The shape of the configuration is basically the same, but the secrets of each application are different.

For Kubernetes to eventually make it into our staging environment, we need to narrow it down to a minimum in order to gain confidence. Our main concern is unicorn service traffic. There are many reasons why we should do this. The first reason is pure observability. This is by far our most instrumented process, and our observations are getting better and better. You can see response time metrics, error rate metrics, individual endpoint metrics, and so on. It makes it easier to measure success. “Does it work as well as it used to? Or have I broken something?”

The second reason HTTP requests are chosen is because their units of work are small: you get a request, you do something, and then you send a response. That’s the easiest way to roll them back. We can look at these metrics as we roll things out, and then flip them back with minimal impact. That’s something we have to take into account.

So far, we have confidence in observability. With ConfigMap, Kubernetes was suddenly running in a staging environment for our original Ruby on Rails unit product (Classic). Once we get started, we can deploy it reliably and send traffic to it. I remember it was a normal day and we said, “Oh, this really works. Hello, Classic is now available on Kubernetes.”

What’s next for more configuration issues? We’ve been running in Staging for a while, found a lot of bugs, fixed them, and some types of bugs are starting to pop up. For example, if developer A changes the configuration, they have the code snippet that goes with it. That’s good. Developer B then deployable the code into our staging environment and got the configuration changes from Developer A without knowing about the code changes… We end up with a form of brain split: our configuration inputs are out of sync with the way we update our code. It became very confusing, very fast. We’re starting to see things change a little bit, not because of Kubernetes changes, but because of configuration changes.

Something special happened here. It was a turning point. The changes are merged into the README file. It will automatically deploy to our staging environment and then… Suddenly, the staging environment didn’t work. People were confused: “We changed a Markdown file, which was the least innocuous thing we could have done. How does that break staging?” We immediately rolled back the old tags, but… That doesn’t solve the problem. The lesson of this experience is that in a complex environment, or any environment, the configuration version is critical. This illustrates how difficult configuration can be and ultimately leads us to standardization of configuration across the company.

Distributing the load So far, we have reached a stage where we feel good and want to move into production. But before we do that, we need to know how the new infrastructure will perform under load. Is the load balancer good enough? Will the agent layer be able to delegate all these things evenly, or will we end up with bottlenecks and request queuing?

This is interesting because it relates to Zendesk’s history. In our previous infrastructure, we had a Ruby monolithic application that scaled vertically to accommodate growth. At this point, we bought dedicated hardware, a bare-metal instance of 256GB of ram with a 64-bit CPU. Next we will run 100 Unicorn services. So we decided to install an incredibly high-performance load balancer in front of it. At the time, our agent layer didn’t need balancing, and it didn’t care because we had dedicated hardware. Part of it was a conscious decision, part of it was the lack of technology at the time. This structure even ran through our AWS transition, with an AWS elastic load balancer (before application or network balancers) before EC2 instances.

But in the world of Kubernetes, this option is not easy. To make the application stable, we must understand the configuration and all the different features. Find out the different vertical and horizontal scaling characteristics. Now let’s review how unicorn works and how the load is distributed in the NGINX layer. Revisiting these decisions of eight years ago is painful but valuable. Revisiting these decisions also led to some good optimizations, such as changing NGINX’s load balancing strategy, which greatly affected our resource utilization.

When we started migrating to Kubernetes, we tried to match its deployment to the previous environment. We try to make them as large vertically as possible, but unfortunately, it’s only a quarter of what it was before. When we started balancing these requests, we were worried that we would get a completely busy instance. This is a clunky tool called the Classic API Traffic Generator. It takes a snapshot of the first 20 requests in the production environment and looks at the data and proportion of each request. The traffic generator generates these requests in the staging environment and weights them accordingly to be roughly consistent with the production environment. Search is one of our biggest motivators. As a result, the generator has a lot of influence over the search terminal, and the request shape is roughly similar.

You can make a lot of interesting discoveries in the staging environment. First, we didn’t measure the resources properly. Run a certain number of unicorn processes in a single Kubernetes pod that can use a certain number of cpus. There is no problem unless the load is too heavy. The pod would crash with so many requests lined up, so we had to dig deeper, “I only sent it 10 requests. Why is this app failing now?” Finally, the upper limit of CPU memory resources is not high enough for the application’s configuration file.

Another big challenge for us to reduce deployment time was actually deploying Kubernetes into our production environment, in other words, passing a new set of code to something running in the Kubernetes cluster. Deploying an immutable code base actually takes less work and more time than changing files on disk. When it was first tested on a large scale, it was very slow. Last year, we spent a lot of time and effort slowing down our deployment by adding tools to pre-load build artifacts. This means that running the new code on ec2-specific hardware takes just a few minutes: we update the code, unicorn completes the request, the worker process terminates, and the new code runs again.

In the Kubernetes environment, it’s a different story. Cannot connect to Kubernetes or signal to perform something like unicorn overload. Even if we could, we couldn’t put the new code into an existing container because we would lose all the hot overloads. To deploy a full scale Kubernetes Pod cluster, you need to download all containers and create all pods. Deployment time can be anywhere from 2-3 minutes to 30 minutes. There is no silver bullet, but we have taken some important steps to speed up deployment.

One of the reasons we wasted time on Kubernetes was to rotate all instances. We will scale the cluster, but we don’t have enough space to run such a large application. An obvious solution is to “just make the cluster permanently bigger”. The dream, of course, would be to run two full versions: start a brand new cluster, move traffic to the new cluster, validate it while the old cluster still exists, and send traffic back at the flick of a switch if there are any problems. At times, we did start to wonder if we needed to double the overhead of EC2 to make sure we could deploy. But at the time, doubling spending was not an option, so we had to find other ways to move forward.

With the help of Kubernetes experts, we spent a lot of time discussing whether rolling deployment could be leveraged. When we started doing this, our main focus became the actual container, “How can we help it deploy itself faster? How do you make it better in Kubernetes?”

There’s a lot of work in the container. The first thing we need to know is the startup order of the unicorn worker process. How much time did it take from Kubernetes to schedule our workload to when the workload was useful to us? As it turned out, the boot took 60 to 90 seconds to complete. This means that during this rolling deployment, we had to wait 90 seconds to kill the old batch before starting the new batch: not a good thing when you’re talking about hundreds of batches.

We first drew a fire map of the startup process locally and ran the startup program in various modes. In Rails, we looked at “eager Load” and “not eager Load.” When building Docker containers, we try to preload as much content as possible. This is where we found one of the most powerful ways to reduce deployment time, which I still find interesting. During unicorn startup, after the Unicorn worker is forked, it will reclaim all connections to the database or cache storage. It was here that we found… A sleep. After the connection reclaim command is issued, there is a one-second sleep, accompanied by a note: “Allow time for connection reset.” But we ran 20~30 unicorn processes in a container. Connection reclamation is like an atomic operation that occurs when the method is called to reclaim the connection. You don’t have to wait for anything. If we remove this dormant line, we can immediately save 30 seconds. When writing any application, if you decide to just wait, that’s a bad choice. In terms of scale, this is definitely the wrong choice.

Eventually, we were able to shorten the deployment time to a satisfactory degree and ramp up production. One of the main reasons is its safety. Since the production code path is not executed in the staging phase, we don’t want to find any horrible errors. We are diverting traffic generated by real customers and can’t risk Making Zendesk slower or doubling the error rate.

Since we couldn’t afford to run two copies of EC2 instances, rolling it out gradually allowed us to gain confidence bit by bit. We’ll push out a small amount of traffic to validate it, let it sit for a few days, gain confidence from our instruments, and then reduce the cluster by 5%. Next, we’ll do it one by one, for three to four months.

It’s like a leapfrog game, a ratchet, where we take one step up, then one step down. It slows down the rollout, but in a good way. At the beginning, we’re going to have a 1.2 percent increment; At the end, we’re free to move in increments of 10, 20 percent.

Finally, we reached a point where we could tell our external agent layer to send 100 percent of the traffic to the new infrastructure. After nearly two years of effort, we are ready to open EC2 instances… However, old infrastructure still needs to be dealt with. It turns out we haven’t dealt with it. This singleton contains most of the API, and we have a lot of external users, but we also have internal services running. There is also the unknown question of how these internal services make requests to singleton. Now we have to look for these elves, find out why they talk to us, and how we can introduce them into new infrastructure. This step is frustrating because we thought it was done. But what I like about this is that it finally allows us to get a standard that really ensures that people want to request oil to our monomer in a consistent way. For us, this is a significant architectural improvement.

At present, we have fully entered Kubernetes and have seen some unintended consequences. Moving to the new scale level, we had to change a lot of things, including how we use Consul, networking, DNS configuration, Etcd clustering, node updates, and cluster auto-tuning.

We are now well into Kubernetes and seeing some unintended consequences. We have reached new levels of scale and have had to change many things, including the way we use Consul, networking, DNS configuration, Etcd clustering, node updates, and our cluster autoexpander.

And we’re enjoying the unexpected. Having our singleton run on the same infrastructure as all the new services means that when building a common component, we can now bind it to Kubernetes. With faster iterations, we can have standard processes that everyone will benefit from. Just like we have recently modified our logging infrastructure and have all these latest logs available for free.

To sum up, we are now 100% using Kubernetes and it feels great. In the Kubernetes world, we have automatically expanded and scaled to make the background worker more resilient. Also, Kubernetes is forcing us to be more consistent, which is a good thing. Recently, I archived the internal Slack channel # classic-k8S-rollout. It was one of my favorite channels, and a bittersweet moment, but now, it’s time to move on.