7 lessons learned from two years of Using Kubernetes

Source: Distributed Laboratory

After a few years running the infrastructure team at Ridecell, I wanted to take a break to record some thoughts and lessons learned.

1Kubernetes is not just hype

I’ve been active in Kubernetes for a long time, so it’s not a surprise, but when something is hyped up, it’s always good to double check. In just over two years, my team has completed a full migration from Ansible+Terraform to pure Kubernetes, and in the process we have more than tripled deployment rates while reducing deployment errors to “I can’t remember when they last happened” levels. We also improved operational visibility, lots of uninteresting but important automated tasks, and average recovery times when infrastructure outages occur.

Kubernetes is not magic, but when used by a team that understands it, it can be a very powerful tool.

2Traefik + cert-Manager + ext-DNS is a great combination

The combination of Traefik as Ingress controller, cert-Manager generating certificates through LetsEncrypt, and external-DNS managing edge DNS records makes HTTP routing and administration butter smooth. I’ve been critical of Traefik 2.0 for removing a lot of 1.x annotations, but they’re finally back in 2.2, albeit in a different form. As an edge agent, Traefik is a solid choice with excellent metrics integration, the fewest management parts of any Ingress controller, and a responsive development team. Cert-manager is a fantastic tool to use with any Ingress scheme. If you have TLS in your Kubernetes cluster but are not already using it, check it out now.

External-dns is not as popular as the other two, but it is no less important to the steps of automating DNS recording and actual matching.

If anything, these tools may actually make it too easy to set up new HTTPS endpoints. Over the years, we ended up using dozens of unique certificates, which made a lot of noise in terms of Cert Transparency searches and LetsEncrypt’s own certificate expiration warnings. In the future I will carefully consider which host names can be used as part of the globally configured wildcard certificate to reduce the total number of certificates being used.

Thanos was not overqualified, 3Prometheus was impressed

This is the first time I’ve used Prometheus as the primary measurement system, and it’s certainly the primary tool in the field. We chose Prometry-Operator to manage it, which was a good choice and made it easier to distribute fetching and rule configuration to applications that needed them. I would have used Thanos in the first place. I thought it would be overuse, but it was easy to configure and helped a lot in cross-region queries and reducing Prometheus resource usage, even if we weren’t using the master-master high availability setup directly.

The biggest frustration I had with the stack was data management in Grafana, how to store and combine dashboards. There has been a huge growth in tools for managing dashboards, such as YAML files, JSON files, Kubernetes custom objects, and just about anything else you can think of. But the root problem is that it’s hard for any tool to write a dashboard from scratch, because Grafana has a million different configuration options, panel modes, and so on. We ended up treating it as a stateful system, with all the dashboards managed in groups, but I didn’t like that solution. Is there a better workflow here?

4GitOps is the way to go

If you use Kubernetes, then you should use GitOps. There are many tools to choose from. The simplest is to add a job running Kubectl Apply to your existing CI system, all the way to using dedicated systems such as ArgoCD and Flux. But I’m firmly in the ArgoCD camp, it’s a solid choice to start with, and it’s gotten better and better over the years. Just this week, the first version of GitOPs-Engine went live, putting ArgoCD and Flux on a shared underlying system, so it’s now faster and better. If you don’t like the workflows of these tools, it’s even easier to build new ones now. We had an unexpected disaster recovery game day a few months ago when someone accidentally deleted most of the namespaces in the test cluster. Thanks to GitOps, we recovered by executing Make Apply in the Bootstrap library and waiting for the system to rebuild itself. That said, it’s also important to have Velero backups of stateful data that won’t survive in Git (such as cert-Manager certificates, which can be reissued, but you might run into LetsEncrypt’s rate limits).

The biggest problem we encountered was choosing to keep all of our core infrastructure in one repository. I still think using a single library is the right design, but I will divide things into different instances of ArgoCD, rather than putting everything in one instance of “infra” as I do now. Using a single instance leads to longer convergence times and a noisy UI, and it doesn’t do much good if we get used to splitting our Kustomize definitions properly.

5 We should create more operators

I’ve been active in developing custom operators from the beginning, and we’ve had great success in doing so. We started with a custom resource and controller to deploy our primary network application and slowly expanded to all the other automation needed for that and other applications. Simple infrastructure services using plain Kustomize and ArgoCD work fine, but when we want to control external things (such as creating AWS IAM roles from Kubernetes and using kiam), Or when we need some level of state machine to control these things, such as a Django application deployment with SQL migration. As part of this, we also built a very thorough test suite for all our custom objects and controllers, which greatly improved operational stability and our own certainty that the system worked correctly.

There are more and more ways to build Opeator these days, but I’m still pretty happy with KubeBuilder (although to be fair, we did modify the project structure considerably over time, So it uses Controller-Runtime and Controller-Tools more fairly than KubeBuilder itself. No matter which language and framework you prefer to use most, there is probably an Operator toolkit available, and you should definitely use it.

6Secret management is still a problem

Kubernetes has its own Secret object for managing Secret data at run time, with containers or with other objects, and the system works fine. But Secret’s long-term workflow is still a bit messy. Committing a raw Secret to Git is bad for a number of reasons that I hope I don’t need to list, so how do we manage these objects? My solution was to develop a custom EncryptedSecret type that uses AWS KMS to encrypt each value, while the controller running in Kubernetes decrypts it back to normal Secret as usual, along with a command line tool for a decryption-edit-re-encrypt cycle. Using KMS means that we can use IAM rules to limit the use of KMS keys for access control, and only encrypt values, so that files have reasonable differences. There are now community operators based on Mozilla Sops that provide much the same workflow, although Sops is a bit frustrating in terms of native editing workflows. Overall, there is still a lot of work to be done in this area, and one should expect an auditable, versionable, code-reviewable workflow, just like everything else in the GitOps world.

As a related issue, the weakness of Kubernetes’ RBAC model is most obvious on Secrets. In almost all cases, the Secret used for one thing must be in the same namespace as the thing that uses it, which often means that Secret for many different things will end up in the same namespace (database passwords, vendor API tokens, TLS certificates), and if you want to give someone (or something, The same problem applies to the Operator) accessing one of them, and they get all access. Keep your namespace as small as possible. Anything that can fit in its own namespace, do it. Your RBAC policy will thank you now.

Native CI and log analysis are still open issues

The two biggest ecosystem pits I’ve come across are CI and log analysis. There are many CI systems deployed in Kubernetes, such as Jenkins, Concourse, Buildkite, etc. But there are few solutions that feel completely native-like. JenkinsX is probably the closest thing to a native experience, but it’s built on such a huge amount of complexity that I think it’s a shame. Prow itself is very native, but it’s very customizable, so it’s not an easy tool to use. Tekton Pipelines and Argo Workflows both had low-level Pipelines of native CI systems, but finding a way to expose them to my development team was never beyond the theoretical operator. Argo-ci seems to have been abandoned, but the Tekton team seems to be actively pursuing this use case, so I’m hopeful it will make some improvements.

Log collection is basically a solved problem, with the community focusing on Fluent Bit, sending it as a DaemonSet to some Fluentd Pods, and then to whatever system you use for storage and analysis. On the storage side, we have ElasticSearch and Loki as the main open competitors, each with their own analytics front ends (Kibana and Grafana). It was the last part that seemed to be the main source of my frustration. Kibana is older and has rich analytics, but you have to use the commercial version to get basic operations like user authentication and user rights that are still pretty vague. Loki is much newer, with fewer parsing tools (substring searches and per-line tag searches) and no design for permissions so far. This is fine if you are careful to ensure that all log output is secure and visible to all engineers, but be prepared for some sharp questions in SOC/PCI audits, etc.

8 epilogue

Kubernetes is not what many would call a complete deliverable solution, but with some careful engineering and an extraordinary community ecosystem, it can be an unparalleled platform. Take the time to learn each of the underlying components and you’ll be well on your way to container happiness. Hopefully you can avoid some of my mistakes along the way.

7 lessons learned from two years of Using Kubernetes

Related Posts

UI Designer SVG Animation: Path Distortion Animation (Part 2)

Principles of Computer Networking – Chapter 3: Transport Layer

Rust 2018 is coming: Find a way to transition from Rust 2015