After several years of managing the infrastructure team at Ridecell, I wanted to take a break to jot down some thoughts and lessons learned.

1Kubernetes is more than hype

I’ve been active in Kubernetes for a long time, so this wasn’t a surprise to me, but when something gets hyped up, it’s always good to double check. In a little over two years, my team has completed a full migration from Ansible+Terraform to pure Kubernetes, and in the process we have more than tripled our deployment rate while reducing deployment errors to the “I can’t remember the last time it happened” level. We’ve also improved operational visibility, a number of boring but important automated tasks, and average recovery times when infrastructure outages occur.

Kubernetes is not magic, but when used by a team that understands it, it is a very powerful tool.

2Traefik + Cerrt-Manager + Ext-DNS is a great combination

The combination of Traefik as the Ingress controller, cert-manager generating certificates through LETSEncrypt, and external-DNS managing edge DNS records makes HTTP routing and management as smooth as butter. I’ve been complaining about the choice to remove many of the 1.x annotation features in Traefik 2.0, but they’re finally back in 2.2, albeit in a different form. As an edge agent, Traefik is a solid choice with excellent metrics integration, the fewest managed parts of any Ingress controller, and a responsive development team. The cert-manager is an amazing tool to use with any Ingress solution, so if you’ve done TLS in a Kubernetes cluster but haven’t started using it yet, it’s time to learn about it.

External-DNS is less popular than the other two, but it is just as important for automating DNS recording and the actual matching steps.

If anything, these tools may actually make it too easy to set up new HTTPS endpoints. Over the years, we’ve ended up using dozens of unique certificates, which has made a lot of noise in terms of CERT Transparency searches and LETSEncrypt’s own certificate expiration warnings. In the future, I will carefully consider which hostnames can be used as part of a globally configured wildcard certificate to reduce the total number of certificates in use.

3Prometheus is shocked that Thanos is not overqualified

This is the first time I’ve used Prometheus as my primary measurement system, and it deserves to be the premier tool in the field. We chose prometheus-operator to manage it, which is a good choice and makes it easier to distribute fetching and rule configuration to applications that need them. I would have used Thanos in the first place. I thought it would be an overuse, but it was so easy to configure and helped a lot in cross-region queries and reducing Prometheus resource usage, even if we didn’t use the master-master interstandby high availability setup directly.

The biggest problem I had working with the technology stack was Grafana’s data management, how to store and compose dashboards. There has been a huge growth in tools for managing dashboards, such as YAML files, JSON files, Kubernetes custom objects, and anything else you can think of. But the root problem is that it’s hard for any tool to write a dashboard from scratch, because Grafana has a million different configuration options, panel modes, etc. We ended up treating it as a stateful system, with all the dashboards managed in groups, but I didn’t like this solution. Is there a better workflow?

Four Gitops is the way to go

If you use Kubernetes, then you should use Gitops. There are many tools to choose from, the simplest being to add a job running Kubectl Apply to your existing CI system, all the way up to using dedicated systems such as ArgoCD and Flux. But I’m firmly in the ARGOCD camp. It’s a solid choice to start with, and it’s gotten better over the years. Just this week, the first version of Gitops-Engine went live, putting ArgoCD and Flux on a shared underlying system, so it’s now faster and better. If you don’t like the workflow of these tools, you can now even build new ones more easily. A few months ago we had an unexpected disaster recovery game day when someone accidentally deleted most of the namespace in the test cluster. Thanks to Gitops, we recovered by doing Make Apply in the Bootstrap library and waiting for the system to rebuild itself. Having said that, it’s also important to have Velero backups of some stateful data that won’t survive in Git (such as cert-manager certificates, which can be reissued, but you may run into the rate limit of LetsEncrypt).

The biggest problem we encountered was the choice to keep all of our core infrastructure in one repository. I still think it’s the right design to use a single library, but I’ll divide the different things into different instances of ArgoCD, rather than putting everything in one instance of “infra” as at present. Using a single instance results in longer convergence times and a noisy UI, and it’s not much of a benefit if we get used to splitting our Kustomize definitions properly.

5 We should create more operators

I’ve been active in developing custom operators from the beginning, and we’ve had great success with this. We started with a custom resource and controller for deploying our primary network application and slowly expanded to all the other automation required for this and other applications. Simple infrastructure services with normal Kustomize and ArgoCD work well, but when we want to control the external things (such as creating AWS IAM roles from Kubernetes and using them through KiAM), Or when we need some level of state machine to control these things (such as Django application deployment with SQL migration), we need operators. As part of this, we also built a very thorough test suite for all of our custom objects and controllers, which greatly improved operational stability and our own certainty that the system was working correctly.

There are more and more ways to build Opeator these days, but I’m still pretty happy with KubeBuilder (although to be fair, we did modify the project structure considerably over time, So it uses Controller-Runtime and Controller-Tools more fairly than KubeBuilder itself). Whichever language and framework you prefer, there is probably an Operator toolkit available, and you should definitely use it.

6Secret management is still a challenge

Kubernetes has its own Secret object for managing Secret data at runtime, with containers or with other objects, and the system works quite well. But Secret’s long-term workflow is still a bit messy. Committing a raw Secret to Git is bad for a number of reasons that I hope I don’t have to list, but how do we manage these objects? My solution was to develop a custom EncryptedSecret type that used AWS KMS to encrypt each value, while a controller running in Kubernetes decrypted it back to a normal Secret as usual, and a command-line tool for the decryption-edit-re-encrypt loop. Using KMS means that we can do access control by limiting the use of KMS keys by IAM rules, and encrypt only the values to make the files reasonably different. There are now a few community operators based on Mozilla SOPs that provide roughly the same workflow, although SOPs can be a bit frustrating in terms of local editing workflows. In summary, there is a lot of work to be done in this area, and one should expect an auditable, versifiable, code-vetted workflow, just like everything else in the Gitops world.

As a related issue, the weaknesses of Kubernetes’ RBAC model are most obvious in Secrets. In almost all cases, the Secret used for a thing must be in the same namespace as the thing that uses it, which often means that the Secret for many different things will end up in the same namespace (database password, vendor API token, TLS certificate), and if you want to give someone (or something), The same problem applies to Operator), and they get all the access. Keep your namespace as small as possible, and anything that can be put in its own namespace, do it. Your RBAC strategy will thank you for doing so.

7 Native CI and log analysis are still open issues

The two big ecosystem potholes I encountered were CI and log analysis. There are many CI systems deployed in Kubernetes, such as Jenkins, Concourse, Buildkite, etc. But there are very few solutions that feel completely native. JenkinsX is probably the closest thing to a native experience, but it’s built on a huge amount of complexity, which I think is a shame. Prow itself is also very native, but there is a lot of customization, so it’s not an easy tool to use. Tekton Pipelines and Argo Workflows both have low-level pipes for native CI systems, but finding a way to expose them to my development team was never beyond the theoretical operator. Argo-Ci seems to have been abandoned, but the Tekton team seems to be actively tracking this use case, so I’m hopeful for some improvements.

Logging collection was basically a solved problem, and the community focused on the Fluent Bit, sending it as a DaemOnset to some Fluentd Pods, and then to whatever system you were using for storage and analysis. On the storage side, we have Elasticsearch and Loki as the main open contenders, each with its own analytics front end (Kibana and Grafana). It was this last part that seemed to be the main source of my frustration. Kibana is much older and has plenty of analytics, but you’ll have to use the commercial version to get basic operations like user authentication and user permissions that are still pretty vague. Loki is much newer, with fewer parsing tools (substring search and per-line tag search) and no design for permissions yet. This is fine if you are careful to ensure that all log output is secure and visible to all engineers, but you should be prepared for some sharp questions during SOC/PCI audits etc.

8 epilogue

Kubernetes is not the all-in-one deliverable solution that many claim, but with some careful engineering and an extraordinary community ecosystem, it can be an unbeatable platform. Take the time to learn each of the underlying components and you’ll be well on your way to container happiness, and hopefully you’ll avoid some of my mistakes along the way.