Six factors you must consider when building a Prometheus platform

Author’s brief introduction

Loris Degioanni, founder and CTO of Sysdig, is also the founder of Falco, a container security tool.

This article was adapted from Rancher Labs

Currently, Prometheus is widely used by many enterprises and organizations to monitor their containers and microservices. But in this process, large companies often run into difficulties: scaling monitoring metrics is a significant challenge as the number of applications increases.

The growing number of containers complicates the situation

It is often easier to monitor a monolithic environment because the number of static physical servers and virtual machines is fixed, and the number of metrics monitored is limited. Today, however, due to containers and the need to move to microservice architectures, the number of instances to track and monitor has skyrocketed.

If servers in the data center are pets that need our constant attention, cloud instances are more like cows (because there are so many, you don’t have to care about individual instances), while containers are more like bees. They are numerous, sometimes hundreds of containers per machine, and new containers are appearing all the time, which can be very short-lived when used with container choreography engines such as Kubernetes. This makes it harder to track and monitor them, and they can cause a lot of damage if you accidentally misoperate them.

As complexity and distributed environments increase, so does the number of entities you need to monitor. In addition, you may want to monitor more properties to ensure that you have an accurate understanding of what is happening, or what is happening in the case of troubleshooting or event response. The latter is especially problematic in ephemeral environments, because by the time you want to understand the root cause of the problem, often the associated resources are down, which means that the monitoring solution must provide a way to store enough history for forensics.

Popular monitoring tool: Prometheus

More and more teams that need cloud monitoring are turning to Prometheus, an open source CNCF project. Prometheus has become the monitoring tool of choice for developers to collect and understand metrics in cloud native environments. It is supported by a large community, with 6,300 contributors from over 700 companies, 13,500 code submissions, and 7,200 pull requests.

By default, the typical cloud native application stack (Kubernetes, Ngnix, MongoDB, Kafka, Golang, etc.) exposes Prometheus metrics. Prometheus is a vertically scalable Go program that makes it easy to deploy for a single container or host. In other words, it was extremely easy to use Prometheus at first, and you could easily monitor your first Kubernetes cluster, but it also meant that monitoring became more complex as the infrastructure grew.

Scaling problems with application growth

As environments grow in size, you need to track and monitor rapidly increasing time series data, and a point beyond which a single Instance of Prometheus cannot continue to track and monitor. In this case, the most straightforward option is to run a set of Prometheus servers across the enterprise, but this presents some challenges. For example, it is not easy to manage and merge data across dozens or even hundreds of Prometheus servers. Likewise, understanding enterprise workflows, single sign-on, role-based access control, and SLA or compliance are not easy issues. As applications grow, running a comprehensive monitoring solution without interrupting the work of the developers becomes an issue of manageability and reliability.

In order to solve this problem, enterprises have adopted many methods.

The simple approach is to have a separate Prometheus server for each namespace or cluster. This approach becomes unsustainable at a certain scale, and it also has the disadvantage of creating a large number of disconnected data islands. This can make troubleshooting cumbersome, as most problems span multiple services/teams/clusters. Not only is it hard to find the same metrics in every environment, but you also need to stitch the data together to try to understand what’s going on.

Another common approach is to assemble multiple Prometheus servers using open source tools such as Cortex or Thanos. These efficient tools allow you to centrally query servers, collect data and share it in a unified dashboard. However, like any data-intensive distributed system, they require a lot of skills and resources to operate.

Six factors to consider

For companies that started with Prometheus and then sought commercial solutions for global monitoring, it was important not to lose all of the standardized development work done on Prometheus — Dashboards, Alarms, Exporters, etc. However, this is not the only thing to consider, and if you continue to use Prometheus, adhere to the following criteria:

1. Compatibility to support all Prometheus features

Your vendor/tool /SaaS solution needs to be able to consume data using any entity that produces Prometheus metrics, whether it’s on-premise Kubernetes or a cloud service. Consuming Prometheus metrics is relatively trivial, but don’t overlook the small things that make sense for your environment, such as being able to re-label metrics when extracting metrics into storage or adding data. These little things add up to a mountain of different data that can be collected.

2. PromQL compatibility

Prometheus query language was invented by the creator of Prometheus to extract information stored in Prometheus. PromQL allows you to query metrics for specific services or users, and it also aggregates or breaks down data. For example, you can use it to display the CPU usage of each application in all containers. Or just display the Cassandra container’s data and display it as a single value per cluster. It can be said that PromQL unlocks the true value of Prometheus, so integrating Prometheus’s metrics into a product that does not fully support PromQL defeats the purpose of using Prometheus.

3. Support hot swap

To be truly compatible with Prometheus, the solution must be hot-pluggable to work with your existing dashboards, alarms, and scripts. For example, many enterprises that use Prometheus use Grafana for dashboard. The open source tool integrates nicely with Prometheus, including at the query level, and can be used to generate a range of useful charts and dashboards. Therefore, commercial products claimed to be compatible with Prometheus should be compatible with tools such as Grafana. It’s not enough to say that the solution lets you view numbers in Grafana, you need to be able to extract existing Grafana dashboards as is and reapply them to installed data in a commercial solution.

4. Access control

Access control is another security issue you need to consider when evaluating tools. The ability to secure user authentication using industry standard protocols, including LDAP, Google Oauth, SAML, and OpenID, enables companies to isolate and secure resources through service-based access control.

5. Troubleshooting

Kubernetes simplifies deployment, elastic scaling, and management of containerized applications and microservices. This helps keep services running, but to identify and resolve root problems such as performance degradation, deployment failures, and connection errors, you need to be able to collect and visualize infrastructure, application, and performance data from across the environment. Without simultaneous access to real-time information and contextual data, it’s almost impossible to correlate metrics in your environment, so you can solve problems faster.

6. Compatible with existing alarms

Finally, if you are looking for a commercial solution to help solve Prometheus scalability issues, make sure it supports all levels of alarm. The key to achieving this goal is full support for Alert Manager functionality, which also requires 100% integration and PromQL compatibility.

If you find a commercial tool that meets the above criteria, you should be able to easily integrate it into existing Prometheus and avoid the scalability issues your company has encountered. Developers have good reason to love Prometheus, so thorough, due diligence before adopting a commercial solution will ensure they can still use their preferred metrics.