In addition to the code for individual applications, our project is made up of dozens of microservices. All of these microservices need to be monitored, and it’s impossible for DevOps engineers to do all of these things. So we developed a monitoring system to provide services for developers. They can freely configure monitoring systems, use them to build multidimensional reports, and set thresholds to trigger alarms. DevOps engineers only need to provide infrastructure information and documentation.

This post is my RIT++ speech. We receive many requests for a written version of RIT++ talks. If you attended the conference or watched the screenshots, you’ll see that this is the same as the video. If not, please refer to this article to tell you how we evolved to the current solution, how we implemented it, and what we plan to do in the future.

Then: Layout and planning

How did we evolve to our current surveillance system? To answer that question, we need to go back to 2015, when it looked like this:

(Click to enlarge image)

We have 24 nodes for monitoring, with a large number of scheduled tasks, scripts, background processes for monitoring and messaging, task execution, and so on. We realized that the further we went in this direction, the harder our system would become to maintain, the more pointless it would be to continue developing it, and it would be too easy to get out of control.

We decided to develop and maintain some reusable modules in the existing system and discard the rest. So we picked 19 apps to continue development. These are just Graphites, data aggregators, and Grafana dashboards. So what does our new system look like? As shown below:

(Click to enlarge image)

We have a metrics repository – graphical presentation of high-speed SSD disks and metrics aggregation. Meanwhile, Grafana is responsible for the dashboard display and Moira is responsible for the alarm. We also need to develop a system to catch all exception information.

Standard: Monitoring 2.0

This is where we are in 2015. But we didn’t just need to develop the infrastructure and services themselves, we also needed to document. We developed a collaboration standard called Monitoring 2.0. System requirements are as follows:

  • 24/7 high availability,
  • Indicator storage interval = 10 seconds,
  • Structured storage of metrics and reports,
  • The SLA > 99.99%,
  • Collect all event data over UDP.

We need UDP because we have massive traffic and a lot of metrics to generate events. If they were all stored in Graphite in real time, the warehouse would be overwhelmed. We also selected a first-level prefix for all metrics.

Each prefix has some attributes, we have monitoring data for servers, networks, containers, resources, applications, and so on. We provide clear, rigorous, type-based filtering by recording the first paragraph prefix and removing the rest. This is what we see in 2015. What does the system look like today?

Today: Monitor interactions between components

First, we monitor applications, including our PHP code, applications, microservices, etc., in short, all the code that developers write. All applications send data via UDP to the Brubeck aggregator (STATSD, C implementation). It performed best in the overall test. Brubecks sends polymerization data to Graphite via TCP.

It has a unique type of indicator called Timers. They are very convenient. For example, for each user that connects to the service, you can send the response time of the service to Brubeck. Even when sending a million response times, the Brubeck aggregator generates only 10 pieces of data. You can know the maximum number of visitors, the minimum number of visitors, and the average response time. The data is then forwarded to Graphite so everyone can see it.

We also aggregate hardware and software monitoring indicators, system indicators and traditional monitoring system Munin (which we used until 2015) indicators. We use CollectD to collect all the data (it has a lot of plugins built in to find resources for any host installation, you just need to specify where the data should be stored) and send it to Graphite. It also supports Python plug-ins and Shell scripts, so you can develop your own solutions: CollectD will collect data locally or remotely (assuming you already have Curl) and send it to Graphite. All collected data is then sent to carbon-C-relay. This is Graphite’s Carbon Relay program, which is modified in C language. It acts as a route to collect all the data sent by the aggregator and routes to these nodes. When routing, it checks all monitoring data. First, they must match the prefix and second, they must meet the specifications of Graphite. Otherwise, throw it away.

Carbon-c-relay then sends data to the Graphite cluster. As the primary monitoring data warehouse, we modified the carbon-cache with Go. Go-carbon is much more powerful than carbon-cache because of its multi-threading capabilities. It gets the data and writes to disk using the Whisper package (a standard package implemented in Python). Reading data from our data warehouse, we use Graphite API. It will be much more efficient than standard Graphite WEB. What happens to the data next?

The data will be sent to Grafana. As the main data source, we used Graphite cluster and Grafana as a unified portal to display monitoring data and build dashboards. For each service, developers build their own unique reports. Then a two-dimensional graph shows the monitoring data from their application. In addition to Grafana, we also use SLAM. This is a Python process for analyzing SLAs based on Graphite. As I said, we have dozens of microservices, each with a need for customization. We used SLAM to check the documents, compare the data of Graphite, and evaluate whether the service availability has reached the target.

Calling the police is the next step. It is a powerful system built on Moira. It has built in Graphite, was developed by the SKB Kontur team, developed in Python and Go, and is 100% open source. Moira takes data of the same structure as Graphites. If for some reason the warehouse goes down, the alarm function still works.

We deployed Moira on Kubernetes, using the Redis cluster as the primary database. Therefore, our system is fault-tolerant. It compares monitoring data against a series of triggers: if monitoring items are not matched, the data is discarded, so it can process gigabytes of data per minute.

We also made an LDAP connection so that all employees within the company could set up alarms for existing triggers. Because Moira has its own Graphite, it supports all Moira functions. So, you can pick a line of code and copy it to Grafana, see how the data shows up, and if it’s good, you can copy that line to Moira. Design the threshold and you can have an alarm. You don’t need special skills to do it. Moira can send messages via SMS, email, Jira, Slack, and more. It also supports the execution of user scripts. When it is triggered and subscribed by a script or binary package, it executes the binary and sends JSON to the binary stdin. Your application needs to parse this data, depending on what you want to do with it, such as sending data to Telegram, creating tasks automatically in Jira, etc.

For the alarm function, we use our own proprietary scheme – Imagotag. We supported the demand for electronic price tags in stores. We use it to display Moira triggers. It displays their time and status. Some developers unsubscribe from Slack messages and emails and use this dashboard instead.

Because our business is driven by product functions, we also use this system to monitor Kubernetes. We introduced this system with Heapster and deployed it in the cluster environment to collect data and send it to Graphite, so as to realize the monitoring of Kubernetes. The results are shown below:

(Click to enlarge image)

Monitoring component

Here are all the components we used, all of which are open source.

Graphite:

  • go-carbon: github.com/lomik/go-carbon
  • whisper: github.com/graphite-project/whisper
  • graphite-api: github.com/brutasse/graphite-api

Carbon-c-relay:

github.com/grobian/carbon-c-relay

Brubeck:

github.com/github/brubeck

Collectd:

collectd.org

Moira:

github.com/moira-alert

Grafana:

grafana.com

Heapster:

github.com/kubernetes/heapster

statistics

Here are some performance data for our system:

Aggregator (brubeck)

Number of data items: ~ 300,000/ second

Time interval for sending data to Graphite: 30 seconds

Physical resource utilization: ~ 6% CPU (in this case, the server containing the full service collection); ~ 1 Gb memory; ~ 3 Mbit/s Intranet bandwidth

Graphite (go-carbon)

Number of data items: ~ 1,600,000/ minute

Data refresh interval: 30 seconds

Data persistence duration: 30 seconds 35 days, 5 minutes 90 days, 10 minutes 365 days (to see how the service performs over a period of time)

Physical resource usage: ~ 10% CPU; ~ 20 Gb memory; To 30 Mbit/s Intranet bandwidth

flexibility

We benefit from flexible monitoring services. Why is it so flexible? First, its components can change, including the components themselves and versions. Second, it has good maintainability. Because the entire project is built on an open source solution, you can modify the code, change it, and implement the desired features. We used a very common software stack, mainly Go and Python, so it was easy to implement new features.

Here’s a practical example. In Graphite an indicator is a document that has a name. The file name is the indicator name. It also has a path. In Linux, file names are limited to 255 characters. There were internal users from the database team who said, “We’re going to monitor our SQL queries, and they’re not just 255 characters long, they’re 8 MB. We need to display it in Grafana, view the parameters of the query, or, further, query the heat of all SQL statements. It would be better if this information could be displayed in real time and, in theory, they should also be integrated with an alarm function.”

Here is an example of an SQL query from Postgrespro.ru:

We built a Redis service and then used our Collectd-Plugins to connect to Postgres and get the data and send the data to Graphite. But we’ve replaced the data name with a hash code. The same hash code is sent to Redis as the Key for the data, with the entire SQL query as the value. Grafana does the rest to connect to Redis and get the data. We opened up the Graphite API because it is the main interface between all monitoring modules and Graphite. When we go to a new method in Grafana, aliasByHash (), we can get the name of the data, use the name as the Key of the Redis query, and return our “SQL query”. Therefore, SQL queries that theoretically could not be displayed can be displayed, and other data can be displayed, including number of calls, number of rows, total time, and so on.

conclusion

Availability. Our monitoring service is highly available 24/7 for any application, any code. If you have access to the data warehouse, you can write data yourself, regardless of the language or scheme, just know how to enable Socket communication, upload data, and then close the Socket.

Reliability. All components are fault tolerant and perform well under our existing pressures.

The low threshold. You don’t need to know the Grafana language or how to query if you want to use this system. You just need to develop your application, set up a Socket to connect to Graphite and send data, then close the Socket, open Grafana, create a new dashboard and monitor your application metrics via Moira’s message notifications.

Self-service. All of this can be used self-service, without the assistance of DevOps engineers. This is an obvious benefit because you can immediately start monitoring your project without having to ask for help, either directly or by custom development.

What are we after?

Here are not just general ideas, but actual goals and what has been achieved so far.

  1. Exception catching. We want to build a service that connects to a Graphite data warehouse and checks monitoring data using different algorithms. We have algorithms to show what we want to see, we have data, we know what to do with it.
  2. Metadata. We have a lot of services that are constantly updated, supported, and the people who use them are better. Manual maintenance of documents was not feasible, so we injected a lot of metadata into our microservices. Metadata records who developed the service, what language is supported, SLA requirements, notification receipt, and address. Once the service is deployed, all data entities are created independently. So, you get two links, one for the trigger and one for the Grafana dashboard.
  3. Self-service monitoring system. We think all developers should use this system. Here, you can always know where your app’s traffic pressure is, what’s happening, and where the problems and bottlenecks are. If some part of the service goes down, you are notified not by a phone call from your customer service provider, but by your alarm system, and can immediately open the log to see what happened.
  4. High performance. Our project continues to grow and now generates nearly 2 million pieces of monitoring data per minute, up from 500,000 a year ago. In the meantime, we are still adding, and in a while Graphite (Whisper) will explode the subsystem disks. As I said, our monitoring system is very generic due to the variability of components. Some people chose expanded machines specifically for Graphite, but we went the other way and used ClickHouse as a data warehouse to store monitoring data. This transition is almost complete, and I’ll go into more details about it shortly, including what we did and how we did it — what challenges we encountered, how we solved them, how we did the migration; I’ll show you the basic components and how they are configured.

tech.olx.com/monitoring-…

Thanks to Dongyu for correcting this article.

The related content

The whole link monitoring system under microservice system is constructed

Re: Zero-based microservices architecture

Re: Microservice architecture from 0: (v) Here is the code for you, how to support microservice with Docker

Re: Microservice architecture from 0: (4) How to ensure data consistency under microservice architecture

How does Suning call chain monitoring system escort 818?

Related Manufacturer content

How do I keep up with technology in the AGE of AI?

How to design high availability architecture for JINGdong virtual goods system?

Top Architects End of year Summary conference: Hundreds of technical cases scheduled!

Don’t change a Line of code: Left ear look at the three big pieces of microservice scheduling

How do I keep up with technology in the AGE of AI?

Related Sponsors

Hello, friend!

Register an InfoQ account
The login


Get more experiences from InfoQ.

Tell us what you think

Watch Thread
Shut down
Shut down
Shut down