When it comes to TiDB monitoring, the first thing that comes to mind is Prometheus and Grafana, which are already very mature monitoring products and I believe you have some understanding of them. Prometheus collects TiDB cluster information and Grafana uses this information to generate visual ICONS for display.

As a DBA, these two tools are indispensable when escorting TiDB cluster. No matter in daily inspection or troubleshooting, we need to open Grafana to check cluster status, study various indicators and troubleshoot problems one by one.

This article describes potholes used and trodden by Prometheus on an actual project, which are potholes! If you’re trying custom development for TiDB’s Prometheus or other monitoring and alarm systems, reading this article will help!

Project background

First of all, let’s talk about the background of the project. This project is mainly to build a set of TiDB cluster with a large number of nodes. Then, the company has its own requirements for the monitoring of TiDB cluster, because the whole company has not only one SET of TiDB cluster, but also other MySQL and Oracle clusters. The monitoring and alarm of all database clusters must be integrated into the same monitoring platform and alarm platform.

In short, we developed a set of monitoring and alarm platform to manage all database clusters. For TiDB cluster, we still use Prometheus of TiDB as the collector of cluster information, and then use Prometheus of TiDB as the collector of cluster information. Query Prometheus data using PQL, display and connect to the corresponding alarm system. You can simply think of it as redeveloping a Grafana and AlertManage.

Prometheus integrates with custom platforms

TiDB’s Prometheus has a set of alarm rules that cover almost all aspects of the TiDB cluster, so they can be imported directly from TiDB. It can be found in the Prometheus deployment directory, under the conf file.

From these rules, we can extract some rules we want and integrate them into our own platform. For specific integration methods, please refer to:

  • First go to rule-. yaml to find the alarm that you want to integrate (for example, TiDB_server_panic_total, which indicates that panic occurs more than once in TiDB Server).

  • If the start time of the TiDB process changes within 5 minutes, it indicates that the TiDB service has been restarted.

  • Access Prometheus’ API and obtain data from PQL using http://prometheus\_address/ API /v1/query


response = requests.get('http://%s/api/v1/query' % prometheus_address, params={'query': query})


Copy the code
  • After obtaining the query results, you can process them and output them to your own alarm monitoring platform.

  • For PQL can be debuggable through http://prometheus\_address/graph, the syntax of PQL is relatively simple, as long as you know the corresponding TiDB index and the basic function of PQL, you can easily query the results you want. If you’re having trouble, I recommend using the expr expression in rule-yaml and making a few minor changes that will work for most scenarios.

Integrating Prometheus into a custom monitoring and alarm system is about as simple as it gets, and it’s up to your business logic and design to figure out what to do with it.

A long list of holes

It’s not that hard to do, and the biggest problem is probably what you do with the data you get from Prometheus. However, you will encounter a lot of potholes in this simple process, mainly from documentation and versions.

Let me outline it:

We all know TiDB version update is very fast, so for some monitoring index of the TiDB as version changes, however, TiDB version update speed and monitoring rules relevant code with new speed is not the same, there appeared two factors change, this time, we need to query the various documents, to undertake unity, Such as alarm rules TiDB cluster | PingCAP Docs 1, problem again, TiDB warning document update more slowly, and there are some problems, this time you are not scared?

It can be very difficult to confirm something when three factors are changing, like let me give you an example

Tikv_batch_request_snapshot_nums 1

In the documentation, this indicator means that the Coprocessor CPU usage of a TiKV exceeds 90%. However, if you remove the criteria above 90% and simply want to get the Coprocessor CPU usage, You will find that PQL cannot query any data from Prometheus.

Tikv_thread_cpu_seconds_total in the PQL does not have a name starting with COP, which means you cannot get CPU usage of Coprocessor from tikv_thread_CPU_seconds_total. Yaml in Prometheus also had the same error, so in Grafana in some versions of the TiDB cluster you will find that your Coprocessor CPU panel has no data, as follows:

So what’s the root cause? The reason for this is that in TiDB, the concept of UnifyReadPool is introduced, which is to merge the Coprocessor thread Pool with the Storage ReadPool. Coprocessor CPU usage is also placed in the UnifyReadPool instead of being monitored separately.

Of course, there are many problematic monitoring indicators, such as some abandoned indicators, some renamed indicators, merged or disassembled indicators, we must test more in the process of using, otherwise the cluster problems, monitoring and alarm did not respond it is very dangerous.

conclusion

Due to limited space, I will not present the problems one by one. This article mainly records a project practice of TiDB cluster monitoring and alarm custom design to help you better master and use Prometheus of TiDB. There are some problems encountered in Prometheus, and I hope to share them with you. Remind you to pay attention to it in the process of use.

As you can see, TiDB and its documentation, while now quite powerful, still have some problems, big and small, that you encounter as a member of the community on a regular basis. As TiDB is an open source project, we can not only enjoy the open source benefits, but also contribute to it. When we find these problems, we can go to Github warehouse to submit a small PR to help it improve.

Indicators for Prometheus were also found earliernode_cpuLater updates becomenode_cpu_seconds_totalBefore the document was updated, I submitted a PR to PingCAP doc repository and got a Contributor badge, which was pretty cool.