The efficiency of 10 x! Build the Swiss Army knife of operation and maintenance of TiDB

Chen Shuang, member of The Personal Bar Team, PingCAP TiDB r&d engineer, participated in TiDB Hackathon 2019 and won the third prize for her project “Manage many as one with SQL”.

Extreme ease of use has always been a goal of PingCAP. Prior to this, TiDB hid the complexity of distribution behind TiDB by being compatible with MySQL, freeing users from the complexity of a separate database and table scheme and allowing them to use TiDB as if they were using a standalone database.

However, compatibility with MySQL is only the first step in ease of use, and this step mainly improves the developer experience. But for dbAs, operating and maintaining a distributed system is not easy. So what are the problems facing the operation and maintenance of distributed database? There are basically the following aspects:

Component status information is scattered.
O&m operations need to cross nodes.
Operation and maintenance scripts are written repeatedly and cannot be standardized.

After TiDB Hackathon 2019, we have a unified answer to the above question: manage the entire TiDB cluster with SQL.

Use SQL to query cluster information

In the example above, we get the following information for the cluster using SQL:

Topology information and version information of all nodes.
Configuration information about all nodes.
The request that all nodes are currently processing, that is, processList.
Slow query information about all nodes.
Server hardware information of all nodes.
Load information of all nodes.

Provide more system memory tables

Previously, TiDB had fewer system tables providing system information. The Hackathon project also added more system tables for retrieving more information through SQL:

The hardware information of a node includes CPU, memory, network interface card (NIC), and disk information.
Load information about nodes: CPU/Memory load, network/disk traffic.
Node configuration information.
Node monitoring information, such as QPS, KV Duration, and TSO Duration.

Each system table has a cluster system table view with the corresponding Cluster_ prefix. I’m not going to show it here.

Use SQL to dynamically change the configuration of the cluster

The following shows how to dynamically modify the configuration of all nodes in a cluster using SQL:

As can be seen from the above GIF, we can modify the CLUSTER_CONFIG system table through the Update statement to complete the cluster configuration change without complex operation and maintenance commands. Of course, with this statement, you can specify that only one configuration of a node is changed.

Use SQL to diagnose faults

When a FAILURE occurs in the TiDB cluster, previous DBAs may troubleshoot each failure according to the symptom and troubleshooting manual. These can be completely automated by adding the basic capabilities mentioned above, such as some common troubleshooting scenarios in TiDB:

Find out why system writes are slow, such as node outages, etc.
Locate the slow – queried link-wide log behavior in the cluster.
Locate the node whose disk capacity or memory monitoring is abnormal.

After this Hackathon, the above results can be obtained with some simple SQL statement combinations. These functions are becoming a reality little by little.

As a mature distributed database, IT is very important for TiDB to operate, maintain, manage and monitor the cluster in an easy-to-use and unified way. As the number of internal components increases, the fragmentation of monitoring and diagnostic tools, coupled with the naturally distributed nature of TiDB, makes this problem even more difficult. For databases, this is the most natural and scalable way to do this, and TiDB is working in this direction. This project is a good example.

— — Huang Dongxu (PingCAP | CTO)

For DBAs, everything is simple, efficiency first, SQL is the most powerful and efficient tool for DBAs. The distributed database is different from the traditional database, its components are mostly complex interaction, difficult performance diagnosis, long time to locate problems and other problems really make DBA headache. However, TiDB, as a new generation of NewSQL database, has been adhering to the concept of extreme ease of use, suitable for DBA, returning to SQL to achieve the whole distributed database cluster management, is to save DBA a lot of energy and time, perfect fit for DBA.

— — wen-tao jin (PingCAP | DBA)

Project Design Details

Prior to this project, when querying TiDB’s system tables, users were bound to encounter the problem that some system tables only contained data for the current TiDB node, not for all nodes. Examples include PROCESSLIST (current execution statement), SLOW_QUERY (slow query of memory table), and so on.

In order to read the system information of all nodes and modify the configuration of all nodes, the data interaction between TiDB nodes should be opened first. Let’s take a look at the architecture of the current TiDB cluster as follows:

It can be found that under the existing structure, other components communicate with RPC, so it is natural for TiDB to communicate with RPC as well, for the following reasons:

The scalability of TiDB using HTTP API to read information of other TiDB nodes is relatively poor, and the plasticity of RPC framework is higher, which can meet the needs of data interaction between TiDB nodes in the future.
Calculations are pushed down to each TiDB node. Some information may have a large amount of data, such as slow query of memory table. If there is no operator to push down to complete data filtering, there will be a lot of redundant network overhead, which is the same as TiDB to TiKV.

So we finally chose to add an RPC service in TiDB. The structure is as follows:

The new RPC service and HTTP service of TiDB share port 10080, but no new port is added. As you can see from the Explain results below, the following statements were successfully pushed down to each TiDB.

Write in the last

The above features were not fully completed during Hackathon due to time constraints, but are currently being developed as a community incubator project. I’m sure we’ll see you soon.

We also welcome those interested to join our community working group: SQL Diagnostics Working Group to further improve and implement this project and improve TiDB’s usability.

Encouragingly, this is just the beginning.

pingcap.com/blog-cn/man…

The efficiency of 10 x! Build the Swiss Army knife of operation and maintenance of TiDB

Use SQL to query cluster information

Provide more system memory tables

Use SQL to dynamically change the configuration of the cluster

Use SQL to diagnose faults

Project Design Details

Write in the last

Related Posts

The last joke of 2017

Interview Notes – Database chapter

AMD, CMD, CommonJS and UMD