Inspection is an essential link to ensure the smooth and effective operation of the system, and the purpose is to find the hidden dangers in the system in time. We can also see various kinds of inspection everywhere in our life, such as electricity inspection, fire inspection and so on. It is these inspection work that we can work and live in a stable environment. Inspection is also crucial for databases and other IT systems, especially to reduce risks and improve service stability.

This paper introduces the framework and inspection content of Meituan MySQL database inspection system, hoping to help you understand what is database inspection, how to design the inspection system architecture of Meituan, and how to ensure the stable operation of MySQL service.

The background,

To ensure the stable operation of the database, the following core functional components are essential:

Database inspection, as one of the most important links in the operation and maintenance system, can help us find hidden dangers in the database, manage them in advance, and prevent them before they happen. Flexible and robust automatic inspection is critical for large-scale clusters.

Any system will go through a primitive stage, the earliest inspection is composed of central computer + periodic inspection script + front-end display. However, with the passage of time, the old inspection scheme has gradually exposed some problems:

  • The execution of scheduled inspection tasks depends on the central controller, resulting in a single point problem.
  • Inspection results are scattered in different database tables and cannot be counted.
  • There is no unified development standard for inspection scripts, which cannot ensure the success rate of execution.
  • Each inspection item needs a separate write interface to fetch data and modify the front end for displaying inspection results, which is cumbersome.
  • Dbas need to take the initiative to open the front end to check the hidden dangers found in the inspection, and then deal with them, which affects the speed of the overall hidden dangers.
  • .

Therefore, we need a flexible and stable inspection system to help us solve these pain points and ensure the stability of the database.

Ii. Design principles

The design principle of inspection system is considered from the following three aspects:

Stability: Inspection as a tool to ensure the stability of the database, its own stability must be guaranteed; High efficiency: user-centered, to simplify the quantitative complexity, reduce the use cost of users, so that new students can quickly start governance and management hidden dangers; Improve the efficiency of new inspection deployment. As the o&M environment, such as architecture, version, and basic modules, changes constantly, new inspection requirements emerge in an endless stream. Faster deployment ensures earlier security. Operable: based on data, the inspection hidden dangers can be operated, including promoting hidden dangers governance, checking governance efficiency, trend, weak spots, etc.

3. System architecture

The MySQL database inspection system architecture diagram of Meituan is as follows. Next, we briefly introduce the main modules of the inspection system from bottom to top according to the architecture diagram:

1. The executive level

Inspection execution environment: An inspection execution environment consists of multiple inspection execution units, and inspection task scripts are deployed on all of them. The executable periodically pulls the latest scripts from the inspected Git repository. The scripts are managed using Python Virtualenv + Git to facilitate the expansion of new executable machines.

Task scheduling: Inspection tasks are scheduled by Crane, a distributed timing task system developed by Meituan Infrastructure Department, to solve the single point problem of traditional timing tasks. Crane randomly assigns one of the actuators to perform the task, and if that one fails, it assigns another to perform the task again. Generally, one inspection task corresponds to one inspection item. The inspection task determines whether potential hazards exist based on certain rules for specific inspection targets.

Inspection objectives: In addition to production databases, inspection is performed on peripheral database products, such as high-availability components and middleware, to cover all risks that may cause database failures.

2. Storage layer

Inspection database: Saves inspection data. To standardize and simplify the process, we save the hidden dangers found in inspection to the database and provide a common entry function to achieve the following functions:

  • Automatically fill up the hidden trouble person, hidden trouble discovery time and other information;
  • Idempotent storage operation;
  • Semi-structured inspection results can be stored in the database. Hidden inspection results have different attributes. For example, hidden inspection results of A have middleware type, and hidden inspection results of B have number of CPU cores in the primary library.
  • For hidden risks of table granularity, if hidden risks occur in tables of different databases and different tables, they are automatically merged into one logical table and stored in the database.

Inspection scripts Git repository: used to manage inspection scripts. In order to facilitate dbAs to add inspection, we added several public functions during system construction to reduce the cost of developing new inspection and to facilitate the migration of old inspection scripts to the new system.

3. The application layer

Integrated into the database o&M platform: provides an entrance for displaying hidden danger details, configuring inspection, and managing whitelists. In order to improve hidden trouble management efficiency. We made the following design.

  • The days of each hidden danger are marked on the hidden danger details display page to facilitate the tracing of hidden danger causes.
  • When configuring a new inspection display, you must formulate a solution for potential hazards to ensure that potential hazards can be managed according to rules and avoid compounding errors caused by incorrect management methods.

Hidden trouble operation background: The main purpose of this module is to promote the governance of hidden trouble.

  • Operating statements help managers master the progress of hidden trouble governance from a global perspective, including hidden trouble trend, stock distribution, incremental distribution, average governance cycle and other core content, and then promote hidden trouble governance from top to bottom; The report data is also obtained through crane timed task calculations.
  • This function is used to urge DBAs to deal with hidden risks. Prompt content will contain hidden danger specific content, occurrence time, treatment plan, etc. Prompt modes include elephant messages and alarms. You can configure the mode according to the inspection severity.

External data service: it mainly provides inspection hidden trouble data to other platforms or projects within Meituan, so that inspection data can play a greater value.

  • Connected with wevin platform (a risk discovery and operation platform mainly for RD users developed by Meituan SRE team), the platform received the hidden danger data reported by each service party, showed the risk points of each service from the perspective of RD from the perspective of organizational structure, and followed up the PROGRESS of RD processing. The inspection system will push the hidden trouble that RD needs to participate in governance, such as large table and table with no unique key, to RD for governance through wevin platform.
  • The weekly operation and maintenance report is mainly for the RD person in charge of the business line and the DBA of the business line. It displays the running status of the business line database and the existing problems in the form of static reports. Inspection hidden problems are one of the contents of the report.

Iv. Inspection items

Inspection items are divided into DBAs and RDS. Dbas handle basic database components and risks that may affect service stability. RD is mainly responsible for service faults or performance problems caused by database table design defects and non-standard database usage. They also need to participate in the management of inspection items, such as Disk Available Space forecast. Currently, there are 64 inspection items, and their distribution is as follows:

Cluster: Checks potential risks at the cluster level, such as the cluster topology and core parameters. Machine: the main check server hardware layer hidden trouble; Schema/SQL: Check hidden problems in table structure design, database use, and SQL quality. High availability/backup/middleware/alarm: Mainly check whether there are hidden problems in related core function components.

The following describes the inspection items by listing several inspection tasks:

Five, the results

Meituan MySQL inspection system has been running stably for nearly a year, and 49 inspection items have been launched based on the new inspection system. Through the continuous operation of the inspection system and the joint efforts of the team, we have managed 8000+ core hidden dangers in total. The average hidden dangers management cycle in the past three months is no more than 4 days, keeping the total number of hidden dangers at a minimal level and effectively ensuring the stability of the database.

The following trend chart shows the number of potential hazards in the past year. The number of potential hazards increases suddenly because new inspection items come online. Look from the overall trend, hidden trouble stock has a very obvious decline.

In addition to promoting internal hidden trouble governance, we also actively promote the number of HIDDEN trouble governance RD more than 5000 through docking wevin platform.

In order to improve user experience, we have also made a key investment in improving accuracy, so that each inspection will go through strict testing and verification before it goes online.

Compared with other wevin access parties, the hidden dangers reported by DBA are at a higher level in the total amount, conversion rate and response rate, indicating that the hidden dangers reported by us have also been recognized by RD.

Indicator description:

  • Feedback rate = number of risk events fed back up to the current moment/total risk events generated up to the current moment * 100%;
  • Feedback accuracy = number of accurate feedback risk events up to the current moment/total feedback risk events up to the current moment * 100%;
  • Conversion rate = the number of risk events that need to be handled with accurate feedback from users up to the current time/the total number of risk events generated up to the current time * 100%.

6. Future planning

In addition to continuing to improve and supplement inspection items, the inspection system will continue to explore and iterate in the following directions in the future:

  • Improve automation capability, improve CI and audit;
  • Strengthen the operation ability, further refine the importance of each hidden danger, assist decision-making priority;
  • Hidden trouble automatic repair.

Author’s brief introduction

Wang Qi, member of DBA group of infrastructure Department, joined Meituan in 2018, responsible for MySQL database operation and maintenance/database inspection system/monitoring/automated operation and maintenance weekly report/operation and maintenance data mart construction, etc.

To read more technical articles, please scan the code to follow the wechat public number – Meituan technical team!