• GreatSQL community original content is not allowed to use without authorization, please contact xiaobian and indicate the source.

[toc]

MGR is similar to traditional master-slave replication, in that it pays attention to the running status of each node and whether the transaction of the Secondary node is delayed. This article describes how to monitor MGR node status, transaction status, and so on.

1. Monitor node status

Query performance_schema.replication_group_members to find the status of each node in the MGR table:

mysql> select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------+-------------+--------------+-------- -----+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+--------------+-------------+--------------+-------- + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | group_replication_applier | af39db70 c9 ec - 94-00155-6850-11 d064000 | 192.168.6.27 | 4306 | The ONLINE | PRIMARY | 8.0.25 | | group_replication_applier | b05c0838 ec - 6850-11 - a06b - 00155 d064000 | 192.168.6.27 | 4307 | ONLINE | SECONDARY | 8.0.25 | | group_replication_applier | b0f86046 ec - 6850-11-92, fe - 00155 d064000 | 192.168.6.27 | 4308 | The ONLINE | SECONDARY | 8.0.25 | +---------------------------+--------------------------------------+--------------+-------------+--------------+-------- -----+----------------+Copy the code

The main columns in the output result can be interpreted as follows:

  • The MEMBER_ID column value is the server_UUID of each node, which uniquely identifies each node. In command line mode, MEMBER_ID is passed to specify each node when the UDF is called.
  • MEMBER_ROLE indicates the role of each node. PRIMARY indicates that the node can accept read/write transactions; SECONDARY indicates that the node can only accept read-only transactions. If only one node is PRIMARY and all other nodes are SECONDARY, the node is in single-primary mode. If all nodes are PRIMARY, the node is in multi-primary mode.
  • MEMBER_STATERefers to the state of each node, and there are several states: ONLINE, RECOVERING, OFFLINE, ERROR, UNREACHABLE, etc. The following states are introduced respectively.
    • ONLINE: indicates that the node is in the normal state and can provide services.
    • RECOVERING means that nodes are performing distributed recovery and waiting to join the cluster. In this case, clone data may be copied from donor node or transmitted to binlog.
    • OFFLINE: Indicates that the node is OFFLINE. Note that there may also be a brief moment of OFFLINE status when you are about to join or rejoin a cluster.
    • ERROR: indicates that the node is in an ERROR state and cannot be a member of the cluster. It is also possible to be in this state when a node is performing a distributed restore or applying a transaction. When the node is in ERROR state, it cannot participate in the decision of cluster transaction. When a node is joining or rejoining the cluster, it may also display an ERROR state before completing the compatibility check and becoming an official MGR node.
    • UNREACHABLE. When the group communication message sending and receiving times out, the fault detection mechanism marks the node as suspected. The node is suspected to be unable to connect to other nodes, for example, when a node is disconnected unexpectedly. When other nodes on a node are in UNREACHABLE state, it may mean that some nodes have been partitioned, that is, multiple nodes are split into two or more subsets. Nodes in the subsets can communicate with each other, but subsets cannot.

When the node is not ONLINE, you should immediately raise an alarm and check what is happening.

Performance_schema.replication_group_members data is updated when node status changes or a node joins or exits. The status information is exchanged and shared among nodes and can be viewed on any node.

2. Monitor MGR transaction status

Another important concern is the transaction status of the Secondary node, more specifically, the transaction to be authenticated and the transaction queue size to be applied. To check whether the COUNT_TRANSACTIONS_IN_QUEUE and COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE columns of the non-primary node are large, run the following SQL: COUNT_TRANSACTIONS_IN_QUEUE

mysql> SELECT MEMBER_ID AS id, COUNT_TRANSACTIONS_IN_QUEUE AS trx_tobe_verified, COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE AS trx_tobe_applied, COUNT_TRANSACTIONS_CHECKED AS trx_chkd, COUNT_TRANSACTIONS_REMOTE_APPLIED AS trx_done, COUNT_TRANSACTIONS_LOCAL_PROPOSED AS proposed FROM performance_schema.replication_group_member_stats; +--------------------------------------+-------------------+------------------+----------+----------+----------+ | id | trx_tobe_verified | trx_tobe_applied | trx_chkd | trx_done | proposed | +--------------------------------------+-------------------+------------------+----------+----------+----------+ | 4ebd3504-11d9-11ec-8f92-70b5e873a570 | 0 | 0 | 422248 | 6 | 422248 | | 549b92bf-11d9-11ec-88e1-70b5e873a570 | 0 | 238391  | 422079 | 183692 | 0 | | 5596116c-11d9-11ec-8624-70b5e873a570 | 2936 | 238519 | 422115 | 183598 | 0 | | ed5fe7ba-37c2-11ec-8e12-70b5e873a570 | 2976 | 238123 | 422167 | 184044 | 0 | +--------------------------------------+-------------------+------------------+----------+----------+----------+Copy the code

Count_transactions_in_applier_queue specifies the size of the transaction queue waiting to be applied. COUNT_TRANSACTIONS_IN_QUEUE specifies the size of the transaction queue waiting to be authenticated. If either of these values is greater than 0, there is some degree of delay.

You can also look at the changes in the two values to see whether the two queues are gradually growing or shrinking to determine whether the Primary node is “running too fast” or the Secondary node is “running too slowly”.

By the way, when flow control is enabled, Group_replication_flow_control_applier_threshold and group_REPLICATION_FLOW_control_certifier_threshold exceed the corresponding thresholds Default threshold is 25000), which triggers the flow control mechanism.

3. Other monitoring

Alternatively, you can look at the gap between the received transaction and the completed transaction to determine:

mysql> SELECT RECEIVED_TRANSACTION_SET FROM performance_schema.replication_connection_status WHERE channel_name = 'group_replication_applier' UNION ALL SELECT variable_value FROM performance_schema.global_variables WHERE variable_name  = 'gtid_executed'\G *************************** 1. row *************************** RECEIVED_TRANSACTION_SET: 6cfb873b-573f-11ec-814a-d08e7908bcb1:1-3124520 *************************** 2. row *************************** RECEIVED_TRANSACTION_SET: 6cfb873b-573f-11ec-814a-d08e7908bcb1:1-3078139Copy the code

GTID = 3124520; GTID = 3078139; GTID = 3124520;

By the way, you can keep an eye on how this difference changes to estimate whether the local node will catch up with the delay in time or increase the delay.

In addition, when the original primary node fails and you want to manually select a node as the new primary node, you should first determine which node has a larger transaction GTID value and select this node first.

4. Summary

This article describes the main concerns that MGR monitors, including node state and replication delay state, and how to predict whether replication delay will continue to grow or catch up in time.

References, documents

  • MySQL 8.0 Reference Manual
  • Database kernel development – Wenzheng Lake
  • Group Replication principle – Song Libing

disclaimer

Due to my limited skills, errors and omissions are unavoidable in this column. Do not directly copy the commands and methods in the document and apply them to the online production environment. Please be sure to fully understand and verify in the test environment before formal implementation, to avoid damage or damage to the production environment.

Enjoy GreatSQL 🙂

Article recommendation:

GreatSQL Quarterly Report (2021.12.26)

Mp.weixin.qq.com/s/FZ_zSBHfl…

Technology sharing | sysbench pressure measuring tool usage

Mp.weixin.qq.com/s/m16LwXWy9…

Failure analysis | Linux disk IO utilization rate is high, the analysis of the correct position

Mp.weixin.qq.com/s/7cu_36jfs…

Technology sharing | flashbacks in MySQL implementation and improvement

Mp.weixin.qq.com/s/6jepwEE0D…

Wanta #20, how does index push down perform data filtering

Mp.weixin.qq.com/s/pt6mr3Ge1…

About GreatSQL

GreatSQL is a MySQL branch maintained by Wanli Database, which focuses on improving the reliability and performance of MGR. It supports InnoDB parallel query feature and is a MySQL branch version suitable for financial applications.

Gitee:

Gitee.com/GreatSQL/Gr…

GitHub:

Github.com/GreatSQL/Gr…

Bilibili:

Space.bilibili.com/1363850082/…

Wechat &QQ Group:

You can search to add GreatSQL Community Assistant wechat friends, and send the verification message “Add group” to join the GreatSQL/MGR communication wechat group

QQ group: 533341697

Wechat assistant: Wanlidbc

This article is published by OpenWrite!