Solution to Bug – Master/slave switchover “failed “?

preface

Database master/slave switching is a very interesting topic. Stable handling of primary/secondary switchover is a prerequisite for ensuring business continuity. Today, the author will tell a small problem in the process of switching from master to slave.

Fault scenarios

In the recent online master/slave switchover, most applications have been cut off, but some applications are still connected to the old master (new slave).

This makes the corresponding application development puzzled, so I asked for help to find out.

How did you find out?

The application developer receives A Cat monitoring alarm and finds that the request in application (A) has been failing steadily on several machines. It reminds me that I just did the database master/slave switchover drill last night, so I found that the machine has been connected to the old slave database under netstat -ANp!

Netstat anp | grep TCP 1521 0 0. 2:54100 1.1.1.1:1521 ESTABLISHEDCopy the code

The developer feels that the request failed because the master and slave did not switch over. At first glance, it makes perfect sense.

To investigate

What’s going on? It’s been 8 hours since the switch was successful. Why is the connection still on? So the author pings the domain name of the corresponding database:

Ping db.prd 64byres from db.prd (2.2.2.2): icmp_seq=1 TTL =64 time=0.02msCopy the code



That’s weird. DNS has already been switched over. Why is the app still connected to the old library?

First guess, DNS delay

The first thing that comes to mind is that there’s a delay when you switch from master to slave to DNS. For example, the DNS takes effect 2 minutes after the primary/secondary switchover is complete. Therefore, the new connection is still connected to the secondary library during this period.

This is a normal situation and requires the DBA to kill all connections to the old master. After consulting the DBA, they reported that they had killed all the connections. And showed me the statistical connection SQL of the database on the spot, and there was no connection of the corresponding machine. The connection on the application machine is ESTABLISHED.

Application most machines are connected with the old library!

At this time, development to the author’s response, this application corresponding to most of the machines are even the old library! If it’s DNS delay, it can’t be such a coincidence, there are more than 40!

And the DNS on these machines points to the new library.

DB no kill connection?

Did the DBA miss the kill connection step? But it contradicts the DB statistics he and I presented. So I asked the DBA to netstat on the machine corresponding to the old library. Discover, the connection really exists!

Netstat anp | grep TCP 0 0 1.1.1.1:1. 2. 1521, 2:54100 ESTABLISHEDCopy the code

Is there really something wrong with statistics?

Gets the connection creation time

In order to verify my hypothesis about DNS latency, I used some tricks to get the creation time of this connection. First netstat anp | grep 2. Find out the connection. Because it clearly belongs to the application Java process, find the process PID :8299 directly

Netstat anp | grep TCP 1521 0 0. 2:54100 1.1.1.1:1521 ESTABLISHED netstat anp | grep Java ABC 8299 JavaCopy the code

Now that we have the process PID, we can directly cat /proc/8299/net/tcp to get all the connection information of it, and then grep the hexadecimal 05F1 of 1521 (currently there is only one connection of 1521 on the machine).

. local_address rem_address inode ...... . xxx:D345 xxx:05F1 23456789 ......Copy the code

Find the inode number of the socket(1.2.3.4:54100<->1.1.1.1:1521). With this inode, it’s easy. Let’s just

The ls - all - h/proc / 8299 / fd | grep 23456789 (inode)... Jan 29 17:43 222 -> socket:[23456789]Copy the code

In this case, the connection was created on January 29. But the date of the switch was March 19th, and the connection has been in place for 2 months! Then it can’t be the DNS failure problem I’m talking about. Because the connection was never reconnected.

DB has been restarted, how can the old connection remain?

When I see this connection creation time, MY first reaction is: is the DBA sure to kill the connection? I asked the DBA if it could be a statistical problem. After hearing this, the DBA told me that they had restarted the database, how could there be any connection? Check the creation time of the DB process.

Ps - eo lstart, CMD | grep db process name Mar 19 17:52:32 db process 2021Copy the code

Judging from the process startup time, it really started on March 19. The spooky connection did indeed belong to a process that started on March 19. It doesn’t make any sense to look at it.

But since the Linux statistics are here (and, for now, reliable), there must be some other weird logic involved.

The child inherits the connection from the parent

After a little thought, the author found a possibility. The parent process creates a connection first. When the child process forks, the child process inherits the connection from the parent process. The parent process exits, leaving only the child process. The connection may appear to exist before the process is started.



In order to verify this problem, the author wrote a simple C program, the implementation of this is true. Example code:

main.c ...... int main(int argc,char* argv[]){ ...... if((client_fd = accept(sockfd,(struct sockaddr*)&remote_addr,&sin_size)) == -1){ printf("accept error! \n"); } printf("Received a connection \n"); // Create a two-minute delay to cause the above phenomenon sleep(2 * 60); if(! Fork ()){// The child keeps while(1){sleep(100000); }}else{// The parent process closes the connection close(client_fd); } return 0; }Copy the code

I asked the DBA, they will not kill -9 all processes, are in accordance with the standard database restart process to operate (kill -9 all processes at the same time will close the connection of these processes, but such a violent operation obviously dare not be used in DB).

If we use a commercial database that uses the mechanism shown above, the above phenomenon will occur. However, because the DB itself has lost the session, then the connection in the database dimension must have gg (this is also the reason why the database statistics can not be counted). Since it is still there, the connection must never be processed again! There must have been a mistake.

Business code logic

If the above argument is followed, then the request was not executed and no error is reported? If we follow this logic, then only the business error will have a new normal connection. Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid: Druid Sure enough, has been an error on the machine connection is connected to the new library (application development to discover the other machine, or even to the old library, so find me for help), and the creation time is on March 29, without the application of error connection hangs above the old library, picked out a few to look at, hanging in the old library connection is created on January 29.

But why are they still reporting errors?

Since the connection is normal (to the new library), why is the error still reported? Is the business code written so badly that once it’s wrong, it’s wrong forever? So I directly turned up the application of the source code. It uses the connection to the database to obtain the sequence number. Then a careful analysis of the source code found. It did not handle the error after the database, went through a problematic code branch, and never fetched the sequence from the database (the business code was never put up).

Why do only a few machines report errors?

Because this sequence number is used in a large range of machine memory, SQL will not be executed until it is exhausted. So only some machines running out of sequences in memory will run to that branch of code in question.

Why wasn’t the heartbeat detected?

At this point you might be wondering, right? No heartbeat test? No, the application uses the Druid data source, and the version of Druid they use doesn’t have timed heartbeat detection.

Did the master/slave switch succeed?

The master-slave switch was certainly successful. This can be judged by the fact that other applications are performing well after cutting through. Loss of database traffic during master/slave switchover is expected. However, if the application does not recover after a database switch, you need to take a closer look at the application code itself.

conclusion

Database master/slave switchover is a frequent and important action, which is a necessary condition to ensure business continuity. It’s not just up to the DBAs, it’s up to us to write robust code at the application layer to make our production line more stable. ! [](p3-juejin.byteimg.com/tos-cn-i-