After two days of unremitting efforts, we finally recovered the production server data deleted by a misoperation. The process of the accident and the solution to record here, alert themselves, but also remind us not to make this mistake. Also hope to encounter a problem of friends can find a trace of inspiration to solve the problem.


The accident background

I arranged a girl to install Oracle on a production server. The girl studied and installed Oracle, but felt that it was not installed correctly, and prepared to uninstall and reinstall Oracle.

To delete the Oracle installation directory, run the following command:

rm -rf $ORACLE_BASE/ *Copy the code

If ORACLE_BASE is not assigned, the command becomes:

rm -rf /*
Copy the code

Wait, the girl is using Root. In this way, the entire disk files are deleted, including the application of Tomcat, MySQL database and so on……

Isn’t the MySQL database running? Can Linux delete executing files? Anyway, it was completely deleted, and there was still a Tomcat Log file left at last. It is estimated that the file was too large, so it was not deleted successfully for a while.


Look at sister remorse eyes, and because this thing is I arranged her to do, also did not tell her a strong relationship, without any training, responsibility can only be a person on the back, and how can let the beauty bear this responsibility?

Make a phone call to the machine room, hang the disk to another server, SSH to check the files are all clear, this server is running but a customer’s production system ah, has been running for half a year, have to recover as soon as possible ah.

The backup file is only 1KB and contains only a few lines of familiar mysqldump comments. The closest backup is from December 2013.

Think of a leader said the case: when a production system was down, found that all the backup problems, burned CD also have scratches, tape drive is broken (an industry veteran, it is estimated that the disc used to do backup), did not expect today really come true to me, how to do?


After knowing the situation, the department leader has made the worst plan B: the leader personally leads the team and product AA to the customer’s city on Sunday, and communicates with the leadership on Monday. BB and CC go to the account administrator and try to convince the customer……

Lifesaver: ext3grep

Quickly go to the Internet to check the data to recover the deleted data, and actually find an ext3grep can recover the deleted files through rm -RF, our disk is ext3 format, and there are many successful cases online.

Then lit up a glimmer of hope, quickly umount the disk to prevent re-write overdeleted file sector. Download ext3grep and install it.

Run the scan file name command:

ext3grep /dev/vgdata/LogVol00 --dump-names
Copy the code

I printed out all the deleted files and paths. I was so happy that I didn’t have to go to Plan B. The files were all there.

This software can not restore files by directory, can only execute the full restore command:

ext3grep /dev/vgdata/LogVol00 --restore-all
Copy the code

Results The current disk space is insufficient, can not only restore the file, tried several files, unexpectedly part of the success and part of the failure:

ext3grep /dev/vgdata/LogVol00 --restore-file var/lib/mysql/aqsh/tb_b_attench.MYD
Copy the code

Heart can not help a cool, is to delete the disk was written files? Recovery probability is not very, can recover a few count a few, perhaps important data files just in the MYD file can be restored.

Redirects all file names to a single file:

ext3grep /dev/vgdata/LogVol00 --dump-names >/usr/allnames.txt
Copy the code

Filter all MySQL database names into mysqltbname.txt.

Write a script to restore the file:

while read LINE
do
 echo "begin to restore file " $LINE
 ext3grep /dev/vgdata/LogVol00 --restore-file $LINE
 if[$?!= 0]then
 echo "restore failed, exit"
 # exit 1
 fi
done < ./mysqltbname.txt
Copy the code


Execution, about 20 minutes to run, recovered more than 40 files, but not enough ah, we nearly 100 tables, each table FRM, MYD, myI three files, how to say there are more than 300 ah!


Attach the retrieved files to the existing database, and restart MySQL after the file permission is 777, which can be regarded as part of the data recovered, but the customer’s important attendance data and mobile phone report data (it is said that the customer does the employee performance according to these data) have not been recovered.

Do how? In the middle, I tried another tool extundelete, which has basically the same syntax as ext3grep, and the principle should be the same, but it is said that it can be restored by directory.

Well, give it a try:

extundelete /dev/vgdata/LogVol00 --restore-directory var/lib/mysql/aqsh
Copy the code

As expected, recovery can not come out !!!!!!!! Those documents have been destroyed. Report to your boss and go ahead with Plan B…… Helpless to go home from work. (It’s the weekend. Go back and have a rest. Think about it.)

Brainwave: Binlog


The next morning one early wake up (in the mind occupy), back computer, go to the company (this weekend is an expense account, not criticized, report, fine, discharge is good, still lead what weekend).

Still run ext3grep, extundelete, and that’s it. Put the system on a test server and see if the data can be patched up.

Mysqldump = mysqldump = mysqldump = mysqldump = mysqldump = mysqldump

Wait, Wait, isn’t there a Binlog? Our service requires Binlog to be enabled. Maybe we can recover data from Binlog.

Dump = Binlog; Dump = Binlog;

  • mysql-binlog0001
  • mysql-bin.000009
  • mysql-bin.000010

Restore 0001:

ext3grep /dev/vgdata/LogVol00 --restore-file var/lib/mysql/mysql-bin.000001
Copy the code

Surprisingly failed…… Mysql > restore mysql-bin.000010 mysql > restore mysql-bin.000010 mysql > restore mysql-bin.000010 mysql > restore mysql-bin.000010 mysql > restore mysql-bin.000010

SCP to the test server. Perform Binlog restore:

mysqlbinlog /usr/mysql-bin.000010 | mysql -uroot -p
Copy the code

Input password, stuck (good phenomenon), after a long wait, finally ended. Open the app, oh, thanks CCTV, MTV, the data is back!

Afterword.


After this accident, although the data is very lucky to recover, but the process is thrilling. They are also afraid of the consequences of their mistakes and the joint liability brought by their colleagues and leaders.

I also hope to remember this incident and not make the same mistake in the future. Reflections on the accident are as follows:

  • This time, WHEN MM was arranged to maintain the server, she did not explain the severe situation in advance, and she did not pay attention to it. The management was chaotic and the process was chaotic. In an online production system, any change must be made first.
  • There was a problem with the automatic backup without anyone checking it. Offline backups download 1K files at a time from the server and never pay attention. People need to be clear about their responsibilities at work.
  • After an accident occurs, data is not detected in a timely manner. As a result, some data is written to disks, causing unrecoverable problems. You need to write application monitoring programs. Once the service is abnormal, the person responsible for the SMS alarm is responsible.
  • According to the comments, add one more: you can’t use Root to do this. You should set up different levels of users on the server.

Through this accident, several colleagues who had nothing to do with the project and the accident came to help, checked materials and helped test. One colleague even helped to conduct data recovery test at 1:00 in the evening.

At the same time, the product manager, thinking of the enormous pressure of facing the customer, did not panic and blame the developers and operators, but let everyone calm down to think of a solution.

The department leaders also took the initiative to help us find ways, accompany us to work overtime to test, and track the progress of things in real time. Through the joint efforts of all of us, the matter finally came to a relatively successful conclusion. Next, we will conduct collective reflection on Monday morning and sum up the experience and lessons. We will try our best to avoid such accidents.

Links to tools used in this article:

ext3grep:https://code.google.com/p/ext3grep/

Compilation installation dependency package is more, you can search online how to install. Unfortunately, the hoWTO given by the author is blocked. I have downloaded the PDF file of HOWTO, after reading it, you will have a further understanding of Linux file system.


This tool has a Bug that does not execute down after an error:

ext3grep: init_directories.cc:534: void init_directories(): Assertion `lost_plus_found_directory_iter ! = all_directories.end()' failed.
Copy the code


As a result of the recovery failure, the author released a patch, download address: patch download. I don’t understand why the author didn’t include this patch in the new edition.

The function is similar to ext3grep, and the principle should be similar. Just claimed to be able to restore the directory, I did not test success here.

Do you also have the wrong experience of deleting files? And how to deal with the solution? Feel free to share your tips in the comments.