From the public account: New World Grocery store

What is the greatest sorrow of the programming ape!

After both of these accidents, the biggest grief I felt was calling the DBA in the middle of the night to ask for help recovering the data. The battle between the programmer and the PM often goes back and forth, but when I met the DBA, I was really asking for help, as long as the DBA liked!

In order to avoid sucking DBA in the future, I will write down two accidents I have experienced to remind myself.

First data rollback

PM is the producer of requirements and programmer is the consumer of requirements, which are the typical producer-consumer model. Therefore, the root cause of this accident is PM’s demand. Therefore, the author believes that there will be no accident as long as PM no longer demands.

Alas! Wake up! Stop dreaming!

Returning to the accident itself, the author first describes the background at that time.

PM has a lot of data that needs to be updated online urgently. How urgent is the need? The PM bypasses QA validation and tests directly online with a small amount of data, then updates all remaining data after passing a small amount of data.

Combined with the business scenario of the author’s company, the author completed the data update according to the following steps.

1. Use mysqldump to back up the data to be updated.

mysqldump --replace -f --single-transaction -t \
-h hostname -u user -P 3936 -p dbname tablename  \
--where="Id in (1, 2, 3)"  > tablename.sql
Copy the code

2. Develop a script that directly calls the interface of the existing update data on the line (the author has tested it in the test environment during development).

3. Update a small amount of data online first, and update and modify the cache of the data part. PM verifies a small amount of data.

4. PM starts to update the remaining data online after confirming that this part of data is verified.

At first glance, the above steps didn’t seem to be a big deal, but the actual result was a slap in the face. Below, the author is good to pull pull pull what causes this accident in the end.

1. The update interface logic is not clear, resulting in online data update errors.

This interface is an old service with few relevant documents. Because the author did not sort out all the logic clearly, some data parameters were incorrectly transferred when calling the interface, leading to errors in online data updating.

2. There is a problem in updating the interface implementation. After invoking the service, the data of the associated table is deleted, so it needs to be restored.

If it was just the first problem, the author’s own backup of the replace into statement would have done the job, but it’s clear there’s more to it than that. When the accident happened, the author started to comb the service logic again, and found that the interface also updated the associated table of the main table, and the update logic first deleted the associated data and then inserted new associated data. The point is that the implementer of the interface takes all the request parameters as an associative array and passes this associative array to all functions. Well, each function with a different business function passes the same parameters, which makes it difficult for me to fully understand what data each business function really needs when I first comb through the logic.

The associated data was deleted and the author did not have a backup, so he ended up licking the hands of the DBA to roll back the data.

Warning ⚠️ : code is not clear, procedures ape tears two lines!

3. Testing directly online without QA assurance.

Although the author has carried out a simple test in the test environment, the programmer still cannot spend too much energy to complete the QA work, and the PM is obviously not professional enough, which led to mistakes in the quality assurance link and expanded the scope of errors online.

4. The test was not verified in the no-cache environment.

Initially, the PM can verify a small amount of data without any problems, but when all data updates are completed and the cache is gradually rebuilt, the problems of data errors and data deletion begin to be exposed. This is because the author only updated the cache of the data that PM wanted to verify, but did not update the cache of the associated data. Therefore, the problem gradually emerged only after the natural failure of this part of the cache.

subsequent

After the DBA rolls back the data, batch updates continue! The cruel PM forced the author to fix the problem in the middle of the night and continue the verification. The only thing worth being happy about is updating only a small amount of data this time and continuing to update the rest data the next day. Finally, I fixed the problem and successfully updated all the data.

Second data rollback

PM put forward the demand of batch data updating again, but this time I am full of confidence, after all, this demand is almost the same as the first demand, the only difference is that part of the data specified by PM does not need to be updated (there are problems with the data provided by PM, so it is not updated).

However, how can people not make mistakes? The author forgot that some data did not need to be updated, and finally the correct and incorrect data were updated online. I never expected that after experiencing the first data rollback, I could also encounter the second data rollback. My mentality is really broken.

The event has already happened, and the author can only try to solve it. The following are two data recovery solutions based on actual business scenarios:

Solution a:

1. Determine which data needs to be repaired by the data ID first (the author recorded the log of the data ID when executing the script).

2. Parse the data to be restored in the backup SQL and concatenate it into the new recovery SQL.

3. Call the service to delete the newly added data (the data update interface will add data of other associated tables while modifying data).

4. Run the SQL generated in Step 2 to restore data.

Scheme 2:

Seek DBA help to recover data.

Plan 1 can restore the data by itself, and the correct data will be retained, but the operation is troublesome and the recovery process may cause new problems, so in the end, it is shameless to go to the DBA to restore the data.

After recovering the data, DBA sent the following SQL to the author to recover the online data:

alter table table_a rename to table_a_bk_2;
alter table table_a_bk rename to table_a;
Copy the code

Boy, the hint from the DBA has been so obvious that I silently sent an email to apply for an account with DDL permission. Now I think it is very clear that the operation of batch updating online data in the future must be better to back up data in the whole table instead of using an account with only read permission to back up replace into statement.

-- Full table backup SQL statement
CREATE TABLE table_a_bk AS SELECT * FROM table_a;
Copy the code

conclusion

The following is the author of these two accidents after the occurrence of some experience, I hope you can provide reference.

1, the code logic should be clear, function parameter naming should be semantic. It is not correct to have a single parameter that contains all the data you need and to make as many comments as possible in your code.

2. Be in awe of online data and clear business logic when operating data.

3. Before preparing the data on the operation line, try to pre-verify the data in the no-cache environment.

4. We all make mistakes, so we need QA for double assurance.

5. The author suffered from the loss of emergency demand which led to these two accidents. In other cases, please be sure to conduct data operation according to the normal process.

6. Backups are really important! After these two accidents, the author believes that the full table data backup scheme mentioned above is relatively reasonable and easy to recover.

Finally, I sincerely hope that this article can be of some help to all readers