Many friends asked, if not “double” capacity expansion, can achieve smooth migration, does not affect the service?

What scenarios are applicable?

The Internet has many “large amount of data, large amount of concurrency, and high service complexity” service scenarios. The typical hierarchical system architecture is as follows:

(1) Biz is the upstream business layer to realize personalized business logic;

(2) The midstream is the service layer, which encapsulates data access;

(3) Downstream is the data layer DB, which stores solidified business data;

Is the advantage of the layered architecture of service, the service layer shielding of the complexity of the downstream data layer, such as the cache storage details, depots table, the storage engine does not need to be exposed to the caller, and upstream only provides the convenience of RPC access interface, when there are some changes the data layer, all the caller also don’t need to upgrade, you just need to upgrade the service layer.

The Internet architecture is often faced with the following requirements:

(1) Changes in the underlying table structure: in the case of a large amount of data, some attributes are added, some attributes are deleted, and some attributes are modified.

(2) Changes in the number of sub-databases: Due to the continuous increase of data volume, the number of sub-databases at the bottom level does not increase exponentially.

(3) Changes of underlying storage media: the underlying storage engine changes from a database to another database.

All kinds of requirements require data migration. How to smooth data migration, the process of migration is not stopped, and to ensure the continuous service of the system are the issues to be discussed in the text.

Plan 1: shutdown plan

Before we discuss smooth data migration solutions, let’s take a look at the unsmooth shutdown data migration solutions, which are divided into three main steps.

Step 1: Hang a notice similar to “in order to provide better service to the majority of users, the server will be shut down at 0:00-0:400 for maintenance”, and shut down at the corresponding time period, when the system has no traffic.

Step 2: After shutdown, develop an offline data migration tool for data migration. Different data migration tools will be developed for each of the three requirements described in Section 1.

(1) Change requirement of underlying table structure: develop tools for guiding new tables from old tables;

(2) Demand for the transformation of the number of branches: develop tools for 2 libraries and 3 libraries;

(3) Transformation requirements of underlying storage media: develop Mongo guide Mysql tool;

Step 3: Restore the service and cut the traffic to the new database. Different requirements may involve different service upgrades.

(1) Changes in the underlying table structure: services need to be upgraded to access the new table;

(2) The requirement of the number of branch libraries transformation: the service does not need to be upgraded, but only needs to change the routing configuration of the library;

(3) Underlying storage medium transformation requirements: services are upgraded to access new storage media;

Overall, the shutdown scheme is relatively straightforward and simple, but it has an impact on the availability of the service, and many game companies may adopt similar schemes for server upgrades, game partitions and conversions.

In addition to affecting the availability of services, this solution has another disadvantage, that is, the upgrade must be completed within the specified time, which will be very stressful for students in R&D, testing, and operation and maintenance. Once problems such as data inconsistency occur, they must be solved within the specified time, otherwise they can only be rolled back. According to experience, the more stressed people are, the more likely they are to make mistakes. This shortcoming is to some extent fatal.

In any case, outage schemes are not the focus of today’s discussion, so let’s take a look at common smooth data migration schemes.

Scheme 2: Log chasing scheme

Chase log scheme, is a high availability smooth migration scheme, this scheme is mainly divided into five steps.

Before data migration, upstream service applications access the old data through the old service.

Step 1: Upgrade the service and record the log of “data modification on the old library” (the modification here refers to insert, DELETE and update of data). This log does not need to record detailed data and mainly records:

(1) The modified library;

(2) the modified table;

(3) unique primary key modified;

You do not need to record the new rows and the modified data format in detail. The advantage of this is that no matter how the business details change, the log format is fixed, which ensures the universality of the solution.

This service upgrade is less risky:

(1) The write interface is a small number of interfaces with few changes;

(2) The upgrade only adds some logs and has no impact on service functions;

Step 2: Develop a data migration tool for data migration. The data migration tool, like an offline migration tool, moves data from the old repository to the new repository.

This gadget is less risky:

(1) The whole process is still the old library to provide services online;

(2) The complexity of small tools is low;

(3) Whenever problems are found, the data in the new database can be wiped out and restarted;

(4) Speed can be slow migration, technical students have no time pressure;

After the data migration is complete, can we switch to the new library to provide services?

The answer is no. In the process of data migration, the old database still provides online services, and the data in the database may change at any time. This change is not reflected in the new database, so the data in the old database and the new database are inconsistent, so the database cannot be directly cut, and the data needs to be equalized.

What has changed?

It is the changed data that is recorded in the log in step 1.

Step 3: Develop a small tool that reads logs and migrates data, to make up for the differences generated during step 2 migrates data. What this gadget needs to do is:

(1) Read the log to find which library, which table, which primary key has changed;

(2) read out the records corresponding to the primary key in the old library;

(3) Replace the records corresponding to the primary key in the new library;

In any case, the principle is to keep the data from the old library.

There’s also little risk with this gadget:

(1) The whole process is still the old library to provide services online;

(2) The complexity of small tools is low;

(3) If any problem is found at any time, start again from Step 2.

(4) Speed limit can be slowly replay log, technical students have no time pressure;

After the log replay, will I be able to switch to a new library for service?

The answer is still negative. In the process of log replay, there may be data changes in the old database, resulting in inconsistent data. Therefore, the database still cannot be cut. As can be seen, the program for replaying log data leveling is a while(1) program, and the new library data leveling with the old library will also be an “infinite approximation” process.

When will the data be exactly the same?

Step 4: In the process of continuously replaying logs and equalizing data, develop a small tool for data verification to compare the data in the old library with that in the new library until the data is completely consistent.

The risks of this gadget are still minimal:

(1) The whole process is still the old library to provide services online;

(2) The complexity of small tools is low;

(3) If any problem is found at any time, start again from Step 2.

(4) Speed limit can be slowly compared data, technical students have no time pressure;

Step 5: After the data comparison is completely consistent, migrate the traffic to the new database. The new database provides services to complete the migration.

If the data in Step 4 is 99.9% consistent all the time, it is normal that the data cannot be completely consistent. You can create a readonly of the old library at the second level, and wait for the log replay program to completely catch up with the data, and then cut the database traffic.

At this point, the upgrade is complete, and the entire process can continue to provide services online without affecting service availability.

Scheme 3: Dual-write scheme

The dual-write scheme is also a highly available smooth migration scheme, which is mainly divided into four steps.

Before data migration, upstream service applications access the old data through the old service.

Step 1: Upgrade the service, and perform the same operation on the new library for “data modification on the old library” (insert, delete, update), which is called “double write”. The main modification operation includes:

(1) Old library and new library at the same time insert;

(2) Delete the old and new libraries simultaneously;

(3) Update old and new libraries at the same time;

Since there is no data in the new library at this time, double-writing the old library may not affect Rows the same as the new library, but this does not affect business functions at all, as long as the old library does not cut the library, the old library still provides business services.

This service upgrade is less risky:

(1) The write interface is a small number of interfaces with few changes;

(2) Whether the write operation of the new library is successfully executed or not has no impact on the business function;

Step 2: Develop a data migration tool for data migration. The data migration tool appears for the third time in this article, moving data from an old repository to a new repository.

This gadget is less risky:

(1) The whole process is still the old library to provide services online;

(2) The complexity of small tools is low;

(3) Whenever problems are found, the data in the new database can be wiped out and restarted;

(4) Speed can be slow migration, technical students have no time pressure;

After the data migration is complete, can we switch to the new library to provide services?

The answer is yes, because the preceding steps are double-written, so theoretically the data in the new and old repositories should be exactly the same after the migration.

In the process of data migration, the old database and the new database are double-write operations at the same time. How can we prove that the data is completely consistent after data migration?

As shown above:

(1) The left side is the data in the old library, and the right side is the data in the new library;

(2) Perform data migration in segments according to the sequence from min to Max of primary key. Assume that the data segment has been migrated to now, and the modification operations during data migration are discussed respectively:

  • Assume that a double insert operation is performed during the migration, and data is inserted into both the old and new libraries, and data consistency is not broken
  • Suppose a double DELETE operation is performed during the migration, which is divided into two cases

Case 1: Assume that the delete data belongs to the range [min,now], that is, the migration has been completed, then the data of the old library and the new library have been deleted, and the data consistency is not damaged;

Situation 2: If the delete data is in the [now, Max] range, the affect Rows for the delete operation will be 1, and the Affect Rows for the delete operation will be 0, but the data migration tool will not migrate the deleted data to the new library. So data consistency is still intact;

  • If a double UPDATE operation is performed during the migration, you can think of the update operation as a combination of delete and INSERT, so the data is still consistent

Except, in a very limiting situation:

(1) Date-migrant-tool just fetched a data X from the old library;

(2) Before X is inserted into the new library, the old and new libraries just double delete X;

(3) Date-migrant-tool inserts X into the new library;

In this case, the new library has one more piece of data X than the old one.

However, in order to ensure the consistency of data, data verification is still needed before database cutting.

Step 3: After data migration is complete, use the data verification tool to compare the data in the old database with the data in the new database. If the data in the old database is completely consistent with the expected data, the data in the old database will prevail if the limit inconsistency in Step 2 occurs.

The risks of this gadget are still minimal:

(1) The whole process is still the old library to provide services online;

(2) The complexity of small tools is low;

(3) If any problem is found at any time, start again from Step 2.

(4) Speed limit can be slowly compared data, technical students have no time pressure;

Step 4: After data consistency is complete, traffic is cut to the new database to complete smooth data migration.

At this point, the upgrade is complete, and the entire process can continue to provide services online without affecting service availability.

conclusion

In view of many Internet business scenarios with “large amount of data, large amount of concurrency and high business complexity”, it can be found in:

(1) The underlying table structure changes;

(2) Changes in the number of branch libraries;

(3) Changes of underlying storage media;

There are two common solutions for data migration to complete “smooth data migration, uninterrupted migration process, and continuous system service”.

Log recovery scheme, five steps:

(1) Upgrade the service and record the log of “data modification on the old library”;

(2) Develop a data migration tool for data migration;

(3) Develop a small tool to read logs to make up for data differences;

(4) Develop a data comparison tool to verify data consistency;

(5) Flow is cut to the new library to complete smooth migration;

Double write scheme, four steps:

(1) Upgrade the service and record “data modification on the old library” to double write the new library;

(2) Develop a data migration tool for data migration;

(3) Develop a data comparison tool to verify data consistency;

(4) Flow is cut to the new library to complete smooth migration; javascript:void(0);)

Thinking is more important than conclusion.