preface

Ran into a former colleague the other day and exchanged a few brief pleasantries. Looking at this friend who gave us hard tests at the beginning, I think of the dual-system migration and merger at that time, because he was the one who tested the project work I did at the beginning. Now that I think about it, the project is still commendable. So record it and share it and remember it for yourself.

background

The last company was a P2P finance company, which was in full swing in Hangzhou at that time. At that time, two things happened: first, our financial platform was connected to the bank depository system; second, our company acquired a local P2P company in Hangzhou (I have to say that the company was still strong at that time). So company level decided to purchase the user data migration to our own system, so don’t need to buy again platform to access bank depository, and is convenient to manage maintenance data, moreover, also is for later to buy more platform to do a good job of demonstration (have to say, ambitious; But ambition is no match for a red-headed document, hahaha).

Analysis of the

The following are represented by system A and SYSTEM B. System B is the object of migration and needs to be merged into system A. In this kind of work, I personally think we should start from two aspects:

  1. In business, data migration needs to be compatible with existing systems;
  2. In terms of technology, data migration needs to achieve data consistency between the two sides, reduce the impact on users as much as possible, and achieve an ideal smooth migration.

In fact, there is nothing to talk about in business, because the different system business, how to compatible with the existing business, are based on the actual situation. For example, the user table of system A and B, how many will have different fields, or more or less, or there is A conflict in concept, it is possible, this depends on their choice. Of course, there is A special scenario that needs to be considered, that is, users have registered and have user behaviors in both system A and system B. This, too, depends on the business scenario. At that time, we still kept the record of system B, but this record would not be used as real, but only as a mark. User A can be queried based on user B’s records. Of course, since it is a solution, it is more from a technical point of view. How to design the solution for this is an overall task. First of all, we need to clarify the effect we want to achieve in this migration:

  1. Data consistency;
  2. Migrate as smoothly as possible.

The data consistency issue here is not too big of a problem, although it is the most necessary part. Smoothing data migration is the hardest challenge. When I hear the call for a smooth migration, my first reaction is to the garbage collector mentioned earlier when I looked at the JVM, especially the classic metaphor of sweeping the floor and spitting out the melon shells. In fact, this scene is the same. Because smooth migration, then it means that A, B two systems, especially B system non-stop service. It is also possible to generate new data during migration, namely DML operations in user traffic.

solution

It doesn’t seem that easy to do, isn’t there a solution? I can think of a similar solution to MySQL master-slave synchronization. At present, most companies use the master-slave architecture of MySQL to ensure its availability. How does MySQL’s slave library keep up with the master library after it crashes and is repaired? Yes, the idea is to use backup +binlog. MySQL’s binlog, which is very important for MySQL, is the DML operation used to record each record. For a single record, by collecting the binlogs on that record, the data is guaranteed to be consistent at multiple ends, or the combination of multiple binlogs is a database record in its final state. This reminds me of when I worked in my first company, I did a requirement for MySQL table aggregation to generate a wide table. The method was to listen to MySQL binlog logs and send them to Kafka, and consume Kafka messages to synchronously generate and maintain database table records, so as to keep data consistent with the main table. Unfortunately, there is no engineering service for MySQL to monitor binlog at the company level, so we cannot directly start from MySQL level. However, the idea is still the same, and incremental processing can be carried out at the code level.

What’s involved

A system

For A system, is not very important role in the migration process, but it is worth mentioning, because will add B the user of the system, so need to be part of the user table on the ID, such as A system of user in table data already has more than 200 w, B system is the data of more than 50 w, you need to change on the ID to 300 w, The value increases from 300w. The users in system B are inserted according to the mapping (240W + user ID of system B). Why do you do that? This is mainly to facilitate the positioning of data that has been calibrated. After the data migration, it is essential to calibrate the data. Otherwise, it can only be judged according to the unique information of the user. When the user information exists in the system A and B, it will appear very weak at this time.

System B

For system B, it is necessary to ensure that users can use it normally in the migration. Of course, some performance loss can be allowed, but availability cannot be said to be lost. This has no effect when the user performs a query; But there are add, delete, change operations, this time you need to pay attention to record the related SQL. Because a backup will be cut at some point, the migration is for that backup as well. The user’s add, delete, or change operations after this point need to be synchronized in the form of incremental SQL. Of course, there are situations: what if the user data is being migrated and SQL is generated at this point? What if the migration effort is already working with incremental SQL, and system B generates new SQL? We need to analyze all of these key scenarios

Migration merge routine

There’s nothing specific about migration procedures, because the heavier part is migration and merging operations on the business side. The only thing that needs to be noted is the previous coordination with System B. After the data migration merge is complete, you also need to execute those incremental SQL.

The calibration procedure

After the migration is completed, a few days of observation period should be set aside, during which the calibration procedure should be executed regularly every day to check the data consistency of this part of user information, amount, product information, etc.

plan

Come to come, the front said a lot of, I guess you must have been bored, and then go on you can close the web page. What? It needs to be off now! Grandpa, please stay and do me a favor, because I can’t code words easily.



Pictured above,This is a flow chart for System B.This is the flow chart of the migration program.

Before I start, I want to do two things:

  1. For example, if the key is com:showyool:user:123 and value is 0, 0 indicates that the migration has not started and 1 indicates that the migration is in progress.
  2. These user ids are sent to RocketMQ, and when the migration program is started on multiple machines, the messages are consumed and stored in the in-memory ConcurrentLinkedQueue. The thread pool in the migration application will go to this queue to fetch individual users and process them concurrently.

B System flow chart

Next we can explain, first thread B system flow chart start to explain:

  1. When the user’s DML operation request comes in, we first acquire the distributed lock KEY1, of course this is the user level lock. The purpose of this lock is to coordinate with the migration program.
  2. If this is successful, go to redis to check whether com:showyool:user: XXX exists. It does not matter whether 0 or 1 exists, because as long as there is, the user data has not been migrated, and because the distributed lock key1 has been obtained, we will not worry about migrating the application. We can look at this later. If so, DML operations are performed normally, the organization’s SQL is stored in MySQL, and the distributed lock KEY1 is released. If not, the user’s data has been migrated, and the SQL is sent to RocketMQ. (Here you may ask why the first case stores SQL into MySQL and the second case stores SQL into RocketMQ. The assumption here was that a message sent to RocketMQ might be quickly consumed, and the first case had not yet been migrated, so it was not appropriate to append SQL with no record. So the purpose of storing it in MySQL is to be able to retrieve the messages when the migration is completed. Of course, the sequential execution of SQL is guaranteed by both MySQL and RocketMQ.
  3. Going back to the previous level, if the original distributed lock key1 was not obtained, then the user’s content in Redis is obtained. If it does not exist, it indicates that the user data has been migrated, which is the normal usage of user traffic, meaning that previous user requests may have acquired the distributed lock first. Do normal DML operations and organize the SQL to be sent to RocketMQ. If this content is 0, then the expression is normal user traffic concurrency, and the data has not started migration at this time, then normal DML operation, and organize the SQL stored in MySQL. There is another case, which is more complicated, where the content is 1, so you are already migrating. That is, the migration program takes the distributed lock KEY1, sets the content to 1, and already migrates. At this time, system B can perform normal DML operations. Next, and more cumbersome, is the need to obtain the distributed lock KEY2, also at the user level. Why another distributed lock? These are the scenarios that we analyzed earlier. Let’s look at the graph below

There are three possible points in time when our migration program will be faced with new SQL from system B while fetching incremental SQL and executing it. Let’s do a quick analysis:

  1. The first new SQL, which is reasonable at this point in time because the migration program has not yet started processing the new SQL;
  2. The second new SQL, because the migration program has obtained the new SQL, but also a new SQL, this is in fact when we learn database principle is a typical magic read problem, this is not reasonable;
  3. The third new SQL will not be processed because the migration program already processes the incremental SQL, and the situation is similar to the second one.

So what about the last two cases? Let’s take a look at the third new SQL, which doesn’t need to be stored in MySQL anymore, but sent directly to RocketMQ for consumption and processing. The trouble lies in the second case. In this case, we shouldn’t have, so we need a concurrency control, which is why I introduced a second distributed lock! If the distributed lock key2 was obtained successfully, then determine if the user is still in Redis. If not, the migration process has finished, then the third case is sent to RocketMQ. If it still exists in Redis, then it is the first case, stored in MySQL. If the distributed lock fails to be acquired, it is either a concurrent problem with normal user traffic or the migration is processing incremental SQL, at which point you can simply retry and re-acquire the distributed lock, thus preventing the second case.

Flow chart of the migration program

If you already know the flow chart of system B above, then the flow of this migration program will certainly be quickly understood. In the spirit of seriousness, I’d like to talk to you about these: (I’m so nice, why don’t I click on the thumbs up?)

  1. Threads in the thread pool acquire the user from ConcurrentLinkedQueue and migrate the user from the start.
  2. First of all, we still need to obtain the distributed lock KEY1. If it is successfully obtained, the user content in Redis will be set to 1, indicating that it has entered the migration state. This is followed by a long migration and merge effort, which is the business nature of the code operation, not to mention. When the migration merge is complete, the distributed lock key2 is acquired, which means that the increment SQL is executed. In most cases, this can be achieved very quickly. . Then execute the incremental SQL, then clear the user information in Redis, and finally release the distributed lock KEY2, key1. If the distributed lock key2 fails to be obtained, the user is working on it and retry.
  3. If distributed lock key2 get failed, it means that the user traffic has entered the system B and holding the key2, then record this user in the redis, open a new key, such as com: showyool: fail: user: 123, the value is the number of times failure. Then after a period of time to acquire the distributed lock key1, if successful to execute the above process, if not, continue to increase the counter. After 15 total failures, the user is stopped and the next user is selected. This is given, the user may be a very active users, so this kind of user may be special treatment alone, although the treatment process and main process similar (migrations also not seen at that time, there are several users fail once or twice, this was quite a surprise, yes, surprise, you give translation translation, what a surprise).

The implementation of

Then the core ideas and schemes are described above, and then the specific implementation is shared.

  1. Mask the registration entry of system B and direct user registration to system A.
  2. Select 2 am to start operation. This time is chosen because, first of all, most users have gone to sleep at 2 o ‘clock, so there are not many DML operations by users. In addition, a large number of scheduled tasks will be executed in the early morning, such as calculating interest and penalty interest. (You can see that the returns of Yu ‘ebao on Alipay are not displayed at 0 o ‘clock on time, but are calculated by scheduled tasks.) Because a scheduled task will have a large number of DML operations during the execution of the task, it can avoid that time point and usually can be completed before 1:30.
  3. At 2 o ‘clock, the service of system B should be offline, and the written code should be sent to system A and B, and the migration merging program should be deployed. At the database level, A backup is required for system B and the auto_INCREMENT of the user table in system A is changed to 300W. Before system B goes online, set user data in Redis, for example, com:showyool:user:123=0. To say that the complete smooth migration, so this time actually not calculate, after all, this is a small stop service, fortunately, this place does not need a lot of time.
  4. Inspection services are online. Start the inspection calibration process. I don’t have much to say about this above, because I think this program is just a calibration of the business nature, and also for data consistency. In addition to sending the user ID to RocketMQ, the migration mergers consume the topic and store it in their own memory in ConcurrentLinkedQueue. Threads in the thread pool are constantly fetching user ids from this queue. Once the user ID is present, the migration begins.
  5. The next step is to look at the data, such as how many users have been migrated and how many users are left.
  6. Of course, there is a failed user mentioned in the above flowchart, although it does not appear in the actual situation, but if there is a failed user, it is re-sent to RocketMQ to continue execution.
  7. The last, of course, is to test… And I was looking at the rising sun, do not know is in a daze or sleep.

subsequent

My thoughts pulled me from that time to the present bald uncle, laughing, like a familiar tacit understanding. After all, it is a good memory, but also I grow up on the road bloomed. My past may not be perfect, there will be welcome criticism and correction, but I hope to go on bravely, with you, with me, with Java.

The last

Color {red}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail} Also attached here is my Github address :github.com/showyool/ju…