background

As the historical business data is stored by mysql, there is an operation record table video_log, every time the user creates, updates or auditors review, the corresponding Video_log will add a log, this log table only insert, you can imagine, 1 video corresponds to multiple logs. If there are 10w videos per day, each video has 5 logs on average. In this case, 50w logs per day, 50 * 30 = 1500w records per month, and 1500 * 12 = 180 million records per year. With more than 200 million pieces of data already online, the log itself is not c-oriented and can be used to query the problem, so it can tolerate a bit of delay. However, with the accumulation of time, it will be slower and slower, affecting the efficiency, so the reform is proposed.

Solution 1: Back up old data

Since the log itself is not the most critical data, but also requires high real-time performance (for real-time query problem), so the idea of the beginning is the foundation of the core remains the same, or storage of the older data migration out, all of a sudden to query a year ago, after all the operation records of probability is very small, if all of a sudden to check, can go offline. By design, all we need is a timing script that pulls data every day at around 4am (when business is at a low peak). The extracted data can be reported to some offline storage (most companies have Hive-based data stores) so that the video_log data on the line does not grow.

Scheme two: sub-table

Subtable is also a solution. The advantage of solution one is that all the data can be checked in real time. The disadvantage is that the code needs to be modified.

  1. First confirm the sharding key. Since video_log is bound to video, it is natural to choose Video_id as our sharding key
  2. According to what sub-table determined, next to confirm the number of sub-table. Set a small goal for 3 years. The maximum data volume of each table is 100 million (because our query is simple). According to the above statistics, we need about 3 years: 3*1.8= 540 million, so we need about 5.4/1≈6 tables.

The next step is to change the code and solve the problem of reading and writing old and new data.

  1. Inserts of new data are inserted directly into the new table
  2. Since the log table only has inserts, update and DELETE operations do not exist, so these scenarios do not need to be considered.
  3. The log of a video has two tables (the old table and the new table), so temporarily look up both tables, and then do a merge
  4. Synchronize old data to the new table
  5. Offline read the code of the old table

Scenario 3: Migrate to TIDB

The shortcomings of scheme two are more obvious, 3 years later, continue to dismantle the table? It feels like there’s always a debt there. So we focused on TIDB. Tidb is a distributed database. When tiDB is connected, we do not need to care about sub-tables. Because tiDB is distributed, it is important that the primary key of tiDB is unordered. The whole process can be divided into the following four steps:

  1. Double write first (record mysql id at the beginning of double write)
  2. Synchronize old data (distinguished by the IDS recorded in the first step)
  3. Cut (old data synchronization finished)
  4. Under the dual write

Focus on the pitfalls of synchronizing old data

Migrating to TIDB looks simple, but there are several hidden pits in the job script.

  1. In case the job breaks, what to do if you restart it? Besides the time cost of re-running the data, the data that has been synchronized will be repeated in re-running, and also consider the problem of repeated data. To solve the problem of duplicate data, you can add a new field to the old table to identify whether it has been synchronized. After each synchronization, update the following field. Disadvantages: online data is large, add a field is not safe, may cause online blocking.
  2. If the script is restarted from scratch, it will be skipped because the same primary key has already been inserted. This may seem perfect, but the TIDB is distributed and the primary key IDS are not continuous. If the primary key id assigned by TIDB is the same as the primary key ID synchronized by mysql, the last entry will fail.

Finally, the script scheme is synchronized

Considering the repeatability of data, the efficiency of job restart, and the efficiency of the whole synchronization, I make the following plan:

  1. Batch tasks to improve efficiency: First of all, according to the processing capacity and expected completion time, the old data was divided into about 10 batches. The 10 jobs ran different batches of data without interfering with each other, and each batch updated 100 pieces.
  2. Record the status, and restart automatically restores to the breakpointAfter each synchronization, record the current synchronization location (Redis records the current id), even if the restart can also get the previous update location from Redis, and then update.
  3. Avoid primary key conflicts: synchronizes all fields except the primary key (not the primary key)

Finally, the data migration was smoothly completed through the four switching steps of scheme 3 + efficient synchronization script