When you use Taobao, Ele. me, Meituan and other apps, I believe you have seen the historical orders, and only the orders in recent months can be queried. This is because, as the business grows, the database grows rapidly. The data volume reaches tens of millions, even hundreds of millions. Such a large amount of data, so that the platform’s order query is very slow, I also said before, tens of millions of table query is already very slow. At this time, if the operator fails, several more times, CPU surge, request blocking, and finally downtime. So, at this point, there’s an optimization called hot and cold separation, which is what you see in your app, where you can only query data from the last few months (hot data) and if you want to query cold data you can’t query it on your app.

What is hot and cold separation

As the name implies, it is divided into two repositories, one is cold storage and the other is hot storage. The data that is not commonly used a few months ago is put into the cold storage, and the latest data is put into the hot storage. This is the same with the new generation and old generation in the JVM. Some data that is used frequently will end up in the old generation, and some data that is not used often will be destroyed during the new generation GC, and the data that is not used will be stored in the new generation GC

When is hot and cold separated?

  1. Old data is no longer modified, that is, only read but not written
  2. Users do not care about the old data, the old data for users have no use for the case, such as orders, school management system, the class with the previous class of students.

How to design hot and cold data separation?

In the case of an order, we just need to determine when the order is placed and whether it is completed. For example, an order is more than six months and the order has been completed, then the order can be considered as cold data, a user can not go to apply for a refund six months before the order, it can only be said that the user is porcelain, it is more to prevent this user let him check the order before six months. Of course, is half a year or three months or a month, it depends on your actual scene. It can also be based on other conditions, such as whether it is completed or not. The completed data is considered cold data, and the incomplete data is considered hot data

How to separate hot and cold data?

Here in fact, many people will think of using the way to listen to the database change log binlog to trigger, but this way, Ali Cloud RDS is not super permission. It is also impossible to distinguish hot from cold by time. When data is cold, no operation is performed during the period, and data concurrency needs to be considered.

The second way is through timed tasks:

You can regularly scan the hot data at midnight every day, identify the cold and hot, and then synchronize it to the cold database. Scheduled tasks also have a disadvantage that they cannot be implemented in real time and may have a large amount of data that cannot be processed at one time.

Here we also need to ensure consistency of data. Data consistency can be guaranteed through distributed transactions. But adding and deleting is costly. There is another option:

  1. In the hot database, flag the data to be moved: flag=1. (In practice, the values used to identify fields are numbers, for ease of understanding.)
  2. Find all data to be moved (flag=1) : This step is to ensure that some previous threads fail for some reason and some data to be moved is not moved
  3. To save a piece of data in a cold database, we need to add a judgment in the save logic to ensure idempotency (in this case, we need to enclose transactions). In plain English, we need to ensure that the logic can continue if the data we save already exists in the cold database
  4. Delete the corresponding data from the hot database

In scheduled task mode, if the amount of data is large and cannot be processed at a time, what should I do?

We can do this with the idea of a buffer, and we do it in batches, say, a hundred

  1. Add an identifier to the hot database: flag=1;
  2. Find the first 50 pieces of data to move
  3. Keep a copy of data in the freezer
  4. Delete the corresponding data from the hot repository
  5. Loop execution

At the same time, if it is too large, we can also enable multi-threaded concurrent execution. But this requires good concurrent code design.

Inadequate separation of hot and cold:

  1. It is still slow for users to query cold data
  2. Services cannot modify cold data. If the amount of cold data is large, you need to divide the cold data into databases and tables