In the previous installment, we talked about how to use the proper configuration of vnodes to complete Tdengine data sharding (these secret parameters teach you the correct use of Tdengine clustering). In this installment, we will continue to talk about how Tdengine partitioned data from the time dimension.
First, take a look at the relevant description on the official website:
“In addition to vnode sharding, TDengine also partitioned sequential data by time period. Each data file contains only one period of time series data, and the length of the period is determined by the DB configuration parameter DAYS. This method of partitioning by time period also facilitates the efficient implementation of data retention policies, as data files will be automatically deleted if they exceed the specified number of days (the system configuration parameter KEEP). In addition, different time periods can be stored in different paths and storage media, so as to facilitate the hot and cold management of big data and achieve multi-level storage.
In general, Tdengine is through vnode and time two dimensions, big data segmentation, convenient parallel efficient management, to achieve horizontal expansion.”
It can be seen that the keep parameter plays a very important role in this process. However, the keep parameter is also a typical and confusing parameter for users.
“The number of days kept in the database in days (default: 3650)”, along with a DAYS parameter (default: 10) : “The duration of a data file stored in days (default: 10).
From the user’s point of view, this sentence means that the data should not be able to be queried after the number of days the data is kept. However, in practice, it is common to see data that is out of the time range still appear in the query results.
First, let’s take a brief look at the storage logic of TDEngine: After the data is written to the database, it will be stored in the buffer pool in memory. When it reaches the threshold value (buffer 1/3, Or shut down the database service) data in memory will fall under the tray to the table’s vnode directory (default/var/lib/taos/vnode/vnodeX/TSDB/data). The X in VNODEX can be seen by the show vgroups command.
Here’s an example:
To test it, simply insert a random piece of data and do a service restart: SystemCTL Restart Taosd. The data that was just written to memory will now land on the hard drive.
Note: Restarting the service is a useful test operation that can trigger an in-memory drop — currently, the automatic delete mechanism is triggered only when the data is dropped (an automatic delete trigger will be added later on during initialization). If there is no more data behind the database, the data file will not be deleted even if it is expired.
Now you can find your data files. As you can see in the image below, there were no files in this directory before the restart. But when I restarted, I saw a set of three files numbered 1880.
In a broad sense, these three files are data files, and the following data files refer to groups of files formed by them.
Now, let’s go back to the actual scene.
If you want to test your data storage strategy, you will be familiar with the following scenario: When we build the database, we specify keep as 10 and days as 10. If the data file was generated on January 1, but by January 19, the data inserted on January 1 can still be queried. So you log out of the Taos shell — and sure enough, the data file generated on January 1 hasn’t been deleted yet.
Strange – is the keep parameter not in effect?
One more thing we need to know to understand the answer to this question is the design of the DAYS parameter: the DAYS definition — “the length of time a data file holds” — is based on the system time. It starts from the date the data file is first generated, and is calculated against the system time. The calculation is segmented into natural days only, not 24 hours). Once the file has been generated for more than a few days, a new data file will be generated the next time the data is dropped.
In fact, 99.9% of the time when you find that old data can still be queried, it’s not that KEEP doesn’t work. The basic reason is that TDEngine waits until all data in the data file has expired before deleting it. The above scenario is the same (keep 10 days 10) : data of January 10 May exist in the data file generated on January 1, so this part of data is not allowed to be deleted in the design since it is not available for 10 days on January 19. Therefore, the data from day 1 to day 9 is not deleted.
That’s the answer to the title of the article.
It can be seen that since the data files are stored together in days, the smaller the days, the more accurate the automatic deletion will be. So why don’t we just make Days smaller? Well, that’s fine. But in terms of performance, fewer days means more data files, which can lead to too many files being read on and off frequently and add overhead. So, a default value of 10 days is a compromise.
Now we come to a new question:
1. Under what circumstances does TDEngine delete expired files?
2. How can we quickly tell if the auto-delete mechanism is working properly?
We can answer these two questions in one scenario:
Question 1: The answer can be obtained as long as the above scenario continues to advance (KEEP 10 DAYS 10) : when the time comes to January 21, the third batch of data files are generated, and the data of the last day of the first batch of data files finally exceeds the KEEP value. At this point, KEEP will take effect and remove the first set of data files from the store. Now back in TDengine, you can’t find this part of the data.
Question 2: The answer is simply to count the number of datafiles in the vnode directory: for example, in the case above (keep 10 days 10), the maximum number of datafiles in the vnode directory is two: 1 to 10 days and 11 to 20 days (time range), when the data files stored for 21 to 30 days are generated, the data files stored for 1 to 10 days have been deleted, so only two can be retained at most. The calculation method is keep/days+1. In this case, as long as the number of data files in the vnode is less than or equal to KEEP/DAYS +1, the automatic deletion mechanism is considered to be working properly.
However, in the case that keep is not divisible by days, the following situations will occur:
Let’s assume keep=3 days=2. Under this configuration, the time stored in the first batch of data files is 1-2 days, and the second data file is 3-4 days. As you can see, the date 2 data in the first file will not expire until the date 5 (2+3) ends, so the date 12 data file will not be deleted until the date 6 begins. In this way, in the period between the 5th and 6th, there will be the coexistence of the 12th, the 34th and the 5th files.
This is what the official documentation says: “Given days and keep, the total number of data files in a typical working vnode is: keep/days +1”.
So, as long as the number of files in your vnode directory matches the results of the two scenarios above, there is no need to worry about the auto-delete mechanism not working properly.
From the readers here, do you now understand the automatic deletion mechanism of TDengine? If not, I must be remiss.