Abstract:

OpenTSDB in Hbase, KairosDB in Cassandra, BlueFlood, and Heroic in Cassandra; Finally, TSDB ranking Top 1 InfluxDB.

Prometheus

In this series of articles, OpenTSDB of Hbase, KairosDB of Cassandra, BlueFlood and Heroic have been compared and analyzed. Finally, TSDB ranking Top 1 InfluxDB is analyzed. InfluxDB is a self-developed TSDB from bottom to top. What I am interested in is the underlying TSM, a storage engine optimized for timing data scenarios based on LSM. InfluxDB shared their journey from initial LevelDB to BoltDB, and finally to their decision to develop TSM on their own, giving an in-depth description of the pain points in each phase and the core issues to be addressed in the transition to the next phase, as well as the core design ideas for TSM. This is the kind of sharing THAT I prefer, not telling you what technology is the best, but telling you the evolution of the whole technology step by step. The profound analysis of the problems encountered at each stage and the reasons for the final technical choice are impressive and can learn a lot.

However, the TSM for InfluxDB is not detailed enough, and is more descriptive of policies and behaviors. Writing a Time Series Database from Scratch describes the design of a TSDB storage engine. And the storage engine is not a concept or toy, but a real production application, a completely rewritten new storage engine for Prometheus in version 2.0, released in November 2017. The new version of the storage engine is said to bring “huge performance improvements”, because the changes are too big to be backward compatible, and it is estimated to bring a lot of surprises to the game.

And this article, mainly an interpretation of that article, most of the content from the original, slightly abridged. If you want to know more about the content, it is suggested that you can go to see the Original English text.

The data model

Prometheus, like other major sequential databases, includes metric names, one or more labels, and metric values in its data model definition. Metric name adds a set of labels as a unique identifier to define a time series. In the query, time series can be searched according to labels conditions, simple conditions and complex conditions can be supported. The design of storage engines, which Prometheus does not mention for the moment, takes into account data storage (more writes than reads), data recovery (retention), and data query based on the characteristics of sequential data.

The figure above is a simple view of the distribution of all data points, with time on the horizontal axis and time line on the vertical axis, and each point in the region as a data point. Each time Prometheus receives data, it receives a vertical line in the region of the figure. This is a nice statement, because at the same time, each time line produces only one data point, but there are multiple time lines that produce data at the same time, and when the data points are connected together, they form a vertical line. This feature is important and affects optimization strategies for data writing and compression.

V2 Storage engine

This article focuses on some design ideas for the new V3 storage engine, the old storage engine is V2. V2 storage engine will store the data points of each timeline in different files. Under this design strategy, several questions are proposed to discuss:

  1. Write optimization: Write optimization for SSDS and HDDS follows the sequential write and batch write principles. However, as noted above, Prometheus receives data at once from a vertical line consisting of a number of data points on different timelines. In the current design, a single timeline corresponds to a single file, so each write requires writing a very small amount of data to many different files. To solve this problem, the optimization strategy of V2 storage engine is Chunk write. Write to a single timeline must be written in batches, so a certain amount of data points can be collected only after the data accumulates in the timeline dimension for a certain amount of time. Besides batch write, the Chunk write policy can optimize hot data query efficiency and data compression ratio. The V2 storage engine uses the same compression algorithm as Facebook Gorilla and is able to compress 16-byte data points to an average of 1.37 bytes, saving 12 times memory and space. Chunk writing requires that the data must be accumulated in the server memory for a certain period of time, that is, the hot data is basically stored in the memory, and the query efficiency is very high.
  2. Optimization for query: the query scenario of sequential data can be repeated, and the data of a certain point in time, multiple timelines at a certain point in time, or multiple timelines in a certain time range can be checked. As indicated on the data model diagram above, it is a rectangular block of data in a two-dimensional quadrant. Whether SSD or HDD, disk read-friendly optimizations are optimized to the point where a query requires only a small number of random locations plus a large number of sequential reads. This has a lot to do with how the data is distributed on the disk, and ultimately, how the data is written, but not necessarily real-time write optimization, it can be optimized through subsequent data collation.
In the V2 storage engine, there are some good optimizations that have been done, mainly Chunk writes and hot data memory caching, which are carried over to V3. But in addition to these two points, V2 still has a number of flaws:

  • The number of files increases year over year with the number of timelines, slowly running out of inodes.
  • Even with Chunk write optimization, the IOPS requirements are high if too many timelines are involved in one write.
  • Each file cannot be kept open at all times, and a query may require reopening a large number of files, increasing the query delay.
  • Data reclamation requires scanning and rewriting data from a large number of files, which takes a long time.
  • Data must be stored in the memory for a certain period of time and written in Chunk. V2 uses the Checkpoint writing mechanism to prevent data loss in the memory. However, the Checkpoint time is usually longer than the time window for data loss, and it takes a long time to get data from the Checkpoint restore when the node is recovered.
In addition to the timeline index, the V2 storage engine uses LevelDB to store the label to timeline mapping. When the time line reaches a certain size, the efficiency of query becomes very low. In general scenarios, the cardinal number of the timeline is relatively small, because the application environment rarely changes, and the cardinal number of the timeline will be in a stable state if the operation is stable. However, if the label is set improperly, for example, a dynamic value, such as the program version number, is used as the label, the value of the label will change each time the application is upgraded. Then over time, there will be more and more invalid timelines (called Series Churn by Prometheus). The size of the timeline becomes larger and larger, affecting index query efficiency.

V3 Storage engine

The V3 engine was completely redesigned to address these issues in the V2 engine. V3 engine can be regarded as a simple version of LSM optimized for sequential data scenarios. It can be understood with the design idea of LSM. Let’s take a look at the file directory structure of V3 engine data.

All data is stored in the data directory. The next level of the data directory is prefixed with ‘B -‘ and suffixed with ids that are incremented sequentially, representing blocks. Each Block contains chunks, index, and meta. Json. Chunks are stored in the chunks directory. This chunk is the same concept as V2 chunk. The only difference is that a chunk contains many timelines instead of just one. Index is the chunk index of the block. It can quickly locate the time line and chunk of data based on a label. Meta. Json is a simple description file of block data and state. To understand the V3 engine design, you need to understand the following questions: 1. What is the storage format of chunk files? 2. Index storage format, how to achieve fast search? 3. Why does the last block have a wal directory but not a chunk directory?

Design idea

Prometheus shards data into blocks over time, and each block is considered to be an independent database, covering a different time range of data, with no crossover at all. Data in each chunk of a Block cannot be modified after it is dumped to a file. Only the latest Block can receive new data. The latest data in a block is written to a memory structure first. To prevent data loss, a Write Ahead log (WAL) is written first.

V3 completely borrowed the design idea of LSM, and made some optimization for the characteristics of timing data, bringing many benefits:

  1. When querying data in a time range, you can quickly exclude irrelevant blocks. Each block has its own index, which effectively solves the problem of “invalid timeline Series Churn” encountered inside V2.
  2. Memory data is dumped to a chunk file, enabling efficient sequential writing of large chunks of data, which is friendly to SSDS and HDDS.
  3. Similar to V2, the latest data is stored in the memory. The latest data is the hottest data and supports the most efficient query in the memory.
  4. Recycling old data becomes very simple and efficient with only a small number of directories removed.
Blocks in V3 are cut in a span of two hours, which cannot be too large or too small. If the data is too large, the memory will be occupied for two hours. Too small can result in too many blocks, requiring more files to be queried. So two hours is a comprehensive value, but when querying a large time range, it is still inevitable to cross multiple files, such as 84 files in a week. V3 also uses a similar query optimization strategy to merge a small block into a larger one, while compaction can do other things like delete stale data or reconstruct chunk data to support more efficient queries. This article doesn’t give a full account of how THIS compaction happens, but check out this article to see how InfluxDB conducts its operations. There are several different compaction strategies that InfluxDB uses at different times.

This figure is a schematic of the collection of expired data in V3, which is much simpler than in V2. If the entire block has expired, delete the folder directly. However, a block that has only some data that has expired cannot be reclaimed. Instead, it must wait for the entire block to expire or compaction. One of the things to discuss here is that as we constantly compaction historical data, blocks get bigger and cover more time, making it harder to reclaim. The upper limit of blocks must be controlled here, usually configured based on a Retention window period.

The above mentioned basic data storage design points, or relatively straightforward. As with any sequential database, there is an index library in addition to the data repository. The V3 index structure is relatively simple, and directly references the example given in the article:

LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB = LevelDB For the latest block that is still writing data, V3 will hold all indexes in memory, maintain a memory structure, and wait until the block is closed, then persist to a file. In this way, it is easier to maintain the time-line to ID mapping and label to ID list mapping in memory, and the query efficiency is high. And Prometheus made an assumption about the cardinality of Label: “A real-world dataset of ~4.4 million series with about 12 labels each has less than 5,000 unique labels”, This is all stored in memory is also a small order of magnitude, no problem at all. InfluxDB uses a similar strategy, while some other TSDBS use ElasticSearch as their indexing engine directly.

conclusion

An LSM-like storage engine has some advantages when it comes to sequential data, such as writing more and reading less. Some TSDBS are directly based on open source LSM engine distributed databases such as Hbase or Cassandra, have their own development based on LevelDB/RocksDB, or have been developed solely from The Sources such as InfluxDB and Prometheus, because the specific scenario of timing data can be optimized more. Examples include indexes, compaction, and other operations. The Design of the Prometheus V3 engine is really similar to InfluxDB, with a high degree of consistency in optimisation and more changes to be made as new requirements emerge.

This article is the original content of the cloud habitat community, shall not be reproduced without permission, if need to be reproduced, please send email to [email protected]; If you find any content suspected of plagiarism in our community, you are welcome to send an email to [email protected] to report and provide relevant evidence. Once verified, our community will immediately delete the content suspected of infringement.

The original link