Prometheus TSDB storage principle

Prometheus includes a time series database stored locally on disk and supports integration with remote storage systems, such as the free cloud storage API provided by the Grafana Cloud, by filling the remote_write interface information into the Prometheus configuration file.

This article does not involve the remote storage interface, but mainly introduces the implementation principle of local storage of Prometheus timing data.

What is timing data?

Before studying the storage principles of Prometheus TSDB, consider the influence of time series databases such as Prometheus TSDB and InfluxDB.

Time sequence data usually appears in the form of (key,value), which is the set of corresponding values at the time series collection point, that is, each data point is a tuple composed of time stamp and value.

identifier->(t0,v0),(t1,v1),(t2,v2)...
Copy the code

Data model for Prometheus TSDB

<metric name>{<label name>=<label value>, ... }Copy the code

Specific to an instance

requests_total{method="POST", handler="/messages"}  
Copy the code

During storage, the name label can be used to mark the metric name, and the time can be identified by the identifier @, thus forming a complete sample of timing data.

----------------------------------------key-----------------------------------------------value--------- {__name__ = "requests_total", method = "POST", the handler = "/ messages"} @ 1649483597.197 52Copy the code

A time series is a strictly monotonically increasing sequence of data points in time that can be addressed by metric. Abstracted into a two-dimensional plane, where the horizontal axis represents monotonically increasing time, and metrics span the entire vertical axis. Value can be obtained by giving a time window and metric when extracting sample data

How is timing data stored in Prometheus TSDB?

After a brief introduction to timing data, we’ll expand Prometheus TSDB storage (V3 engine).

Prometheus TSDB overview

In the figure above, the Head Block is TSDB’s memory Block, and the gray Block is a persistent Block on disk.

First, the incoming sample (T, V) is entered into the Head block. To prevent memory data loss, WAL is performed and it stays in memory for a period of time. Then, it is flushed to disk and m-map is performed. When these memory-mapped blocks or blocks in memory age to a certain point in time, they are stored to disks as persistent blocks. Multiple blocks are then merged as they become old and cleaned up after their retention period has expired.

Lifetime of samples in Head

When a sample comes in, it is loaded into the active chunk (red block) in the Head, which is the only unit that can actively write data, with a write-ahead log (WAL) to prevent memory data loss.

Once the active chunk is filled (more than 2 hours or 120 samples), the old data is truncated to head_CHUNk1.

Head_chunk1 is flushed to disk and then memory mapped. Active Chunk continues to write data, truncate data, write to the memory map, and so on.

A memory map should load only the latest and most frequently used data, so Prometheus TSDB will flush old data to disk persistence blocks as shown in figure 1-4.

At this point, the basic structure of the Prometheus TSDB data directory becomes clearer.

. / data ├ ─ ─ 01 bkgv7jbm69t2g1bgbgm6kb12 │ └ ─ ─ meta. Json ├ ─ ─ 01 bkgtzq1syqjtr4pb43c8pd98 # block ID │ ├ ─ ─ chunks # │ ├─ ├─ ├─ ├─ meta. Json │ ├─ ├─ bass │ ├─ meta Head memory mapping │ └ ─ ─ 000001 └ ─ ─ a wal # write-ahead log ├ ─ ─ 000000002 └ ─ ─ checkpoint. 00000001 └ ─ ─ 00000000Copy the code

The role of checkpoint in WAL

We need to periodically delete old WAL data, otherwise the disk will eventually fill up and take a lot of time when TSDB restarts to replay WAL events, so any data in wal that is no longer needed needs to be cleaned up. Checkpoint filters the data cleared by wal to create new segments.

There are six WAL data segments as follows

Data ├── 000007 ├─ 000008 ├─ 000008 ├─ 000008 ├─ 000008Copy the code

Now we want to clean the sample data before point T, assuming the first four data segments:

The checkpoint operation will traverse all records in sequence 000000 000001 000002 000003 and:

Deletes all sequence records that are no longer in the Head.
Discard all time inTThe previous sample.
deleteTAll previous Tombstone records.
Rewrite the remaining sequences, samples, and tombstone records (in the same order as they appear in WAL).

Checkpoint is named checkpoint.X, the last segment to create a checkpoint

So we get the new WAL data. When wal is playing the replay, it first looks at checkpoint, first plays back the data segment from checkpoint, and then checkpoint.000003’s next data segment 000004

Data └ ─ ─ a wal ├ ─ ─ checkpoint. 000003 | ├ ─ ─ 000000 | └ ─ ─ 000001 ├ ─ ─ 000004 └ ─ ─ 000005Copy the code

Persistent storage of blocks

We have seen wal and chunks_head storage constructs. Next we have blocks. What are persistent blocks? When was it created? Why merge blocks?

Block’s directory structure

├ ─ ─ 01 bkgtzq1syqjtr4pb43c8pd98 # block ID │ ├ ─ ─ chunks # block of the chunk files │ │ └ ─ ─ 000001 │ ├ ─ ─ tombstones # data delete records file │ ├ ─ ─ ├ ─ ├ ─ sci-imp # ├ ─ sci-imp #Copy the code

A Block on disk is a collection of chunks within a fixed time frame, consisting of its own index. A directory that contains multiple files. Each Block has a unique ID (ULID), which is sortable. When a sample of a Block needs to be updated or modified, Prometheus TSDB can only rewrite the entire Block, and the new Block has a new ID (for the sake of the index mentioned later). Prometheus TSDB uses tombstones to clean the tombstones without touching the original samples if they need to be deleted.

Tombstones can be thought of as a delete marker that records which time ranges to ignore during reading a sequence. Tombstones are the only files in a Block that are used to store deletion requests created and modified after data has been written.

The data structures recorded in tombstones are as follows, corresponding to the sequence, start, and end times to be ignored.

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ series ref < uvarint64 > │ mint < varint64 > │ maxt The < varint64 > │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘Copy the code

meta.json

Meta. Json contains all metadata for the entire Block

{
    "ulid": "01EM6Q6A1YPX4G9TEB20J22B2R",
    "minTime": 1602237600000,
    "maxTime": 1602244800000,
    "stats": {
        "numSamples": 553673232,
        "numSeries": 1346066,
        "numChunks": 4440437
    },
    "compaction": {
        "level": 1,
        "sources": [
            "01EM65SHSX4VARXBBHBF0M0FDS",
            "01EM6GAJSYWSQQRDY782EA5ZPN"
        ]
    },
    "version": 1
}
Copy the code

Start and end times of human-readable chunks, samples, sequences, number of chunks, and merge information are recorded. What does Version tell Prometheus about parsing metadata

Block merge

We can see from the previous figure that when chunks in a memory map span 2 hours (the default) the first Block is created. When Prometheus creates a stack of blocks, these blocks need to be maintained periodically to make efficient use of the disk and maintain the performance of the query.

The main job of Block merging is to write one or more existing source blocks or parent blocks into a new Block, and finally, to delete the source blocks and replace them with the new merged blocks.

Why do I need to merge blocks?

As described to Tombstones above, we know that Prometheus’ deletion of data is recorded in a separate file, STOMbstone, while data remains on disk. Therefore, when the STOMbstone sequence exceeds a certain percentage, the data needs to be removed from disk.
If the sample data values fluctuate very little, most of the data in two adjacent blocks will be the same. Merging these blocks can save disk space by reducing duplicate data.
When the query hits more than 1 Block, the results of each Block must be merged, which may incur some additional overhead.
If there are overlapping blocks (overlapping in time), querying them also deduplicates the samples between the blocks, and merging these overlapping blocks avoids the need for deduplication.

As shown in the example above, we have a sequential set of blocks [1, 2, 3, 4]. Data blocks 1, 2, and 3 can be merged to form new blocks [1, 4]. Or compressed in pairs to [1,3]. All of the time series data still exists, but there are now fewer blocks overall. This significantly reduces query costs.

How are blocks removed?

Prometheus TSDB takes a simple approach to source data deletion by deleting blocks in that directory that are not in our reserved time window.

As shown in the figure below, block 1 can be safely removed, while block 2 must remain until it falls completely behind the boundary

Because Block merging exists, it means that the older the data is retrieved, the larger the Block may become. Therefore, there must be an upper limit on merging so that blocks do not grow across the entire database. Usually we can set percentages based on reserved Windows.

How do I retrieve data from a large series?

Inverted indexes are used in the Prometheus TSDB V3 engine to provide quick lookups of data items based on a subset of their contents, such as finding all sequences with the label app =”nginx” without going through each sequence and then checking to see if it contains the label.

First we assign a unique ID to each sequence, the query ID complexity is O(1), and then build an inverted ID table for each label. For example, if the ID containing app =”nginx” tag is 1,11,111, then the inverted index of the tag “nginx” is [1,11,111], so if n is our total number of sequences and m is the size of the query, then the complexity of the query using the inverted index is O(m). That is, the complexity of the query is determined by the number of M’s. But in the worst case, if we have an “nginx” tag for each sequence, obviously the complexity becomes O(n). If it is an individual tag, we can wait a little bit, but it is not the case.

It is common for tags to be associated with millions of sequences, AND often multiple tags are retrieved per query, such as a sequence app = “dev” AND app = “ops” with worst-case complexity O(n^2), Then more tags grow to O(n^3), O(n^4), O(n^5)… This is unacceptable. So what do we do?

What if we sorted the inversion list?

= "app dev" - > [100150, 0200, 00511, 66] = "app ops" - >,4,8,10,50,100,20000 [2]Copy the code

Their intersection is [100,20000]. To achieve this quickly, we can advance from the smaller end of the list through two cursors. When the values are equal, they can be added to the result set. The search cost is obviously lower, and the search complexity on k inversion lists is O(k*n) rather than O(n^k) in the worst case.

All that remains is to maintain the index, which keeps the query efficient by maintaining the mapping between the timeline and ids and labels and inversion tables.

For a brief introduction to Prometheus TSDB storage, there are plenty of details not covered in this article, such as how WAL does compression and playback, how MMAP works, and how TSDB stores files, which you can refer to if you need further information. Read via blog: iqsing.github. IO

This paper refers to:

Prometheus TSDB is the serial blog of Prometheus maintainer Ganesh Vernekar

Prometheus developer Fabian Writing a Time Series Database from Scratch

PromCon 2017: Storing 16 Bytes at Scale – Fabian Reinartz

What is timing data?

How is timing data stored in Prometheus TSDB?

Prometheus TSDB overview

Lifetime of samples in Head

The role of checkpoint in WAL

Persistent storage of blocks

How do I retrieve data from a large series?

Related Posts

A tempting site for Python crawlers, The Cute Picture Web, as it’s called

Algorithm: 557. Invert the word III in the string

A List of Java collections