Currently, only Memory, FileSystem, and RocksDB are available for Flink jobs, and RocksDB is the only option for large amounts of state data (from GB to TB). RocksDB’s performance is highly dependent on tuning, and if all defaults are used, read and write performance can be poor.

However, RocksDB’s configuration is extremely complex, with up to 100 parameters that can be adjusted. There is no one-size-fits-all optimization scheme. If we only consider Flink state storage, we can still summarize some relatively general optimization ideas. This article first introduces some basic knowledge, and then enumerates methods.

**Note: ** The content of this article is based on our Flink version 1.9 practice running online. In version 1.10 and later, due to TaskManager memory model refactoring, RocksDB memory became part of off-heap managed memory by default, eliminating some manual adjustments. If the performance is still poor, intervention is required, then the state must be the backend. Rocksdb. Memory. Managed parameter is set to false to disable rocksdb managed memory.

State R/W on RocksDB

The read and write logic of RocksDB as the back end of the Flink state is slightly different from the normal case, as shown in the following figure.

Each registered state in a Flink job corresponds to a column family, which contains its own set of memtables and SstAbles. The write operation writes data to the active MEMTable first. When the write is full, the data is converted to an immutable MEMTable and flushed to disk to form an Sstable. Read operations look for target data in active memTable, immutable MemTable, Block Cache, and Sstable, in that order. Sstables also need to merge using a compaction strategy to create a layered LSM Tree.

In particular, since Flink persists RocksDB’s data snapshots to the file system at each checkpoint cycle, there is no need to write pre-log (WAL) and WAL and fsync can be safely turned off.

I’ve explained this strategy in detail before, and explained the concepts of read, write, and spatial compaction. Tuning RocksDB is essentially a balance between these three factors. In Flink operations, which focus on real-time performance, read amplification and write amplification should be considered.

Tuning MemTable

As the read and write cache of LSM Tree system, memtable greatly affects the write performance. Here are some parameters worth noting. For comparison purposes, the original parameter names of RocksDB and the parameter names in the Flink configuration are listed below, separated by vertical lines.

  • Write_buffer_size | state. Backend. Rocksdb. Writebuffer. The size of a single memtable size, default is 64 MB. When the memtable size reaches this threshold, it is marked as immutable. Normally, increasing this parameter will minimize the impact of write compaction, but it will also cause compaction to occur on L0 and L1 after flush.

  • Max_write_buffer_number | state. Backend. Rocksdb. Writebuffer. Count: the maximum number of memtable (containing active and immutable), the default is 2. If all memtables are full but the flush speed is slow, write stops. Therefore, if the memory is sufficient or a mechanical hard disk is used, you are advised to set this parameter to 4.

  • Min_write_buffer_number_to_merge | state. Backend. Rocksdb. Writebuffer. Number – to – the merge in the flush before be merged memtable minimum quantity, The default is 1. For example, if this parameter is set to 2, flush is not triggered until there are at least two immutable memtables (that is, if there is only one immutable memtable, it waits). Increasing this value has the advantage of allowing more changes to be merged before flush, reducing write magnification but potentially increasing read magnification because there are more memtables to check when reading data. After testing, setting this parameter to 2 or 3 is relatively good.

Tuning Block/Block Cache

A block is the basic storage unit of an Sstable. Block cache acts as a read cache and uses the LRU algorithm to store the recently used blocks, which greatly affects the read performance.

  • Block_size | state. Backend. Rocksdb. Block. Blocksize block size, the default value is 4 KB. In the production environment, it is always appropriately increased to 32KB. For mechanical disks, it can be increased to 128KB to 256KB to make full use of the sequential reading capability. However, if the block size increases but the block cache size stays the same, the number of blocks in the cache decreases, which increases read magnification.

  • Block_cache_size | state. Backend. Rocksdb. Block. The cache – the size of block the size of the cache, the default is 8 MB. According to the read/write process described above, a large block cache can effectively prevent hot data read requests from falling into the sstable. Therefore, if the memory space is sufficient, you are advised to set the cache size to 128MB or even 256MB to significantly improve the read performance.

Tuning Compaction

Compaction is one of the most expensive operations to perform on any LSM tree-based storage engine, and it can cause read and write operations to block when compaction fails. It is recommended that you read the previous article about RocksDB’s compaction strategy to get some background, which won’t be covered here.

  • Compaction_style | state.backend.rocksdb.com paction. Style compaction algorithm, using the default LEVEL (i.e., leveled compaction), the parameter is also based on the below.

  • Target_file_size_base | state.backend.rocksdb.com paction. Level. The target – file – size – a single base L1 layer sstable file size threshold, the default value is 64 MB. For each increment, the threshold is multiplied by the factor target_file_size_multiplier (which defaults to 1, meaning that the sstAbles are at most the same for each increment). Obviously, increasing this value can reduce write compaction by reducing the frequency of operations, but it can also cause old data to fail to clean up in a timely way, thereby increasing read compaction. It is not easy to adjust this parameter. You are not advised to set this parameter to larger than 256MB.

  • Max_bytes_for_level_base | state.backend.rocksdb.com paction. Level. Max – size – level – base layer L1 data total size threshold, the default value is 256 MB. For each increment, the threshold is multiplied by the max_bytes_for_level_multiplier (default is 10). Since the upper size threshold is derived from it, adjust it carefully. You are advised to set this parameter to a multiple of target_file_size_base. The value cannot be small, for example, 5 to 10 times.

  • Level_compaction_dynamic_level_bytes | state.backend.rocksdb.com paction. Level. Use the dynamic – the size of the parameters mentioned before. When enabled, the multiplication factor of the threshold will be changed into a division factor, which can dynamically adjust the data volume threshold of each layer, so that more data can be placed in the highest layer, which can reduce space enlargement, and the structure of the whole LSM Tree will be more stable. This function is strongly recommended for mechanical hard disks.

Generic Parameters

  • Max_open_files | state. Backend. Rocksdb. Files. Open as the name implies, is rocksdb instance can open the maximum number of files, the default value is 1, said no limit. Since both indexes and Bloom filters of the Sstable reside in memory and occupy file descriptors by default, if this value is too small, the indexes and Bloom filters will not load properly, which can cause a significant drag on read performance.

  • Max_background_compactions/max_background_flushes | state. Backend. Rocksdb. Thread. The num background is responsible for the maximum number of concurrent threads flush and compaction, The default value is 1. Note Flink combined these two parameters processing (corresponding DBOptions. SetIncreaseParallelism () method), whereas flush and compaction are relatively heavy operation, if the CPU allowance is enough, suggested the big, In our practice, this is usually set to 4.

conclusion

In addition to the above set parameters method, by realizing ConfigurableRocksDBOptionsFactory interface, users can also create DBOptions and ColumnFamilyOptions instances to incoming custom parameters, some more flexible. For more information, the viewer can refer to Flink’s pre-defined set of RocksDB parameters (in the PredefinedOptions enumeration).

This article reprinted from LittleMagic blog, the original link: www.jianshu.com/p/bc7309b03…