Global Dictionary Explanation

Why do you need a global dictionary

In the field of OLAP data analysis, Count distinct is a very common requirement, which can be divided into approximate deduplication and precise deduplication according to the requirements of deduplication results.

With large data sets, accurate de-duplication and fast query response can be challenging. We know that the most commonly used processing method for accurate reduplication is the Bit map. For integer data, we can save statistics in the Bit map, but there will be other types such as String in the actual data processed. If we want to achieve accurate reduplication, we first need to build a dictionary to map these data uniformly, and then conduct statistics through the Bit map method.

We all know that Kylin uses predictive computing to speed up big data analysis. When incrementally building a Cube, all the segments in a Cube will use the same Dictionary, namely the Global Dictionary, in order to avoid the final decertification error caused by the Dictionary being built by different segments separately.

Evolution of the global dictionary

Kylin has supported global dictionary functionality since version 1.5.3, but the way it was built at this time also had obvious drawbacks:

The global dictionary is built at a single point on the Job Server. As the data increases, the build time becomes uncontrollable. In Kylin3.1, a Hive based distributed global dictionary build has been added, which has solved the above problem. See the Kylin distributed global dictionary for more details. However, in order to accommodate the new build query engine, Kylin 4.0 has implemented another way of building a global dictionary based on Spark. Today we will describe in detail how the global dictionary is implemented in Kylin 4.0.

Spark based global dictionary

Kylin 4.0 builds a global dictionary by distributed coding based on Spark, which reduces the pressure on a single node and breaks the limit of the maximum number of integer types by building dictionaries.

Design and Principle

The structure of the global dictionary

  • Each build task generates a new global dictionary
  • Each time a new build task’s dictionary is saved by version number, the old global dictionary is gradually deleted
  • A dictionary includes a meta data file and multiple dictionary files, each of which is called a Bucket.
  • Each bucket is divided into two mappings (Map
    ), which are combined into a complete mapping relationship,>

Bucket concepts and transformations

Kylin introduced the concept of buckets, which can be understood as the processing of data into several buckets (i.e., multiple partitions) for parallel processing. The first time the dictionary is built, the values in each bucket are encoded from 1, and the whole dictionary value is assigned according to the offset value of each bucket after the encoding of all buckets is completed. The two encodings in the code are stored through two HashMaps, one of which stores the relative dictionary values within the buckets and the other stores the absolute dictionary values between all buckets.

The following figure shows the dictionary passing through the bucket of number 1 during several build tasks. Each build creates a new version of the bucket (v1, v2, v3, etc.). The reason for the versioning will be explained later. Curr(current) and Prev(Previous) are two hashmaps in a bucket, storing respectively the Relative encoding value of the dictionary in the bucket and the Absolute encoding value of all the dictionary values that have been constructed before.

Build steps

  • Create a flat table using Spark and get DISTINCT values that need to be precisely de-column
  • The number of shards is confirmed according to the number of literal values after the reduplication, and the capacity expansion is determined according to the demand
  • Repartition the data into multiple partitions, encode them respectively, and store them in their dictionary files
  • Assign a version number to the current build task
  • Save dictionary files and metadata data (number of buckets and offset value of buckets)
  • The condition determines that the old version needs to be deleted

The first building

  • Calculating the size of buckets takes the maximum value of the number of dictionaries that need to be built to handle the threshold of a single bucket and the default value of the number of buckets.
  • Create buckets and assign data for encoding
  • Generate META files to record the offsets of buckets

The following are the related configuration items and their default values.


Non-primary construction

  • Determine whether the bucket needs to be enlarged according to the dictionary number
  • The encoded dictionary value reallocates the expanded bucket
  • Read the latest dictionary data from the previous version and assign it to each bucket
  • Assign the new value to the bucket
  • The dictionary values from the previous build do not change

Version control

The global dictionary is isolated by assigning a timestamp based version number to a single build. Versioning was added because build tasks may be executed concurrently, which is not currently supported by the code in the global dictionary build process. Through version control, the previously constructed global dictionary can be read completely in each encoding, thus ensuring that the latest version of the dictionary has the most complete global dictionary code. Moreover, a Cube’s global dictionary will select the latest version of the dictionary every time it is read. The dictionary is eventually stored by version on the file storage system (HDFS in this case) as shown in the figure below.


  1. Why do I need to use two maps in a BucketDIctionary?

At the beginning of the construction process, a Relative encoding should be done for the dictionaries assigned to each bucket starting from 1, and the Relative encoding value of this part of the dictionary will be stored in a HashMap. After the encoding of the Relative dictionary value is completed, the offset value of each bucket will be obtained, that is, the number of dictionaries in the bucket. This dictionary value is then used to calculate the Absolute code for each bucket (buckets are ordered) of the dictionary values relative to the offset values of all buckets. The Absolute code for the dictionary is also stored in another HashMap.

  1. Is there a data skew problem?

Now, the probability of hot spots not being built is very small, and it is generally impossible to pass by tilting one billion levels. Many columns may indeed cause this problem, but the number of coded buckets can be infinitely amplified unless the single key hot spot, otherwise it is very easy to adjust the parameters to complete the construction.

  1. Why can the number of global dictionaries exceed the maximum cardinality limit for integers (2.1 billion)?

Because of the introduction of a new BitMap data structure, Roaring64Bitmap, after the global dictionary encoding is completed, the encoding is compressed into binary and stored in the Roaring64Bitmap object. Bitmap is actually stored via a Long instead of an Integer.