Translated & edited by BURAK YUCESOY/Ma Shu

It is very common to run SELECT COUNT (DISTINCT) on a database. In applications, there are usually large analysis screens that highlight the number of de-duplicated items, such as de-duplicated users, de-duplicated products, or de-duplicated access. While traditional SELECT COUNT (DISTINCT) queries work well in a stand-alone setup, they can be difficult to solve in distributed systems. When you have this type of query, you can’t just push the query to the slave node and add up the results, because there are probably overlapping records in different slave nodes. Instead, you can do this:

  • Pull all the different data onto one machine and count it there. (Poor scalability)

  • Do Map/Reduce. (Scalable, but slow)

That’s where you could use approximation algorithms or sketches. Sketches are probabilistic algorithms that effectively generate approximate results within a mathematically provable margin of error. There are many such sketches, but today we will focus on just one: HyperLogLog (HLL). HLL is very good at estimating de-weighted element counts in lists. First, let’s take a look at the internal structure of the HLL to help us understand why the HLL algorithm can scale to solve the de-counting problem and how to apply it in a distributed manner. Then, let’s look at some examples of using HLL.

What is HLL doing back there?

Hash all the elements.

HLL and almost all other probability counting algorithms rely on uniform distribution of data. However, because in the real world, our data is generally not evenly distributed, HLL first hashes each element to make the data more evenly distributed. By uniform distribution, we mean that each element has a 0.5 chance of becoming either a 0 or a 1. We’ll see why this is useful soon. In addition to uniformity, hash lets HLL handle all data types the same way. As long as your data type has a hash function (hash function), you can use HLL for cardinality estimation.

Observe rare data patterns

After all elements have been hashed, HLL looks for the binary form of each hash element. It basically looks at whether there are bit patterns that are unlikely to occur. If this rare pattern exists, it means we are dealing with large data sets.

To do this, HLL looks for the first zero bit in the hash value of each element and finds the length of the first zero bit. Basically, in order to be able to observe k first zeros, we need 2k + 1 trials (i.e., hash number). Thus, if the maximum number of first zeros in the dataset is K, HLL concludes that there are about 2k + 1 distinct elements.

This is a very straightforward, simple way of estimating. However, it has some important features, especially in distributed environments.

  • HLL has a very low memory footprint. For the largest number n, we only need to store log log n bits. For example, if we hash elements into 64-bit integers, we only need to store 6 bits for estimation. This saves a lot of memory compared to the naive method, where you need to remember all the values.

  • We just need to traverse the data once (scan) to find the maximum number of first zeros.

  • We can use streaming data. After calculating the maximum value of the first zero, if some new data comes in, we don’t have to go through the whole data set again, we can just include them. We simply look up the number of first zeros of each new element, compare them to the maximum number of first zeros in the entire dataset, and update the maximum number of first zeros if necessary.

  • We can effectively combine the estimates of two separate data sets. We just need to choose the one with the largest number of first zeros as the maximum number of first zeros in the merged dataset. This allows us to fragment data, estimate their cardinality, and combine results. This is called additivity, and additivity allows us to use HLL in distributed systems.

The average random

If you think that’s not a good estimate, you’re right. First of all, our predictions are always in 2k. Second, if the distribution of the data is not uniform enough, we may end up with estimates that are wildly off.

One possible solution to these problems is to repeat the process with different hash functions and then take the average. This should work, but hashing all the data multiple times is expensive. HLL solved this problem with a method called random averaging. Basically, we split the data into buckets and apply the algorithm to each bucket individually. And then we take the average of those results. We use the first few bits of the hash value to determine which bucket an element belongs to, and then use the remaining bits to calculate the maximum number of first zeros.

In addition, we can also select buckets to divide the data to customize/adjust the accuracy. We need to store log log n bits for each bucket. Since we can store every estimate in log log n bits, even if we create a lot of buckets, we end up using very little memory. Such a small footprint is important when performing large-scale data operations. To merge the two estimates, we merge each bucket and take the average. Therefore, if we are going to merge, we should keep the maximum number of first zeros in each bucket.

What else?

There are a few other things HLL does to improve the accuracy of its estimates, but looking at bit patterns and random averages is still the key to HLL. After these optimizations, HLL can use 1.5 kB of memory to estimate the cardinality of the data set, with a typical error rate of 2%. Of course, if you use more memory, you can improve accuracy. We won’t go into the details of the other steps, but there are many, many articles about the HLL online.

HLL in distributed systems

As we mentioned, HLL has the property of additivity, which means that you can divide a dataset into several parts and use the HLL algorithm to operate on them separately to find the number of deweighted elements in each part. Then, without going back to the original data, you can effectively merge the intermediate HLL results and look up the number of de-weighted elements for all the data.

If you work with large numbers of data and keep the data on different physical machines, you can use HLL to calculate the de-count of all the data without having to drag and drop the entire data into one location. In fact, Citus can do this for you. There is an HLL extension pack developed for PostgreSQL that is fully compatible with Citus. Citus will automatically enable HLL if you already have the HLL extension pack installed and want to run COUNT (DISTINCT) queries on distributed tables. Once configured, you don’t need to do anything extra.

The use of HLL

To establish

To use HLL, we will use the Citus Cloud and GitHub event dataset. You can see and learn more about Citus Cloud here. Assuming you created your Citus Cloud instance and connected to it via PSQL, you can create the HLL extension by using the following:

CREATE EXTENSION hll;Copy the code

You should create extensions on both master and slave nodes. Then enable counting different approximations by setting the citus.count_DISTINCt_error_rate configuration value. When the configuration value is set low, it provides more accurate results, but takes longer and more memory to compute. We recommend setting it to 0.005.

SET citus.count_distinct_error_rate TO 0.005;Copy the code

Instead of using the github_events table and the large_events.csv dataset, we’ll just use the github_events table;

CREATE TABLE github_events
(
    event_id bigint,
    event_type text,
    event_public boolean,
    repo_id bigint,
    payload jsonb,
    repo jsonb,
    user_id bigint,
    org jsonb,
    created_at timestamp 
);

SELECT create_distributed_table('github_events'.'user_id');

\COPY github_events FROM large_events.csv CSVCopy the code

example

After distributing the data table, we can use the regular COUNT (DISTINCT) query to find out how many de-duplicate users created events:

SELECT
    COUNT(DISTINCT user_id)
FROM
    github_events;Copy the code

It should return something like this:

 count
--------
 264227

(1 row)Copy the code

It appears that this query has nothing to do with the HLL. However, if you set citus.count_DISTINCt_ERROR_rate to greater than 0 and issue a COUNT (DISTINCT) query, Citus will automatically use HLL. For this simple use case, you don’t even need to change the query. For the user who created the event, the exact de-count is 264198, so our error rate is a little over 0.0001.

We can also use constraints to filter out some of the results. For example, we can query the number of users who created PushEvent to be deduplicated:

SELECT
    COUNT(DISTINCT user_id)
FROM
    github_events
WHEREThe event_type = 'PushEvent'::text;Copy the code

It returns:

count
--------
 157471

(1 row)Copy the code

Similarly, the exact de-count for this query is 157154, and our error rate is slightly greater than 0.002.

conclusion

If you have problems with count (DISTINCT) scalability in Postgres, take a look at HLL, which may be useful if the count is approximate enough for you. If you have any questions about “extending counting events further with Citus”, please contact us.


If there is any improper translation, welcome to correct. Welcome to discuss technical issues.

Original link: CitusData