Comics: What is a Bitmap algorithm?

Two months ago —

In order to meet the statistical requirements of user labels, Xiao Hui uses Mysql to design the following table structure, and labels of each dimension correspond to a column of Mysql table:

How do you count all the programmers born in the 1990s?

Use an SQL statement to find the intersection:

Select count (distinct Name) as number of users from table whare age = ’90 ‘and Occupation =’ programmer ‘;

How do you figure out the total number of iPhone users or those born in the 2000s?

Use a union SQL statement:

Select count (distinct Name) number of AS users from table whare Phone = ‘apple’ or age = ’00 ‘;

Two months later —

— — — — — — — –

1. Given a bitmap of length 10, each bit corresponds to 10 integers ranging from 0 to 9. All bits of the bitmap are zeros.

2. Store the integer 4 in the bitmap at the position marked 4 and set this bit to 1.

3. Store the integer 2 in the bitmap with the subscript 2 and set this bit to 1.

4. Store the integer 1 into the bitmap at the location marked 1. Set this bit to 1.

5. Store the integer 3 into the bitmap at the location marked with subscript 3. Set this bit to 1.

What elements are stored in the bitmap at this point? 4,3,2,1, obviously.

Bitmaps are not only easy to query, they can also remove duplicate integers.

1. Create a mapping between the user name and user ID.

2. Make each label store all the user ids that contain it, and each label is a separate Bitmap.

3. In this way, it becomes clear at a glance to realize users’ de-duplication and query statistics:

1. How to find programmers who use iPhone?

2. How to find all male or post-2000 users?

A week later……

Take the user data of the previous period as an example. The basic information of the user is as follows. According to the age label, it can be divided into the post-90s and post-00s Bitmap:

In a more graphic way, the Bitmap of post-90s users is as follows:

At this time can we directly get non-post-90 users? Do the non-operation directly?

Obviously, there is actually only one non-post-90s user, instead of the 8 results obtained in the figure, so non-operation cannot be carried out directly.

In the same example, we are given a Bitmap of users born after 90 and a Bitmap of full users. Finally, what is required is the part that exists in all users but does not exist in post-90s users.

So how do we figure that out? We can use xOR operations where the same bit is 0 and the different bits are 1.

25769803776L = 11000000000000000000000000000000000B

8589947086L = 1000000000000000000011000011001110B

1. Parse Word0 and learn that the number of empty words across the current RLW is 0, followed by three consecutive ordinary words.

2. Calculate that the maximum NUMBER of consecutive common Word ids behind the current RLW is 64 X (0 + 3) -1 = 191.

3. Since 191 < 400003, the new ID must come after the next RLW (Word4).

4. Parsing Word4, it is known that the number of empty words across the current RLW is 6247, followed by a consecutive ordinary Word.

5. Calculate that the maximum ID of consecutive common Word behind the current RLW (Word4) is 191 + (6247 + 1) X64 = 400063.

6. Because 400003 < 400063, the correct position of the new ID 400003 is in the plain Word (Word5) behind the current RLW (Word4).

The final insert result is as follows:

The official instructions are as follows:

* Though you can set the bits in any order (e.g., set(100).set(10), set(1),
* you will typically get better performance if you set the bits in increasing order (e.g., set(1), set(10), set(100)).
* 
* Setting a bit that is larger than any of the current set bit
* is a constant time operation. Setting a bit that is smaller than an 
* already set bit can require time proportional to the compressed
* size of the bitmap, as the bitmap may need to be rewritten.


Copy the code

A few notes:

1. The initial technology selection of this project is not Mysql, but HANA, the in-memory database. In order to facilitate understanding, the original storage scheme is written as Mysq database.

1. The Bitmap optimization method introduced in this paper has been simplified to a certain extent, and the logic in the source code is much more complex. For example, the location of insert data 400003 is different from the actual step.

2. If students are interested, they can read the source code themselves, or even try to implement their own Bitmap algorithm. It takes a lot of time, but it’s a great way to learn.

Maven dependencies for EWAHCompressedBitmap are as follows:Copy the code

< the dependency > < groupId > com. Googlecode. Javaewah < / groupId > < artifactId > javaewah < / artifactId > < version > 1.1.0 < / version > </dependency>Copy the code

— — the END — — –

If you like this article, please long click the picture below to follow the subscription number programmer Xiao Grey and watch more exciting content

Related Posts

JAVA concurrency ReentrantLock principle analysis

What are the recent Java projects at Gitee that are noteworthy? This article tells you

Quick code simplification in one minute, using Lombok to eliminate getters, setters, and Equals methods