The little knowledge

  1. In real projects, we often need aggregate statistics, such as a user aged 20-30 who likes reading technical books, listening to music, a programmer who likes to stay at home, and so on. If mysql is used for union, first, statements grow longer as labels grow longer, and second, statements are aggregated, grouped, and de-reloaded. How can this situation be resolved?
  2. For example, the popular interview question is to find 100 repeated numbers in a billion integers, or given any integer and determine whether it is in the billion.

BitMap principle

Bitmaps are used to mark elements with bits. The key is the element and the value is usually 0 or 1, which greatly saves storage space.

Now there are (2, 3, 4, 5, 7)5 numbers, given any number in the range 0-7, check whether it is in this set:

  1. Create an array of Byte types in the range (0-7) and set the array’s bit to 1.
  2. The Byte array is then iterated over. If the Byte array position is 1, the number exists.
  3. The same can be done for a billion integers. An int number is 4 bytes and 32 bits. If you use bits to mark positive integers, you can save 32 times the memory space.
  4. 1 billion digits, 4 billion bytes, 32 billion bits, about 4g memory, using bitMap markup, only 125M memory space can be stored, greatly saving memory space.

This is very similar to the idea of bucket sort.

BitMap in practice

For 10 million numbers, is any given number in it? The divide-and-conquer idea uses an int array as a bitMap. Divide an array into 32 groups with (0-31) positions in each group. If the given array’s bit in the specified array is 0, it does not exist.

  1. A [I] = a[N/32]

  2. Index = a[I] % 32

The above two operations can be changed to bitwise operations because bitwise operations are very efficient and consume very few CPU clock cycles. Conclusion: For multiples of 2, %2^n = &(2^n-1), modular operation is equal to and budget, for example: A % 16 = A&15, where 15 needs to be hexadecimal, that is, 0x0F.

In 10000000 numbers ranging from 1 to 100000, a specified number is given to determine whether it is in the set

Code:

public class BitMapActual {
    // 10 million data sets
    private static final int N = 10000000;

    / / bitmap array
    private static int[] a = new int[N/32 + 1]; //int equals 32 bits, so the data length is (N/32+1).

    /** * tags the collection data *@param n
     */
    public static void addValue(int n) {
        // Locate the array number equivalent to n/32
        int row = n >> 5;

        // Locate the slot position in the array equal to n%32
        int offset = n & 0x1F;

        // Set array slot to 1
        a[row] |= 1 << offset;
    }

    /** * determines whether the given number exists *@param n
     * @return* /
    public static boolean exits(int n) {
        // Locate the array number
        int row = n >> 5;

        // Locate the slot position in the array
        int offset = n & 0x1F;

        // 1 << position left position in a[I] and return true if slot position has value
        return (a[row] & ( 1<< offset)) ! =0;
    }

    public static void main(String[] args) {
        // Initialize an array of length N
        int num[] = new int[N];
        Random random = new Random();
        for (int i = 0; i < N; i++) {
            // Random number range is (0-100000)
            int item = random.nextInt(100000);
            num[i] = item;
        }

        BitMapActual map = new BitMapActual();
        / / 1
        for(int i = 0; i < N; i++){
            map.addValue(num[i]);
        }

        int temp = 200;
        if(map.exits(temp)){
            System.out.println("temp:" + temp + "has already exists");
        } else {
            System.out.println("temp:" + temp + "has no exists"); }}}Copy the code

Note: this code has some shift operations, interested can be studied