Common mass data processing problem solving ideas

The basic idea

Hash sharding. Compute the hash code for each data, then perform the modulo operation, so that the same data will enter the same shard.
Dictionary tree/hash table. For each shard, the number of occurrences of the statistics.
The heap. Use the heap to solve the topK problem.
Merge. Since there is no duplicate data between each shard, the merging process is simple.

Common topic

Given two files a and B, each containing 5 billion urls, each containing 64 bytes, with a memory limit of 4 gigabytes, let you find the common URL of a and B.

Hash (s)%1000 to get 2000 files a0, A1… a999,b0,b1,… B999. Ai and BJ are respectively processed by putting the data in AI into a hash table and judging whether the data in BJ appears in the hash table. If so, it is public data.

Find the non-repeating integers in the 250 million integers. Memory is not enough to hold the 250 million integers.

Use a 2-bit BitMap. 00 indicates that the number does not exist, 01 indicates that the number occurs once, and 11 indicates that the number occurs many times. The memory required is about 2^32*2bit=1GB when 250 million numbers are put into the BitMap.

10 million strings, some of which are duplicate, you need to get rid of all the duplicates and keep the unduplicated strings. How to design and implement?

Using hash sharding, divide 10 million strings into 100 small files, and then deduplicate each small file, using dictionary tree. Finally, combine the 100 results.

A text file, find the first 10 words that often appear, but this time the file is relatively long, say it is hundreds of millions of lines or a billion lines, in short, can not read into memory at a time, ask the optimal solution.

First, hash sharding is carried out to divide all data into 100 small files. Then, the dictionary tree is used to count the number of times for each small file, and then topK is calculated. Minimum heap can be used, and finally merge.

Combine multiple sets into a set without intersection: Given a set of strings, the format is: (aaa, BBB, CCC),(CCC, DDDD),(eee),(MMM). It is required to merge the sets whose intersection is not empty, and it is required that there is no intersection between the merged sets. For example, the above example should output (AAA, BBB, CCC, DDD),(eee),(MMM).

Use and search set implementation.

Common mass data processing problem solving ideas

The basic idea

Common topic

Given two files a and B, each containing 5 billion urls, each containing 64 bytes, with a memory limit of 4 gigabytes, let you find the common URL of a and B.

Find the non-repeating integers in the 250 million integers. Memory is not enough to hold the 250 million integers.

10 million strings, some of which are duplicate, you need to get rid of all the duplicates and keep the unduplicated strings. How to design and implement?

A text file, find the first 10 words that often appear, but this time the file is relatively long, say it is hundreds of millions of lines or a billion lines, in short, can not read into memory at a time, ask the optimal solution.

Related Posts

Extensions of synchronized: Reentry locks &AQS

Kettle Based Data Integration Platform (2)

Use cookies for authentication in ASP.NET Core applications