1, massive log data, extract the IP that visits Baidu the most times in a day. The solution: The first is to take out the IP addresses in the logs of this day, and is to visit Baidu, and write them into a large file one by one. Notice that the IP address is 32 bits, with a maximum of 2^32 IP addresses. Similarly, a mapping method can be adopted, such as module 1000, to map the whole large file into 1000 small files, and then find out the IP with the highest frequency in each small text (hash_map can be used for frequency statistics, and then find out the maximum frequency) and the corresponding frequency. And then out of the 1000 largest IP addresses, find the IP with the highest frequency, and that’s what you want.

2. The search engine will record all search strings used by the user each time through a log file. Each search string is 1-255 bytes long. Let’s say there are 10 million records (these query strings are highly repetitive, and the total is 10 million, but no more than 3 million if duplicates are removed. The more repetitions a query string has, the more users it has, and therefore the more popular it is.) , please count the 10 most popular query strings, the requirement of the use of memory should not exceed 1G. Solution: The first step is to preprocess this batch of massive data and complete sorting with Hash table in O(N) time. The second step, with the help of the data structure of heap, find Top K, time complexity is N ‘logK. That is, with the help of the heap structure, we can find and adjust/move in order of log time. Therefore, we maintain a small root heap of size K(10 in this case), and then iterate over 3 million queries to compare them to the root element, so we end up with O(N) + N’*O(logK), (N = 10 million, N’ = 3 million). Or: the trie tree is used, and the keyword field stores the number of occurrences of the query string. If there is no occurrence, it is 0. Finally, the minimum extrapolation of 10 elements is used to sort the occurrence frequency.

3, there is a 1G size of a file, each line is a word, the size of the word is not more than 16 bytes, the memory limit is 1M. Returns the 100 words with the highest frequency. Solution: Read the file sequentially, take hash(x)%5000 for each word x, and store that value into 5000 small files (named x0,x1… X4999). That’s about 200K per file. If some of the files are larger than 1 MB, you can continue to divide them in a similar way until no smaller files are larger than 1 MB. For each small file, count the words that appear in each file and their frequency (trie tree /hash_map, etc.), take the 100 words that occur most frequently (a minimum heap of 100 nodes can be used), and store the 100 words and their frequency into the file, thus generating another 5000 files. The next step is to merge (sort and merge) the 5000 files.

4. There are 10 files, 1G per file, and each row of each file contains the user’s Query, which may be repeated for each file. Ask you to sort by the frequency of queries. Solution: Option 1: Read 10 files sequentially and write query to another 10 files (marked as) based on the hash(query)%10 result. The size of each newly generated file is also about 1G(assuming the hash function is random). Use hash_map(query, query_count) to count the number of occurrences of each query. Sort by number of occurrences using fast/heap/merge sort. Output the sorted query and corresponding query_cout to the file. This results in 10 sorted files (denoted as). Merge sort (a combination of inner sort and outer sort) on the 10 files.

Option 2: Generally, the total number of queries is limited, but the number of repeated times is relatively large. It is possible that all queries can be added to memory at one time. We can use trie tree /hash_map to count occurrences of each query directly, and then do a quick/heap/merge sort by occurrences.

Scheme 3: Similar to Scheme 1, but after the hash is done and multiple files are divided into, multiple files can be processed by using a distributed architecture (such as MapReduce), and then merged.

6. Find the non-repeating integers in the 250 million integers. Note that there is not enough memory to hold the 250 million integers. Solution: Solution 1:2-bitmap is used (2 bits are allocated to each number, 00 means non-existence, 01 means occurrence once, 10 means multiple occurrences, 11 is meaningless). Memory is required, which is acceptable. Then scan the 250 million integers to see the corresponding bits in the Bitmap. If 00 changes to 01,01 changes to 10,10 stays the same. When you’re done, look at the bitmap and print the corresponding integer with bit 01. Scheme 2: similar to question 1, the method of dividing small files can also be adopted. Then find the non-repeating integers in the small file and sort them. And then we merge, removing duplicate elements.

7, Tencent interview question: Given 4 billion unsigned int numbers that are not duplicated, and then given a number, how to quickly determine if the number is among the 4 billion numbers? Solution: Request 512 MEgabytes of memory, one bit representing an unsigned int value. Read 4 billion numbers and set the corresponding bit. Read the number to be queried and check whether the corresponding bit is 1. 1 indicates that the number exists, and 0 indicates that the number does not exist. Dizengrong: Scheme 2: Since 2^32 is more than 4 billion, a given number may or may not be in it; So we’re going to represent each of those four billion numbers in 32-bit binary notation and let’s say that those four billion numbers start out in a file. Then divide the 4 billion numbers into two classes: 1. The highest bit is 0. 2. Compare with the highest digit of the number to be searched and then enter the corresponding file to be searched again and then divide the file into two classes: 1. 2. Write the two classes into two files, one of which has a number <=1 billion and the other >=1 billion. Compare with the next highest bit of the number to be searched and then enter the corresponding file to search again. . And so on, and so on, and so on, and so on, and so on, and so on, and so on, and so on, and so on.

8. How to find the most repeated one in the mass of data? Solution: First hash, then map modules to small files, find the one that repeats the most in each small file, and record the number of repetitions. Then find the data that is repeated the most often in the previous step (refer to the previous problem).

9, tens of millions or hundreds of millions of data (there is duplication), statistics which appear the most money N data. Solution: Tens of millions or billions of dollars of data, which today’s machines should be able to store in memory. So consider using hash_map/ search binary tree/red black tree etc for counting times. The next step is to retrieve the first N occurrences of the data, which can be done using the heap mechanism mentioned in question 2.

10, a text file, about 10,000 lines, each line of one word, ask statistics of the most frequent occurrence of the top 10 words, please give ideas, give the time complexity analysis. Solution: Solution 1: Consider time efficiency. The trie tree is used to count the number of occurrences of each word, and the time complexity is O(nLE)(LE represents the word’s leveling length). And then we have to find the top 10 words that occur most frequently, which we can do with the heap, which we saw in the previous problem, in order nlg10. So the total time is the greater of O(nle) or O(nlg10). Find the largest 100 of 100w numbers. Scenario 1: In the previous problem, we have mentioned a minimum heap of 100 elements. The complexity is O(100wlg100). Scheme 2: The idea of quick sorting is adopted. After each segmentation, only the part larger than the axis is considered. When the part larger than the axis is more than 100, the traditional sorting algorithm is adopted to sort and the first 100 are taken. The complexity is O(100w100). Scheme 3: use local elimination method. Take the first 100 elements, sort them, and call them sequence L. Then scan the remaining element x once, compare it with the smallest element of the 100 sorted elements, and if it is larger than this smallest element, delete the smallest element, and insert X into sequence L using the idea of insertion sort. Loop through until all the elements are scanned. The complexity is O(100W x 100). Welcome to join the big data technology exchange group QQ; 615997810

This group was created on January 11, 2018: In an era of big data, uphold the “one for all, all for one” of the group of purpose, focus on large data analysis study and communication, analysis of data exchange, data mining, big data, data warehouse, large data programming, and big data case analysis, help the hadoop ecosystem engineer to help others, the achievement yourself, big data communication group of learning, big data technology: Hadoop, Spark, Storm, Python and other related technology exchange, big data FAQ discussion.