Big Data Interview (Hadoop)

🚀 author: “Big Data Zen”

🚀 Column introduction: This column shares big data-related interview questions about Hadoop, Spark, Flink, Zookeeper, Flume, Kafka, Hive, Hbase and other big data-related technologies. Big data interview column address.

🚀 Personal homepage: Big data little Zen

🚀 Fan benefit: Join the big data community of Xiaochan

🚀 Welcome to like 👍, collect ⭐, leave a message 💬

1. How does HAnamenode work?

ZKFailoverController main responsibility: 1) Health monitoring: periodically sends health detection commands to the NN it monitors to determine whether a NameNode is in a healthy state. If the machine is down and the heartbeat fails, ZKFC will mark it in an unhealthy state. 2) Session Management: If the NN is healthy, ZKFC will keep an open session in ZooKeeper. If the NameNode is also Active, ZKFC will also keep a ZNode of type transient in ZooKeeper. When the NN dies, the ZNode will be deleted. The standby NN will then acquire the lock, upgrade to the primary NN, and mark the status as Active. 3) When the broken-down NN is started, it will register zookeper again and find that a ZNode has been locked, it will automatically change to Standby state. This cycle ensures high reliability. It should be noted that only two NNS can be configured at most at present. 4) Master election: As mentioned above, zooKeeper maintains a transient type of ZNode to implement a preemptive lock mechanism to determine which NameNode is Active

Hadoop serialization and deserialization and custom bean object serialization

Serialization and deserialization (1) Serialization is the conversion of an object in memory into a sequence of bytes (or other data transfer protocol) for easy storage (persistence) and network transmission. (2) Deserialization is to convert the received byte sequence (or other data transfer protocol) or persistent data from the hard disk into objects in memory. (3) Java serialization is a Serializable framework. After an object is serialized, it will carry a lot of additional information (various checksum information, headers, inheritance system, etc.), which is not easy to transfer efficiently in the network. Therefore, Hadoop has developed a serialization mechanism (Writable), which is streamlined and efficient. 2) Steps and notes for serialization of user-defined bean objects: (1) the Writable interface must be implemented (2) the empty parameter constructor must be used when deserializing (3) overriding the serialization method (4) overriding the deserialization method (5) Note that the sequence of deserialization is exactly the same as that of deserialization (6) To display the result in a file, We need to rewrite toString() and separate it with “\t” for subsequent use. (7) If we need to transfer custom beans in keys, we also need to implement the Comparable interface, because the Shuffle process in the MapReduce box must sort the keys

3. What is InputSplit in a running Hadoop task?

The other two splits will have to go to that field. The other two splits will have to go to that field. The FileInputFormat source Code splits. (2) Start traversing each file in the directory (planning slices). (3) Traverse the first file ss.txt. A) Obtain file size fs.sizeof (ss.txt); . B) Calculate slice size computeSliteSize(math.max (minSize, math.min (maxSize,blocksize)))=blocksize=128M. C) By default, slice size =blocksize. D) Start cutting to form the first slice: ss.txt – 0:128m The second slice ss.txt – 128:256m the third slice ss.txt – 256M:300M (when slicing each time, it is necessary to judge whether the remaining part is larger than 1.1 times of the block, and divide a slice if it is less than 1.1 times). E) Write slice information into a slice planning file. F) The core process of the whole slice is completed in getSplit() method. G) Data slicing only divides the input data logically and does not store the segmented component slices on disk. InputSplit records only metadata information about the split, such as the start position, length, and list of nodes. H) Note: Block refers to the stored data physically stored in HDFS, and slice refers to the logical partition of data. (4) Submit the slice planning file to YARN. MrAppMaster on YARN can count the number of mapTasks enabled based on the slice planning file.

4. How to determine the number of Map and Reduce tasks for a job?

1) Number of maps splitSize= Max {minSize,min{maxSize,blockSize}} Number of maps is determined by the number of blocks divided into data processing default_num = total_size/split_size; 2) Reduce the number of reduce tasks job.setNumreducetasks (X); X is the number of Reduce. If this parameter is not set, the default value is 1.

5. What determines the number of MapTasks?

The number of MapTask parallelism in the map phase of a job is determined by the number of job slices submitted by the client.

6. Working mechanism of MapTask and ReduceTask

Working mechanism of MapTask

(1) Read stage: Map Task parses one key/value from input InputSplit through user written RecordReader. (2) Map stage: this node mainly gives the resolved key/value to the user to write the Map () function for processing, and generates a series of new keys/values. (3) Collect stage: When the user writes the map() function, after the data processing is completed, outputCollector.collect () will generally be called to output the results. Inside the function, it writes the generated key/value partition (calling the Partitioner) to a ring memory buffer. (4) Spill phase: When the ring buffer is full, MapReduce writes data to the local disk and generates a temporary file. Before writing data to the local disk, sort the data locally and merge or compress the data if necessary. (5) Combine phase: When all data processing is complete, MapTask merges all temporary files once to ensure that only one data file is generated.

ReduceTask works

(1) Copy stage: ReduceTask remotely Copy a piece of data from each MapTask, and write a piece of data to disk if its size exceeds a certain threshold, or directly put it into memory. (2) Merge stage: While remotely copying data, ReduceTask starts two background threads to Merge files in memory and on disk to prevent excessive memory usage or files on disk. (3) Sort stage: according to MapReduce semantics, the input data of reduce() function written by users is a group of data aggregated by key. To cluster data with the same key, Hadoop uses a sort-based strategy. As each MapTask has implemented local sorting of its own processing results, ReduceTask only needs to merge and sort all data once. (4) Reduce phase: The Reduce () function writes the calculation result to HDFS.

7. Describe the sorts in mapReduce and the stages in which sorts occur

1) Sorting classification: (1) Partial sorting: MapReduce sorts data sets according to the key of the input record. Ensure that each output file is sorted internally. (2) Full sort: how to use Hadoop to generate a globally sorted file? The easiest way is to use a partition. But this approach is extremely inefficient when dealing with large files, because a single machine must process all the output files, thereby completely losing the parallel architecture provided by MapReduce. Alternative: First create a series of sorted files; Second, concatenate the files; Finally, generate a globally sorted file. The idea is to use a partition to describe the global ordering of the output. For example, you can create three partitions for the file to be analyzed. In the first partition, the first letters of the word are a-G, the second partition is H-N, and the third partition is O-Z. (3) auxiliary sorting :(GroupingComparator group) the Mapreduce framework sorted records by keys before the records arrived at reducer, but the corresponding values of keys were not sorted. Even in different execution rounds, the order of these values is not fixed, because they come from different map tasks that take different times to complete in different rounds. In general, most MapReduce programs avoid making the Reduce function dependent on sorting values. However, sometimes you need to sort and group the keys in a specific way to sort the values. (4) Quadratic sorting: in the custom sorting process, if the judgment conditions in compareTo are two, it is quadratic sorting. The WritableComparable bean overrides the compareTo method. @override public int compareTo(FlowBean o) {return this.sumflow > o.getSumflow ()? 1:1; } 3) The stage where the sort occurs: (1) The stage where the map side occurs before the post-spill partition. (2) One occurs in reduce side before copy and reduce.

8. Describe the shuffle phase in mapReduce and how to optimize the Shuffle phase

Partition, sort, overwrite, copy to the corresponding Reduce machine, add combiner, compress overwrite files.

9. Describe the functions of Combiner in mapReduce, common usage scenarios, situations where combiner is not needed, and the differences between Combiner and Reduce.

1) The significance of Combiner is to locally summarize the output of each MapTask to reduce network transmission. 2) The prerequisite that Combiner can be applied is that the final service logic cannot be affected. Moreover, the output KV of Combiner should correspond to the input KV type of reducer. 3) The difference between Combiner and reducer lies in the running position. Combiner runs on the node where each MapTask resides. Reducer receives the output of all Mapper globally.

10. If no partitioner was defined, how was the data partitioned before it was sent to reducer?

If there is no custom partitioning, the default partition algorithm, that is, the number of (%) reduce operations based on the hashcode value of the key of each piece of data, is the “partition number”.

11. How much is the single point load of MapReduce and how to balance the load?

Implemented through the Partitioner

12. How does MapReduce achieve TopN?

You can customize the GroupingComparator to sort the maximum value of the results, and control to output only the first N numbers when reducing the output. The purpose of topN output is achieved.

Distributedcache Hadoop Distributedcache

One of the most important applications of distributed cache is that in the join operation, if one table is large and the other table is small, we can broadcast the small table, that is, save a copy on each compute node, and then perform the join operation on the map side. After my experiment verification, In this case, the processing efficiency is much higher than the general Reduce end join, and the broadcast processing uses the distributed cache technology. DistributedCache copies the cached files to the Slave node. Before any Job is executed on the Slave node, the files are copied only once in each Job and the cached archive files are decompressed in the Slave node. The local file is copied to HDFS, and the Client tells the DistributedCache location in HDFS using the addCacheFile() and addCacheArchive() methods. When the file is stored in a context, JobClient also obtains DistributedCache to create symbolic links in the form of URI and fragment identifier of the file. When the user needs to get a list of all valid files in the cache, JobConf’s methods getLocalCacheFiles() and getLocalArchives() both return an array of objects pointing to the local file path.

14. How to join two tables using mapReduce?

1) Reduce side join: In the map phase, the map function reads file File1 and file e2 at the same time. To distinguish key/value data pairs from two sources, a tag is assigned to each data pair. For example, tag=0 indicates that the data is from file File1, and tag=2 indicates that the data is from file File2. 2) Map Side Join: Map Side Join is optimized for scenarios where one of the two tables to join is very large and the other is so small that the smaller table can be directly stored in memory. In this way, we can make multiple copies of the small table, keep one copy in memory per Map Task (for example, in a hash table), and then scan only the large table: For each key/value record in the large table, the hash table searches for records with the same key. If yes, the hash table outputs the key.

15. What kind of calculation cannot be speeded up by Mr?

1) Small amount of data. 2) Miscellaneous little files. 3) Index is a better access mechanism. 4) Transaction processing. 5) When there is only one machine.

16. ETL stands for what three words

Extraction-transformation-loading is short for data Extraction, Transformation and Loading.

17. Briefly describe the similarities and differences between hadoOP1 and Hadoop2 architectures

1) Add YARN to solve the resource scheduling problem. 2) Added support for ZooKeeper to achieve reliable high availability.

18. Why is YARN created, what problems does it solve, and what are its advantages?

1) The main function of Yarn is to completely decouple running user programs from the Yarn framework. 2) Various types of distributed computing programs (MapReduce is just one of them) can be run on Yarn, such as MapReduce, Storm, and Spark

19. HDFS data compression algorithm?

Common compression algorithms in Hadoop include Bzip2, gzip, LZO, and SNappy. Lzo and SNappy can be supported only when native libraries are installed in the OPERATING system.

Snappy is the most widely used enterprise development.

Hadoop scheduler summary

(1) The default scheduler FIFO the default scheduler in Hadoop. It selects the jobs to be executed according to the priority of the jobs and then according to the arrival time successively. Capacity Scheduler supports multiple queues. Each queue can be configured with a certain amount of resources. Each queue adopts FIFO scheduling policy. During scheduling, an appropriate queue is selected based on the following policies: Calculate the ratio between the number of running tasks in each queue and the allocated computing resources, and select a queue with the lowest ratio. Then select a job in the queue according to the following policy: The job is selected in order of priority and commit time, and the user resource limit and memory limit are considered. Similar to the computing power Scheduler, Fair Scheduler supports multiple queues and multiple users. The amount of resources in each queue can be configured, and jobs in the same queue share all resources in the queue fairly. In fact, there are more than three types of Hadoop schedulers, and recently, there have been many Hadoop schedulers for new applications.

21. MapReduce 2.0 fault tolerance

1) Fault tolerance If the MRAppMaster fails to run, YARN’s ResourceManager restarts the DEVICE. The maximum number of restarts can be set by the user. The default value is 2. Once the maximum number of restarts is exceeded, the job fails to run. 2) Map Task/Reduce Task periodically reports heartbeat to MRAppMaster; Once the Task dies, MRAppMaster will reresource it and run it. The maximum number of reruns can be set by the user. The default number is 4.

conclusion

Hadoop interview questions are divided into two chapters, with more content. You can choose the part you need to view. More big data and this article installation package can follow me to get oh.