The last article summarized some hot test points about Hive for us, got some friends affirmation and forwarding, fungus feel that it is very valuable to spend time to do these knowledge integration, meaningful thing. In this article, let’s take a look at how you can prepare for Hadoop to get the upper hand in the interview process.


What is Hadoop?

This is a look humble, in fact, “send proposition” typical. Often people are so prepared for everything else about big data that they can get caught off guard when they ask you what Hadoop is, and their answers stumble and leave a bad impression on the interviewer. In addition, to answer this question, be sure to go from the thing itself to the general description. Interviewers often use this question to determine whether you have basic cognitive ability.

Hadoop is a software framework that enables distributed processing of large amounts of data. Data processing in a reliable, efficient, scalable manner. Hdfs consists of three parts: Hdfs, MapReduce, and Yarn

Hadoop refers to an ecosystem in a broad sense and refers to open source components or products related to big data technologies, such as HBase, Hive, Spark, Zookeeper, Kafka, and Flume….

Can you tell me the difference between Hadoop and Spark?

Don’t be surprised if you are asked. Interviewers can often tell if you are a real learner by the way you describe different skills.

Hadoop Spark
type Basic platform, including computing, storage, scheduling Distributed computing tool
scenario Batch processing on large data sets Iterative computing, interactive computing, stream computing
The price Low requirements for machines, cheap Internal requirements, relatively expensive
Programming paradigm MapReduce has a low-level API and poor algorithm adaptability RDD consists of DAG directed acyclic graph, API is more top-level, convenient to use
Data storage structure The intermediate calculation results of MapReduce are stored on the HDFS disk, which has a high latency RDD intermediate operation results are stored in memory with low latency
Operation mode Tasks are maintained in process mode and start slowly Tasks are maintained in threading mode and can be started quickly

What are the common versions of Hadoop? What are their features? How do you choose them?

This is entirely based on personal experience, if you have not studied these carefully at ordinary times, this question must be a bad answer.

Due to the rapid growth of Hadoop and the constant updating and improvement of features, there are many versions of Hadoop, but they are also cluttered. Currently on the market, the mainstream is the following versions:

  • Apache Community Edition

Apache Community edition is fully open source, free, and non-commercial. Apache community has many Hadoop versions branching, and some Hadoop has bugs. Consider compatibility when selecting Hadoop, Hbase, and Hive. At the same time, the deployment of this version of Hadoop has high technical requirements for Hadoop developers or operations personnel.

  • Cloudera version

Cloudera is an open-source, free, commercial and non-commercial version of Hadoop. Cloudera is a Hadoop development and maintenance version based on the Apache community version of Hadoop. Because this version of Hadoop has carried out a large number of compatibility tests on the integration of other frameworks during the development process, users do not need to consider the compatibility problems of Hadoop, Hbase, Hive and other versions in the use process, which greatly saves the time cost of debugging compatibility for users.

  • Hortonworks version

Hortonworks Hadoop is open source, free and available in both commercial and non-commercial versions. It is modified from Apache and redeveloped with related components or functions, of which the commercial version is the most powerful and complete.

Therefore, based on the above characteristics, we generally use CDH when we first get to know big data, and Apache or Hortonworks is most likely to be used in our work.

Can you briefly introduce the differences between Hadoop1.0, 2.0 and 3.0?

Interviewers who ask this kind of question are usually tough people with good basic skills, and they want candidates to stand out by asking these “details” questions.

Hadoop1.0 consists of HDFS, a distributed storage system, and MapReduce, a distributed computing framework, in which HDFS consists of a NameNode and multiple Datenodes, and MapReduce consists of a JobTracker and multiple TaskTrackers. Hadoop1.0 was prone to single points of failure, poor scalability, low performance, and support for a single programming model.

To overcome the shortcomings of Hadoop1.0, Hadoop2.0 proposes the following key features:

  • Yarn: it is a new general-purpose resource management system introduced in Hadoop2.0, completely replacing JobTracker in Hadoop1.0. The JobTracker resource management and job tracking functions in MRv1 are abstracted into ResourceManager and AppMaster. Yarn also supports multiple applications and frameworks to provide unified resource scheduling and management functions
  • NameNode single point of failure solved: Hadoop2.2.0 addresses both NameNode single point of failure and memory limitation and provides NFS, QJM and Zookeeper shared storage options
  • HDFS snapshot: a read-only image of the HDFS (or subsystem) at a certain point in time. This snapshot is important to prevent data deletion or loss. For example, you can periodically create snapshots for important files or directories. When data is deleted or lost, the snapshots can be used to restore data
  • Support for Windows: A major improvement in Hadoop 2.2.0 is the introduction of Support for Windows
  • Append: The new version of Hadoop introduces appending operations to files

At the same time, the new version of Hadoop makes two important enhancements to HDFS: support for heterogeneous storage hierarchies and memory buffering for data stored in HDFS through data nodes

Compared to Hadoop2.0, Hadoop3.0 is a new release directly based on JDK1.8, and Hadoop3.0 introduces some important features and features

  • HDFS erasable coding: This technique allows HDFS to save a significant amount of storage space without compromising reliability
  • Multiple namenodes support: in Hadoop3.0, support for multiple namenodes has been added. Of course, there must be only one NameNode instance in the Active state. That is, starting with Hadoop3.0, one ActiveNameNode and multiple standbynamenodes can be deployed in the same cluster.
  • MR Native Task optimization
  • Yarn Memory and disk I/O isolation based on Cgroup
  • Yarn container resizing

Limited to space reasons, this is only part of the characteristics, we pay more attention to the part of the fungus brother mark color, enough to deal with the interview.

Hadoop common port number

There are only a few common Hadoop ports, so you can choose the ones that are easy to remember

dfs.namenode.http-address:50070
dfs.datanode.http-address:50075SecondaryNameNode:50090
dfs.datanode.address:50010
fs.defaultFS:8020or9000
yarn.resourcemanager.webapp.address:8088History server Web access port:19888
Copy the code

6. Briefly introduce the process of building a Hadoop cluster

This problem is really basic, here is also a brief overview.

Before formal construction, we need to prepare the following 6 steps:

The preparatory work

  1. Disabling the Firewall
  2. Close the SELINUX
  3. Changing the host Name
  4. SSH copy data without password
  5. Set the mapping between host name and IP address
  6. Jdk1.8 installation

Construction work:

  • Download and decompress the Hadoop JAR package
  • Configure the hadoop core file
  • Format the namenode
  • To start…

7. Introduce the HDFS read and write process

This problem is very basic and occurs with an unusually high frequency, but don’t be intimidated by the HDFS read/write flow. It is not the first time that you have memorized HDFS, and I do not recommend you to memorize those words. Here are two pictures, so you should learn to have a picture in your mind.

  • HDFS data reading process

  • HDFS data writing process

8. Introduce the Shuffle process of MapReduce and provide Hadoop optimization solutions (including compression, small files, and cluster optimization).

The process of reading and writing MapReduce data to HDFS actually consists of 10 steps

One of the most important and most difficult is the shuffle phase. When the interviewer asks you to focus on the shuffle phase, it can’t be as simple as the picture above.

You can say:

  1. The process after Map and before Reduce is called Shuffle

  2. After the Map method, the data first enters the partition method, marks the partition, and then sends the data to the circular buffer. The default size of the ring buffer is 100m. When the ring buffer reaches 80%, overwrite is performed. Sort the data before overwrite. Sort the data according to the index of the key in lexicographical order. Write overflow generates a large number of write overflow files, which need to merge and sort. The Combiner operation can also be performed on overwritten files. The prerequisite is that the operation is summarized and the average value cannot be obtained. Finally, the file is stored to a disk by partition and is pulled by the Reduce end.

3) Each Reduce pulls data from the corresponding partition on the Map end. After the data is pulled, the data is stored in the memory. When the memory is insufficient, the data is stored to the disk. After all data is pulled, merge sort is used to sort the data in memory and on disk. Before entering the Reduce method, data can be grouped.

At this point your mouth may be dry and you want to take a break. But the interviewer might be impressed: How do you optimize Hadoop based on MapReduce? How do you optimize Hadoop based on MapReduce? How do you optimize Hadoop based on MapReduce

You may have 10,000 horses racing in your mind, but you still have to start thinking about how best to respond to this interview:

1) Impact of HDFS small files

  • Affects the life of the NameNode because the file metadata is stored in NameNode memory
  • The number of tasks that affect the computing engine, such as generating a Map task for each small file

2) Data input small file processing

  • Merge small files: Archive small files (Har) and customize Inputformat to store small files as SequenceFile files.
  • ConbinFileInputFormat is used as the input to solve the scenario of large number of small files on the input end
  • For a large number of small file Jobs, YOU can enable JVM reuse

3) Map phase

  • Increases the ring buffer size. From 100 meters to 200 meters
  • Increases the ratio of write overflows to the ring buffer. From 80% to 90%
  • Reduce merge times for overwritten files. Merge 10 files, 20 at a time
  • On the premise that services are not affected, Combiner is used to merge data in advance to reduce I/ OS

4) Reduce stage

  • Set the number of Map and Reduce parameters properly. Do not set too few or too many Map and Reduce parameters. If there is too little, the Task will wait and the processing time will be prolonged. Too many Map and Reduce tasks compete for resources, resulting in errors such as processing timeout.
  • Setting the coexistence of Map and Reduce: Adjustslowstart.completedmapsIf Map runs to a certain extent, Reduce also starts to run, reducing the Reduce waiting time
  • Avoid using Reduce, which can cause a lot of network consumption when used to connect data sets.
  • Increase the number of parallelism for each Reduce to obtain data from Map
  • If the cluster performance is adequate, increase the memory size for storing data on the Reduce end

5) IO transmission

  • Data compression is adopted to reduce the network I/O time
  • Use the SequenceFile binary file

6) as a whole

  • The default MapTask memory size is 1 GB. You can increase the MapTask memory size to 4 gb
  • ReduceTask The default memory size is 1G, you can increase the ReduceTask memory size to 4-5G
  • You can increase the number of CPU cores in MapTask and ReduceTask
  • Increase the number of CPU cores and memory size of each Container
  • Adjust the maximum retry times for each Map Task or Reduce Task

7) compression

Compression, if you look at this picture

Tip: If asked during the interview, we generally answer that the compression method is Snappy, which is characterized by fast speed and cannot be segmented (it can be answered in chained MR, and the output of Reduce end is compressed by Bzip2, so that the subsequent map task can split the data).

9. Introduce the Job submission process of Yarn

There are a total of two versions, respectively detailed version and brief version, which specific use or points for different occasions. Under normal circumstances, the brief version of the answer is very OK, detailed version of the most to do a content supplement:

  • A detailed version

  • A brief version

The corresponding steps of the brief version are as follows:

  1. The client submits the application program to RM, including the required information about the ApplicationMaster to start the application, such as the ApplicationMaster program, the command to start the ApplicationMaster, and the user program
  2. ResourceManager starts a Container to run ApplicationMaster
  3. During startup, ApplicationMaster registers itself with ResourceManager, and maintains heartbeat communication with RM after startup
  4. ApplicationMaster sends requests to ResourceManager to apply for a number of Containers
  5. ApplicationMaster initializes the container that is successfully applied for. After the startup information of the Container is initialized, the AM communicates with the NodeManager and asks NM to start the Container
  6. NM to start the container
  7. When a Container is running, ApplicationMaster monitors the Container. Container reports its progress and status to the AM through RPC
  8. After the application runs, ApplicationMaster deregisters itself to ResourceManager and allows its Container to be reclaimed

This section describes the default Yarn scheduler, its classification, and the differences between them

In fact, the knowledge about Yarn does not account for much in the interview, like the interview often asked about the Job execution process or the classification of the scheduler, the answer is often similar, the following answers for reference:

1) Hadoop schedulers are divided into three categories:

  • The FIFO Scheduler is a Scheduler that schedulers on a first-in, first-out basis.
  • Capacity Scheduler allows multiple task pairs to be created and executed simultaneously. But inside a queue it’s still first in, first out. Hadoop2.7.2 Default scheduler
  • Fair Scheduler: The first program can start up and use resources from other queues (100%). When another queue has a task submitted, the queue that uses resources needs to return resources to the task. It is slow to return resources. [Default yarn scheduler of CDH version]

11. Know about Hadoop parameter optimization

If you put yourself at the intern level, you don’t have to study so much about performance tuning. After all, for engineers with a little work experience, Tuning is very important

Common Hadoop parameter tuning is as follows:

  • Configure multiple directories in the hdFS-site. XML file. Otherwise, you need to restart the cluster
  • NameNode has a worker thread pool that handles concurrent heartbeats from different Datanodes and concurrent metadata operations from clients
dfs.namenode.handler.count=20 * log2(Cluster Size)
Copy the code

For example, if the cluster size is 10, set this parameter to 60

  • Edit the log storage path dfs.namenode.edits.dir to separate it from the image file storage path dfs.namenode.name.dir to minimize write latency
  • The default value is 8192 MB. If your node memory is less than 8GB, you need to reduce the value. YARN does not detect the total physical memory of the node
  • Maximum amount of physical memory that can be requested by a single task, default is 8192 (MB)

Are you familiar with Hadoop benchmarks?

This is completely based on the project experience of the interview question, temporarily answer friends can pay attention to:

After building the Hadoop cluster, we need to test HDFS read and write performance and MR computing capability. The test JAR package is in the Hadoop share folder.

How did you handle the Hadoop outage?

I believe that when asked here, some of the partners have been unable to stick to it

But without further discussion, we can’t say that I don’t have a cameo file with ٩(❛ᴗ❛ danjun) danjun: p.

If MR causes system downtime. In this case, you need to control the number of concurrent Yarn tasks and the maximum memory required by each task. Adjustment parameters: yarn.scheduler. Maximum-allocation-mb (the maximum amount of physical memory that can be allocated to a task. The default value is 8192MB).

If NameNode crashes due to excessive file writing. Increase the Kafka storage size and control the write speed from Kafka to HDFS. Kafka is used for caching during peak times and data synchronization will automatically catch up after peak times.

Can you give an example of how you solved the problem of Hadoop data skew?

Performance optimization and data skew, if you don’t prepare well before the interview, you can be prepared to suffer in the interview. Here is a reliable answer that you can use for reference:

1) Combine in MAP in advance to reduce the amount of data transmitted

Adding combiner to a Mapper is equivalent to reducing in advance. That is, same keys in a Mapper are converged, reducing the amount of data transferred during shuffle and the amount of calculation on the Reducer end.

This approach is not very effective if the keys that cause data skew are distributed in large numbers across different Mapper

2) A large number of data skew keys are distributed in different Mapper

In this case, there are roughly the following methods:

  • Local polymerization plus global polymerization

In the map stage for the first time, add random prefixes from 1 to N to those keys that cause data bias, so that the same keys will also be divided into multiple Reducer for local aggregation, and the number will be greatly reduced.

In the second MapReduce, the random prefix of the key is removed to perform global aggregation.

Idea: Second Mr, hash keys randomly to different reducer for the first time to achieve load balancing. The second time, remove the random prefix of the key and reduce the key.

This method performs mapReduce twice, with slightly lower performance.

  • Reducer was increased to improve the concurrency
JobConf.setNumReduceTasks(int)
Copy the code
  • Implement custom partitions

Customize hash functions according to data distribution and evenly distribute keys to different Reducer


eggs

In order to encourage you to learn to summarize more, You can post your mind map here. If you need friends, you can pay attention to the personal wechat public account of the blogger [Simman Fungus] and reply “Mind map” in the background to get it.

conclusion

I am very glad to see friends here. If you have any good ideas or suggestions, you can leave a message in the comment section, or send a private message to me directly. I will consider some big data interview scenarios later

One key three, form a habit ~

The article continues to update, you can search the “ape man bacteria” on wechat for the first time to read, mind mapping, big data books, big data high-frequency interview questions, a large number of first-line large factories through…. Looking forward to your attention!