1.1. Common Hadoop port number

  • dfs.namenode.http-address:50070
  • dfs.datanode.http-address:50075
  • SecondaryNameNode SecondaryNameNode port number: 50090
  • dfs.datanode.address:50010
  • Fs. DefaultFS: 8020 or 9000
  • yarn.resourcemanager.webapp.address:8088
  • History server Web access port: 19888

1.2 Hadoop configuration file and simple Hadoop cluster construction

(1) Configuration file:

  • XML, hdFS-site. XML, mapred-site. XML, and yarn-site. XML
  • Hadoop-env. sh, yarn-env.sh, mapred-env.sh, slaves

(2) Simple cluster building process:

  • The JDK installation
  • Configure SSH password-free login
  • Configure the Hadoop core file:
  • Format the namenode

1.3 HDFS read and write flow

This is very important. Although Hadoop has reached 3.x and storage is becoming more and more diversified, HDFS is still the mainstream storage. We need to know the read and write process of HDFS.

1.3.1 HDFS read process

1.3.2 HDFS write process

1.3.3 MapReduce process

1.3.3.1 Shffule mechanism

1) The process after Map method and before Reduce method is called Shuffle

2) After the Map method, the data first enters the partition method, marks the partition, and then sends the data to the ring buffer; The default size of the ring buffer is 100m. When the ring buffer reaches 80%, overwrite is performed. Sort the data before overwrite. Sort the data according to the index of the key in lexicographical order. Write overflow generates a large number of write overflow files, which need to merge and sort. The Combiner operation can also be performed on overwritten files. The prerequisite is that the operation is summarized and the average value cannot be obtained. Finally, the file is stored to a disk by partition and is pulled by the Reduce end.

3) Each Reduce pulls data from the corresponding partition on the Map end. After the data is pulled, the data is stored in the memory. When the memory is insufficient, the data is stored to the disk. After all data is pulled, merge sort is used to sort the data in memory and on disk. Before entering the Reduce method, data can be grouped.

1.4. Hadoop optimization

1.4.1 impact of HDFS small Files

  • (1) The lifetime of NameNode is affected because the file metadata is stored in NameNode memory
  • (2) The number of tasks affecting the computing engine, for example, each small file will generate a Map task

1.4.2 Data input small file processing:

  • (1) Merge small files: Archive small files (Har) and customize Inputformat to store small files as SequenceFile files.
  • (2) ConbinFileInputFormat is used as input to solve the scenario of large number of small files at the input end.
  • (3) Enable JVM reuse for a large number of small file jobs.

1.4.3 Map Phase

  • (1) Increase the size of ring buffer. From 100 meters to 200 meters
  • (2) Increase the overwrite ratio of ring buffer. From 80% to 90%
  • (3) Reduce merge times of overwrite files.
  • (4) On the premise that services are not affected, Combiner is used to merge data in advance to reduce I/ OS.

1.4.4 Reduce phase

  • (1) Set Map and Reduce numbers properly: Neither too few nor too many. If there is too little, the Task will wait and the processing time will be prolonged. Too many Map and Reduce tasks compete for resources, resulting in errors such as processing timeout.
  • (2) Set the coexistence of Map and Reduce: adjust the parameter of slowstart.completedmaps to make Map run to a certain extent, and Reduce Reduce waiting time.
  • (3) Avoid using Reduce, because Reduce will generate a large amount of network consumption when it is used to connect data sets.
  • (4) Increase the number of parallelism for each Reduce to obtain data from Map
  • (5) Increase the size of memory for storing data on the Reduce end when the cluster performance is acceptable.

1.4.5 IO transmission

  • (1) Data compression is adopted to reduce the time of network IO. Install the Snappy and LZOP compression encoders.
  • (2) Use the SequenceFile binary file

1.4.6, whole

  • (1) The default memory size of MapTask is 1G. You can increase the memory size of MapTask to 4-5G
  • (2) ReduceTask The default memory size is 1G, and you can increase the ReduceTask memory size to 4-5G
  • (3) Increase the NUMBER of CPU cores of MapTask and ReduceTask
  • (4) Increase the number of CPU cores and memory size of each Container
  • (5) Adjust the maximum retry times for each Map Task or Reduce Task

1.5, compression

Compressed format Hadoop bring? algorithm File extension Support the segmentation Whether the original program needs to be modified after the compression format is changed
DEFLATE If yes, use it directly DEFLATE .deflate no As with text processing, no modification is required
Gzip If yes, use it directly DEFLATE .gz no As with text processing, no modification is required
bzip2 If yes, use it directly bzip2 .bz2 is As with text processing, no modification is required
LZO If no, install it LZO .lzo is You need to build an index, and you need to specify the input format
Snappy If no, install it Snappy .snappy no As with text processing, no modification is required

Tip: If asked during the interview, we generally answer that the compression method is Snappy, which is characterized by fast speed and cannot be segmented (it can be answered in chained MR, and the output of Reduce end is compressed by Bzip2, so that the subsequent map task can split the data).

1.6. Slicing mechanism

1) Simply slice according to the content length of the file 2) Slice size, which is equal to Block size by default 3) Slice separately for each file instead of considering the whole data set: slice size formula: Max (0,min(Long_max,blockSize)

1.7. Yarn Job submission process

1.7.1. Yarn’s default scheduler, scheduler classification, and their differences

1) Hadoop schedulers are divided into three categories:

  • FIFO, Capacity Scheduler, and Fair Sceduler.
  • The default resource scheduler for Hadoop2.7.2 is the capacity scheduler

2) Differences:

  • FIFO scheduler: First in, first out, only one task in the queue is executing at a time.

  • Capacity scheduler: multi-queue; Each queue is first in first out, and only one task is executing in the queue at a time. The parallelism of queues is the number of queues.

  • Fair scheduler: multi-queue; Each queue allocates resources to start tasks according to the size of the vacancy. Multiple tasks are executed in the queue at the same time. The parallelism of queues is greater than or equal to the number of queues.

1.8. Hadoop Parameters tuning

1) Configure multiple directories in the hdFS-site. XML file. Otherwise, you need to restart the cluster. 2) NameNode has a worker thread pool to handle concurrent heartbeats of different Datanodes and concurrent metadata operations of clients. DFS. The namenode. Handler. Count = 20 * log2 (Cluster Size), such as a Cluster of Size 10 sets, Set this parameter to 60. 3) Edit the log storage path dfs.namenode.edits.dir to separate the log storage path dfs.namenode.name.dir from the image storage path dfs.namenode.name.dir to minimize write latency. The default value is 8192 (MB). If your node memory is less than 8GB, you need to reduce this value. YARN does not intelligently detect the total physical memory of the node. Yarn. The nodemanager. Resource. The memory – 5 MB) a single task can be applied for most of the physical memory, the default is 8192 (MB). yarn.scheduler.maximum-allocation-mb

1.9. Hadoop Is down

1) If MR causes system downtime. In this case, you need to control the number of concurrent Yarn tasks and the maximum memory required by each task. Adjustment parameters: yarn.scheduler. Maximum-allocation-mb (the maximum amount of physical memory that can be applied for by a task. The default value is 8192MB) 2) if the NameNode breaks down due to excessive file writing. Increase the Kafka storage size and control the write speed from Kafka to HDFS. Kafka is used for caching during peak times, and sync will automatically catch up after peak times.

Follow my official account [Big Data]