This is the 27th day of my participation in Gwen Challenge

1 the Shuffle optimization

  1. The Map phase

    1. Increases the ring buffer size. From 100 meters to 200 meters
    2. Increases the ratio of write overflows to the ring buffer. From 80% to 90%
    3. Reduce merge times for overwritten files. Merge 10 files, 20 at a time
    4. On the premise that services are not affected, Combiner is used to merge data in advance to reduce I/ OS.
  2. Reduce phase

    1. Set the number of Map and Reduce parameters properly. Do not set too few or too many Map and Reduce parameters. If there is too little, the Task will wait and the processing time will be prolonged. Too many Map and Reduce tasks compete for resources, resulting in errors such as processing timeout.
    2. Set the coexistence of Map and Reduce: Adjust the slowstart.completedmaps parameter to make Map run to a certain extent, and Reduce the Reduce wait time.
    3. Avoid using Reduce, which can cause a lot of network consumption when used to connect data sets.
    4. Increase the number of parallelism for each Reduce to obtain data from Map
    5. If the cluster performance is adequate, increase the memory size for storing data on the Reduce end.
  3. IO transfer

🥰🥰 Uses data compression to reduce network I/O time. Install the Snappy and LZOP compression encoders.

  • Compression:
    1. The map input takes into account data volume and slices. Bzip2 and LZO slices are supported. Note: LZO must create indexes in order to support slicing;
    2. The map output terminal mainly considers speed, including fast SNappy and LZO.
    3. The reduce output end mainly depends on specific requirements. For example, slice should be considered as the next Mr Input, and GZIP with relatively high compression rate should be permanently saved.
  1. As a whole
    1. NodeManager the default memory 8 g, need according to the actual server configuration and flexible adjustment, 128 gb of memory, for example, the configuration is about 100 gb of memory, the yarn. The NodeManager. Resource. – MB of memory.

    2. The default memory size for a container is 8 GB. You can flexibly adjust the memory size based on the task data volume, for example, 128 MB data, 1 GB memory, yarn.scheduler. Maximum-allocation-mb.

    3. Mapreduce.map.memory. MB: controls the upper limit of memory allocated to MapTask. If the memory exceeds the upper limit, the process will be killed. Container is running beyond physical memory limits. Current Usage :565MB of512MB physical memory used; This time the Container). The default memory size is 1 GB. If the amount of data is 128 MB, the memory size does not need to be adjusted. If the amount of data is larger than 128 MB, the MapTask memory can be increased to 4-5g.

    4. Graphs. Reduce. The memory. MB: control assigned to ReduceTask memory limit. The default memory size is 1 GB. If the amount of data is 128 MB, the memory size does not need to be adjusted. If the data volume is larger than 128M, you can increase the ReduceTask memory size to 4-5G.

    5. Mapreduce.map.java. opts: controls the MapTask heap memory size. (if the memory is not enough, quote: Java. Lang. OutOfMemoryError)

    6. Graphs. Reduce. Java. Opts: control ReduceTask heap memory size. (if the memory is not enough, quote: Java. Lang. OutOfMemoryError)

    7. You can increase the number of CPU cores in MapTask and ReduceTask

    8. Increase the number of CPU cores and memory size of each Container

    9. Configure multiple directories (disks) in the HDFS -site. XML file

2 Yarn scheduler

  1. Hadoop schedulers fall into three categories:

    FIFO, Capacity Scheduler, and Fair Sceduler. Apache’s default resource scheduler is the capacity scheduler; The CDH’s default resource scheduler is the fairness scheduler.

  2. The difference between:

    FIFO scheduler: support single queue, FIFO production environment will not be used. Capacity scheduler: Supports multiple queues. Queue resource allocation: The queue with the lowest resource usage is preferentially allocated resources. Job resource allocation, which allocates resources according to job priority and submission time order; Container resource allocation, local principle (same node/same rack/different rack on different nodes) Fairness scheduler: Supports multiple queues to ensure that each task has equal access to queue resources. When resources are insufficient, they can be allocated according to the amount of vacancy.

  3. How to choose in a production environment?

    Dachang: If the concurrency requirements are high, the selection is fair and the server performance must be OK. Small and medium-sized companies, cluster server resources are not enough to choose capacity.

  4. How do you create queues in production?

    1. The scheduler has only one default queue by default, which cannot meet production requirements.
    2. Hive/Spark/Flink By framework: Hive/Spark/Flink By framework: Hive/Spark/Flink
    3. According to the business module: login registration, shopping cart, order, Business department 1, business department 2
  5. What are the benefits of creating multiple queues?

    1. Because of the fear of careless employees, write recursive endless loop code, all the resources are exhausted.
    2. To achieve the degraded use of tasks, ensure that important task queue resources are sufficient in special periods.

Business Department 1 (Major) = business Department 2 (Minor) = Order (Minor) = Shopping Cart (Minor) = Login registration (minor)