Hive performance optimization and parameter tuning

Record yourself in the work often used in a few parameter Settings, from the adjustment of the actual effect or the effect.

Enterprise-related server resource allocation: 600 active nodes on average,

The available memory of each node is about 200GB, and the total available memory is 116TB

1, set hive. The exec. The parallel = true;

Enable job parallelism: This parameter is enabled in almost every HQL script. The default parallelism is 8.

If the cluster resources are sufficient, you can increase the number of parallel jobs:

set hive.exec.parallel.thread.number=16; (I seldom use this in the production of the enterprise, and use the default value, because it consumes too much resources and is afraid of affecting other tasks, so it may be caught by operation and maintenance, and criticized by email! Of course, it depends on the specific situation when you use it!

Because the number of jobs in a table is usually more than 20 each time, in the case of multiple related dimensions and complicated field logic,

The number of jobs in a table will exceed 100. The number of jobs in an insert script in the previous requirement reached 169.

When running in the test environment, it only took an hour to run, and the data volume was about 100 million pieces, about more than 100 GIGABytes.

Set hive.map.aggr=true;

Partial aggregation is performed on the Map side, which is more efficient but requires more memory. You can set it according to the resources of your own enterprise.

I usually don’t turn this on if my script involves a small amount of data.

3, set hive.input.format=org.apache.hadoop.hive.ql.io.Com bineHiveInputFormat;

Hive0.5 default value. Small file merging is performed before executing a map. If a large number of maps are generated in a job,

If this parameter is enabled and used together with the second parameter, the number of maps can be reduced by more than twice in actual production.

The number of maps for a job was 577 before it was started, but only 196 after it was started, greatly improving the running efficiency of the program.

4, set mapred.max.split.size=256000000;

The maximum input size per Map(a slice), which determines the number of merged files, is used in conjunction with the third parameter

The default value is 256000000,

Mapred.min.split. size The default value is 10000000

Dfs.block. size the default value is 128MB. This parameter is not used by Hive and can only be changed by HDFS

In hive, split size is not equal to or smaller than blocksize, but larger than blocksize. (Number of maps)

<1> If hive processes compressed files and the compression algorithm does not support file splitting, the number of maps depends on the actual storage size of file blocks.

If the file block itself is large, say around 500Mb, then the splitsize for each map should be at least around 500Mb.

We cannot increase the number of maps by reducing the splitsize of each map. We can only decrease the number of maps by increasing the splitsize.

If hive processes files in compression mode and the compression mode does not support file splitting, you can only control parameters to reduce the number of maps, but cannot configure parameters to increase the number of maps. Therefore, Hive has limited tuning capabilities for compressing unsharable files

<2> If the files processed by Hive are in uncompressed or compressible format and inputFormat is CombineHiveInputFormat,

The following four parameters are used to control the number of maps. For the priorities and precautions of these four parameters, see the following:

Generally speaking, the result size of these parameters should meet the following conditions:

max.split.size >= min.split.size >= min.size.per.node >= min.size.per.rack

The priority of several parameters is:

max.split.size <= min.split.size <= min.size.per.node <= min.size.per.rack

Summary: So for the control of the number of maps to tune, first need to see whether compression is enabled, whether the compression algorithm supports segmentation, parameter Settings and so on!

5, set mapred. Min. The split. The size.. Per node = 256000000;

The minimum split size on a node (the value that determines whether files on multiple Datanodes need to be merged),

This is used in conjunction with parameters 3 and 4.

6, set mapred. Min. The split. The size.. Per rack = 256000000;

The minimum split size of a switch (the value that determines whether files on multiple switches need to be merged),

It is also suitable for the 3rd, 4th and 5th parameters.

7, set hive. The exec. Mode. Local. Auto = true;

Enable local mode. This parameter may be often used in my own learning, but it is seldom used in actual production.

Because this parameter is turned on for small data sets, processing all tasks on a single machine, not for production tasks!

8, the set hive. The exec. Reducers. Bytes. Per. Reducer = 51210001000;

The amount of data processed by each Reduce task is 256M by default, and 1G before hive0.14.0 by default. Our company sets 512M and writes 51210001000 because 1000 is used in network transmission, rather than 1024.

If you set this parameter to a smaller value, you can increase the number of Reduce operations and improve operation efficiency.

Of course, more Reduce is not the better, because starting and initializing Reduce consumes resources and time.

Moreover, there are as many output files as there are reduce files, which will cause the problem of too many small files if these files are used as input for the next task

9, hive. The exec. Reducers. Max

The default value is 1009. Before hive0.14.0, the default value is 999

The reducer number is calculated using a simple formula N=min(parameter 9, total input/parameter 8)

That is, if the total size of reduce inputs (map outputs) is less than 1 GB, there is only one Reduce job.

Set mapred.reduce. Tasks = 15;

Set the number of Reduces (use caution in actual production)

So when can you manually set reduce numbers? For example, the system automatically calculates the number of Reduce operations because cluster resources are insufficient.

If the program runs OOM(memory overflow is insufficient), you can manually increase the number of Reduce programs based on the estimated number of Reduce programs to ensure that the program can run properly while running slowly

So when is there only one Reduce?

< 1 >, when the output of the map file is less than the hive. The exec. Reducers. Bytes. Per. The reducer

<2>, manually set mapred.reduce.tasks to 1

<3>, when order by is used (global sorting will be handled by a reduce)

<4>, cartesian product occurs when the table is associated

<5> select count(*) from tablename

Select sign_date,count(*) from tablename group by sign_date; select sign_date from tablename group by sign_date;

11, the set mapred. Job. Reuse. The JVM. Num. The tasks = 10;

This method is used to avoid scenarios with small files or a large number of tasks. Most of these scenarios take a short time to execute. When Hive starts mapReduce jobs, JVM startup costs a lot, especially when jobs have tens of thousands of tasks. JVM reuse can cause JVM instances to be reused N times in the same job

12, set hive. The exec. Dynamic. Partition = true;

Indicates that dynamic partitioning is enabled

13, set hive. The exec. Dynamic. Partition. The mode = nonstrict;

Allows all partitions to be dynamic,

The default is strict, which means that at least one partition must be static

14, set hive.groupby.skewindata=true;

When there is data skew, load balancing is carried out to determine whether group by supports skew data. In fact, it is equivalent to the preaggregation of Conbiner in MR.

Note: Aggregation can only be done for a single field.

Two MR jobs are generated. The output of the first MR Job Map is randomly allocated to Reduce data skew caused by too many keys and too few keys.

In the first MapReduce, the output result set of map is randomly distributed to Reduce. Each Reduce performs partial aggregation and outputs the result. As a result, the same Group By Key may be distributed to different Reduces to achieve load balancing.

The second MapReduce task is then distributed to reduce By Group By Key based on the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce), and the final aggregation operation is completed

15, set hive.auto.convert.join=true;

Open the map join

16, set hive. Mapjoin. Smalltable. Filesize = 512000000;

The size of the small table in a Map Join is also the threshold for enabling or disabling a Map Join

17, hive.exec.com press. The output = true;

When compression is turned on, our company uses the default compression algorithm, Deflate

Compression algorithms are: < 1 >, org.apache.hadoop.io.com the GzipCodec,

< 2 >, org.apache.hadoop.io.com press DefaultCodec,

< 3 >, com.hadoop.com pression. Lzo. LzoCodec,

< 4 >, com.hadoop.com pression. Lzo. LzopCodec,

< 5 >, org.apache.hadoop.io.com press. BZip2Codec

Compression algorithm used:

set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

The three parameter values of the above small file merge are explained as follows:

If the file block size is larger than 128 MB, separate the file block size by 128 MB. If the file block size is smaller than 128 MB, separate the file block size by 100 MB. Merge the file block size that is smaller than 100 MB

Hive performance optimization and parameter tuning

Related Posts

Why micro channel small program is born to comply with The Times!

Front-end routing implementation ideas

Explain ten lessons of SQL performance optimization