Hive compression

For data-intensive tasks, I/O operations and network data transfers can take quite a long time to complete. By enabling compression in Hive, you can improve Hive query performance and save storage space on HDFS clusters.

HiveQL statements are eventually converted into MapReduce jobs in Hadoop, and MapReduce jobs can compress the processed data.

We’ll start by explaining which MapReduce procedures can be set to compression: The data to be analyzed and processed can be compressed before entering the Map and then decompressed. The output after map processing can be compressed to reduce network I/O(Reduce and Map are usually on different nodes). Reduce copies the compressed data and decompresses it. To reduce disk usage.

Hive intermediate data compression

Once submitted, a complex Hive query is typically transformed into a series of multi-stage MapReduce jobs that are linked through the Hive engine to complete the query. Therefore, the intermediate output here refers to the output of the previous MapReduce job, which will be used as the input data of the next MapReduce job.

By using Hive set command in the Shell or modify the Hive – site. The XML configuration files to modify hive.exec.com the intermediate properties, This allows us to enable compression on Hive Intermediate output.

Hive.exec.com press. Intermediate: the default is false, set true to activate the intermediate data compression, is the middle of the graphs of the shuffle phase on the mapper compression. You can use the set command to set these properties in the Hive shell

set hive.exec.compress.intermediate=true set mapred.map.output.compression.codec= Org.apache.hadoop.io.com press. Press SnappyCodec or set hive.exec.com. Intermediate = true set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodecCopy the code

It can also be configured in a configuration file

<property>
   <name>hive.exec.compress.intermediate</name>
   <value>true</value>
   <description>
     This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed.
     The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
   </description>
</property>
<property>
   <name>hive.intermediate.compression.codec</name>
   <value>org.apache.hadoop.io.compress.SnappyCodec</value>
   <description/>
</property>
Copy the code

The final output is compressed

Hive.exec.com press. The output: the user can Hive of the resulting table data is usually need to compress. This parameter controls whether this function is enabled or disabled, and is set to true to declare that the resulting file is compressed.

Mapred.output.com pression. Codec: hive.exec.com press. The output parameter is set to true, then choose a suitable codec, if choose SnappyCodec. The Settings are as follows (both compressions are written the same way) :

Press the set hive.exec.com. The output = true set mapred.output.compression.codec=org.apache.hadoop.io.com press. SnappyCodec or set mapred.output.compress=true set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzopCodecCopy the code

It can also be configured through a configuration file

<property>
  <name>hive.exec.compress.output</name>
  <value>true</value>
  <description>This controls whether the final outputs of a query (to a local/HDFS file or a Hive table) is compressed. The compression  codec and other options are determined from Hadoop config variables mapred.output.compress*</description>
</property>
Copy the code

Common compression formats

Hive supports compression formats such as Bzip2, Gzip, Deflate, SNappy, and LZO. Hive relies on the Hadoop compression method. Therefore, an advanced Hadoop version supports more compression methods. You can configure the following methods in $HADOOP_HOME/conf/core-site. XML:

<property>  
    <name>io.compression.codecs</name>  
    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,c om.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>  
</property>  
<property>
 
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
Copy the code

Note that before we enable compression in Hive configuration, we need to configure Hadoop support, because hive compression only specifies which compression algorithm to use, and the specific configuration needs to be configured in Hadoop

Common compression formats are:

The compression ratio bzip2 > Zlib > gzip > Deflate > SNappy > LZO > LZ4 varies in different test scenarios. This is only a general ranking. Bzip2, zlib, gzip, and Deflate guarantee minimal compression but consume too much time in computation.

In terms of compression performance, lZ4 > LZO > SNappy > Deflate > Gzip > Bzip2. Lz4, LZO, and SNappy have high compression and decompression speed and low compression ratio.

Therefore, lZ4, LZO, and SNAPPY compression are often used in the production environment to ensure computing efficiency.

Compressed format Corresponding encoding/decoding
DEFAULT org.apache.hadoop.io.compress.DefaultCodec
Gzip org.apache.hadoop.io.compress.GzipCodec
Bzip org.apache.hadoop.io.compress.BzipCodec
Snappy org.apache.hadoop.io.compress.SnappyCodec
Lzo org.apache.hadoop.io.compress.LzopCodec

For files that are compressed using Gzip or Bzip2, we can import them directly into the table. Hive automatically decompresses the data for us

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
 
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
Copy the code

Native Libraries

Hadoop is developed by Java language, so most compression algorithms are implemented by Java. However, some compression algorithms are not suitable for Java implementation, and will provide supplementary support for Native Libraries. In addition to the bzip2, LZ4, SNappy, and Zlib compression methods provided in Native Libraries, you can customize function Libraries (such as SNappy and LZO) required for installation. Using the compression method provided by Native Libraries can improve performance by about 50%.

Use the following command to view the loading status of native libraries:

hadoop checknative -a
Copy the code

You can compress Hive tables in either of the following ways: Configure MapReduce compression or enable Hive table compression. Hive converts SQL jobs into MapReduce jobs. Therefore, you can directly configure MapReduce compression to achieve compression. For the sake of convenience, Hive supports compression properties for certain tables, which automatically perform compression.

Available compression codecs in Hive

To enable compression in Hive, we first need to find out which compression codecs are available on the Hadoop cluster. We can list the available compression codecs using the set command below.

hive> set io.compression.codecs;
io.compression.codecs=
  org.apache.hadoop.io.compress.GzipCodec,
  org.apache.hadoop.io.compress.DefaultCodec,
  org.apache.hadoop.io.compress.BZip2Codec,
  org.apache.hadoop.io.compress.SnappyCodec,
  com.hadoop.compression.lzo.LzoCodec,
  com.hadoop.compression.lzo.LzopCodec
Copy the code

demo

First we create an uncompressed table tmp_no_COMPRESS

CREATE TABLE tmp_no_compress ROW FORMAT DELIMITED LINES TERMINATED BY '\n'
AS SELECT * FROM log_text;
Copy the code

Let’s look at the output without compression

Setting compression properties in Hive Shell:

set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
Copy the code

Create a compressed table tmp_order_id_compress from tmp_order_ID;

CREATE TABLE tmp_compress ROW FORMAT DELIMITED LINES TERMINATED BY '\n'
AS SELECT * FROM log_text;
Copy the code

Let’s look at the output after setting the compression property:

conclusion

  1. At what stages can data compression occur: 1 the input data can be compressed; 2 the intermediate data can be compressed; 3 the output data can be compressed
  2. Hive is only configured to enable compression and use the compression mode. The real configuration is performed in Hadoop, while data compression is performed in MapReduce
  3. For data-intensive tasks, I/O operations and network data transfers can take quite a long time to complete. By enabling compression in Hive, you can improve Hive query performance and save storage space on HDFS clusters.