This is the 8th day of my participation in the August Text Challenge.More challenges in August

Data compression algorithm

Common compression formats in the field of big data include Gzip, SNappy, LZO, LZ4, bzip2, and ZSTD.

Why data compression?

To optimize storage (reduce storage space) and make full use of network bandwidth, compression is usually used. Big data requires processing massive amounts of data, so data compression is very important.

In many scenarios that exist in the enterprise, the data sources are typically derived from multiple text formats (CSV, TSV, XML, JSON, and so on). These files are human-readable, but take up a lot of storage space.

However, in big data processing, data should be as machine readable as possible. Using serialized compression techniques to compress this human-readable data into machine-readable data ensures that the storage space required is significantly reduced.

Here are some commonly used compression formats known as codecs that allow data compression/serialization and decompression/deserialization

Gzip(extension.gz)

GNU Zip(GZip), a well-known compression format, is widely used in the Internet world. You can use this format to compress requests and responses to efficiently utilize the bandwidth of your Web site /Web application.

advantages

Hadoop itself supports high compression. Processing gZIP files in the application is the same as processing text directly. Hadoop Native library is available.

disadvantages

Does not support the split

What is Hadoop Native?

Due to performance issues and the absence of certain Java class libraries, Hadoop provides its own native implementation for some components. These components are kept in a separate dynamically linked repository in Hadoop. This library is called libhadoop.so on Unix platforms.

Snappy(extension.snappy)

The Codec developed by Google (formerly known as Zippy) is considered to have the best performance of medium compression ratios. Performance is more important than compression ratio for this format. Snappy is one of the most widely used formats, obviously due to its excellent performance.

advantages

Fast compression speed; Hadoop Native library is supported

disadvantages

Split is not supported; Low compression ratio; Hadoop itself is not supported and needs to be installed. The corresponding command does not exist in Linux

LZO(extension.lZO)

Licensed under the GNU Public License (GPL) and very similar to Snappy, it has a medium compression ratio and high compression and decompression performance. LZO is a lossless data compression algorithm focusing on decompression speed.

advantages

Compression/decompression speed is relatively fast, reasonable compression rate; Support split, which is the most popular compression format in Hadoop; Hadoop Native library support; You need to install the lzop command in Linux, which is easy to use

disadvantages

Hadoop itself is not supported and needs to be installed. Lzo supports split, but lZO files need to be indexed. Otherwise, Hadoop will regard LZO files as ordinary files. (To support split, InputFormat needs to be set to LZO format.)

LZ4(extension.lz4)

advantages

Good performance, high compression ratio, good initial initialization speed, compression speed and stability are also good

disadvantages

  1. Lz4 is cumbersome to decompress and needs to specify the size of the original byte array, so it takes a lot of work to develop
  2. Does not support the split

Bzip2(extension.bz2)

More or less similar to GZip, with a higher compression ratio. But as expected, Bzip2’s data decompression is slower than GZip’s. One important aspect is that it supports data segmentation, which is important when using HDFS as storage. If the data is just stored and not queried, this compression is a good choice.

Advantages:

Support the split; High compression rate, higher than gZIP compression rate; Hadoop itself supports, but native is not supported; The bzip2 command comes with the Linux operating system and is easy to use

Disadvantages:

Slow compression/decompression speed; Does not support native

ZSTD (extension.zstd)

ZSTD is a new lossless compression algorithm that Facebook opened source in 2016. The advantages of ZSTD are compression rate and compression/decompression performance. ZSTD also has a special feature that supports dictionary file generation in training mode, which can greatly improve the compression rate of small packets compared to traditional compression method.

  1. For text compression scenarios with large amounts of data, ZSTD is the best choice considering compression rate and compression performance, followed by LZ4.
  2. For small data compression scenarios, if ZSTD’s dictionary mode can be used, the compression effect is more outstanding.
  3. Is ZSTD splitabble in hadoop/spark/etc

Data compression specification

Evaluation of compression mode

  • Compression methods can be evaluated using the following three criteria
  1. Compression ratio: The higher the compression ratio, the smaller the compressed file, so the higher the compression ratio, the better
  2. Compress time: The faster the better
  3. Whether files in compressed formats can be resplit: A split format allows a single file to be processed by multiple Mapper programs, allowing for better parallelization

Common Compression formats

contrast

Compression way Compression ratio Compression speed Decompression speed Separable or not
gzip 13.4% 21 MB/s 118 MB/s no
bzip2 13.2% 2.4 MB/s 9.5 MB/s is
lzo 20.5% 135 MB/s 410 MB/s is
snappy 22.2% 172 MB/s 409 MB/s no

Hadoop encoding/decoder mode

Compressed format Corresponding encoding/decoder
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
Gzip org.apache.hadoop.io.compress.GzipCodec
BZip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compress.lzo.LzopCodec
Snappy org.apache.hadoop.io.compress.SnappyCodec

Data compression use

Compression of data in Hive tables

#Set totrueTo enable intermediate data compression, the default isfalse, not turned on
set hive.exec.compress.intermediate=true;
#Set the compression algorithm for intermediate data
set mapred.map.output.compression.codec= org.apache.hadoop.io.compress.SnappyCodec;
Copy the code

Compress the output of the Hive table

set hive.exec.compress.output=true;
set mapred.output.compression.codec= 
org.apache.hadoop.io.compress.SnappyCodec;
Copy the code