“This is the second day of my participation in the First Challenge 2022, for more details: First Challenge 2022”.

Hello, everyone, I am Huaijin Shake Yu, a big data meng new, home has two gold swallowing beast, Jia and Jia, can code can teach next almighty dad

If you like my article, you can [follow ⭐]+[like 👍]+[comment 📃], your three companies is my motivation, I look forward to growing up with you ~


preface

Hive compresses both intermediate and final data to improve throughput and performance.

Generally, a high compression ratio occupies a small space but decomposes slowly, and vice versa. The commonly used compression formats are GZIP, BZIP2, and Snappy

Compression way Compressed size Compression speed Separable or not
GZIP In the In the no
BZIP2 small slow is
Snappy big fast is

Snappy is the most commonly used compression method for big data storage because it can be decompressed quickly and separated.

The cause of

Run batch today, read the HDFS Snappy file error, used here is org.. Apache hadoop. Hive. Serde2. JsonSerDe analytic method, analytic TEXTFILE JSON file storage, compression is Snappy, Just want to see Snappy inside the specific file content.

To solve the problem

1. View information on the CLI

hadoop fs -text /XXX/XXX.snappy
Copy the code

You can run the -text command to view the file or output the command to the file.

The disadvantage is that it is not convenient to perform some complex processing, or output statistical results.

2. Use code parsing

After code parsing, it can be read and processed, or statistical results can be output, or abnormal data can be located directly.

CompressionCodec has two methods for compressing or decompressing data. To compress data that is being written to an output stream, create a CompressionOutputStream using the createOutputStream(OutputStreamout) method and write it to the underlying stream in a compressed format. Instead, to decompress data read from an input stream, call createInputStream(InputStreamin) to obtain a CompressionInputStream, which reads uncompressed data from the underlying stream.

Compression way Compressed package
GZIP org.apache.hadoop.io.compress.GzipCodec
BZIP2 org.apache.hadoop.io.compress.BZip2Codec
Snappy org.apache.hadoop.io.compress.SnappyCodec
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
The search code for compression and decompression is as follows:
public static void main(String[] args) throws IOException { decompres("d:\a1-k01-1642561371606.snappy"); } public static void compress(String filername, String method) throws ClassNotFoundException, IOException {// 1 Create the input stream of the compressed File path File fileIn = new File(filername); InputStream in = new FileInputStream(fileIn); // 2 Obtain the compressed method of Class codecClass = class.forname (method); Configuration conf = new Configuration(); / / 3 by name to find the corresponding coding/decoder CompressionCodec codec = (CompressionCodec) ReflectionUtils. NewInstance (codecClass, conf); / / 4 the compression method corresponding to the File extension File fileOut = new File (filername + codec. GetDefaultExtension ()); OutputStream out = new FileOutputStream(fileOut); CompressionOutputStream cout = codec.createOutputStream(out); // Ioutils. copyBytes(in, cout, 1024 * 1024 * 5, false); // Ioutils. copyBytes(in, cout, 1024 * 1024 * 5, false); // Set buffer to 5MB // 6 close resource in.close(); cout.close(); out.close(); } public static void decompres(String filename) throws FileNotFoundException, IOException { Configuration conf = new Configuration(); CompressionCodecFactory factory = new CompressionCodecFactory(conf); // 1 Obtain the file compression method CompressionCodec codec = factory.getCodec(new Path(filename)); // 2 Check whether the compression method exists if (null == codec) {system.out. println("Cannot find codec for file "+ filename); return; } InputStream cin = codec.createInputStream(new FileInputStream(filename)); // 4 Create output stream File fout = new File(filename + ".decoded"); OutputStream out = new FileOutputStream(fout); // Ioutils. copyBytes(cin, out, 1024 * 1024 * 5, false); // 6 Close the resource cine.close (); out.close(); }Copy the code

conclusion

If you like my article, you can [follow ⭐]+[like 👍]+[comment 📃], your three companies is my motivation, I look forward to growing up with you ~

You can pay attention to the public number “Huaijin Shake Yu jia and Jia”, access to resources download