“This is the 15th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Hadoop data compression

1, an overview of the

  • Advantages and disadvantages of compression

    Advantages of compression: Reduces disk I/O and disk storage space.

    Disadvantages of compression: increased CPU overhead.

  • The compression principle

    Computation-intensive jobs, use less compression.

    IO – intensive Jobs, using compression.

2, MR support compression coding

  • Comparison of compression algorithms

  • Comparison of compression performance

    http://google.github.io/snappy
    
    Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
    Copy the code

3. Selection of compression mode

When selecting a compression mode, consider the following: compression/decompression speed, compression ratio (storage size after compression), and whether slicing is supported after compression.

3.1. Gzip Compression

Advantages: high compression rate;

Cons: No Split support; General compression/decompression speed;

3.2 Bzip2 compression

Advantages: high compression rate; Support the Split;

Disadvantages: Slow compression/decompression.

3.3. Lzo compression

Advantages: Fast compression/decompression speed; Support the Split;

Disadvantages: general compression rate; To support slicing, you need to create additional indexes.

3.4. Snappy Compression

Advantages: fast compression and decompression speed;

Cons: No Split support; Compression rate is average;

3.5. Selection of compression position

Compression can be enabled at any stage of MapReduce action.

4. Set compression parameters

  • To support multiple compression/decompression algorithms, Hadoop introduces codecs

  • To enable compression in Hadoop, configure the following parameters

5. Compression practice cases

5.1 Map output uses compression

Even if your graphs of input and output files are not compressed file, you can still do to the middle of the Map task results output compression, because it is written in the drive and transmission through the network to Reduce nodes, the compression can improve the performance of a lot of the work as long as the set two properties, we’ll look at how to set up the code.

  • Hadoop source code to support the compression formats are: BZip2Codec, DefaultCodec

    public class WordCountDriver {
    
        public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
            // 1 Obtain the job
            Configuration conf = new Configuration();
    
            // Enable map output compression
            conf.setBoolean("mapreduce.map.output.compress".true);
    
            // Set the map output compression mode
            conf.setClass("mapreduce.map.output.compress.codec", SnappyCodec.class, CompressionCodec.class);
    
            Job job = Job.getInstance(conf);
    
            // 2 Set the jar package path
            job.setJarByClass(WordCountDriver.class);
    
            // 3 Associate mapper and reducer
            job.setMapperClass(WordCountMapper.class);
            job.setReducerClass(WordCountReducer.class);
    
            // 4 Set the KV type for the map output
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);
    
            // 5 Sets the final output kV type
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
    
            // 6 Set the input path and output path
            FileInputFormat.setInputPaths(job, new Path("D:\input\inputword"));
            FileOutputFormat.setOutputPath(job, new Path("D:\hadoop\output888"));
    
            // 7 Submit the job
            boolean result = job.waitForCompletion(true);
    
            System.exit(result ? 0 : 1); }}Copy the code
  • The Mapper stays the same

    public class WordCountMapper extends Mapper<LongWritable.Text.Text.IntWritable> {
        private Text outK = new Text();
        private IntWritable outV = new IntWritable(1);
    
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            // 1 gets a row
            String line = value.toString();
            / / 2
            String[] words = line.split("");
            // 3 Write the loop
            for (String word : words) {
                / / encapsulation outk
                outK.set(word);
                / / writecontext.write(outK, outV); }}}Copy the code
  • Reducer remains unchanged

    public class WordCountReducer extends Reducer<Text.IntWritable.Text.IntWritable> {
        private IntWritable outV = new IntWritable();
    
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for(IntWritable value : values) { sum += value.get(); } outV.set(sum); context.write(key, outV); }}Copy the code

5.2 Reduce output end adopts compression

Case processing based on WordCount.

  • Modify the driver

    public class WordCountDriver {
    
        public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
            // 1 Obtain the job
            Configuration conf = new Configuration();
    
            Job job = Job.getInstance(conf);
    
            // 2 Set the jar package path
            job.setJarByClass(WordCountDriver.class);
    
            // 3 Associate mapper and reducer
            job.setMapperClass(WordCountMapper.class);
            job.setReducerClass(WordCountReducer.class);
    
            // 4 Set the KV type for the map output
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);
    
            // 5 Sets the final output kV type
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
    
            // 6 Set the input path and output path
            FileInputFormat.setInputPaths(job, new Path("D:\input\inputword"));
            FileOutputFormat.setOutputPath(job, new Path("D:\hadoop\output888"));
    
    
            // Enable output compression on the Reduce end
            FileOutputFormat.setCompressOutput(job, true);
    
            // Set the compression mode
            // FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
            // FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
           FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class);
    
    
            // 7 Submit the job
            boolean result = job.waitForCompletion(true);
    
            System.exit(result ? 0 : 1); }}Copy the code
  • Mapper and Reducer remain unchanged

Common errors and solutions

  • Guide packages are prone to error. Especially Text and CombineTextInputFormat.

  • The first parameter in Mapper must be LongWritable or NullWritable, not IntWritable. The error reported is a cast exception.

  • java.lang.Exception: java.io.IOException: Illegal partition for 13926435656 (4), indicating that partition and ReduceTask number is not on, adjust the number of ReduceTask.

  • If the number of partitions is not 1, but reducetask is 1, whether to perform the partitioning process. The answer is: do not perform partitioning. Because in the source code of MapTask, the premise of partitioning is to judge whether the number of ReduceNum is greater than 1. If the value is less than 1, it will not be executed.

  • Unsupported Major. Minor version 52.0 The cause is JDK1.7 for Windows and JDk1.8 for Linux. Solution: Unify JDK versions.

  • Type conversion exception reported. It is usually written incorrectly when setting the Map output and final output in the driver function. If the keys output by Map are not sorted, a type conversion exception will be reported.

  • An input file cannot be obtained while running wc.jar in the cluster. Cause: The input file of the WordCount case cannot be in the root directory of the HDFS cluster.

  • When customizing the Outputformat, note that the close method in RecordWirter must close the stream resource. Otherwise, the output file contains no data.

Three, friendship links

Big data Hadoop-MapReduce learning journey chapter 5

Big data Hadoop-MapReduce learning journey chapter 4

Big data Hadoop-MapReduce learning journey chapter 3

Big data Hadoop-MapReduce learning journey chapter 2

Big data Hadoop-MapReduce learning journey first