“This is the 15th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Hadoop data compression

1, an overview of the

Advantages and disadvantages of compression

Advantages of compression: Reduces disk I/O and disk storage space.

Disadvantages of compression: increased CPU overhead.
The compression principle

Computation-intensive jobs, use less compression.

IO – intensive Jobs, using compression.

2, MR support compression coding

Comparison of compression algorithms

Comparison of compression performance

http://google.github.io/snappy

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
Copy the code

3. Selection of compression mode

When selecting a compression mode, consider the following: compression/decompression speed, compression ratio (storage size after compression), and whether slicing is supported after compression.

3.1. Gzip Compression

Advantages: high compression rate;

Cons: No Split support; General compression/decompression speed;

3.2 Bzip2 compression

Advantages: high compression rate; Support the Split;

Disadvantages: Slow compression/decompression.

3.3. Lzo compression

Advantages: Fast compression/decompression speed; Support the Split;

Disadvantages: general compression rate; To support slicing, you need to create additional indexes.

3.4. Snappy Compression

Advantages: fast compression and decompression speed;

Cons: No Split support; Compression rate is average;

3.5. Selection of compression position

Compression can be enabled at any stage of MapReduce action.

4. Set compression parameters

To support multiple compression/decompression algorithms, Hadoop introduces codecs

To enable compression in Hadoop, configure the following parameters

5. Compression practice cases

5.1 Map output uses compression

Even if your graphs of input and output files are not compressed file, you can still do to the middle of the Map task results output compression, because it is written in the drive and transmission through the network to Reduce nodes, the compression can improve the performance of a lot of the work as long as the set two properties, we’ll look at how to set up the code.

Hadoop source code to support the compression formats are: BZip2Codec, DefaultCodec

public class WordCountDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        // 1 Obtain the job
        Configuration conf = new Configuration();

        // Enable map output compression
        conf.setBoolean("mapreduce.map.output.compress".true);

        // Set the map output compression mode
        conf.setClass("mapreduce.map.output.compress.codec", SnappyCodec.class, CompressionCodec.class);

        Job job = Job.getInstance(conf);

        // 2 Set the jar package path
        job.setJarByClass(WordCountDriver.class);

        // 3 Associate mapper and reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 4 Set the KV type for the map output
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 Sets the final output kV type
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6 Set the input path and output path
        FileInputFormat.setInputPaths(job, new Path("D:\input\inputword"));
        FileOutputFormat.setOutputPath(job, new Path("D:\hadoop\output888"));

        // 7 Submit the job
        boolean result = job.waitForCompletion(true);

        System.exit(result ? 0 : 1); }}Copy the code

The Mapper stays the same

public class WordCountMapper extends Mapper<LongWritable.Text.Text.IntWritable> {
    private Text outK = new Text();
    private IntWritable outV = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1 gets a row
        String line = value.toString();
        / / 2
        String[] words = line.split("");
        // 3 Write the loop
        for (String word : words) {
            / / encapsulation outk
            outK.set(word);
            / / writecontext.write(outK, outV); }}}Copy the code

Reducer remains unchanged

public class WordCountReducer extends Reducer<Text.IntWritable.Text.IntWritable> {
    private IntWritable outV = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for(IntWritable value : values) { sum += value.get(); } outV.set(sum); context.write(key, outV); }}Copy the code

5.2 Reduce output end adopts compression

Case processing based on WordCount.

Modify the driver

public class WordCountDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        // 1 Obtain the job
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        // 2 Set the jar package path
        job.setJarByClass(WordCountDriver.class);

        // 3 Associate mapper and reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 4 Set the KV type for the map output
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 Sets the final output kV type
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6 Set the input path and output path
        FileInputFormat.setInputPaths(job, new Path("D:\input\inputword"));
        FileOutputFormat.setOutputPath(job, new Path("D:\hadoop\output888"));


        // Enable output compression on the Reduce end
        FileOutputFormat.setCompressOutput(job, true);

        // Set the compression mode
        // FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
        // FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
       FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class);


        // 7 Submit the job
        boolean result = job.waitForCompletion(true);

        System.exit(result ? 0 : 1); }}Copy the code

Mapper and Reducer remain unchanged

Common errors and solutions

Guide packages are prone to error. Especially Text and CombineTextInputFormat.
The first parameter in Mapper must be LongWritable or NullWritable, not IntWritable. The error reported is a cast exception.
java.lang.Exception: java.io.IOException: Illegal partition for 13926435656 (4), indicating that partition and ReduceTask number is not on, adjust the number of ReduceTask.
If the number of partitions is not 1, but reducetask is 1, whether to perform the partitioning process. The answer is: do not perform partitioning. Because in the source code of MapTask, the premise of partitioning is to judge whether the number of ReduceNum is greater than 1. If the value is less than 1, it will not be executed.
Unsupported Major. Minor version 52.0 The cause is JDK1.7 for Windows and JDk1.8 for Linux. Solution: Unify JDK versions.
Type conversion exception reported. It is usually written incorrectly when setting the Map output and final output in the driver function. If the keys output by Map are not sorted, a type conversion exception will be reported.
An input file cannot be obtained while running wc.jar in the cluster. Cause: The input file of the WordCount case cannot be in the root directory of the HDFS cluster.
When customizing the Outputformat, note that the close method in RecordWirter must close the stream resource. Otherwise, the output file contains no data.

Three, friendship links

Big data Hadoop-MapReduce learning journey chapter 5

Big data Hadoop-MapReduce learning journey chapter 4

Big data Hadoop-MapReduce learning journey chapter 3

Big data Hadoop-MapReduce learning journey chapter 2

Big data Hadoop-MapReduce learning journey first

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Big data Hadoop-MapReduce learning journey 6

Hadoop data compression

1, an overview of the

2, MR support compression coding

3. Selection of compression mode

3.1. Gzip Compression

3.2 Bzip2 compression

3.3. Lzo compression

3.4. Snappy Compression

3.5. Selection of compression position

4. Set compression parameters

5. Compression practice cases

5.1 Map output uses compression

5.2 Reduce output end adopts compression

Common errors and solutions

Three, friendship links

Big data Hadoop-MapReduce learning journey 6

Hadoop data compression

1, an overview of the

2, MR support compression coding

3. Selection of compression mode

3.1. Gzip Compression

3.2 Bzip2 compression

3.3. Lzo compression

3.4. Snappy Compression

3.5. Selection of compression position

4. Set compression parameters

5. Compression practice cases

5.1 Map output uses compression

5.2 Reduce output end adopts compression

Common errors and solutions

Three, friendship links

Related Posts

Android development (20) Use ADB to connect PCS and Android devices. Usb connection.

Implementation of clue Binary Tree (C Language) (Data Structure of NTU)

10 Beginner – Java object System basics