Hadoop: MapReduce in action

Original address: itweknow.cn/detail?id=6… Welcome to visit.

MapReduce is a programming model based on the idea of “Map “and “Reduce”. Input data is distributed through Map functions, and the results are summarized and output through Reduce. In fact, this concept is somewhat similar to our StreamApi in Java8, interested students can also check out. MapReduce jobs are divided into two phases: Map phase and Reduce phase. Each stage takes key-value pairs as input and output, and the types of keys and values are specified by us. Generally, map input keys are of type LongWritable, which is the offset of the start position of a line relative to the start position of the file. The value is of type Text and is the Text content of the line.

The premise condition

A Maven project.

A Linux machine or virtual machine running Hadoop, or a Hadoop cluster, if you don’t already have it.

The general steps for writing a MapReduce program are :(1) the map program. (2) Reduce program. (3) Program driver. Let’s write a simple example in this order, which is used to count the number of occurrences of each character in a file and output it.

Project depend on

Let’s first address the dependency issue by adding the following to POM.xml.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>${hadoop.version}</version>
</dependency>
Copy the code

The Map program

We inherited the Mapper class and overwrote its map methods. In the Map phase, the input data is the original data obtained from the HDFS. The input key is the offset from the start position of a line to the start position of the file, and the value is the text of the line. The output is also key-value pairs. In this case, you can specify the type of the key-value pairs. In this case, key is Text and value is LongWritable. The output is sent to the Reduce function for further processing.

public class CharCountMapper extends Mapper<LongWritable.Text.Text.LongWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        // Convert this line of text to an array of characters
        char[] chars = value.toString().toCharArray();
        for (char c : chars) {
            // if a character occurs once, output its occurrence once.
            context.write(new Text(c + ""), new LongWritable(1)); }}}Copy the code

The Reduce application

We inherit the Reducer class and rewrite its reduce method. In this example, the input of the Reduce phase is the output of the Map phase, and the output result can be the final output result. As you may have noticed, the second argument to the Reduce method is an Iterable. MapReduce aggregates the output of the same characters in the Map phase as the reduce input.

public class CharCountReducer extends Reducer<Text.LongWritable.Text.LongWritable> {

    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context)
            throws IOException, InterruptedException {
        long count = 0;
        for (LongWritable value : values) {
            count += value.get();
        }
        context.write(key, newLongWritable(count)); }}Copy the code

The driver

So far, we have map and Reduce programs, and we need a driver to run the entire job. So you can see we’re initializing a Job object here. The Job object specifies the execution specification of the entire MapReduce Job. We use it to control the operation of the entire job, where we specify the location of the JAR package, our Map program, Reduce program, the output type of the Map program, the output type of the entire job, and the address of the input and output files.

public class CharCountDriver {

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);
        // Hadoop automatically scans the Jar package for the job based on the driver's classpath.
        job.setJarByClass(cn.itweknow.mr.CharCountDriver.class);

        / / specify the mapper
        job.setMapperClass(CharCountMapper.class);
        / / specify the reducer
        job.setReducerClass(CharCountReducer.class);

        // Map program output key-value pair type
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        // Outputs the key-value pair type
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        // Enter the path to the file
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        // Enter the file path
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean res = job.waitForCompletion(true);
        System.exit(res?0:1); }}Copy the code

You will notice that we initialized an empty Configuration, but did not do any Configuration. In fact, when we run it on a machine running Hadoop, it will default to the Configuration on our machine. I’ll also write about how to configure it in a program in a future article.

Execute MapReduce jobs

Packaging job, we need to jar our MapReduce program.
```
mvn package -Dmaven.test.skip=true
Copy the code
```
The generated JAR package can be found in the target directory.
Copy the JAR package to the Hadoop machine.
Prepare the file to be collected on the HDFS. The file I prepared is in the directory/Mr /input/ on the HDFS, and the content is as follows.
```
hello hadoop hdfs.I am coming.
Copy the code
```

Perform the jar

Hadoop jar Mr - test - 1.0 - the SNAPSHOT. Jar cn. Itweknow. Mr. CharCountDriver/Mr / / output/input / / Mr Out. TXTCopy the code

Check the output directory first, the result is as follows, the final output result is stored in/Mr /output/part-r-00000 file.

root@test: ~# hadoop fs -ls /mr/output
Found 2 items
-rw-r--r--   1 root supergroup          0 2018-12-24 10:33 /mr/output/_SUCCESS
-rw-r--r--   1 root supergroup         68 2018-12-24 10:33 /mr/output/part-r-00000
Copy the code

View the details of the result file:

root@test: ~# hadoop fs -cat /mr/output/part-r-00000
 	4
.	2
I	1
a	2
c	1
d	2
e	1
f	1
g	1
h	3
i	1
l	2
m	2
n	1
o	4
p	1
s	1
Copy the code

Finally, send the source address of this article, stamp here.

The premise condition

Project depend on

The Map program

The Reduce application

The driver

Execute MapReduce jobs

Related Posts

How do I gracefully stop Java services using the kill command in Linux

JVM memory allocation mechanism, once enough!

AlibabaJava post: Collections +JVM+ Design Patterns + Spring +Redis etc