IDEA runs the WordCount program

1. Write it at the front

The target

Submit tasks to a Hadoop pseudo-distributed cluster in a virtual machine using IDEA, and run WordCount V1.0, the official example of MapReduce.

The environment that

Windows 10
The IDEA of 2020.2.2
CentOS 7.6
Hadoop 2.9.2
Maven 3.6.3
JDK 1.8

2. IDEA to prepare the Hadoop environment

Installing a plug-in

JetBrains provides a plug-in for connecting to a Hadoop cluster, which is handy for connecting to HDFS on IDEA.

Go to the file-> setting-> Plugins and find the Big Data Tools. Install.

Connect the HDFS

You can choose the Hadoop installation path, but this is Hadoop installed on your own machine (Win10).

The second method is to connect to the remote Hadoop, and you need to make sure that the Hadoop cluster is started before you test the connection. The second method is chosen here.

Add the dependent

Be careful to select the appropriate version

<dependencies> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> < version > 2.8.2 < / version > < / dependency > < the dependency > < groupId > org.. Apache hadoop < / groupId > < artifactId > hadoop - common < / artifactId > < version > 2.9.2 < / version > < / dependency > < the dependency > < groupId > org, apache hadoop < / groupId > < artifactId > hadoop - client < / artifactId > < version > 2.9.2 < / version > < / dependency > < the dependency > < groupId > org, apache hadoop < / groupId > < artifactId > hadoop - HDFS < / artifactId > < version > 2.9.2 < / version > </dependency> </dependencies>

3. IDEA runs the WordCount program

Prepare the input folder and output folder

$ hadoop fs -cat /demo/wordcount/input/file01
Hello World Bye World
$ hadoop fs -cat /demo/wordcount/input/file02
Hello Hadoop Goodbye Hadoop

To run the program

Wordcountv1.0 source code. If you don’t want to manually delete the Output folder each time, add the following code snippet:

PS: If there is an output folder in HDFS, you can delete it manually first (folder permission problem). This is just to be able to run repeatedly on IDEA

Path = new Path(args[1]); Path = new Path(args[1]); FileSystem fileSystem = path.getFileSystem(conf); if (fileSystem.exists(path)) { fileSystem.delete(path, true); } FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0:1);

Set the input parameters of the program, namely args[0] and agrs[1]

hdfs://ip:9000/demo/wordcount/input hdfs://ip:9000/demo/wordcount/output

The results

1. Write it at the front

The target

The environment that

2. IDEA to prepare the Hadoop environment

Installing a plug-in

Connect the HDFS

Add the dependent

3. IDEA runs the WordCount program

Related Posts

05- Create mode [Factory mode]

Lock screen interview question 100 days to brush -Redis Part 4

How to specify NLS_LANGUAGE for Oracle database operation with MyBatis