1. Write it at the front


The target

Submit tasks to a Hadoop pseudo-distributed cluster in a virtual machine using IDEA, and run WordCount V1.0, the official example of MapReduce.

The environment that
  • Windows 10
  • The IDEA of 2020.2.2
  • CentOS 7.6
  • Hadoop 2.9.2
  • Maven 3.6.3
  • JDK 1.8

2. IDEA to prepare the Hadoop environment


Installing a plug-in

JetBrains provides a plug-in for connecting to a Hadoop cluster, which is handy for connecting to HDFS on IDEA.

Go to the file-> setting-> Plugins and find the Big Data Tools. Install.

Connect the HDFS

You can choose the Hadoop installation path, but this is Hadoop installed on your own machine (Win10).

The second method is to connect to the remote Hadoop, and you need to make sure that the Hadoop cluster is started before you test the connection. The second method is chosen here.

Add the dependent

Be careful to select the appropriate version

<dependencies> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> < version > 2.8.2 < / version > < / dependency > < the dependency > < groupId > org.. Apache hadoop < / groupId > < artifactId > hadoop - common < / artifactId > < version > 2.9.2 < / version > < / dependency > < the dependency > < groupId > org, apache hadoop < / groupId > < artifactId > hadoop - client < / artifactId > < version > 2.9.2 < / version > < / dependency > < the dependency > < groupId > org, apache hadoop < / groupId > < artifactId > hadoop - HDFS < / artifactId > < version > 2.9.2 < / version > </dependency> </dependencies>

3. IDEA runs the WordCount program


Prepare the input folder and output folder

$ hadoop fs -cat /demo/wordcount/input/file01
Hello World Bye World
$ hadoop fs -cat /demo/wordcount/input/file02
Hello Hadoop Goodbye Hadoop

To run the program

Wordcountv1.0 source code. If you don’t want to manually delete the Output folder each time, add the following code snippet:

PS: If there is an output folder in HDFS, you can delete it manually first (folder permission problem). This is just to be able to run repeatedly on IDEA

Path = new Path(args[1]); Path = new Path(args[1]); FileSystem fileSystem = path.getFileSystem(conf); if (fileSystem.exists(path)) { fileSystem.delete(path, true); } FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0:1);

Set the input parameters of the program, namely args[0] and agrs[1]

hdfs://ip:9000/demo/wordcount/input hdfs://ip:9000/demo/wordcount/output

The results