Java MapReduce parses the Parquet log

1. Single input format

Specify the input format of ParquetInputFormat

Job.setMapPerClass (ParquetMap.class); job.setInputFormatClass(ParquetInputFormat.class); ParquetInputFormat.addInputPath(job, new Path(args[1])); ParquetInputFormat.setReadSupportClass(job, CheckLevelRunner.MyReadSupport.class); Public static final class MyReadSupport extends DelegatingReadSupport<Group> {public DelegatingReadSupport () {}  super(new GroupReadSupport()); } @Override public org.apache.parquet.hadoop.api.ReadSupport.ReadContext init(InitContext context) { return super.init(context); }} static class ParquetMap extends Mapper<Void, Group, Text, Text> {protected Void map(Void key, String, String, String, String, String, String); Group value, Mapper<Void, Group, Text, Context Context) {try {String md5sha1=value.getString("key1", 0); Write (new Text(outputKey), new Text(outputValue)); } catch (Exception e) { return; }}}

Parse if Parquet encounters empty file:

You can set the MapReduce fault-tolerant parameters at this point:

Graphs. Map. Failures. Maxpercent: this parameter when the map Task failure rate more than the value, the entire operation failed, the default value is 0. Set to 5 here, the number of task maps is the same as the number of input files, so if the number of empty files is less than 5%, the task succeeds, and if it is greater than 5%, the task fails.

job.getConfiguration().set("mapreduce.map.failures.maxpercent", "5");

2. Multiple input formats

One directory has a file format of TEXT and the other is PARQUET. Process data using multipleInputs with multiple maps set according to the input source.

/ / set multiple input, multiple mapper MultipleInputs. AddInputPath (job, new Path (path1), TextInputFormat. Class, NormalMap. Class); MultipleInputs.addInputPath(job, new Path(path2), ParquetInputFormat.class, ParquetMap.class); ParquetInputFormat.setReadSupportClass(job, CheckLevelRunner.MyReadSupport.class);

3. Problems encountered in calling the HTTP interface in MapReduce

Will be deployed server program, found that the wrong will quote: the Exception in the thread “is the main” Java. Lang. NoSuchFieldError: The org.apache.hadoop package contains the version 3.1 of HttpClient.http: / / httpclient.org.apache. hadoop package contains the version 3.1 of HttpClient.http: / / httpclient.org.apache. hadoop package contains the version 3.1 of HttpClient. The 3.1 HttpClient in the Hadoop package is used.

Refer to the article

https://www.cnblogs.com/EnzoD…

https://blog.csdn.net/woloqun…

https://blog.csdn.net/csdnmrl…

Java MapReduce parses the Parquet log

1. Single input format

2. Multiple input formats

3. Problems encountered in calling the HTTP interface in MapReduce

Refer to the article

Related Posts

Super simple integration! Step-by-step tutorial to achieve audio editing capabilities

HashMap in 3 seconds

Java source code parsing: ThreadLocal