This is the 16th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

Flink’s initial experience in data processing

above

This article is for flink batch processing and streaming data sample code, not standard coding method, only for the experience level of entry code.

Flink does batch processing

Here’s a line-by-line explanation of what the code means. The flink environment needs to be set up first. Since this is a batch operation, the ExecutionEnvironment can be used to define the environment directly. Since batch processing already has fixed data sets, it can simply declare the structure of stored data as DataSet. After the data is broken up, it needs to be grouped first. The grouped variable content is the 0th value stored in the tuple. At the same time, the data needs to be added again after grouping, and the variable is the second value of the tuple. After processing, it is possible to analyze the total number of occurrences of each data.

String filePath = "D:\projects\flinkProject\src\main\resources\word.txt"; ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // DataSet<String> DataSet = env.readTextFile(filePath); DataSet<Tuple2<String,Integer>> outData = DataSet. FlatMap (new MyFlatter()).groupby (0).sum(1); outData.print();Copy the code

Flink does stream processing

Flink is adopted to improve the flow, the first to declare for StreamExecutionEnvironment must pay attention to the environment. Because streaming data is of a continuous type, it is different from batch data directly. Therefore, the socket is used as the data source. The data sent through the listening port is the source data processed by FLink. Similar to flink’s batch processing, the data needs to be grouped, and the grouped variable is the first value of the tuple. Note that grouping is done using keyBy, as opposed to batch processing. After grouping, the second value of the tuple is added and the required statistical results are obtained.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> inputStream = env.socketTextStream("localhost",8888);

DataStream<Tuple2<String,Integer>> outStream = inputStream.flatMap(new MyFlatter())
        .keyBy(0).sum(1);
outStream.print().setParallelism(4);

env.execute();
Copy the code

Data processing part of the code

The above batch and stream processing code uses action classes that break up the data and process it into tuples. The code for this class splits the data by whitespace and prints the split result as the first value of the tuple and 1 as the second value of the tuple. After this operation, the data is preliminarily processed.

public class MyFlatter implements FlatMapFunction<String, Tuple2<String,Integer>> { @Override public void flatMap(String o, Collector<Tuple2<String,Integer>> collector) throws Exception { String[] words = o.split(" "); for (String word: words) { collector.collect(new Tuple2<>(word, 1)); }}}Copy the code

conclusion

Flink is a popular data processing framework with two methods of batch processing and streaming processing. For the whole data system, it has its unique characteristics and is worth learning. This article is only a superficial introduction to the code, the specific content is to be explored in the future.

Afterword.

  • How many events through the ages? Leisurely. The Yangtze River is still rolling.