Flink uses DataSet and DataStream to represent datasets. DateSet is used for batch processing, which means data is finite; DataStream is used to stream data, which means that data is unbounded. Data in a dataset is immutable, that is, elements cannot be added or deleted. We create DataSet or DataStream through data source, and operate DataSet to generate new DataSet through map, filter and other transform operations.

Get an execution environment

Create input data

Transform operations on data sets (hereafter collectively referred to as: transform)

Output data

Below we’ll cover the basic apis involved in writing Flink programs.

Input and output

First, you need an execution environment, and Flink offers the following three ways:

getExecutionEnvironment()createLocalEnvironment()createRemoteEnvironment(Stringhost,intport,String… jarFiles)

The code to create the execution environment for the first example is as follows

Batch:

  


Stream processing:

  


Contents of the words.txt file:

  


The code above creates the execution environment while using env to create the input source. You can call the print method on a dataset to output to the console, or you can call a method like writeAsText to output to another medium. The last line of the flow processing above calls the execute method, which needs to be explicitly called in the flow processing to trigger the execution of the program.

The above code can be run either directly in the IDE, just like running a normal Java program, and Flink will launch a local environment executable. Another option is to package the program and commit it to a Flink cluster. The above example basically contains the skeleton of a Flink program, but does not perform more transform operations on the data set. Let’s briefly introduce the basic transform operations.

The map operation

Map operations are similar to map operations in MapReduce to parse and process data. The sample is as follows

Batch:

  


Stream processing

  


Here batch and stream are written the same except for the data set type. It maps each word to a (word, 1) binary. Transform, similar to Map, also has filter, which filters unwanted records and allows readers to try it out.

The specified key

Big data processing often needs to be processed according to a dimension, that is, the key needs to be specified. GroupBy is used to specify the key in the DataSet, and keyBy is used to specify the key in DataStream. Here we use keyBy as an example.

Flink’s data model is not based on key-value. Keys are virtual and can be regarded as functions defined on data.

Define the key in the Tuple

KeyedStreamTuple2String, Integer, Tuple keyed = words.keyBy(0); //0 represents the first element in Tuple2

KeyedStreamTuple2String, Integer, Tuple keyed = words.keyby (0,1); //0,1 represents the first and second elements of the binary as key\

For nested tuples

DataStreamTuple3Tuple2Integer, Float,String,Long ds;

Ds.keyby (0) will take Tuple2Integer, Float as the whole key.

Specify keys with field expressions

  


This specifies the WORD field of the WC object as the key. The field expression syntax is as follows:

Java objects use field names as keys, as in the example above

Use the field name for the Tuple type (f0, F1…) Or the offset (starting from 0) specifies the key, such as f0 and 5 for the first and sixth fields of the Tuple, respectively

For example, f1.user.zip indicates that the zip field in the user object in the second field of the Tuple serves as the key

The wildcard * character represents an example of selecting all types as key field expressions

  


  

Count: the count field of the WC class

Complex: All fields of complex (recursively)

Complex.word. f2: The third field of the Word triplet in the ComplexNestedClass class

Complex. hadoopCitizen: The hadoopCitizen field in the complex class specifies the Key using a Key Selector

The key is specified by the key selector function. The input of the key selector is each element and the output is the specified key, as shown in the following example

  


You can see that the implementation is the same as keyBy(0).

That’s how Flink specifies keys.

conclusion

This article mainly introduces the basic skeleton of Flink program. Get the environment, create the input source, transform and output the data set. Because data processing is often counted in different dimensions (different keys), this article focuses on how keys are specified in Flink. We will continue to cover the use of the Flink API.