The third article | Flink DataStream API programming guide

Flink DataStream API is mainly divided into three parts: Source, Transformation, and Sink. Source is the data Source. Flink has many built-in data sources, such as Kafka, which is the most commonly used. Transformation is a specific Transformation operation, mainly user-defined logic for processing data, such as Map and FlatMap. Sink is the output of data, which can output the processed data to storage devices. Flink has many built-in sinks, such as Kafka, HDFS, etc. In addition to Flink’s built-in Source and Sink, users can achieve self-defined Source and Sink. Considering that the use of built-in Source and Sink is relatively simple and convenient, the use of built-in Source and Sink is not within the scope of discussion in this paper. This paper will start from the custom Source, then describe the use of some common operators in detail, and finally achieve a custom Sink.

The data source

Flink internally implements commonly used data sources, such as file-based, socket-based, aggregation-based and so on. If these cannot meet the requirements, users can customize data sources. MySQL will be used as an example to implement a customized data source. This data source will be used for all operations in this article, with the following code:

/ * * *@Created with IntelliJ IDEA.
 *  @author : jmx
 *  @Date: 2020/4/14
 *  @Time: 17:34 * note: RichParallelSourceFunction and SourceContext must add generics * /
public class MysqlSource extends RichParallelSourceFunction<UserBehavior> {
    public Connection conn;
    public PreparedStatement pps;
    private String driver;
    private String url;
    private String user;
    private String pass;

    /** * This method is called only once at the beginning * this method is used to get connections **@param parameters
     * @throws Exception
     */
    @Override
    public void open(Configuration parameters) throws Exception {
        // Initialize database connection parameters
        Properties properties = new Properties();
        URL fileUrl = TestProperties.class.getClassLoader().getResource("mysql.ini");
        FileInputStream inputStream = new FileInputStream(new File(fileUrl.toURI()));
        properties.load(inputStream);
        inputStream.close();
        driver = properties.getProperty("driver");
        url = properties.getProperty("url");
        user = properties.getProperty("user");
        pass = properties.getProperty("pass");
        // Get the data connection
        conn = getConection();
        String scanSQL = "SELECT * FROM user_behavior_log";
        pps = conn.prepareStatement(scanSQL);
    }

    @Override
    public void run(SourceContext<UserBehavior> ctx) throws Exception {
        ResultSet resultSet = pps.executeQuery();
        while (resultSet.next()) {
            ctx.collect(UserBehavior.of(
                    resultSet.getLong("user_id"),
                    resultSet.getLong("item_id"),
                    resultSet.getInt("cat_id"),
                    resultSet.getInt("merchant_id"),
                    resultSet.getInt("brand_id"),
                    resultSet.getString("action"),
                    resultSet.getString("gender"),
                    resultSet.getLong("timestamp"))); }}@Override
    public void cancel(a) {}/** * close the connection */
    @Override
    public void close(a) {
        if(pps ! =null) {
            try {
                pps.close();
            } catch(SQLException e) { e.printStackTrace(); }}if(conn ! =null) {
            try {
                conn.close();
            } catch(SQLException e) { e.printStackTrace(); }}}/** * get database connection **@return
     * @throws SQLException
     */
    public Connection getConection(a) throws IOException {
        Connection connnection = null;

        try {
            // Load the driver
            Class.forName(driver);
            // Get the connection
            connnection = DriverManager.getConnection(
                    url,
                    user,
                    pass);
        } catch (Exception e) {
            e.printStackTrace();
        }
        returnconnnection; }}Copy the code

First inherit RichParallelSourceFunction, implementation inheritance, the method of mainly includes the open () method, the run () method and the close method. The above

RichParallelSourceFunction is to support the setting of parallelism, more about the difference between the RichParallelSourceFunction and RichSourceFunction, the former allows users to set up more parallelism, The latter does not support through setParallelism set parallelism () method, the parallelism of the default is 1, otherwise will be submitted to the following error: bashException in thread “main” Java. Lang. IllegalArgumentException: The maximum parallelism of non parallel operator must be 1.

In addition, RichParallelSourceFunction provides additional open () method and the close () method, if you need for a link when defining the Source, you can in the open () method to initialize, and then close () method of close links to resources, As for the difference between Rich***Function and ordinary Function, it will be explained in detail below. Here is an impression first. The configuration information in the above code is passed through the configuration file. Due to space constraints, I will put the code in this article on Github. See github address at the end of this article.

The basic transformation

Flink provides a large number of operators for users to use, common operators mainly include the following types, note: this article does not discuss about time and window based operators, these contents will be detailed in “Flink Time and window based operators”.

Note: The operation of this paper is based on the user-defined MySQL Source above, and the corresponding data interpretation is as follows:

userId;     / / user ID
itemId;     ID / / commodities
catId;      // Product category ID
merchantId; / / the seller ID
brandId;    / / brand ID
action;     // User behavior, including ("pv", "buy", "cart", "FAv ")
gender;     / / gender
timestamp;  // The timestamp when the behavior occurred, in seconds
Copy the code

Map

explain

DataStream → DataStream, input an element, return an element, as follows:

SingleOutputStreamOperator<String> userBehaviorMap = userBehavior.map(new RichMapFunction<UserBehavior, String>() {
            @Override
            public String map(UserBehavior value) throws Exception {
                String action = "";
                switch (value.action) {
                    case "pv":
                        action = "Browse";
                    case "cart":
                        action = "Purchased";
                    case "fav":
                        action = "Collection";
                    case "buy":
                        action = "Buy";
                }
                returnaction; }});Copy the code

Schematic diagram

A map operation that converts a raindrop shape into a corresponding circular shape

flatMap

explain

DataStream → DataStream, enter an element and return zero, one, or more elements. In fact, the flatMap operator can be regarded as a generalization of filter and Map, that is, it can realize these two operations. The corresponding FlatMapFunction of the flatMap operator defines the flatMap method, which can return zero, one or more events as a result by passing data to the Collector object. The operations are as follows:

SingleOutputStreamOperator<UserBehavior> userBehaviorflatMap = userBehavior.flatMap(new RichFlatMapFunction<UserBehavior, UserBehavior>() {
            @Override
            public void flatMap(UserBehavior value, Collector<UserBehavior> out) throws Exception {
                if (value.gender.equals("Female")) { out.collect(value); }}});Copy the code

Schematic diagram

Filter out the yellow raindrops, round the blue raindrops, and keep the green ones

Filter

explain

DataStream → DataStream: the filter operator determines the data. If the data meets the conditions, the data that returns true will be retained. Otherwise, the data will be filtered. As follows:

  SingleOutputStreamOperator<UserBehavior> userBehaviorFilter = userBehavior.filter(new RichFilterFunction<UserBehavior>() {
            @Override
            public boolean filter(UserBehavior value) throws Exception {
                return value.action.equals("buy");// Keep the purchase behavior data}});Copy the code

Schematic diagram

Filter out the red and green raindrops, keeping the blue ones.

keyBy

explain

DataStream→KeyedStream, which logically divides streams into disjoint partitions. All records with the same key are assigned to the same partition. Internally, keyBy () is implemented via hash partitioning. There are three ways to define a key: (1) use the position of a field, such as keyBy(1), which is used for tuple data types, such as tuple, using the position of the corresponding element of the tuple to define a key; (2) Field expressions for tuples, POJOs, and sample classes; (3) a keySelector, or keySelector, can extract keys from input events

SingleOutputStreamOperator<Tuple2<String, Integer>> userBehaviorkeyBy = userBehavior.map(new RichMapFunction<UserBehavior, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(UserBehavior value) throws Exception {
                return Tuple2.of(value.action.toString(), 1);
            }
        }).keyBy(0) // Scala tuple numbering starts at 1 and Java tuple numbering starts at 0
           .sum(1); // Scroll through the aggregation
Copy the code

Schematic diagram

KeyBy operations that partition events based on shapes

Reduce

explain

KeyedStream → DataStream performs rolling aggregation of data, combining the current element with the value returned by the last Reduce, and then returning a new value. Apply a ReduceFunction to a keyedStream, and every coming event will be aggregated with the result of current Reduce to generate a new DataStream. This operator will not change the data type, so the type of input stream and output stream will always be consistent.

SingleOutputStreamOperator<Tuple2<String, Integer>> userBehaviorReduce = userBehavior.map(new RichMapFunction<UserBehavior, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(UserBehavior value) throws Exception {
                return Tuple2.of(value.action.toString(), 1);
            }
        }).keyBy(0) // Scala tuple numbering starts at 1 and Java tuple numbering starts at 0
          .reduce(new RichReduceFunction<Tuple2<String, Integer>>() {
              @Override
              public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
                  return Tuple2.of(value1.f0,value1.f1 + value2.f1);// Scroll aggregate, which is similar to sum}});Copy the code

Schematic diagram

Aggregations

KeyedStream → DataStream, Aggregations(rolling Aggregations). The rolling Aggregations transform is applied to the KeyedStream to generate a DataStream containing the aggregate results (such as sum, min min). The scrolling aggregate transformation saves an aggregate result for each key value that flows through the operator, and updates the corresponding result value based on the previous result value and the current element value as new elements flow through the operator

Sum (): the sum of the specified fields that the rolling aggregate flows through the operator;
Min (): Scroll to calculate the minimum value of the specified field that flows through the operator
Max (): Scroll to calculate the maximum value of the specified field that flows through the operator
MinBy (): scroll to calculate the minimum value that has passed through the operator so far, and return the event corresponding to that value;
MaxBy (): scroll to calculate the maximum value that has passed through the operator so far, and return the event corresponding to that value;

union

explain

DataStream* → DataStream* → DataStream* → DataStream* → DataStream* → DataStream* → DataStream* → DataStream* → DataStream* → DataStream* → DataStream* → DataStream* → DataStream The union operator does not rerun the data, and each input event is sent to the downstream operator.

userBehaviorkeyBy.union(userBehaviorReduce).print();// Join two streams together. Multiple streams (greater than 2) can be supported
Copy the code

Schematic diagram

connect

explain

DataStream,DataStream → ConnectedStreams, which combines the events of the two streams and returns a ConnectedStreams object, The two streams can have different data types. The ConnectedStreams object provides operators similar to the map() and flatMap() functions, such as CoMapFunction and CoFlatMapFunction, which represent the map() and flatMap operators respectively. Note that CoMapFunction or CoFlatMapFunction does not control the order of events when called. This operator will be called whenever events flow through the operator.

ConnectedStreams<UserBehavior, Tuple2<String, Integer>> behaviorConnectedStreams = userBehaviorFilter.connect(userBehaviorkeyBy);
        SingleOutputStreamOperator<Tuple3<String, String, Integer>> behaviorConnectedStreamsmap = behaviorConnectedStreams.map(new RichCoMapFunction<UserBehavior, Tuple2<String, Integer>, Tuple3<String, String, Integer>>() {
            @Override
            public Tuple3<String, String, Integer> map1(UserBehavior value1) throws Exception {
                return Tuple3.of("first", value1.action, 1);
            }
            @Override
            public Tuple3<String, String, Integer> map2(Tuple2<String, Integer> value2) throws Exception {
                return Tuple3.of("second", value2.f0, value2.f1); }});Copy the code

split

explain

DataStream → SplitStream, which splits a stream into two or more streams, as opposed to union. The segmented stream has the same data type as the input stream and can be routed to zero, one, or more output streams for each incoming event. Datastream.split () takes an OutputSelector function that defines the split rule, assigning streams that meet different criteria to an output named by the user.

 SplitStream<UserBehavior> userBehaviorSplitStream = userBehavior.split(new OutputSelector<UserBehavior>() {
            @Override
            public Iterable<String> select(UserBehavior value) {
                ArrayList<String> userBehaviors = new ArrayList<String>();
                if (value.action.equals("buy")) {
                    userBehaviors.add("buy");
                } else {
                    userBehaviors.add("other");
                }
                returnuserBehaviors; }}); userBehaviorSplitStream.select("buy").print();
Copy the code

Schematic diagram

Sink

Flink provides many built-in sinks, such as writeASText, print, HDFS, Kaka, etc. The following will implement a custom Sink based on MySQL, which can be compared with the custom MysqlSource, as follows:

/ * * *@Created with IntelliJ IDEA.
 *  @author : jmx
 *  @Date: 2020/4/16
 *  @Time: so * * /
public class MysqlSink extends RichSinkFunction<UserBehavior> {
    PreparedStatement pps;
    public Connection conn;
    private String driver;
    private String url;
    private String user;
    private String pass;
    /** * Initializes the connection in the open() method@param parameters
     * @throws Exception
     */
    @Override
    public void open(Configuration parameters) throws Exception {
        // Initialize database connection parameters
        Properties properties = new Properties();
        URL fileUrl = TestProperties.class.getClassLoader().getResource("mysql.ini");
        FileInputStream inputStream = new FileInputStream(new File(fileUrl.toURI()));
        properties.load(inputStream);
        inputStream.close();
        driver = properties.getProperty("driver");
        url = properties.getProperty("url");
        user = properties.getProperty("user");
        pass = properties.getProperty("pass");
        // Get the data connection
        conn = getConnection();
        String insertSql = "insert into user_behavior values(? ,? ,? ,? ,? ,? ,? ,?) ;";
        pps = conn.prepareStatement(insertSql);
    }

    /** * close the connection */
    @Override
    public void close(a) {

        if(conn ! =null) {
            try {
                conn.close();
            } catch(SQLException e) { e.printStackTrace(); }}if(pps ! =null) {
            try {
                pps.close();
            } catch(SQLException e) { e.printStackTrace(); }}}/** * Call the invoke() method to insert data **@param value
     * @param context
     * @throws Exception
     */
    @Override
    public void invoke(UserBehavior value, Context context) throws Exception {
        pps.setLong(1, value.userId);
        pps.setLong(2, value.itemId);
        pps.setInt(3, value.catId);
        pps.setInt(4, value.merchantId);
        pps.setInt(5, value.brandId);
        pps.setString(6, value.action);
        pps.setString(7, value.gender);
        pps.setLong(8, value.timestamp);
        pps.executeUpdate();
    }
    /** * get database connection **@return
     * @throws SQLException
     */
    public Connection getConnection(a) throws IOException {
        Connection connnection = null;

        try {
            // Load the driver
            Class.forName(driver);
            // Get the connection
            connnection = DriverManager.getConnection(
                    url,
                    user,
                    pass);
        } catch (Exception e) {
            e.printStackTrace();
        }
        returnconnnection; }}Copy the code

About RichFunction

Careful readers can notice that in the previous operator operation cases, RichFunction is used. In many cases, some initialization operations or obtaining the context information of the function are required before the function processes data. DataStream API provides a class of RichFunction. This function provides a lot of additional functionality over normal functions.

When using RichFunction, you can implement two additional methods:

Open (), which is the initialization method, is called once before each character first calls the transformation method (such as map). Note that the Configuration parameter is only used in the DataSet API, but not in the DataStream API. Therefore, it can be ignored when the DataStream API is used.
Close (), the function’s termination method, is called once for each task after the last call to the transformation method, usually for operations such as resource release.

The getRuntimeContext() method also allows users to access function context information (RuntimeContext), such as the parallelism of the function, the number of the function’s subtask, and the name of the task executing the function, as well as the partition status.

conclusion

This paper first realizes the custom MySQL Source, then carries out a series of operator operations based on the MySQL Source, and makes a detailed analysis of common operator operations, finally realizes a custom MySQL Sink, and explains the RichFunction.

Code address :github.com/jiamx/study…

The public account “Big Data Technology and Data Warehouse” replies to “information” to receive the big data data package

The third article | Flink DataStream API programming guide

The data source

The basic transformation

Map

explain

Schematic diagram

flatMap

explain

Schematic diagram

Filter

explain

Schematic diagram

keyBy

explain

Schematic diagram

Reduce

explain

Schematic diagram

Aggregations

union

explain

Schematic diagram

connect

explain

split

explain

Schematic diagram

Sink

About RichFunction

conclusion

Related Posts

Tencent Liu Ying: From container to low code, Tencent cloud native technology evolution

Why Tik Tok marketing?

Machine learning Overview – For the little guy