Development history of Flink

In 2008, a team at the Technical University of Berlin launched the Stratosphere project with the goal of building the next generation of big data analysis engines

On April 16, 2014, Stratosphere became an Apache Incubator project. From Stratosphere 0.6, it was officially renamed Flink and written in Java language

Flink 0.7.0, released on November 04, 2014, introduces the most important feature: Streaming API

2016-03-08, Flink 1.0.0, Scala support

In early 2019, Alibaba acquired Ververica, the company that owns Flink products, and created its own product Blink (a branch of Flink).

What is the Flink

The definition of Flink

Apache Flink is a framework and distributed processing engine for state calculation over unbounded and bounded data streams.

By definition, Apache Flink is a framework, a processing engine, that performs calculations on data streams.

We also learned that it supports two types of data flows, unbounded and bounded, and supports stateful calculations of data flows.

So what is computation with states, what is unbounded data flow, and what is bounded data flow?

What is data flow

Any type of data can form a data flow. Credit card transactions, sensor measurements, machine logs, records of user interactions on websites or mobile apps, all of these can form data streams.

What is bounded data flow

A bounded data flow defines both the beginning and end of a flow. Bounded data streams can be calculated after all the data has been ingested. The data in a bounded data stream can be sorted, so no sequential ingestion is required. Bounded data flow processing is often referred to as bounded data flow

What is unbounded data flow

An unbounded data flow defines the beginning of a flow, but not the end of the flow. This data stream produces data endlessly (at least in theory). Because the input is infinite, we can’t wait for all the data to arrive before processing, we have to continuously process incoming data. Processing unbounded data flows often requires that data be ingested in a particular order, such as the order in which events occur

Apache Flink specializes in processing streams, and batches are regarded as bounded data streams by Apache Flink, so Flink is called a stream-batch all-in-one data computing engine

What is state computing

What is stateless computation

Each time data calculation only considers the current data, not the calculation results of the previous data

What is stateful computation

Each time data calculation is performed, the calculation is based on the previous data calculation result (state), and each calculation result is saved to the storage medium

In practical applications, most scenarios need to record states, such as statistics of a user’s number of visits in a period of time, the maximum speed of a vehicle in a period of time, etc. Therefore, whether a flow processing framework supports calculation with state is a very important feature

Distributed computing

The support of bounded and unbounded data flow and state calculation is only the feature of Flink framework, while distributed computing is the core of Flink big data framework which is different from ordinary data computing framework.

By deploying instances on multiple machines to form a computing cluster, Flink can divide large tasks that cannot be handled by one computer into multiple small tasks and distribute them to other instances in the cluster for distributed computing. (Details will be covered in a later chapter)

Flink definition summary

  • Apache Flink is the framework and processing engine
  • Apache Flink supports distributed computing
  • Apache Flink supports both unbounded and bounded data streams
  • Apache Flink supports stateful calculations of data streams

In addition to the main features mentioned above, Flink has many other features, such as

  • High throughput, low latency, high performance
  • Independent memory management based on the JVM
  • supportEvent Time semanticsAnd combined with theWatermarkMechanism that can handle out-of-order data
  • Supports highly flexible Window operations such as time, count and session
  • Fault tolerance based on the lightweight Distributed Snapshot (CheckPoint) implementation, ensuring exactly-once semantics
  • Save Points
  • Support for storing state in multiple media (memory, file, RocksDB)

.

We’ll talk about that later

Flink learning materials

  • Flink website: flink.apache.org/zh/
  • Flink the latest stable version of the official document: ci.apache.org/projects/fl…
  • Flink Chinese Community Video courses: ververica.cn/developers/…
  • Flink source: github.com/apache/flin…

Thanks to alibaba’s promotion, Flink has a large number of Chinese materials. In fact, the best way to learn Flink is to follow the official documents step by step.