Flink profile

Apache Flink is a distributed big data processing engine capable of stateful computation for both finite and infinite data streams. Can be deployed in a variety of cluster environments for rapid calculations of data sizes.

Processing flow

Core components:

Flink features

Support batch processing and data stream processing
Supports both high throughput and low latency
Flexible Windows (time, slide, roll, session, custom trigger) supported with different time semantics (time, ingest, processing)
Fault tolerant guarantees that are handled only once
Automatic backpressure mechanism
Support CEP, such as anti-fraud and risk control rule engine
Compatibility with MapReduce/Storm interfaces, allowing reuse of its code
Integrate YARN, HDFS, HBase, and other Hadoop ecological components

Flink compares with other frameworks

	Record ACK	Micro-batching	Transactional updates	Distributed snapshots
Typical representative	Apache Storm	Storm Trident ,Spark Streaming	Google Cloud Dataflow	Apache Flink
Semantic guarantee	At least once	Exactly once	Exactly once	Exactly once
delay	low	high	Lower (things delay)	low
throughput	low	high	Higher (depending on the amount of food stored)	high
Calculation model	flow	The number of	flow	flow
Fault tolerant overhead	high	low	Low (depends on the throughput of things stored)	low
Flow control	poor	poor	good	good
Business flexibility (separation of business and fault tolerance)	Part of the	Tightly coupled	The separation of	The separation of

Deployment way

Standalone

Flink runs on a cluster consisting of a master node and one or more worker nodes.

Flink on YARN
Separation mode (A long-running Flink cluster on YARN)
1. A permanent process is required in Yarn. The resource size is determined during Yarn startup. After Yarn is started, an Application is resident in Yarn, which is the permanent process of Flink JobManager in Yarn.
2. JobManager can be fault-tolerant. If JobManager fails, Yarn automatically restarts a JobManager without affecting job submission.
3. Different jobs are affected by the size of resources. Resources are allocated at startup. If a task uses up the given resources, other tasks can be executed only after the resources are released.

Client mode (A single Flink job on YARN)
1. Flink serves only as a client and does not need to reside in the Yarn process.
2. The JobManager starts only after the Job is submitted.
3. Resources of each Job are isolated. Resources of different jobs are independent, and resource allocation of jobs is affected only by the Yarn resource policy.

Fault-tolerant processing

Checkpoint — Checkpoint makes the state of Fink very fault tolerant. Flink can restore the state and calculation position of the job through Checkpoint, so Flink job has a high fault tolerant execution semantics. A Checkpoint is to create a snapshot of the status of all tasks at a certain time and store it to State Backend. — Lightweight fault tolerance — Checkpoint to ensure Exactly once semantics — automatic recovery of internal failures without human intervention
Savepoint
- Savepoints is a self-testing Savepoints feature stored in an external file system that provides the ability to stop-restore or upgrade Flink programs. It uses Flink’s checkpoint mechanism to create a full (non-incremental) state snapshot of a stream job and writes checkpoint data and metadata to an external file system. And it doesn’t expire, it doesn’t get overwritten unless you manually delete it.
- The historical version of the state in the flow process
- Has the function of replay
- External recovery (application restart and upgrade)
- Two trigger modes
  - Cancel with Savepoint
  - Manual active trigger
State Backend

After the Checkpoint function is enabled, the status is persisted at each Checkpoint to prevent data loss and restore consistency. How the internal state is presented, and how and where the checkpoint-based persistent state is stored, depends on the state backend chosen.
Out of the box, Flink integrates the following state Backend systems:
- MemoryStateBackend(default) : Stores data as objects in the Java Heap. Based on HeapKeyedStateBackend, MemoryStateBackend may cause OOM.
- FsStateBackend: Saves data to File systems, such as HDFS and Local File. HeapKeyedStateBackend may cause OOM.
RocksDBStateBackend : Using RocksDB to store state, RocksDB overcomes the memory limitation of HeapKeyedStateBackend and persists to a remote file system. It maintains state in the local file system. KeyedStateBackend writes state directly to the local RocksDB. When configuring a remote FileSystem(such as HDFS), the local data is directly copied to the FileSystem during Checkpoint operation. Restore from FileSystem to local during a fail over.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Flink research report

Flink profile

Flink features

Flink compares with other frameworks

Deployment way

Fault-tolerant processing

Flink research report

Flink profile

Flink features

Flink compares with other frameworks

Deployment way

Fault-tolerant processing

Related Posts

Nginx configures various response headers to prevent XSS, clickjacking, and Frame malicious attacks

GitHub distributed version control system

How to build a crawler proxy service?