“This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!”

1: Flink big data skills

Since Alibaba acquired Flink, its ecosystem has become more and more perfect. In particular, Flink has been developing rapidly in various large factories, and has almost become a standard for data development. Especially with the recent fire of lake Silo integration, CDC has taken Flink to a new level. Next, let’s comb through Flink’s cluster architecture and operation mode.

2: Flink cluster architecture

1: JobManger

Each cluster has at least one management node. Manage computing resources in the entire cluster, manage jobs, schedule execution, and Checkpoint coordination.

2: JobManger

Each cluster has multiple TMS responsible for computing resource provisioning.

3: the Client

The local execution applies the Main method to resolve the JobGraph correspondence, and finally submits the JobGraph to the JobManger to run, while monitoring the execution status of the Job.

3: Flink operation mode

Deployment patterns Relevant interpretation advantages disadvantages
Session Mode JobManger and TaskManger are shared, and all submitted jobs run in a single Runtime Resources are fully shared, providing resource utilization. Jobs are managed in the Flink Session cluster and are easy to operate and maintain The resource isolation is poor. The Tm deployment is not of the Native type and is not easy to expand
Per-Job Mode JobManger and TaskManger, each Job starts a separate Runtime. Resource isolation between jobs is sufficient. Resources can be applied for based on Job requirements. This pattern is generally used online. Resources are relatively wasteful, JobManger needs to consume resources.
ApplicationMode (1.11) Application main runs on a Cluster, not a client. Each Application has a Runtime. An Application can contain multiple jobs Reduce loan consumption and client load effectively, and realize resource sharing in Applicaiton. Currently, only Yarn and K8s are supported, which is rarely used in online production.

4: lake warehouse & data lake & CDC

1: lake warehouse

In the past two years, Alibaba has shared a lot in various conferences. My personal understanding is that object-based storage on the cloud, one storage for multiple computing, and cloud native architecture that is separated from computing and storage are fully embraced. This architecture is recommended to choose Aliyun’s MaxCompute, which is very expensive for small and medium-sized enterprises to build themselves.

2: data lake

Data lakes are an old concept that has been revived in recent years. The industry for the data lake has not yet a unified definition. A data lake is a centralized repository that allows the storage of data of any structure and the application of it to big data processing, real-time analysis and related application scenarios such as machine learning. A data warehouse is a central repository of information. A data lake is Schema On Read: the Schema information is required only when data is used. A data warehouse is Schema On Write: the Schema needs to be designed when data is stored. This makes it easier for the data lake to collect data since there are no restrictions on writing data. There are no restrictions on writing data, and the data lake makes it easier to collect data. Common solutions are as follows:

The data of lake positioning The current support insufficient note
Hudi Just as the name indicates (Hadoop Upserts Deletes and Incrementals), the main support is for Upserts, Deletes and Incremental data processing. 1: incremental write, delete, and merge based on Hadoop. The introduction of a timeline allows efficient processing of incremental data. 2: Time roaming. Hudi on Spark is strongly dependent, Hudi on Flink applications are not many Spark is the official recommendation, and Flink is recommended by many domestic companies. Mainly because Flink is the real streaming entity
Iceberg Standard Table Format and efficient ETL The main computing engines supported by Iceberg include Spark 2.4.5, Spark 3.x, Flink 1.11, and Presto. Some o&M tasks such as snapshot expiration, small file merge, incremental subscription consumption, etc. can be implemented = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Iceberg is the first open source project to complete Flink access among the three open source projects of Delta, Iceberg and Hudi
Delta Delta is positioned as a Data Lake storage layer that supports update/ Delete /merge streams. Delta currently supports only the Spark engine Delta and Spark are so deeply integrated that adaptation with other engines will take time. Delta Lake, an open source Databricks project, focuses on solving the inherent problems of storage formats such as Parquet at the Spark level and brings more capabilities.

3:CDC

Change Data Capture (CDC) is short for obtaining Change Data. Using the CDC, we can retrieve committed changes from the database and send them downstream for downstream use. These changes can include INSERT,DELETE,UPDATE, and so on. The details are as follows: 1: Flink Sql can be used to synchronize data from one place to other places, such as mysql and ElasticSearch. 2: Can materialize an aggregate view on the source database in real time. 3: Since only incremental synchronization is performed, data can be synchronized in real time with low latency. 4: Use EventTime to join a temporal table to get accurate results.

4:

Flink can achieve sub-second processing delay, now many areas of business for delay will be increasingly high requirements. Such as anti-fraud, monitoring engine, real-time data analysis, etc. Especially in the Internet of things era, IOT technology can be used to analyze every person’s physical characteristics in real time. Early warning & risk control & intelligent assistant let data know more about yourself, the future has come, together to welcome the new era.