What is FlinkX

FlinkX is a distributed offline/real-time data synchronization plug-in based on Flink, which can realize efficient data synchronization for a variety of heterogeneous data sources. It was initially developed by Kangaroo Cloud in 2016 and maintained by a stable r&d team. It has been opened on Github (see the open source address at the end of the article) and maintained by the open source community. At present, batch stream unification has been completed, and data synchronization tasks of offline computing and stream computing can be realized based on FlinkX.

FlinkX abstracts different data source libraries into different Reader plug-ins and target libraries into different Writer plug-ins, with the following features:

Based on Flink development, support distributed operation; Two-way read and write, a database can be used as a source library, also can be used as a target library; Supports multiple heterogeneous data sources and implements bidirectional collection of more than 20 data sources, such as MySQL, Oracle, SQLServer, Hive, and Hbase. High scalability and flexibility, the new extended data source can be interworked with the existing data source instantly.Copy the code

FlinkX application scenarios

FlinkX data synchronization plugin development platform of data synchronization is mainly used in large data/data integration module, usually adopts the underlying efficient synchronization plugin and interface configuration mode of a combination of, developers can make big data is simple and rapid data synchronization task of development, realize the business database data synchronization and large data storage platform, In this way, data modeling and development can be carried out, and after data development, the result data processed by big data can be synchronized to the application database of the business for the use of enterprise data business.

Iii. Detailed explanation of FlinkX’s working principle

LinkX is based on Flink, and its selection and advantages are detailed in details

Mp.weixin.qq.com/s/uQbGLY3_c…

Engine is a task scheduling Engine encapsulated in kangaroo cloud. Data synchronization tasks configured on the WEB end are first submitted to the task scheduling Engine. The Template module loads Reader and Writer plug-ins corresponding to the source database and target database according to the configuration information of synchronization tasks. The Reader plug-in implements the InputFormat interface to obtain the DataStream object from the database, and the Writer plug-in implements the OutFormat interface to associate the target database with the DataStream object. In this way, the DataStream object is used to concatenate reads and writes together. Assemble a Flink task and submit it to the Flink cluster for running.

Previously, the sharding and accumulator features based on Flink solved the scenarios of incremental synchronization, multi-channel control, dirty data management and error management in the process of data synchronization. 19 Based on Flink checkpoint mechanism, the implementation of breakpoint continuation, stream data continuation and other functions, let’s have a look at its new features.

(1) Resumable data

In the process of data synchronization, if a task wants to synchronize 500G data to the target library, it has been running for 15min, but the data synchronization fails at 400G due to the lack of cluster resources, network and other factors, if the task needs to start again, the student will be mad. FlinkX supports breakpoint continuation based on the checkPOin mechanism. If a synchronization task fails due to the preceding reasons, you only need to continue the synchronization from the breakpoint, saving the rerun time and cluster resources.

The Checkpoint function of Flink is the core function of its fault tolerance. It can periodically generate snapshots of the state of the Operator/task according to the configuration and store these state data regularly and persistently. When the Flink program crashes unexpectedly, You can selectively recover from these snapshots when you re-run the program to correct program data anomalies caused by failures.

If a task fails, the system automatically retries the task. If the retry succeeds, the system continues to synchronize the data at the breakpoint location to reduce manual operation and maintenance (O&M).

(2) Real-time acquisition and running

In June, 19, the kangaroo Cloud stack R&D team realized the unification of batch data collection based on FlinkX, which could collect data sources such as MySQL Binlog, Filebeats and Kafka in real time and write data sources such as Kafka, Hive, HDFS and Greenplum. Collection tasks also support limiting the number of concurrent jobs and job rates, and dirty data management. Based on checkpoint mechanism, the real-time acquisition task can be continued. When service data is generated or the collection process is interrupted by the Flink program, the data reading node can be saved based on the snapshot periodically stored by Flink. Thus, the data breakpoint saved in history can be selected to continue during fault repair to ensure data integrity. This feature is implemented in Kangaroo Cloud’s StreamWorks product. Welcome to learn about it.

(3) Dirty data management of stream data

The BatchWorks offline computing product has implemented the dirty data management for offline data synchronization, and implemented the error management for dirty data based on the accumulator of Flink. When the number of errors reached the configuration, the task failed. At present, real-time stream data collection also supports this function. In the process of writing data from the source database to the target database, error records are stored for subsequent analysis and processing of dirty data during data synchronization. However, because the task is stream data collection, the task is not intermittent. If the number of errors reaches the threshold, the task is not stopped. Subsequent users need to analyze and process dirty data.

(4) Data is written to Greenplum and OceanBase data sources

Greenplum is an MPP database based on PostgreSQL that supports the storage and management of massive data. It is also adopted by many enterprises in the market. More recently, data stack has implemented Greenplum for multiple data source writes based on FlinkX, supporting partial database incremental synchronous writes in addition to full synchronization. OceanBase is an extensible relational database in the financial field developed by Ali. Its usage is basically the same as that of MySQL. Data reading and writing of OceanBase is also based on JDBC connection for synchronization and writing of data tables and fields as well as incremental writing of OceanBase. And job synchronization channel, concurrency control.

Transactions are not used by default when writing to relational databases such as Greenplum, because task failures with very large volumes of data can have a huge impact on the business database. However, transactions must be enabled when breakpoint continuation is enabled. If the database does not support transactions, breakpoint continuation cannot be implemented. When you start the breakpoint continuingly, will generate a snapshot in Flink commit the transaction, when the current data to database, if two snapshots during the task fails, then the transaction data is not written in the database, the location of the task to recover from the last snapshot record continue to synchronize data, can achieve the task so many failures continue to run under the condition of accurate synchronization of data.

Four, write in the back

FlinkX supports the following data sources through internal use in kangaroo Cloud and practice in large data center projects. FlinkX will continue to support more data sources with its highly scalable features.

This article was published in: Stack research Club

Stack is a cloud-native, site-based data central-platform PaaS, and we have an interesting open source project on Github called FlinkX. FlinkX is a data synchronization tool based on Flink. It can collect static data, such as MySQL, HDFS, etc., as well as real-time changing data, such as MySQL binlog, Kafka, etc. It is a data synchronization engine that integrates whole domain, heterogeneous and batch data. Welcome to github community to find us to play ~