Flink + Iceberg, Tencent tens of billions of real-time data into the lake combat

Brief introduction: Shanghai station Flink Meetup to share content, Tencent several according to the lake of ten billion level data scene landing cases to share.

This article is edited from the “real time data into the lake” shared by Chen Junjie, senior engineer of Tencent Data Lake Research and Development, on Flink Meetup of Shanghai website on April 17. The content of the article is as follows:

Tencent data lake introduction

Ten billion level data scene landing

The future planning

conclusion

Making the address https://github.com/apache/flink welcome to Flink thumb up to star ~

I. Introduction to Tencent Data Lake

As can be seen from the above figure, the whole platform is relatively large, including data access, upper level analysis, middle management (such as task management, analysis management and engine management), and the lowest level Table Format.

Two, ten billion level data landing scene landing

1. Traditional platform architecture

As shown in the figure above, there were only two traditional platform architectures in the past, Lambda architecture and Kappa architecture:

In the Lambda architecture, batch and stream are separated, so there are two sets of clusters For operations, one For Spark/Hive and one For Flink. There are several problems with this:
- First, the cost of operation and maintenance is relatively large;
- The second is development costs. For example, on the business side, one might write Spark, another might write Flink or SQL, and overall, development costs are not particularly friendly to data analysts.
The second is the Kappa architecture. It’s message queuing, transmission to the bottom, and then some analysis down the line. It is characterized by relatively fast, based on Kafka has a certain degree of real-time.

Both architectures have their own pros and cons. The biggest problem is that storage may not be uniform, leading to fragmented data links. At present, our platform has already connected with Iceberg. The following will explain the problems encountered and the process of solving them according to different scenarios.

2. Scenario 1: Safety data of hand Q enters the lake

It is a typical scenario for mobile phone QQ security data to enter the lake.

At present, the business scenario is that the message queue TubeMQ is placed in ODS to Iceberg through Flink, and then Flink is used to do some user table association, and then a wide table is made to do some queries, and put into COS, and some analysis may be done in BI scene.

This process seems ordinary, but we should know that the user association dimension table of Hand Q is 2.8 billion, and the daily message queue is 10 billion levels, so it will face certain challenges.

Small File Challenge
1. Flink Writer produces small files
  
  Flink writes without shuffles, distributing data out of order, resulting in many small files.
2. High delay requirement
  
  Checkpoint interval is short, Commit interval is small, put size file problem.
3. Small file explosion
  
  A few days of metadata and small files of data exploding at the same time, the cluster is under enormous pressure.
4. Merging small files magnifies the problem
  
  To solve the small file problem, open an Action to merge small files, resulting in more files.
5. Too late to delete data
  
  Delete snapshots, delete orphan files, but scan too many files, NameNode stress.
The solution
1. Flink sync merge
  - Add small file merge Operators;
  - Added Snapshot auto-cleanup mechanism.
    
    1) the snapshot. Retain – last. Nums
    
    2) the snapshot. Retain – last. Minutes
2. Spark Asynchronous Merges
  - Add background services for small file merging and orphan file deletion;
  - Add small file filtering logic, gradually delete small files;
  - Added merge logic by partition to avoid creating too many deleted files at one time causing task OOM.
Flink sync merge

When all Data files are committed, a Commit Result is generated. We will take the Commit Result to generate a compressed Task, and then send it to multiple Task managers to do the Rewrite work, and finally Commit the Result to the Iceberg table.

Of course, the key here is what the CompactTaskGenerator does. At first we wanted to merge as much as possible, so we went to do a scan of the tables and scanned a lot of files. However, its tables are very large and there are so many small files that one scan makes the entire Flink crash.

We figured out a way to incrementally sweep the data after each merge. Do an increment from the last Replace Operation, and see how much has been added in between, and what fits the Rewrite strategy.

There are a number of configurations to see how many snapshots are reached, or how many files can be merged, which can be set by the user. Of course, we also set the default values to ensure that the user is not aware of these features.

The Fanout Writer’s pit

In Fanout Writer, you may encounter multiple partitions if you have a large amount of data. For example, the data of hand Q is divided into provinces and cities; But it was still very big, so it was divided into buckets. In this case, there may be many partitions in each Task Manager, and each partition may have one Writer on it.

We did two things here:

The first is KeyBy support. Do keyBy based on the partition set by the user, and then group the same partition into one Task Manager so that it does not open many partitioned writers. Of course, there is a performance penalty to doing so.
The second is to be an LRU Writer, maintaining a Map in memory.

3. Scenario 2: Index analysis of news platforms

The upper part is the online index architecture of news articles based on Iceberg stream batch. On the left, Spark collects the dimension table above HDFS, on the right, the access system. After the collection, it will make a window-based Join with the dimension table, and then write it to the index flow table.

function
- Quasi real-time detail layer;
- Real-time streaming consumption;
- Stream MERGE INTO;
- Multidimensional analysis;
- Offline analysis.
Scene features

The above scenario has the following characteristics:
- Order of magnitude: index single table over 100 billion, single batch 20 million, daily average of 100 billion;
- Delay requirements: minute level of end-to-end data visibility;
- Data sources: full volume, quasi-real-time increments, message flows;
- Consumption mode: streaming consumption, batch loading, point check, row update, multi-dimensional analysis.
Challenge: MERGE INTO

Some users put forward the demand of Merge Into, so we think about it from three aspects:
- Function: Merge the flow table after each batch join into the real-time index table for downstream use;
- Performance: downstream has a high requirement for index timeliness, so it needs to consider merge into to catch up with upstream batch consumption window;
- Usability: Table API? The Action API? SQL API?
The solution
1. The first step
  - Reference Delta Lake to design JoinRowProcessor;
  - Use Iceberg’s WAP mechanism to write temporary snapshots.
2. The second step
  - Optionally skip cardinal-check;
  - You can choose to hash only, not sort, when writing.
3. The third step
  - Support DataframeAPI;
  - Spark 2.4 supports SQL;
  - Spark 3.0 uses the community version.

4. Scene 3: Advertising data analysis

Advertising data mainly has the following characteristics:
- Order of magnitude: 100 billion PB data per day, single 2K;
- Data source: SparkStreaming increments into the lake;
- Data features: labels keep increasing, Schema keep changing;
- Usage: Interactive query analysis.
Challenges and solutions:
- Challenge 1: Schema nesting complex, tiled after nearly ten thousand columns, a writing on the OOM.
  
  Solution: The default Page Size per Parquet is set to 1M, requiring Page Size to be set according to Executor memory.
- Challenge two: 30 days of basic data cluster explosion.
  
  Solution: Provide Action for lifecycle management, document differentiation lifecycle and data lifecycle.

** * Challenge 3: ** Interactive queries. ** * 1) Column projection; * 2) Predicate push down.

III. Future planning

The future planning is mainly divided into the kernel side and the platform side.

1. The kernel side

In the future, we hope to have the following plans on the kernel side:

More data access
- Incremental lake entry support;
- V2 Format support;
- ROW Identity support.
Faster queries
- Index support;
- Alloxio acceleration layer support;
- MOR optimization.
Better data governance
- Data Governance Action;
- SQL Extension support;
- Better metadata management.

2. The platform side

On the platform side, we have the following plans:

Data governance is serviced
- Servicing metadata cleanup;
- Data governance is serviced.
Incremental lake support
- Spark consumes CDC into the lake;
- Flink consumes CDC into the lake.
Indicator monitoring alarm
- Write data indicator;
- Small file monitoring and alarms.

Four,

After the application and practice in mass production, we get the summary from three aspects:

Usability: Through the actual combat of multiple business lines, it is confirmed that Iceberg can withstand the test of billions or even billions of dollars per day.
Ease of use: The barrier to use is high, and more work is needed to get users to use it.
Scene support: Currently, there are not as many scenes of entering the lake as Hudi. Incremental reading is also missing, so we need to work hard to make it up.
- *

In addition, Apache Flink- Real-time Computing is now available. This book will help you easily Get the latest features of Apache Flink version 1.13. It also contains experience from famous manufacturers in multiple scenarios. Please click the link in the description to get it

https://developer.aliyun.com/article/784856?spm=a2c6h.13148508.0.0.61644f0eskgxgo

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.